Web Scraper Integration
The web scraper integration in DataQI allows users to ingest content from public web pages. It supports both single-page ingestion and multi-page scraping based on hyperlinks or XML sitemap structures.
Overview
- Supported Content: Public-facing textual content
- Refresh Interval: Every 24 hours
- Crawl Scope: Single page or full site (via discovered links or sitemap)
- Limitations: Only public text content is extracted. Dynamic elements and interactive components are not supported.
Availability
Web scraping is enabled by default for all users with upload permissions. No setup request is required.
Usage Process
1. Navigate to the Web Scraper area in the DataQI platform.
2. Enter the URL of the webpage to scrape.
3. Select whether to ingest a single page or include all discoverable links (sitemap-based crawling).
4. Submit the request. The content will be processed and indexed within your assistant’s scope.
Security and Scope
- Data from scraped pages is scoped only to the assistants permitted by the user initiating the scrape.
- Scraped content is refreshed automatically every 24 hours to maintain up-to-date information.
This integration is ideal for teams that want to include public regulatory sources, supplier pages, or other reference sites in their assistant’s knowledge base.