Web Scraper Integration
The web scraper integration in DataQI allows users to ingest content from public web pages. It supports both single-page ingestion and multi-page scraping based on hyperlinks or XML sitemap structures.
Overview
- Supported content: Public-facing textual content from pages within the same domain as the original URL
- Refresh interval:
- Static (no refresh, data as-is at first scrape)
- Tracked (content is refreshed every 24 hours)
- Scraper depth (content scope):
- Single page
- Multi-page (via discovered links, one level deep)
- Deep search (via discovered links, up to five levels deep)
- Limitations: Only public text content is extracted. Dynamic elements, interactive components, content inside
<iframe>tags and linked content on other domains are not supported.
Availability
Web scraping is enabled by default for all users with upload permissions. No setup request is required.
Usage Process
- Navigate to the data sources area in the DataQI platform.
- Select to
Add website - Enter the URL of the webpage to scrape.
- Select depth of web scrape (single page, multi-page or deep search).
- Select whether to refresh the content, or run a static web scrape.
- Name the data source with a unique name.
- Submit the request. The content will be ingested, and after a short time it will be ready to assign to an assistant.
Security and Scope
- Data from scraped pages is scoped only to the assistants assigned the data source.
- Data from scraped pages is scoped only to the same domain as the root URL configured.
- Scraped content can be refreshed automatically every 24 hours to maintain up-to-date information.
- This integration is ideal for teams that want to include public regulatory sources, supplier pages, or other reference sites in their assistant’s knowledge base.
Last updated on