Web Scraper Integration

The web scraper integration in DataQI allows users to ingest content from public web pages. It supports both single-page ingestion and multi-page scraping based on hyperlinks or XML sitemap structures.

Overview

Supported content: Public-facing textual content from pages within the same domain as the original URL
Refresh interval:
- Static (no refresh, data as-is at first scrape)
- Tracked (content is refreshed every 24 hours)
Scraper depth (content scope):
- Single page
- Multi-page (via discovered links, one level deep)
- Deep search (via discovered links, up to five levels deep)
Limitations: Only public text content is extracted. Dynamic elements, interactive components, content inside <iframe> tags and linked content on other domains are not supported.

Availability

Web scraping is enabled by default for all users with upload permissions. No setup request is required.

Usage Process

Navigate to the data sources area in the DataQI platform.
Select to Add website
Enter the URL of the webpage to scrape.
Select depth of web scrape (single page, multi-page or deep search).
Select whether to refresh the content, or run a static web scrape.
Name the data source with a unique name.
Submit the request. The content will be ingested, and after a short time it will be ready to assign to an assistant.

Security and Scope

Data from scraped pages is scoped only to the assistants assigned the data source.
Data from scraped pages is scoped only to the same domain as the root URL configured.
Scraped content can be refreshed automatically every 24 hours to maintain up-to-date information.
This integration is ideal for teams that want to include public regulatory sources, supplier pages, or other reference sites in their assistant’s knowledge base.