Data sources & files
DataQI uses Retrieval Augmented Generation (RAG) at it’s core, taking the training of an LLM (Large Language Models) and providing additional context with specific data sources.
These specific data sources can be varied, and can be provided to DataQI in a number of ways. They enable responses based on specific, defined information.
Files are loaded either:
- Directly: using file upload
- Indirectly: using integrations like Google Drive or SharePoint, or the web scraper
- On-demand: using SQL connections
What does supported mean?
General support
There are three general levels of support for file types in DataQI.
| Level of support | Description | Content types |
|---|---|---|
| Fully supported | All possible content of the file may be embedded into DataQI | File formats which can only contain content types that are understood by DataQI (e.g. text-only), such as .txt |
| Partially supported | Some of the content that may be in the file may be embedded into DataQI | File formats which can contain a mixture of understood and not understood content types (e.g. text and images), such as .pdf |
| Not supported | None of the content in the file may be embedded in DataQI | File formats which can only contain content types that are not understood by DataQI, such as .bin |
Only fully supported or partially supported files may be uploaded to DataQI for training (embedding).
Training from supported content
Once DataQI is trained on the data from uploaded files in DataQI, it becomes part of the searchable knowledge base. This enables:
- Retrevial of the most relevant content.
- Answers that are grounded in the source documents.
- Response citations (quotes) from source documents.
Assistant-specific support
Some file types are explicitly supported or not when using specific types of assistants. For example:
- The document writer assistant only supports
.docxfiles as templates. - The file search assistant will only return results for generally supported (and embedded) files.
Supported file types
The below table outlines the file types & content types that are supported in DataQI. Any file type not included in this list can be considered to have a support level of not supported.
| File type | Support level | Supported content | Unsupported content |
|---|---|---|---|
.txt |
Fully supported | Plain text | Simulated tabular data (e.g. using spaces, tabs, pipes & line characters to create the appearance of a tabular display), will be treated as plain text. |
.rtf |
Partially supported | Plain text, headings, lists, tables | Embedded images, hyperlinks, advanced formatting |
.md / .markdown |
|||
.pdf |
Partially supported | Plain text, headings, lists, tables | Images or audio, hyperlinks, advanced formatting |
.doc / .docx |
Partially supported | Plain text, headings, lists, tables | Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements |
.odt |
|||
.gdoc |
|||
.epub |
|||
.ppt / .pptx |
Partially supported | Slide text, speaker notes, text inside shapes | Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements |
.xls / .xlsx |
Partially supported | Text-based cell contents, row & column headers | Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements |
.csv |
Fully supported | Text-based cell contents, column headers |
For the purposes of the table above, the term “images” includes rastor or vector images of any kind, including but not limited to:
- Photographs & pictures
- Image sequences (animation) and videos
- Graphs and charts
- Barcodes/QR codes
- WordArt, SmartArt, shapes & image icons
- Hand writing & drawings
- Slide transitions
- Text contained within an image
Supported content types
Various types of supported content are handled in different ways by DataQI, and results may vary based on the quality and/or format of the source data. This section provides some context on how this is handled, and examples of how results may be impacted by quality or format.
Text
Raw text is the most common type of content that LLM (Large Language Models) models can train against (embed), and where the predictive generated responses are most mature/strongest. DataQI uses text to augment the LLM’s training to provide answers based on your own text-based sources.
Text formatting
Simple text formatting, such as headers, paragraphs, unordered or ordered lists are additional context/structure to the text data that can be understood and provide additional context to the embedded data.
Tabular data
Structure and consistency are important factors when DataQI is interpreting tabular data, such as that from an Excel spreadsheet. Poorly formatted and organised spreadsheets can result in less valuable responses and insights, particularly for more complex queries.
Excel is a very powerful tool, but many of the capabilities of the file format are not content that DataQI can embed/take context from.
Some examples of the elements that are not extracted from an Excel spreadsheet are listed in the table above, but also include things such as:
- Cell formatting (conditional or manual),
- Links to/from multiple sheets (or other files),
- Charts, pivot tables, sparklines
Tabular data: negatively influencing factors
Some examples of things that can negatively impact on DataQI’s ability to provide good, meaningful answers from tabular data are:
- Missing headers
- Completely empty columns/rows
- Merged cells
- Fields/records are arranged in the inverted orientation (data fields are rows, records are columns)
- Complex custom formatting
- Complex custom layouts
- Inconsistent data types (e.g. a column with predominantly numbers values, but some text values)
- Multiple tables present on individual sheets
- Non-tabular content (e.g. using a worksheet for a “cover page”)
- Cells where meaning is derived from non-textual content (e.g. if the colour of a cell, such as red/amber/green but has no textual content to provide the same meaning)
Tabular data: positively influencing factors
Some examples of best practice that can positively impact on DataQI’s ability to provide good, meaningful answers from tabular data are:
- Single row used for headers
- Single sheet used for a single table
- Fields/records are arranged in the standard orientation (data fields are columns, records are rows)
- Text delimiters for zero, null, not applicable are using in cells (rather than them being empty)
- Data types (integers, strings etc.) are consistent by record (row or column as appropriate)
- Units of measurement are present in the data (either in the header or the values)