Data sources & files

DataQI uses Retrieval Augmented Generation (RAG) at it’s core, taking the training of an LLM (Large Language Models) and providing additional context with specific data sources.
These specific data sources can be varied, and can be provided to DataQI in a number of ways. They enable responses based on specific, defined information.

Files are loaded either:

Directly: using file upload
Indirectly: using integrations like Google Drive or SharePoint, or the web scraper
On-demand: using SQL connections

What does supported mean?

General support

There are three general levels of support for file types in DataQI.

Level of support	Description	Content types
Fully supported	All possible content of the file may be embedded into DataQI	File formats which can only contain content types that are understood by DataQI (e.g. text-only), such as `.txt`
Partially supported	Some of the content that may be in the file may be embedded into DataQI	File formats which can contain a mixture of understood and not understood content types (e.g. text and images), such as `.pdf`
Not supported	None of the content in the file may be embedded in DataQI	File formats which can only contain content types that are not understood by DataQI, such as `.bin`

Only fully supported or partially supported files may be uploaded to DataQI for training (embedding).

Training from supported content

Once DataQI is trained on the data from uploaded files in DataQI, it becomes part of the searchable knowledge base. This enables:

Retrevial of the most relevant content.
Answers that are grounded in the source documents.
Response citations (quotes) from source documents.

Assistant-specific support

Some file types are explicitly supported or not when using specific types of assistants. For example:

The document writer assistant only supports .docx files as templates.
The file search assistant will only return results for generally supported (and embedded) files.

Supported file types

The below table outlines the file types & content types that are supported in DataQI. Any file type not included in this list can be considered to have a support level of not supported.

File type	Support level	Supported content	Unsupported content
`.txt`	Fully supported	Plain text	Simulated tabular data (e.g. using spaces, tabs, pipes & line characters to create the appearance of a tabular display), will be treated as plain text.
`.rtf`	Partially supported	Plain text, headings, lists, tables	Embedded images, hyperlinks, advanced formatting
`.md` / `.markdown`	Partially supported	Plain text, headings, lists, tables	Embedded images, hyperlinks, advanced formatting
`.pdf`	Partially supported	Plain text, headings, lists, tables	Images or audio, hyperlinks, advanced formatting
`.doc` / `.docx`	Partially supported	Plain text, headings, lists, tables	Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements
`.odt`
`.gdoc`
`.epub`
`.ppt` / `.pptx`	Partially supported	Slide text, speaker notes, text inside shapes	Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements
`.xls` / `.xlsx`	Partially supported	Text-based cell contents, row & column headers	Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements
`.csv`	Fully supported	Text-based cell contents, column headers

For the purposes of the table above, the term “images” includes rastor or vector images of any kind, including but not limited to:

Photographs & pictures

Image sequences (animation) and videos

Graphs and charts

Barcodes/QR codes

WordArt, SmartArt, shapes & image icons

Hand writing & drawings

Slide transitions

Text contained within an image

Supported content types

Various types of supported content are handled in different ways by DataQI, and results may vary based on the quality and/or format of the source data. This section provides some context on how this is handled, and examples of how results may be impacted by quality or format.

Text

Raw text is the most common type of content that LLM (Large Language Models) models can train against (embed), and where the predictive generated responses are most mature/strongest. DataQI uses text to augment the LLM’s training to provide answers based on your own text-based sources.

Text formatting

Simple text formatting, such as headers, paragraphs, unordered or ordered lists are additional context/structure to the text data that can be understood and provide additional context to the embedded data.

Tabular data

Structure and consistency are important factors when DataQI is interpreting tabular data, such as that from an Excel spreadsheet. Poorly formatted and organised spreadsheets can result in less valuable responses and insights, particularly for more complex queries.
Excel is a very powerful tool, but many of the capabilities of the file format are not content that DataQI can embed/take context from.
Some examples of the elements that are not extracted from an Excel spreadsheet are listed in the table above, but also include things such as:

Cell formatting (conditional or manual),
Links to/from multiple sheets (or other files),
Charts, pivot tables, sparklines

Tabular data: negatively influencing factors

Some examples of things that can negatively impact on DataQI’s ability to provide good, meaningful answers from tabular data are:

Missing headers
Completely empty columns/rows
Merged cells
Fields/records are arranged in the transposed orientation (data fields are rows, records are columns)
Complex custom formatting
Complex custom layouts
Inconsistent data types (e.g. a column with predominantly numbers values, but some text values)
Multiple tables present on individual sheets
Non-tabular content (e.g. using a worksheet for a “cover page”)
Cells where meaning is derived from non-textual content (e.g. if the colour of a cell, such as red/amber/green but has no textual content to provide the same meaning)

Tabular data: positively influencing factors

Some examples of best practice that can positively impact on DataQI’s ability to provide good, meaningful answers from tabular data are:

Single row used for headers
Single sheet used for a single table
Fields/records are arranged in the standard orientation (data fields are columns, records are rows)
Text delimiters for zero, null, not applicable are using in cells (rather than them being empty)
Data types (integers, strings etc.) are consistent by record (row or column as appropriate)
Units of measurement are present in the data (either in the header or the values)