Skip to Content
FeaturesSupported File Types

Data sources & files

DataQI uses Retrieval Augmented Generation (RAG) at it’s core, taking the training of an LLM (Large Language Models) and providing additional context with specific data sources.
These specific data sources can be varied, and can be provided to DataQI in a number of ways. They enable responses based on specific, defined information.

Files are loaded either:

  1. Directly: using file upload
  2. Indirectly: using integrations like Google Drive or SharePoint, or the web scraper
  3. On-demand: using SQL connections

What does supported mean?

General support

There are three general levels of support for file types in DataQI.

Level of supportDescriptionContent types
Fully supportedAll possible content of the file may be embedded into DataQIFile formats which can only contain content types that are understood by DataQI (e.g. text-only), such as .txt
Partially supportedSome of the content that may be in the file may be embedded into DataQIFile formats which can contain a mixture of understood and not understood content types (e.g. text and images), such as .pdf
Not supportedNone of the content in the file may be embedded in DataQIFile formats which can only contain content types that are not understood by DataQI, such as .bin

Only fully supported or partially supported files may be uploaded to DataQI for training (embedding).

Training from supported content

Once DataQI is trained on the data from uploaded files in DataQI, it becomes part of the searchable knowledge base. This enables:

  • Retrevial of the most relevant content.
  • Answers that are grounded in the source documents.
  • Response citations (quotes) from source documents.

Assistant-specific support

Some file types are explicitly supported or not when using specific types of assistants. For example:

  • The document writer assistant only supports .docx files as templates.
  • The file search assistant will only return results for generally supported (and embedded) files.

Supported file types

The below table outlines the file types & content types that are supported in DataQI. Any file type not included in this list can be considered to have a support level of not supported.

File type Support level Supported content Unsupported content
.txt Fully supported Plain text Simulated tabular data (e.g. using spaces, tabs, pipes & line characters to create the appearance of a tabular display), will be treated as plain text.
.rtf Partially supported Plain text, headings, lists, tables Embedded images, hyperlinks, advanced formatting
.md / .markdown
.pdf Partially supported Plain text, headings, lists, tables Images or audio, hyperlinks, advanced formatting
.doc / .docx Partially supported Plain text, headings, lists, tables Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements
.odt
.gdoc
.epub
.ppt / .pptx Partially supported Slide text, speaker notes, text inside shapes Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements
.xls / .xlsx Partially supported Text-based cell contents, row & column headers Embedded images or audio, advanced formatting, comments or tracked changes, hyperlinks, styles, formulae, interactive elements
.csv Fully supported Text-based cell contents, column headers

For the purposes of the table above, the term “images” includes rastor or vector images of any kind, including but not limited to:

  • Photographs & pictures
  • Image sequences (animation) and videos
  • Graphs and charts
  • Barcodes/QR codes
  • WordArt, SmartArt, shapes & image icons
  • Hand writing & drawings
  • Slide transitions
  • Text contained within an image

Supported content types

Various types of supported content are handled in different ways by DataQI, and results may vary based on the quality and/or format of the source data. This section provides some context on how this is handled, and examples of how results may be impacted by quality or format.

Text

Raw text is the most common type of content that LLM (Large Language Models) models can train against (embed), and where the predictive generated responses are most mature/strongest. DataQI uses text to augment the LLM’s training to provide answers based on your own text-based sources.

Text formatting

Simple text formatting, such as headers, paragraphs, unordered or ordered lists are additional context/structure to the text data that can be understood and provide additional context to the embedded data.

Tabular data

Structure and consistency are important factors when DataQI is interpreting tabular data, such as that from an Excel spreadsheet. Poorly formatted and organised spreadsheets can result in less valuable responses and insights, particularly for more complex queries.
Excel is a very powerful tool, but many of the capabilities of the file format are not content that DataQI can embed/take context from.
Some examples of the elements that are not extracted from an Excel spreadsheet are listed in the table above, but also include things such as:

  • Cell formatting (conditional or manual),
  • Links to/from multiple sheets (or other files),
  • Charts, pivot tables, sparklines

Tabular data: negatively influencing factors

Some examples of things that can negatively impact on DataQI’s ability to provide good, meaningful answers from tabular data are:

  • Missing headers
  • Completely empty columns/rows
  • Merged cells
  • Fields/records are arranged in the inverted orientation (data fields are rows, records are columns)
  • Complex custom formatting
  • Complex custom layouts
  • Inconsistent data types (e.g. a column with predominantly numbers values, but some text values)
  • Multiple tables present on individual sheets
  • Non-tabular content (e.g. using a worksheet for a “cover page”)
  • Cells where meaning is derived from non-textual content (e.g. if the colour of a cell, such as red/amber/green but has no textual content to provide the same meaning)

Tabular data: positively influencing factors

Some examples of best practice that can positively impact on DataQI’s ability to provide good, meaningful answers from tabular data are:

  • Single row used for headers
  • Single sheet used for a single table
  • Fields/records are arranged in the standard orientation (data fields are columns, records are rows)
  • Text delimiters for zero, null, not applicable are using in cells (rather than them being empty)
  • Data types (integers, strings etc.) are consistent by record (row or column as appropriate)
  • Units of measurement are present in the data (either in the header or the values)
Last updated on