Skip to main content
A Dataset in TrueState is a self-contained table of structured data—whether imported from a CSV file you uploaded or ingested via an external integration (e.g., Snowflake, S3, Salesforce)—that lives in our managed warehouse. Once created, a dataset’s schema and contents are immutable, serving as a reliable, versioned source for all downstream work: you can preview and query its rows, transform and enrich it in Pipelines, visualize it in Dashboards, and reference it in Reports or Automations. Datasets page overview

Dataset Creation

Please see the section ‘Loading Data’ for a detailed guide on how to bring data into the TrueState platform.

Data Schema

The datatypes of each column will be automatically determined during the ingestion step. They consist of one of four built-in types:
TypeDescription
STRINGText values of arbitrary length (e.g. names, descriptions, categorical codes).
NUMBERNumeric values, including integers and floats (e.g. counts, prices, measurements).
BOOLEANTrue/false flags or binary indicators (e.g. “active/inactive”, “yes/no”).
DATETIMETimestamps or dates (e.g. event times, birthdates, log timestamps).

AI Description

The Dataset will be processed by an AI to create an enriched description that considers the datatypes, column names, and distribution of values. This description is provided to any AI operations as context, in order to prevent hallucinations. This description can be viewed on the Dataset details page.

Data Dictionary

The Data Dictionary is an optional setting allowing you to annotate the columns directly, in order to provide further context about the content of each column. This additional context is very helpful when asking the bot to perform operations on the data. The Data Dictionary can also be auto-generated by an AI by clicking the ‘Auto-generate’ button on this page. Dataset data dictionary configuration

Dataset Upsert API

Advanced users may wish to upsert their datasets. The upsert API has been documented below. Here is an example request body:
{
    "dataset_name": "customers",
    "row_unique_identifier": "id",
    "data": [
        {"id": "1", "name": "name1", "num": 1, "misc": {"a": 1, "b": "1"}},
        {"id": "2", "name": "name2", "num": 2, "misc": {"a": 2, "b": "2"}},
        {"id": "3", "name": "name3", "num": 3, "misc": {"a": 2, "b": "2"}}
    ],
    "data_schema": {"id": "string", "name": "string", "num": "integer", "misc": "json"},
}
  • dataset_name: the name of the dataset to be upserted. If the dataset doesn’t already exist, it will be created as part of the API.
  • row_unique_identifier: the column name that is unique to the row (e.g. id), this is required for the update operation.
  • data: an array of json objects, with each object representing a row
  • data_schema: Optional. Specify the schema in the ‘data’ field. It contains key value pairs of column_name -> column_type. column_name is the top level field in the JSON objects, DO NOT provide the inner nested fields. If not specified, the schema will be automatically inferred.
Valid value for column types are:
  • INTEGER
  • FLOAT
  • STRING
  • BOOLEAN
  • JSON
Limitation:
  • When it is a new dataset that doesn’t exist yet, the first object in data array CANNOT contain null value in any field. Because the to be created dataset’s schema will not be inferred with null value.
  • The platform only allows 20 concurrent events to process for a single dataset
  • Maximum size for the data field is ~1G
I