Datasets are essential components for building knowledge-driven AI applications, allowing you to organize, store, and efficiently retrieve information that powers intelligent conversations and automated workflows. A dataset acts as a centralized repository for structured or unstructured data that can be queried, searched, and referenced by bots, agents, and other AI-powered systems.

Creating Datasets

Creating a dataset is the foundational step in building a knowledge base for your AI applications. When you create a dataset, you establish a container that can hold records, files, and structured information that will be searchable and retrievable by your conversational agents and applications.

The dataset creation process allows you to configure various storage and retrieval parameters that determine how your data is indexed, searched, and presented to AI models. Careful consideration of these settings during creation ensures optimal performance and relevance in your application's responses.

To create a new dataset, send a POST request with the basic information and optional configuration parameters:

Key Configuration Options

  • name: A descriptive identifier for your dataset
  • description: Detailed explanation of the dataset's purpose and content
  • store: The underlying storage backend (defaults to platform default)
  • recordMaxTokens: Maximum tokens per record for optimal chunking
  • searchMaxRecords: Maximum number of records returned in search results
  • searchMaxTokens: Maximum total tokens in search results
  • visibility: Access control (private, protected, or public)
  • matchInstruction: Instructions for when records match a query
  • mismatchInstruction: Instructions for when no records match a query

The API returns the newly created dataset's ID upon successful creation:

Important Considerations:

  • Immutable Settings: The store type cannot be changed after creation, so choose carefully based on your performance and scale requirements
  • Token Limits: Setting appropriate token limits helps balance context richness with response time and cost
  • Search Configuration: Fine-tune search parameters based on your use case—more records provide broader context but may introduce noise

Best Practices:

  • Use descriptive names that clearly indicate the dataset's content
  • Set recordMaxTokens based on your content granularity (500-2000 tokens is typical)
  • Consider visibility settings carefully, especially for sensitive data
  • Link datasets to blueprints for organized project management

Deleting a Dataset

Deleting a dataset permanently removes it from your account along with all its records and associated data. This operation is irreversible and cannot be undone, so it should be used carefully, especially for datasets that contain important information or are actively being used by bots or other applications.

When you delete a dataset, the entire dataset entity is removed, including its name, description, store configuration, and all records it contains. The operation automatically handles cleanup of related resources, including vector embeddings and indexed data stored in the underlying data store.

Before deleting a dataset, consider whether you need to:

  • Export your data: If you might need the data later, export records first
  • Update bot configurations: Remove or update any bots that reference this dataset
  • Check dependencies: Verify that no active applications depend on this dataset

To delete a dataset, send a POST request with the dataset ID:

The request returns the ID of the deleted dataset upon successful completion:

Important Considerations:

  • Permanent deletion: Deleted datasets cannot be recovered
  • Record cleanup: All records within the dataset are also deleted
  • Store cleanup: Vector embeddings and indexed data are removed from the store
  • Authorization: You can only delete datasets that belong to your account

If you need to temporarily disable a dataset without deleting it, consider removing it from bot configurations or exporting its data for safekeeping before deletion.

Retrieving a Specific Dataset

Fetching detailed information about a specific dataset allows you to access its complete configuration, search parameters, storage settings, and metadata. This is essential for understanding how a dataset is configured, verifying settings before modifications, or displaying dataset information in user interfaces.

The fetch operation returns comprehensive details about the dataset, including all configuration options that were set during creation or subsequent updates. This information can be used to replicate dataset configurations, audit settings, or make informed decisions about dataset usage in your applications.

To retrieve a specific dataset by its ID, send a GET request:

Replace {datasetId} with the actual dataset identifier (e.g., dts_abc123xyz).

Response Details

The response includes the complete dataset configuration:

Key Fields Explained

  • store: The vector database or storage backend being used
  • reranker: Optional reranking model for improved search relevance
  • recordMaxTokens: Maximum token limit per individual record
  • searchMinScore: Minimum similarity score threshold for search results
  • searchMaxRecords: Maximum number of records returned in searches
  • searchMaxTokens: Total token limit across all search results
  • matchInstruction: System instruction when records are found
  • mismatchInstruction: System instruction when no records match
  • visibility: Access control level (private, protected, public)

Common Use Cases

  • Configuration Auditing: Verify current settings before making updates
  • Dataset Cloning: Retrieve configuration to replicate in new datasets
  • UI Display: Show dataset settings in administrative interfaces
  • Integration Setup: Confirm dataset parameters before connecting to bots
  • Debugging: Diagnose search behavior by reviewing configuration

Authorization Note: You can only fetch datasets that belong to your account. Attempting to access datasets owned by other users will result in an authorization error.

Updating a Dataset

Modifying an existing dataset allows you to refine its configuration, adjust search parameters, update instructions, and change metadata without affecting the underlying data records. This flexibility enables you to optimize dataset performance and behavior as your application requirements evolve.

Dataset updates are ideal for tuning search relevance, adjusting token limits based on performance observations, refining match/mismatch instructions, or updating organizational metadata. The update operation preserves all existing records while applying new configuration settings that will affect future search and retrieval operations.

To update a dataset, send a POST request with the fields you want to modify:

Replace {datasetId} with your dataset's identifier (e.g., dts_abc123xyz). You only need to include the fields you want to update—unchanged fields will retain their current values.

Updatable Fields

The following properties can be modified after dataset creation:

  • name: Display name for the dataset
  • description: Detailed description of contents and purpose
  • recordMaxTokens: Maximum tokens per record chunk
  • searchMinScore: Minimum similarity threshold for search results
  • searchMaxRecords: Maximum number of records returned per search
  • searchMaxTokens: Total token limit across all search results
  • matchInstruction: Instructions when records are found
  • mismatchInstruction: Instructions when no matching records exist
  • visibility: Access control (private, protected, public)
  • reranker: Reranking model for improving search relevance
  • separators: Custom text separators for record chunking
  • blueprintId: Associated blueprint for organization
  • meta: Custom metadata for flexible categorization

Immutable Properties

Important: The following properties cannot be changed after creation:

  • store: The underlying storage backend (e.g., pinecone, postgres)

Attempting to modify the store type will have no effect. If you need to change the storage backend, you must create a new dataset and migrate your data.

Response

Upon successful update, the API returns the dataset ID:

Common Update Scenarios

Tuning Search Relevance: Adjust searchMinScore and searchMaxRecords based on observed result quality. Higher scores increase precision but may reduce recall.

Optimizing Token Usage: Modify recordMaxTokens and searchMaxTokens to balance context richness with API costs and response time.

Refining Instructions: Update matchInstruction and mismatchInstruction to improve how AI models utilize or handle the absence of dataset information.

Changing Visibility: Adjust access control as your dataset's sensitivity or sharing requirements change over time.

Best Practices:

  • Make incremental changes and test the impact before further adjustments
  • Update instructions to be specific about how information should be used
  • Monitor search performance after configuration changes
  • Keep descriptions current as dataset content evolves
  • Use metadata updates to maintain organizational clarity

Listing Datasets

Retrieving a comprehensive list of all datasets in your account is essential for managing your knowledge bases, monitoring data organization, and accessing dataset configurations programmatically. The list endpoint provides powerful filtering and pagination capabilities to help you efficiently navigate large collections of datasets.

The listing operation returns detailed information about each dataset, including its configuration, storage settings, search parameters, and metadata. This is particularly useful for building administrative interfaces, implementing dataset selection features in applications, or automating dataset management workflows.

To retrieve a list of your datasets, send a GET request:

The response includes all datasets associated with your account, returned as an array of dataset objects with their complete configuration and metadata.

Pagination and Ordering

For accounts with many datasets, pagination helps manage the response size and improve performance:

Available pagination parameters:

  • cursor: Pagination token from previous response to fetch the next page
  • take: Number of datasets to retrieve per request
  • order: Sort order by creation date ("asc" or "desc", defaults to "desc")

Filtering by Blueprint

To retrieve only datasets associated with a specific blueprint or project:

This is useful when working with organized project structures where datasets are grouped by purpose or workflow.

Filtering by Metadata

Datasets with custom metadata can be filtered using meta queries, enabling sophisticated organizational schemes:

Response Structure

Each dataset in the response includes:

  • Core identifiers: id, name, description
  • Storage configuration: store type, reranker settings
  • Search parameters: recordMaxTokens, searchMinScore, searchMaxRecords, searchMaxTokens
  • Instructions: matchInstruction, mismatchInstruction
  • Resource relationships: blueprintId
  • Access control: visibility setting
  • Metadata: Custom meta fields
  • Timestamps: createdAt, updatedAt

Best Practices:

  • Use pagination for large dataset collections to improve API performance
  • Apply filters when searching for specific datasets to reduce response size
  • Leverage metadata filtering for custom organizational structures
  • Store pagination cursors for efficient navigation through results