{"id":"datasets","name":"Datasets","description":"Datasets are structured collections of data that serve as knowledge bases for various applications, enabling efficient storage, retrieval, and management of information.","category":"Resources/Datasets","tags":["dataset"],"index":1,"content":"Datasets are essential components for building knowledge-driven AI applications,\nallowing you to organize, store, and efficiently retrieve information that\npowers intelligent conversations and automated workflows. A dataset acts as a\ncentralized repository for structured or unstructured data that can be queried,\nsearched, and referenced by bots, agents, and other AI-powered systems.\n\n## Creating Datasets\n\nCreating a dataset is the foundational step in building a knowledge base for\nyour AI applications. When you create a dataset, you establish a container\nthat can hold records, files, and structured information that will be\nsearchable and retrievable by your conversational agents and applications.\n\nThe dataset creation process allows you to configure various storage and\nretrieval parameters that determine how your data is indexed, searched, and\npresented to AI models. Careful consideration of these settings during creation\nensures optimal performance and relevance in your application's responses.\n\nTo create a new dataset, send a POST request with the basic information and\noptional configuration parameters:\n\n```http\nPOST /api/v1/dataset/create\nContent-Type: application/json\n\n{\n  \"name\": \"SupportFAQs\",\n  \"description\": \"Customer support frequently asked questions and answers\",\n  \"store\": \"pinecone\",\n  \"recordMaxTokens\": 1000,\n  \"searchMaxRecords\": 5,\n  \"visibility\": \"private\"\n}\n```\n\n### Key Configuration Options\n\n- **name**: A descriptive identifier for your dataset\n- **description**: Detailed explanation of the dataset's purpose and content\n- **store**: The underlying storage backend (defaults to platform default)\n- **recordMaxTokens**: Maximum tokens per record for optimal chunking\n- **searchMaxRecords**: Maximum number of records returned in search results\n- **searchMaxTokens**: Maximum total tokens in search results\n- **visibility**: Access control (private, protected, or public)\n- **matchInstruction**: Instructions for when records match a query\n- **mismatchInstruction**: Instructions for when no records match a query\n\nThe API returns the newly created dataset's ID upon successful creation:\n\n```json\n{\n  \"id\": \"dts_abc123xyz\"\n}\n```\n\n**Important Considerations:**\n\n- **Immutable Settings**: The `store` type cannot be changed after creation,\n  so choose carefully based on your performance and scale requirements\n- **Token Limits**: Setting appropriate token limits helps balance context\n  richness with response time and cost\n- **Search Configuration**: Fine-tune search parameters based on your use\n  case-more records provide broader context but may introduce noise\n\n**Best Practices:**\n\n- Use descriptive names that clearly indicate the dataset's content\n- Set `recordMaxTokens` based on your content granularity (500-2000 tokens\n  is typical)\n- Consider visibility settings carefully, especially for sensitive data\n- Link datasets to blueprints for organized project management\n\n## Deleting a Dataset\n\nDeleting a dataset permanently removes it from your account along with all\nits records and associated data. This operation is irreversible and cannot\nbe undone, so it should be used carefully, especially for datasets that\ncontain important information or are actively being used by bots or other\napplications.\n\nWhen you delete a dataset, the entire dataset entity is removed, including\nits name, description, store configuration, and all records it contains. The\noperation automatically handles cleanup of related resources, including vector\nembeddings and indexed data stored in the underlying data store.\n\nBefore deleting a dataset, consider whether you need to:\n\n- **Export your data**: If you might need the data later, export records first\n- **Update bot configurations**: Remove or update any bots that reference this dataset\n- **Check dependencies**: Verify that no active applications depend on this dataset\n\nTo delete a dataset, send a POST request with the dataset ID:\n\n```http\nPOST /api/v1/dataset/{datasetId}/delete\nContent-Type: application/json\n\n{}\n```\n\nThe request returns the ID of the deleted dataset upon successful completion:\n\n```json\n{\n  \"id\": \"dts_abc123xyz\"\n}\n```\n\n**Important Considerations:**\n\n- **Permanent deletion**: Deleted datasets cannot be recovered\n- **Record cleanup**: All records within the dataset are also deleted\n- **Store cleanup**: Vector embeddings and indexed data are removed from the store\n- **Authorization**: You can only delete datasets that belong to your account\n\nIf you need to temporarily disable a dataset without deleting it, consider\nremoving it from bot configurations or exporting its data for safekeeping\nbefore deletion.\n\n## Retrieving a Specific Dataset\n\nFetching detailed information about a specific dataset allows you to access\nits complete configuration, search parameters, storage settings, and metadata.\nThis is essential for understanding how a dataset is configured, verifying\nsettings before modifications, or displaying dataset information in user\ninterfaces.\n\nThe fetch operation returns comprehensive details about the dataset, including\nall configuration options that were set during creation or subsequent updates.\nThis information can be used to replicate dataset configurations, audit\nsettings, or make informed decisions about dataset usage in your applications.\n\nTo retrieve a specific dataset by its ID, send a GET request:\n\n```http\nGET /api/v1/dataset/{datasetId}/fetch\n```\n\nReplace `{datasetId}` with the actual dataset identifier (e.g., `dts_abc123xyz`).\n\n### Response Details\n\nThe response includes the complete dataset configuration:\n\n```json\n{\n  \"id\": \"dts_abc123xyz\",\n  \"name\": \"SupportFAQs\",\n  \"description\": \"Customer support frequently asked questions\",\n  \"store\": \"pinecone\",\n  \"reranker\": \"cohere\",\n  \"recordMaxTokens\": 1000,\n  \"searchMinScore\": 0.7,\n  \"searchMaxRecords\": 5,\n  \"searchMaxTokens\": 2000,\n  \"matchInstruction\": \"Use the following information to answer the question\",\n  \"mismatchInstruction\": \"No relevant information found\",\n  \"visibility\": \"private\",\n  \"blueprintId\": \"bp_xyz789\",\n  \"meta\": {},\n  \"createdAt\": \"2024-01-15T10:30:00.000Z\",\n  \"updatedAt\": \"2024-01-15T10:30:00.000Z\"\n}\n```\n\n### Key Fields Explained\n\n- **store**: The vector database or storage backend being used\n- **reranker**: Optional reranking model for improved search relevance\n- **recordMaxTokens**: Maximum token limit per individual record\n- **searchMinScore**: Minimum similarity score threshold for search results\n- **searchMaxRecords**: Maximum number of records returned in searches\n- **searchMaxTokens**: Total token limit across all search results\n- **matchInstruction**: System instruction when records are found\n- **mismatchInstruction**: System instruction when no records match\n- **visibility**: Access control level (private, protected, public)\n\n### Common Use Cases\n\n- **Configuration Auditing**: Verify current settings before making updates\n- **Dataset Cloning**: Retrieve configuration to replicate in new datasets\n- **UI Display**: Show dataset settings in administrative interfaces\n- **Integration Setup**: Confirm dataset parameters before connecting to bots\n- **Debugging**: Diagnose search behavior by reviewing configuration\n\n**Authorization Note**: You can only fetch datasets that belong to your\naccount. Attempting to access datasets owned by other users will result in\nan authorization error.\n\n## Updating a Dataset\n\nModifying an existing dataset allows you to refine its configuration, adjust\nsearch parameters, update instructions, and change metadata without affecting\nthe underlying data records. This flexibility enables you to optimize dataset\nperformance and behavior as your application requirements evolve.\n\nDataset updates are ideal for tuning search relevance, adjusting token limits\nbased on performance observations, refining match/mismatch instructions, or\nupdating organizational metadata. The update operation preserves all existing\nrecords while applying new configuration settings that will affect future\nsearch and retrieval operations.\n\nTo update a dataset, send a POST request with the fields you want to modify:\n\n```http\nPOST /api/v1/dataset/{datasetId}/update\nContent-Type: application/json\n\n{\n  \"name\": \"Updated FAQ Database\",\n  \"description\": \"Comprehensive customer support FAQ with product information\",\n  \"recordMaxTokens\": 1500,\n  \"searchMaxRecords\": 8,\n  \"searchMinScore\": 0.75,\n  \"matchInstruction\": \"Answer using only the information provided below\",\n  \"mismatchInstruction\": \"No relevant FAQ found for this question\",\n  \"visibility\": \"protected\"\n}\n```\n\nReplace `{datasetId}` with your dataset's identifier (e.g., `dts_abc123xyz`).\nYou only need to include the fields you want to update-unchanged fields will\nretain their current values.\n\n### Updatable Fields\n\nThe following properties can be modified after dataset creation:\n\n- **name**: Display name for the dataset\n- **description**: Detailed description of contents and purpose\n- **recordMaxTokens**: Maximum tokens per record chunk\n- **searchMinScore**: Minimum similarity threshold for search results\n- **searchMaxRecords**: Maximum number of records returned per search\n- **searchMaxTokens**: Total token limit across all search results\n- **matchInstruction**: Instructions when records are found\n- **mismatchInstruction**: Instructions when no matching records exist\n- **visibility**: Access control (private, protected, public)\n- **reranker**: Reranking model for improving search relevance\n- **separators**: Custom text separators for record chunking\n- **blueprintId**: Associated blueprint for organization\n- **meta**: Custom metadata for flexible categorization\n\n### Immutable Properties\n\n**Important**: The following properties **cannot be changed** after creation:\n\n- **store**: The underlying storage backend (e.g., pinecone, postgres)\n\nAttempting to modify the store type will have no effect. If you need to\nchange the storage backend, you must create a new dataset and migrate your\ndata.\n\n### Response\n\nUpon successful update, the API returns the dataset ID:\n\n```json\n{\n  \"id\": \"dts_abc123xyz\"\n}\n```\n\n### Common Update Scenarios\n\n**Tuning Search Relevance:**\nAdjust `searchMinScore` and `searchMaxRecords` based on observed result\nquality. Higher scores increase precision but may reduce recall.\n\n**Optimizing Token Usage:**\nModify `recordMaxTokens` and `searchMaxTokens` to balance context richness\nwith API costs and response time.\n\n**Refining Instructions:**\nUpdate `matchInstruction` and `mismatchInstruction` to improve how AI\nmodels utilize or handle the absence of dataset information.\n\n**Changing Visibility:**\nAdjust access control as your dataset's sensitivity or sharing requirements\nchange over time.\n\n**Best Practices:**\n\n- Make incremental changes and test the impact before further adjustments\n- Update instructions to be specific about how information should be used\n- Monitor search performance after configuration changes\n- Keep descriptions current as dataset content evolves\n- Use metadata updates to maintain organizational clarity\n\n## Listing Datasets\n\nRetrieving a comprehensive list of all datasets in your account is essential\nfor managing your knowledge bases, monitoring data organization, and accessing\ndataset configurations programmatically. The list endpoint provides powerful\nfiltering and pagination capabilities to help you efficiently navigate large\ncollections of datasets.\n\nThe listing operation returns detailed information about each dataset,\nincluding its configuration, storage settings, search parameters, and\nmetadata. This is particularly useful for building administrative interfaces,\nimplementing dataset selection features in applications, or automating dataset\nmanagement workflows.\n\nTo retrieve a list of your datasets, send a GET request:\n\n```http\nGET /api/v1/dataset/list\n```\n\nThe response includes all datasets associated with your account, returned as\nan array of dataset objects with their complete configuration and metadata.\n\n### Pagination and Ordering\n\nFor accounts with many datasets, pagination helps manage the response size\nand improve performance:\n\n```http\nGET /api/v1/dataset/list?take=20&order=desc\n```\n\nAvailable pagination parameters:\n\n- **cursor**: Pagination token from previous response to fetch the next page\n- **take**: Number of datasets to retrieve per request\n- **order**: Sort order by creation date (\"asc\" or \"desc\", defaults to \"desc\")\n\n### Filtering by Blueprint\n\nTo retrieve only datasets associated with a specific blueprint or project:\n\n```http\nGET /api/v1/dataset/list?blueprintId=bp_abc123\n```\n\nThis is useful when working with organized project structures where datasets\nare grouped by purpose or workflow.\n\n### Filtering by Metadata\n\nDatasets with custom metadata can be filtered using meta queries, enabling\nsophisticated organizational schemes:\n\n```http\nGET /api/v1/dataset/list?meta[environment]=production&meta[category]=support\n```\n\n### Response Structure\n\nEach dataset in the response includes:\n\n- **Core identifiers**: id, name, description\n- **Storage configuration**: store type, reranker settings\n- **Search parameters**: recordMaxTokens, searchMinScore, searchMaxRecords,\n  searchMaxTokens\n- **Instructions**: matchInstruction, mismatchInstruction\n- **Resource relationships**: blueprintId\n- **Access control**: visibility setting\n- **Metadata**: Custom meta fields\n- **Timestamps**: createdAt, updatedAt\n\n**Best Practices:**\n\n- Use pagination for large dataset collections to improve API performance\n- Apply filters when searching for specific datasets to reduce response size\n- Leverage metadata filtering for custom organizational structures\n- Store pagination cursors for efficient navigation through results","link":"https://docs.cbk.ai/datasets","createdAt":1779021375179,"updatedAt":1779021375179}