Sitemap Integration
Sitemap Integration enables you to automatically crawl and index website content into datasets, making web pages searchable and accessible to your AI agents. By providing a sitemap.xml URL or website URL, the integration discovers and syncs all linked pages, extracting text content and metadata for knowledge retrieval.
This is particularly useful for building chatbots that answer questions based on documentation sites, knowledge bases, blogs, or any web content you want to make available to your AI agents.
Creating a Sitemap Integration
To create a sitemap integration, you need to provide the website or sitemap URL and specify which dataset should receive the crawled content. The integration will automatically discover pages through sitemap.xml files or by crawling links, extracting content and storing it as searchable records.
Advanced Configuration Options
URL Filtering with Glob Patterns: The glob parameter allows you to
filter which pages to crawl using glob patterns. For example, "**\/blog/**"
only crawls blog posts, while "**\/docs/**" focuses on documentation pages.
Content Extraction with CSS Selectors: The selectors parameter specifies
which HTML elements to extract content from using CSS selectors. This helps
focus on main content and exclude navigation, footers, and other UI elements.
Multiple selectors can be comma-separated.
JavaScript Rendering: Set javascript: true to enable JavaScript execution
during crawling, necessary for single-page applications or dynamic content.
This increases crawl time but ensures complete content extraction.
Sync Scheduling: Use cron expressions to control crawl frequency. Daily syncs ("0 0 * * *") work well for most documentation sites, while more frequent syncs may be needed for rapidly changing content.
Record Expiration: The expiresIn parameter (in milliseconds) determines
how long crawled pages are retained. Set to 90 days (7776000000ms) for
typical documentation, or null for permanent retention.
Warning: Large websites may take significant time to crawl initially. The integration processes pages incrementally and subsequent syncs only update changed content for efficiency.
Listing Sitemap Integrations
Retrieve all sitemap integrations configured in your account to manage web crawlers, monitor crawl configurations, and review which websites are being synced into your datasets. This endpoint provides complete visibility into all active website crawling operations.
Each integration entry includes the full crawl configuration including URL patterns, content selectors, JavaScript rendering settings, and sync schedules, allowing you to audit and manage your web content synchronization.
Query Parameters:
cursor: Pagination cursor for retrieving additional resultsorder: Sort order ("asc" or "desc", default: "desc")take: Number of integrations to retrieve (default: 25)meta: Filter by metadata key-value pairsblueprintId: Filter integrations by blueprint association
The response includes complete crawl configurations, enabling you to verify which websites are being monitored, understand their extraction rules, and identify which datasets receive the crawled content for each integration.
Example Response:
Use this endpoint to maintain an inventory of your web crawling operations and ensure your documentation sites, blogs, and knowledge bases are being properly synchronized for AI agent access.