Learn how to import a website's information into your dataset with ChatBotKit's Sitemap feature. Automatically summarise long pages using AI, and access important information easily from your chatbot.

With ChatBotKit's Sitemap feature, you can easily import a website's information into your dataset by simply providing the website's URL.

Step-by-step Guide

To integrate ChatBotKit's Sitemap feature into your dataset, follow these simple steps:

  1. Navigate to "Integrations" in ChatBotKit and click Website Importer.
  2. Enter a name and optional description for this integration.
  3. Select the dataset you want to import information into.
  4. Enter the website URL. You can also provide a sitemap.xml URL to be more specific.
  5. Save the integration by clicking the "Create" button.

There are several advanced options that needs to be considered. You can find this information under "Advanced Options".

  1. Glob Patterns: Glob patterns allow you to target specific pages for integration. For instance, if you aim to synchronize only the documentation located at /docs, you should set the glob pattern to /docs/**. This field supports multiple entries, enabling the inclusion of various patterns or the exclusion of specific ones using negative globs (prefixed with !). Negative globs override positive ones, offering a way to exclude particular URL patterns and refine your selection criteria.
  2. Selectors: Utilize CSS selectors to narrow down the importer's focus to designated areas of your website. This feature helps in selectively importing content, ensuring only the relevant sections are captured. Additionally you can enter special selectors such as jsonld (to extract structured data) and skiphtml (to skip hmtl) to further fine-tune your import process.
  3. JavaScript Rendering: Activating this option enables the importer to operate with the capabilities of a full browser, essential for capturing content from websites rich in dynamic elements and scripts. This ensures comprehensive content capture, including AJAX-loaded content and other dynamic interactions.
  4. Sync Schedule: This feature provides flexibility in how often your data is synchronized. You can set the sync schedule to various frequencies—never, hourly, daily, weekly, or monthly—according to your needs. Choosing "never" pauses automatic syncing, while selecting one of the time-based options ensures your data is regularly updated. This ensures that the integration aligns with your content update cycles and data freshness requirements, enabling timely and efficient data synchronization.
  5. Expires In: This setting allows for the automatic expiration of outdated records, which is particularly beneficial for frequently updated websites. By specifying an expiration period, you can ensure that only the most current records are retained, enhancing data accuracy and relevance.

Once the Sitemap integration is created, ChatBotKit will automatically import the information from the website into your selected dataset.

Imported Dataset Records

To access the information that has been imported from the website, you need to navigate to the specific dataset that you selected during Step 3 of the process. This dataset is the repository where all the content imported from the website is stored. The content here is organized and structured in a way that makes it easily accessible. This information, which can be text, or any other kind of data extracted from the website, is highly useful. It can be used to train your chatbot, which will enable the chatbot to respond to user queries more effectively and accurately.

Structured Data Importing

When you opt to use the jsonld selector, the sitemap importer will also bring in structured data. This feature proves to be particularly beneficial if your aim is to import e-commerce websites, where production information is key. This comprehensive process allows you to import all the relevant and necessary structured data, facilitating a more streamlined and efficient import process. Furthermore, the sitemap importer is versatile and customizable - you can utilize custom selectors such as skiphtml, which provide you with the option to bypass importing all other pages. This way, you can concentrate solely on the structured data, further enhancing the specificity and efficiency of your import process.

Events Tracking

Within each integration configuration page, you'll find the "Sitemap Integration Events" section. This area provides a comprehensive overview of all events related to your integration, including detailed information on the URLs covered by the crawler. This feature is invaluable for monitoring and analyzing the scope and success of your integration efforts, offering insights into the extent of your website's content that has been successfully captured and integrated.

Caveats

Please note that there are some limitations to the Sitemap feature. Currently, a crawl is limited to a maximum of 15 minutes and the maximum number of URLs that can be crawled is 1000. If you need to crawl more than 1000 URLs or require a longer crawl time, please contact our customer support team for advice on how to create a custom solution.