back to tutorials

Real-Time LLM-as-a-Judge with the Extract Integration

Turn ChatBotKit's Data Extraction integration into a real-time LLM-as-a-judge that reads an agent's tool activity and extracts operational metrics - records written, error rate, retries, and where failures originate. Built around an agent that manages crmkit CRM, all from a single blueprint.

You can measure how well your AI agent performs without writing a single line of evaluation code. The trick is to point a second model at each finished conversation and have it grade the exchange against a rubric. This pattern is called LLM-as-a-judge, and ChatBotKit's Data Extraction integration gives you everything you need to run it in real time.

This becomes especially powerful for tool-using agents. A ChatBotKit conversation records every tool call the agent makes along with its result, so the transcript already contains a full operational trace. An LLM judge can read that trace and turn it into hard numbers: how many records the agent wrote, how often calls failed, how many times it retried, and where the failures came from.

In this tutorial you will build an agent that manages crmkit - an agent-first CRM - over MCP, then wire an Extract integration to it that acts as the judge. After every conversation, the judge reads the tool activity and extracts operational metrics you can chart and monitor.

What You'll Build

A two-part system, expressed as a single blueprint:

  • CRM Sync Agent - a bot that loads crmkit's MCP tools and uses them to create and update company records.
  • Operations Monitor - an Extract integration connected to the agent. Its schema is an operational rubric, and the numbers it pulls from each transcript become collected metrics.
  • Operations Metrics chart - an Extract Chart tool linked to the judge, so the metrics are visible right on the blueprint canvas.

Here is the blueprint you will create. Pan and zoom to see how the pieces connect - the judge attaches to the agent, and the chart attaches to the judge.

Prerequisites

Step 1: Build the CRM Agent

The judge needs an agent that does real work, and a tool-using agent gives it the richest trace to grade. In the blueprint above the agent is four connected resources:

  1. Bot (CRM Sync Agent) - the worker. Its backstory tells it to load the crmkit tools, query before writing, retry once on a version conflict, and report failures honestly. That honesty matters: the cleaner the agent is about reporting tool results, the more accurately the judge can count them.
  2. Skillset (CRM Toolkit) - the container for the agent's abilities, linked to the bot via skillsetId.
  3. Ability (Load crmkit Tools) - an mcp/load[crmkit] ability that dynamically pulls in crmkit's MCP toolset (create/read/update companies, contacts, deals) at runtime.
  4. Secret (crmkit) - an OAuth secret pointing at https://api.crmkit.ai/mcp, linked to the ability via secretId, that authorizes the MCP connection.

Each crmkit operation the agent runs - and each failure it hits - lands in the conversation transcript as a tool request and response. That trace is the raw material the judge reads.

Step 2: Define the Operational Rubric

The Extract integration's schema is your rubric. Each property is a thing the judge measures, and the description is the instruction it follows. Two properties turn a field into a tracked metric:

  • collect: true - records the value as a metric (numeric fields only).
  • display - formats the value on the chart: number, percent, or currency/<code>.

The Operations Monitor reads the tool activity and extracts six numbers plus one summary:

FieldTypeDisplayWhat it captures
recordsCreatednumbernumberNew entries written to the database
recordsUpdatednumbernumberExisting records changed
errorRate0-1percentShare of tool calls that failed
retryCountnumbernumberHow hard the agent had to work to succeed
crmkitErrorCountnumbernumberFailures that originated in the CRM
otherErrorCountnumbernumberFailures from network, fetch, or the agent
errorSummarytext-What failed and where, for spot checks

This is the part that makes the example interesting. The judge is not scoring a vibe - it is parsing a semi-structured trace into operational telemetry. Splitting failures into crmkitErrorCount and otherErrorCount answers the question every on-call engineer asks first: is this our problem or theirs? A spike in crmkit errors means open a ticket with the CRM; a spike in other errors means look at your own agent.

Tip: The percent display formats values as fractions, so 0.25 renders as 25%. Tell the judge to score errorRate between 0 and 1 in the field description, as the schema above does, and the chart reads in clean percentages.

The errorSummary field is not collected, so it never shows up on a chart. It is stored alongside the numbers in the conversation metadata, which makes auditing easy: when crmkitErrorCount jumps, you read the summaries to see exactly which calls broke and why.

Step 3: Connect the Judge to the Agent

An Extract integration grades whatever bot it is attached to. In the blueprint this is the botId field on the integration pointing at #bot:::crm-agent. On the canvas, you draw a line from the judge to the agent. That single connection is what makes the judge real-time: it now sees every conversation that bot has.

Step 4: Set the Trigger to Automatic

The trigger: automatic setting is what makes this run on its own. With automatic triggering, the judge fires after each conversation completes and logs the metrics without any manual step. This is the "real-time" part - your dashboard reflects the agent's behavior as soon as conversations finish, so a climbing errorRate or a burst of retryCount shows up while you can still act on it.

Already have a backlog of conversations? Use the Trigger button on the integration page to apply the rubric to the most recent 100 conversations and backfill the chart, which gives you a baseline before live scoring takes over.

Choose a capable judge model. The blueprint runs the agent on claude-4.6-sonnet for fast, cost-effective CRM work, and runs the judge on claude-4.8-opus because reading a tool trace and attributing errors rewards stronger reasoning. The judge runs once per conversation, so the extra capability is cheap relative to live traffic.

Step 5: Chart the Metrics

Drop an Extract Chart tool onto the canvas and connect it to the Operations Monitor integration - that is the extractIntegrationId link in the blueprint's tools section. The chart reads the judge's collected fields and draws a daily series for each one, formatted with that field's display setting. Records created and updated plot as counts, error rate plots as a percentage, and the two error-location counts plot side by side so you can see at a glance whether crmkit or your own stack is the bigger source of trouble.

Keeping the chart on the blueprint means the operational signal lives next to the design it measures. Anyone who opens the blueprint sees both the agent and how reliably it is running, with no separate dashboard to hunt for.

How It Works

The whole system is a feedback loop:

  1. A user asks the CRM Sync Agent to add or update companies, and it runs crmkit tool calls to do so.
  2. Every tool request and response - successes, errors, and retries alike - is recorded in the conversation transcript.
  3. The conversation completes and goes idle.
  4. The Operations Monitor, attached via botId, reads the full transcript including that tool trace.
  5. Guided by the schema descriptions, the judge model counts records, computes the error rate, tallies retries, attributes each failure to crmkit or elsewhere, and writes a summary.
  6. Fields marked collect: true are logged as metrics; the summary lands in conversation metadata.
  7. The Extract Chart renders the accumulating metrics as daily series.

The agent and the judge never share a model call - the judge is a clean second pass over a finished conversation, which keeps its accounting independent of the agent it is grading.

Why This Matters

Counting tool calls by hand does not scale, and traditional analytics only tell you how many conversations happened, not what the agent actually accomplished or where it struggled. An LLM-as-a-judge closes that gap. Because every conversation is read the same way against the same rubric, the numbers are comparable over time, and that comparability is what lets you monitor an agent in production:

  • Throughput - watch recordsCreated and recordsUpdated to confirm the agent is getting real work done as traffic grows.
  • Reliability - track errorRate and retryCount to catch the agent quietly degrading before users complain.
  • Failure attribution - the crmkitErrorCount versus otherErrorCount split tells you whether to escalate to the CRM provider or fix your own agent, which is usually the slowest question to answer during an incident.
  • Regression detection - after you change a backstory, model, or tool, compare the metric lines before and after to see whether the change actually helped.
  • Automated alerting - set a request URL on the integration to push each scored result to your own endpoint, then fire a Slack alert or open a ticket when error rate crosses a threshold.

Tips for Reliable Scoring

  • Make the agent report tool results honestly. The judge can only count what the transcript records. A backstory that insists on reporting failures plainly, like the one above, directly improves metric accuracy.
  • Write the rubric like instructions to an auditor. State exactly what to count and how to classify it. "Errors that originated inside crmkit, such as 4xx/5xx responses or version conflicts" beats a bare "crmkit errors."
  • Keep scales consistent. Counts as plain numbers and proportions as 0-1 fractions make the chart easy to read at a glance.
  • Anchor with a summary field. Forcing the judge to justify its counts improves the counts themselves and gives you something to audit.
  • Start narrow. A few metrics you trust beat a dozen you have to second-guess. Add more once the first set is stable.

Wrapping Up

With one Extract integration acting as a judge, your CRM agent grades its own runs after every conversation and turns a raw tool trace into operational telemetry. Build the agent, write the rubric as a schema, connect the judge with botId, set the trigger to automatic, and chart the result. From there the accumulated metrics become a live read on throughput and reliability you can watch, alert on, and improve against - and the same pattern works for any tool-using agent, not just one that talks to crmkit.

For deeper measurement patterns, see How to Measure ROI with the Data Extraction Integration and the full Data Extraction documentation.