Continuous Agent Evaluation Harness
A meta-agent system that never stops stress-testing your AI agents - generating adversarial inputs, scoring outputs with LLM-as-judge, tracking quality regressions over time, and producing structured reliability reports with Slack alerts when scores drop.
The Continuous Agent Evaluation Harness blueprint addresses the #1 challenge in deploying AI agents to production: knowing whether they are still working correctly.
Unlike traditional software, AI agents degrade silently. A model update, a prompt change, or a subtle shift in user phrasing can cause responses to deteriorate without triggering any error. Teams discover regressions only after users complain - or worse, after the damage is done. Existing observability tools (LangSmith, Arize, Galileo) tell you what happened. This blueprint is an evaluation harness - it proactively tests agents on a schedule, before problems reach users.
The architecture introduces a self-sustaining evaluation loop powered by three agents with distinct temporal roles:
Test Architect (weekly)
The Test Architect generates and curates the test suite. It reads the
target agent's backstory and documentation, then generates diverse test
cases covering happy paths, edge cases, adversarial prompts, and domain-
specific scenarios. Each batch is stored as a structured YAML file under
.tests/ in the Evaluation Workspace space. Over time the test suite
grows organically as the Architect adds scenarios for new features,
emerging attack patterns, and user-reported edge cases.
Evaluation Runner (daily)
The Runner reads all test case files from the shared space, invokes
each test against the configured target agent using bot/call, and
evaluates every response using an LLM-as-judge rubric across four
dimensions: correctness, helpfulness, safety, and format compliance.
Scored results are appended to timestamped JSONL files under .results/
and a rolling baseline is maintained at .results/baseline.json.
Regression Analyst (triggered after each Runner cycle)
The Analyst reads the last 30 days of scored results, calculates
rolling averages by category, and detects statistically significant
drops against the baseline. When regression scores exceed the alert
threshold it generates a structured Markdown report under .reports/
identifying specific failing test cases with root cause hypotheses,
and sends a Slack notification to the team.
Why This Architecture Works
- Persistent test suite files grow over time - the Architect adds new scenarios weekly while the full history is preserved for replay.
- LLM-as-judge pattern is encapsulated in the Runner's backstory with a reusable rubric, making it easy to customise scoring criteria.
- Score time-series stored as JSONL files enable trend analysis across model updates, prompt changes, and configuration drift.
- Cross-agent evaluation - the harness tests any agent on the
platform via
bot/call, making it a general-purpose capability. - Triggered escalation - if regression scores drop past the configured threshold, a Slack alert is sent automatically.
- Compliance audit trail - produces a dated, persistent record of agent evaluations suitable for EU AI Act Article 9 risk management requirements.
Market Context
The Databricks State of AI Agents 2026 report shows that only 22.8% of teams run online evaluations - the rest are flying blind. LangSmith crossed 100k active users in 2025 and Braintrust raised $20M Series A specifically for AI evaluation, demonstrating strong market demand. This blueprint fills a gap in the catalogue by showing how to deploy continuous evaluation using the platform's native multi-agent and scheduling primitives rather than a separate SaaS tool.
Use Cases
- Production regression guard - deploy after every model update to verify core agent behaviors are preserved before rollout.
- Continuous quality benchmarking - track quality trends across months to demonstrate improvement or catch slow degradation.
- Adversarial red-teaming schedule - the Test Architect continuously adds prompt injection and jailbreak scenarios as new attack patterns emerge.
- Multi-agent cross-comparison - configure two versions of an agent and run the same test suite against both to compare quality before promoting a new version.
Getting Started
- Fork this blueprint and configure the target agent: set the
bot/callability'sbotIdto point at the agent you want to evaluate. - Configure the Slack secret for regression alerts.
- Let the Test Architect run to generate the initial test suite.
- The Evaluation Runner will begin daily testing automatically.
- Monitor the Evaluation Workspace file browser for test results, reports, and score trends.
Backstory
Common information about the bot's experience, skills and personality. For more information, see the Backstory documentation.
Skillset
This example uses a dedicated Skillset. Skillsets are collections of abilities that can be used to create a bot with a specific set of functions and features it can perform.
List Files
List files in the Evaluation Workspace to discover existing test suites and resultsRead/Write Files
Read existing test files and write new test batch YAML files to the Evaluation WorkspaceList Files
List test case files and result files in the Evaluation WorkspaceRead/Write Files
Read test case files and write scored JSONL results and baseline to the Evaluation WorkspaceCall Target Agent
Invoke the target agent with a test case input and receive its response for scoringList Files
List result files and report files in the Evaluation WorkspaceRead/Write Files
Read scored result files and write regression reports to the Evaluation WorkspaceSend Slack Alert
Send a regression alert message to the configured Slack channel when quality scores drop
Secrets
This example uses Secrets to store sensitive information such as API keys, passwords, and other credentials.
Slack
Slack OAuth token for sending regression alerts
Terraform Code
This blueprint can be deployed using Terraform, enabling infrastructure-as-code management of your ChatBotKit resources. Use the code below to recreate this example in your own environment.
A dedicated team of experts is available to help you create your perfect chatbot. Reach out via or chat for more information.