Autonomous SRE Agent

A sophisticated Site Reliability Engineering agent with its own persistent shell environment for autonomous troubleshooting, incident investigation, and infrastructure monitoring. Integrates with Sentry for error tracking, PagerDuty for on-call management, and Slack for team communication.

←Back to examples Copy

sre

devops

infrastructure

629

The Autonomous SRE Agent represents the cutting edge of AI-powered infrastructure operations. This blueprint creates an intelligent agent with full shell access, capable of autonomous investigation, script execution, and proactive system monitoring-essentially a junior SRE that never sleeps.

At its core is a persistent shell workspace that gives the agent true operational capabilities. Unlike simple chatbots that can only provide advice, this agent can actually execute commands, write diagnostic scripts, analyze logs, and store investigation artifacts. The shell environment persists across conversations, allowing the agent to build up a library of runbooks, diagnostic tools, and historical data that improves its effectiveness over time.

The architecture uses a multi-skillset design that organizes capabilities into logical domains:

Shell Operations provides full terminal access with command execution, file read/write operations, and the ability to import external resources. The agent can write Python or Node.js scripts, execute them, and analyze their output-all without human intervention. This enables sophisticated automation like parsing JSON logs, generating reports, or implementing custom health checks.

Monitoring Integration connects to Sentry for error tracking and PagerDuty for incident management. The agent can list recent errors, investigate stack traces, check who's on-call, and even create or acknowledge incidents. When a production issue occurs, it can gather context from multiple sources before escalating to humans.

Dynamic Skillset Installation is a powerful meta-capability that lets the agent expand its own abilities at runtime. Using the conversation/skillset/install ability, the agent can examine available skillsets and install additional capabilities as needed. This makes the agent extensible-add new skillsets to the blueprint, and the agent can discover and use them without reconfiguration.

Research & Documentation enables the agent to search the web and fetch documentation, making it effective at finding solutions to novel problems. When encountering an unfamiliar error, it can research the issue and correlate findings with the actual system state.

The scheduled trigger integration enables autonomous operation. The agent runs periodic health checks, analyzes trends, and proactively identifies issues before they become incidents. Each run produces a timestamped report stored in the workspace, creating an audit trail of system health over time.

The Slack integration allows the agent to communicate with the team. After investigating an issue, it can start conversations with on-call engineers, providing context and suggested remediation steps. This creates a seamless handoff from automated investigation to human decision-making.

Practical use cases include:

Incident Investigation: When an alert fires, the agent can immediately begin investigation-checking logs, querying metrics, identifying recent changes-and prepare a summary before the on-call engineer even opens their laptop.
Automated Runbooks: Store runbook scripts in the workspace and have the agent execute diagnostic procedures automatically. Failed deployment? The agent can run the standard rollback verification script and report results.
Trend Analysis: Schedule regular runs to analyze error rates, latency patterns, or resource utilization. The agent can identify anomalies and alert before they become critical.
Knowledge Building: As the agent investigates issues, it builds up a knowledge base in its workspace-previous incidents, successful remediations, custom scripts-that improves its effectiveness over time.
On-Call Support: Integrate with PagerDuty to understand who's on-call and ensure alerts reach the right person with the right context.

To extend this blueprint, add abilities for your specific infrastructure: Kubernetes cluster access, database query capabilities, cloud provider APIs, or custom internal tools. The dynamic skillset installation pattern means you can add capabilities incrementally and the agent will discover them automatically.

This blueprint showcases several advanced ChatBotKit features: persistent shell workspaces for real execution, multi-skillset architecture for organized capabilities, dynamic skillset installation for runtime extensibility, scheduled triggers for autonomous operation, and multi- integration patterns connecting Slack, Sentry, and PagerDuty.

Backstory

Common information about the bot's experience, skills and personality. For more information, see the Backstory documentation.

You are an autonomous Site Reliability Engineering (SRE) Agent with full access to a persistent shell environment. Your mission is to investigate, diagnose, and help resolve infrastructure issues while maintaining comprehensive documentation of your actions. CORE CAPABILITIES: 1. SHELL ENVIRONMENT - You have a persistent shell workspace that persists across sessions - Execute bash commands to investigate issues (shell/exec) - Read and write files for scripts and reports (shell/read, shell/write) - Import external resources like log files or configs (shell/import) - Write and execute Python scripts for complex analysis - Store artifacts, scripts, and runbooks in your workspace 2. MONITORING INTEGRATION - Query Sentry for errors, issues, and stack traces - Check PagerDuty for active incidents and on-call schedules - Correlate errors across systems to identify root causes 3. DYNAMIC EXPANSION - List available skillsets using blueprint/resource/list - Install additional skillsets at runtime for expanded capabilities - Adapt to new requirements by discovering and using new abilities 4. RESEARCH & DOCUMENTATION - Search the web for solutions to unfamiliar problems - Fetch documentation from external sources - Maintain runbooks and investigation notes in your workspace 5. TEAM COMMUNICATION - Start Slack conversations with on-call engineers - Provide context-rich incident summaries - Coordinate handoffs between automated investigation and human action INVESTIGATION WORKFLOW: When investigating an issue: 1. Gather context: Check Sentry for recent errors, PagerDuty for active incidents 2. Analyze symptoms: Execute diagnostic commands, review logs 3. Correlate data: Look for patterns across sources 4. Document findings: Write structured reports to your workspace 5. Recommend action: Provide specific remediation steps 6. Escalate if needed: Contact on-call via Slack with full context WORKSPACE ORGANIZATION: Maintain your workspace with this structure: - /reports/ - Timestamped investigation and health check reports - /scripts/ - Reusable diagnostic and automation scripts - /runbooks/ - Step-by-step procedures for common issues - /incidents/ - Documentation for specific incident investigations SCRIPTING BEST PRACTICES: When writing scripts: - Use Python for data analysis and complex logic - Use bash for quick system commands and pipeline operations - Include error handling and meaningful output - Store reusable scripts for future use - Document what each script does REPORT FORMAT: All reports should include: - Timestamp and report type - Executive summary (2-3 sentences) - Detailed findings with evidence - Metrics and data points - Recommendations with priority - Next steps or escalation needs EXAMPLE HEALTH CHECK SCRIPT: ```python #!/usr/bin/env python3 import json from datetime import datetime report = { "timestamp": datetime.now().isoformat(), "type": "health_check", "checks": [], "status": "healthy" } # Add your diagnostic checks here print(json.dumps(report, indent=2)) ``` Remember: You are an autonomous agent. Take initiative, investigate thoroughly, document everything, and escalate appropriately. Your workspace persists-build up a library of scripts and knowledge that makes you more effective over time. The current date is ${EARTH_DATE}.

Skillset

This example uses a dedicated Skillset. Skillsets are collections of abilities that can be used to create a bot with a specific set of functions and features it can perform.

⛺
Execute Command
Execute bash commands and scripts in the persistent workspace
🏢
Read File
Read files from the workspace (scripts, logs, reports)
🏢
Write File
Write files to the workspace (scripts, reports, artifacts)
❎
Import Resource
Import external resources (logs, configs, data files) into the workspace
❎
Run Python Code
Execute Python code for data analysis and complex operations
🅰️
Run JavaScript Code
Execute JavaScript/Node.js code for API interactions and automation
❎
List Available Skillsets
Discover all available skillsets in this blueprint for dynamic expansion
➕
Install Skillset
Dynamically install additional skillsets at runtime
🚨
List Sentry Issues
List recent issues and errors from Sentry projects
📄
Get Sentry Issue Details
Get detailed information about a specific Sentry issue including context
🤖
Get Latest Error Event
Get the latest event with full stacktrace for a Sentry issue
📟
List PagerDuty Incidents
List active and recent incidents from PagerDuty
📄
Get Incident Details
Get detailed information about a specific PagerDuty incident
🇪🇸
Check On-Call Schedule
List who is currently on-call for incident escalation
📟
Create Incident
Create a new PagerDuty incident to alert the on-call team
😣
Update Incident
Acknowledge or resolve a PagerDuty incident
✌
Start Slack Conversation
Initiate a Slack DM with an on-call engineer to discuss an incident
💨
Search Web
Search the web for solutions, documentation, and troubleshooting guides
👴
Fetch Documentation
Fetch and read content from documentation URLs

Secrets

This example uses Secrets to store sensitive information such as API keys, passwords, and other credentials.

🔐
Sentry API Token
API token for accessing Sentry error tracking
🔐
PagerDuty API Token
API token for PagerDuty incident management

Terraform Code

This blueprint can be deployed using Terraform, enabling infrastructure-as-code management of your ChatBotKit resources. Use the code below to recreate this example in your own environment.

View Terraform Code

Copy this Terraform configuration to deploy the blueprint resources:

Next steps:

Save the code above to a file named main.tf
Set your API key: export CHATBOTKIT_API_KEY=your-api-key
Run terraform init to initialize
Run terraform plan to preview changes
Run terraform apply to deploy

Learn more about the Terraform provider

A dedicated team of experts is available to help you create your perfect chatbot. Reach out via or chat for more information.

AI Agents

AI Widgets

AI Messaging

AI SDKs

AI Enterprise

AI Whitelabel

Examples

Documentation

Manuals

Tutorials

Changelog

Reflections

Autonomous SRE Agent

Backstory

Skillset

Execute Command

Read File

Write File

Import Resource

Run Python Code

Run JavaScript Code

List Available Skillsets

Install Skillset

List Sentry Issues

Get Sentry Issue Details

Get Latest Error Event

List PagerDuty Incidents

Get Incident Details

Check On-Call Schedule

Create Incident

Update Incident

Start Slack Conversation

Search Web

Fetch Documentation

Secrets

Sentry API Token

PagerDuty API Token

Terraform Code

More Awesome Examples

Dynamic Shell Toolkit with Installable Skillsets

Proactive Slack Incident Responder

Proactive Email Outreach System