Autonomous SRE Agent
A sophisticated Site Reliability Engineering agent with its own persistent shell environment for autonomous troubleshooting, incident investigation, and infrastructure monitoring. Integrates with Sentry for error tracking, PagerDuty for on-call management, and Slack for team communication.
The Autonomous SRE Agent represents the cutting edge of AI-powered infrastructure operations. This blueprint creates an intelligent agent with full shell access, capable of autonomous investigation, script execution, and proactive system monitoring—essentially a junior SRE that never sleeps.
At its core is a persistent shell workspace that gives the agent true operational capabilities. Unlike simple chatbots that can only provide advice, this agent can actually execute commands, write diagnostic scripts, analyze logs, and store investigation artifacts. The shell environment persists across conversations, allowing the agent to build up a library of runbooks, diagnostic tools, and historical data that improves its effectiveness over time.
The architecture uses a multi-skillset design that organizes capabilities into logical domains:
Shell Operations provides full terminal access with command execution, file read/write operations, and the ability to import external resources. The agent can write Python or Node.js scripts, execute them, and analyze their output—all without human intervention. This enables sophisticated automation like parsing JSON logs, generating reports, or implementing custom health checks.
Monitoring Integration connects to Sentry for error tracking and PagerDuty for incident management. The agent can list recent errors, investigate stack traces, check who's on-call, and even create or acknowledge incidents. When a production issue occurs, it can gather context from multiple sources before escalating to humans.
Dynamic Skillset Installation is a powerful meta-capability that lets the agent expand its own abilities at runtime. Using the conversation/skillset/install ability, the agent can examine available skillsets and install additional capabilities as needed. This makes the agent extensible—add new skillsets to the blueprint, and the agent can discover and use them without reconfiguration.
Research & Documentation enables the agent to search the web and fetch documentation, making it effective at finding solutions to novel problems. When encountering an unfamiliar error, it can research the issue and correlate findings with the actual system state.
The scheduled trigger integration enables autonomous operation. The agent runs periodic health checks, analyzes trends, and proactively identifies issues before they become incidents. Each run produces a timestamped report stored in the workspace, creating an audit trail of system health over time.
The Slack integration allows the agent to communicate with the team. After investigating an issue, it can start conversations with on-call engineers, providing context and suggested remediation steps. This creates a seamless handoff from automated investigation to human decision-making.
Practical use cases include:
-
Incident Investigation: When an alert fires, the agent can immediately begin investigation—checking logs, querying metrics, identifying recent changes—and prepare a summary before the on-call engineer even opens their laptop.
-
Automated Runbooks: Store runbook scripts in the workspace and have the agent execute diagnostic procedures automatically. Failed deployment? The agent can run the standard rollback verification script and report results.
-
Trend Analysis: Schedule regular runs to analyze error rates, latency patterns, or resource utilization. The agent can identify anomalies and alert before they become critical.
-
Knowledge Building: As the agent investigates issues, it builds up a knowledge base in its workspace—previous incidents, successful remediations, custom scripts—that improves its effectiveness over time.
-
On-Call Support: Integrate with PagerDuty to understand who's on-call and ensure alerts reach the right person with the right context.
To extend this blueprint, add abilities for your specific infrastructure: Kubernetes cluster access, database query capabilities, cloud provider APIs, or custom internal tools. The dynamic skillset installation pattern means you can add capabilities incrementally and the agent will discover them automatically.
This blueprint showcases several advanced ChatBotKit features: persistent shell workspaces for real execution, multi-skillset architecture for organized capabilities, dynamic skillset installation for runtime extensibility, scheduled triggers for autonomous operation, and multi- integration patterns connecting Slack, Sentry, and PagerDuty.
Backstory
Common information about the bot's experience, skills and personality. For more information, see the Backstory documentation.
Skillset
This example uses a dedicated Skillset. Skillsets are collections of abilities that can be used to create a bot with a specific set of functions and features it can perform.
Execute Command
Execute bash commands and scripts in the persistent workspaceRead File
Read files from the workspace (scripts, logs, reports)Write File
Write files to the workspace (scripts, reports, artifacts)Import Resource
Import external resources (logs, configs, data files) into the workspaceRun Python Code
Execute Python code for data analysis and complex operationsRun JavaScript Code
Execute JavaScript/Node.js code for API interactions and automationList Available Skillsets
Discover all available skillsets in this blueprint for dynamic expansionInstall Skillset
Dynamically install additional skillsets at runtimeList Sentry Issues
List recent issues and errors from Sentry projectsGet Sentry Issue Details
Get detailed information about a specific Sentry issue including contextGet Latest Error Event
Get the latest event with full stacktrace for a Sentry issueList PagerDuty Incidents
List active and recent incidents from PagerDutyGet Incident Details
Get detailed information about a specific PagerDuty incidentCheck On-Call Schedule
List who is currently on-call for incident escalationCreate Incident
Create a new PagerDuty incident to alert the on-call teamUpdate Incident
Acknowledge or resolve a PagerDuty incidentStart Slack Conversation
Initiate a Slack DM with an on-call engineer to discuss an incidentSearch Web
Search the web for solutions, documentation, and troubleshooting guidesFetch Documentation
Fetch and read content from documentation URLs
Secrets
This example uses Secrets to store sensitive information such as API keys, passwords, and other credentials.
Sentry API Token
API token for accessing Sentry error trackingPagerDuty API Token
API token for PagerDuty incident management
Terraform Code
This blueprint can be deployed using Terraform, enabling infrastructure-as-code management of your ChatBotKit resources. Use the code below to recreate this example in your own environment.
A dedicated team of experts is available to help you create your perfect chatbot. Reach out via or chat for more information.