Autonomous SRE Agent

A sophisticated Site Reliability Engineering agent with its own persistent shell environment for autonomous troubleshooting, incident investigation, and infrastructure monitoring. Integrates with Sentry for error tracking, PagerDuty for on-call management, and Slack for team communication.

sre
devops
infrastructure
627

The Autonomous SRE Agent represents the cutting edge of AI-powered infrastructure operations. This blueprint creates an intelligent agent with full shell access, capable of autonomous investigation, script execution, and proactive system monitoring—essentially a junior SRE that never sleeps.

At its core is a persistent shell workspace that gives the agent true operational capabilities. Unlike simple chatbots that can only provide advice, this agent can actually execute commands, write diagnostic scripts, analyze logs, and store investigation artifacts. The shell environment persists across conversations, allowing the agent to build up a library of runbooks, diagnostic tools, and historical data that improves its effectiveness over time.

The architecture uses a multi-skillset design that organizes capabilities into logical domains:

Shell Operations provides full terminal access with command execution, file read/write operations, and the ability to import external resources. The agent can write Python or Node.js scripts, execute them, and analyze their output—all without human intervention. This enables sophisticated automation like parsing JSON logs, generating reports, or implementing custom health checks.

Monitoring Integration connects to Sentry for error tracking and PagerDuty for incident management. The agent can list recent errors, investigate stack traces, check who's on-call, and even create or acknowledge incidents. When a production issue occurs, it can gather context from multiple sources before escalating to humans.

Dynamic Skillset Installation is a powerful meta-capability that lets the agent expand its own abilities at runtime. Using the conversation/skillset/install ability, the agent can examine available skillsets and install additional capabilities as needed. This makes the agent extensible—add new skillsets to the blueprint, and the agent can discover and use them without reconfiguration.

Research & Documentation enables the agent to search the web and fetch documentation, making it effective at finding solutions to novel problems. When encountering an unfamiliar error, it can research the issue and correlate findings with the actual system state.

The scheduled trigger integration enables autonomous operation. The agent runs periodic health checks, analyzes trends, and proactively identifies issues before they become incidents. Each run produces a timestamped report stored in the workspace, creating an audit trail of system health over time.

The Slack integration allows the agent to communicate with the team. After investigating an issue, it can start conversations with on-call engineers, providing context and suggested remediation steps. This creates a seamless handoff from automated investigation to human decision-making.

Practical use cases include:

  • Incident Investigation: When an alert fires, the agent can immediately begin investigation—checking logs, querying metrics, identifying recent changes—and prepare a summary before the on-call engineer even opens their laptop.

  • Automated Runbooks: Store runbook scripts in the workspace and have the agent execute diagnostic procedures automatically. Failed deployment? The agent can run the standard rollback verification script and report results.

  • Trend Analysis: Schedule regular runs to analyze error rates, latency patterns, or resource utilization. The agent can identify anomalies and alert before they become critical.

  • Knowledge Building: As the agent investigates issues, it builds up a knowledge base in its workspace—previous incidents, successful remediations, custom scripts—that improves its effectiveness over time.

  • On-Call Support: Integrate with PagerDuty to understand who's on-call and ensure alerts reach the right person with the right context.

To extend this blueprint, add abilities for your specific infrastructure: Kubernetes cluster access, database query capabilities, cloud provider APIs, or custom internal tools. The dynamic skillset installation pattern means you can add capabilities incrementally and the agent will discover them automatically.

This blueprint showcases several advanced ChatBotKit features: persistent shell workspaces for real execution, multi-skillset architecture for organized capabilities, dynamic skillset installation for runtime extensibility, scheduled triggers for autonomous operation, and multi- integration patterns connecting Slack, Sentry, and PagerDuty.

Backstory

Common information about the bot's experience, skills and personality. For more information, see the Backstory documentation.

You are an autonomous Site Reliability Engineering (SRE) Agent with full access to a persistent shell environment. Your mission is to investigate, diagnose, and help resolve infrastructure issues while maintaining comprehensive documentation of your actions. CORE CAPABILITIES: 1. SHELL ENVIRONMENT - You have a persistent shell workspace that persists across sessions - Execute bash commands to investigate issues (shell/exec) - Read and write files for scripts and reports (shell/read, shell/write) - Import external resources like log files or configs (shell/import) - Write and execute Python scripts for complex analysis - Store artifacts, scripts, and runbooks in your workspace 2. MONITORING INTEGRATION - Query Sentry for errors, issues, and stack traces - Check PagerDuty for active incidents and on-call schedules - Correlate errors across systems to identify root causes 3. DYNAMIC EXPANSION - List available skillsets using blueprint/resource/list - Install additional skillsets at runtime for expanded capabilities - Adapt to new requirements by discovering and using new abilities 4. RESEARCH & DOCUMENTATION - Search the web for solutions to unfamiliar problems - Fetch documentation from external sources - Maintain runbooks and investigation notes in your workspace 5. TEAM COMMUNICATION - Start Slack conversations with on-call engineers - Provide context-rich incident summaries - Coordinate handoffs between automated investigation and human action INVESTIGATION WORKFLOW: When investigating an issue: 1. Gather context: Check Sentry for recent errors, PagerDuty for active incidents 2. Analyze symptoms: Execute diagnostic commands, review logs 3. Correlate data: Look for patterns across sources 4. Document findings: Write structured reports to your workspace 5. Recommend action: Provide specific remediation steps 6. Escalate if needed: Contact on-call via Slack with full context WORKSPACE ORGANIZATION: Maintain your workspace with this structure: - /reports/ - Timestamped investigation and health check reports - /scripts/ - Reusable diagnostic and automation scripts - /runbooks/ - Step-by-step procedures for common issues - /incidents/ - Documentation for specific incident investigations SCRIPTING BEST PRACTICES: When writing scripts: - Use Python for data analysis and complex logic - Use bash for quick system commands and pipeline operations - Include error handling and meaningful output - Store reusable scripts for future use - Document what each script does REPORT FORMAT: All reports should include: - Timestamp and report type - Executive summary (2-3 sentences) - Detailed findings with evidence - Metrics and data points - Recommendations with priority - Next steps or escalation needs EXAMPLE HEALTH CHECK SCRIPT: ```python #!/usr/bin/env python3 import json from datetime import datetime report = { "timestamp": datetime.now().isoformat(), "type": "health_check", "checks": [], "status": "healthy" } # Add your diagnostic checks here print(json.dumps(report, indent=2)) ``` Remember: You are an autonomous agent. Take initiative, investigate thoroughly, document everything, and escalate appropriately. Your workspace persists—build up a library of scripts and knowledge that makes you more effective over time. The current date is ${EARTH_DATE}.

Skillset

This example uses a dedicated Skillset. Skillsets are collections of abilities that can be used to create a bot with a specific set of functions and features it can perform.

  • Execute Command

    Execute bash commands and scripts in the persistent workspace
  • 🏢

    Read File

    Read files from the workspace (scripts, logs, reports)
  • 🏢

    Write File

    Write files to the workspace (scripts, reports, artifacts)
  • Import Resource

    Import external resources (logs, configs, data files) into the workspace
  • Run Python Code

    Execute Python code for data analysis and complex operations
  • 🅰️

    Run JavaScript Code

    Execute JavaScript/Node.js code for API interactions and automation
  • List Available Skillsets

    Discover all available skillsets in this blueprint for dynamic expansion
  • Install Skillset

    Dynamically install additional skillsets at runtime
  • 🚨

    List Sentry Issues

    List recent issues and errors from Sentry projects
  • 📄

    Get Sentry Issue Details

    Get detailed information about a specific Sentry issue including context
  • 🤖

    Get Latest Error Event

    Get the latest event with full stacktrace for a Sentry issue
  • 📟

    List PagerDuty Incidents

    List active and recent incidents from PagerDuty
  • 📄

    Get Incident Details

    Get detailed information about a specific PagerDuty incident
  • 🇪🇸

    Check On-Call Schedule

    List who is currently on-call for incident escalation
  • 📟

    Create Incident

    Create a new PagerDuty incident to alert the on-call team
  • 😣

    Update Incident

    Acknowledge or resolve a PagerDuty incident
  • Start Slack Conversation

    Initiate a Slack DM with an on-call engineer to discuss an incident
  • 💨

    Search Web

    Search the web for solutions, documentation, and troubleshooting guides
  • 👴

    Fetch Documentation

    Fetch and read content from documentation URLs

Secrets

This example uses Secrets to store sensitive information such as API keys, passwords, and other credentials.

  • 🔐

    Sentry API Token

    API token for accessing Sentry error tracking
  • 🔐

    PagerDuty API Token

    API token for PagerDuty incident management

Terraform Code

This blueprint can be deployed using Terraform, enabling infrastructure-as-code management of your ChatBotKit resources. Use the code below to recreate this example in your own environment.

Copy this Terraform configuration to deploy the blueprint resources:

Next steps:

  1. Save the code above to a file named main.tf
  2. Set your API key: export CHATBOTKIT_API_KEY=your-api-key
  3. Run terraform init to initialize
  4. Run terraform plan to preview changes
  5. Run terraform apply to deploy

Learn more about the Terraform provider

A dedicated team of experts is available to help you create your perfect chatbot. Reach out via or chat for more information.