Skip to content

Lethe044/hermes-incident-commander

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

βš• Hermes Incident Commander

An autonomous SRE agent that detects, diagnoses, and heals production infrastructure β€” then learns from every incident it resolves.

Built on Hermes Agent by NousResearch. Submitted for the "Show us what Hermes Agent can do" challenge.


The Problem

When a production server goes down at 3 AM, an on-call engineer has to:

  1. Wake up, check alerts
  2. SSH in, run diagnostics manually
  3. Piece together root cause from logs
  4. Apply a fix - hopefully the right one
  5. Verify it worked
  6. Write a post-mortem nobody will read

Mean time to resolve (MTTR) for P0 incidents averages 45–60 minutes. Much of that is humans doing things a sufficiently capable agent could do faster and better.

Hermes Incident Commander does all of it - autonomously, in minutes, getting smarter with each incident it handles.


Demo

# Install dependencies
pip install anthropic rich

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run a demo incident (disk full scenario)
python demo/demo_incident.py --scenario disk-full-logs

# Try other scenarios
python demo/demo_incident.py --scenario svc-crash-nginx
python demo/demo_incident.py --scenario cpu-runaway-process

What you'll see:

  • Hermes detects the incident and classifies severity (P0/P1/P2/P3)
  • Runs parallel diagnostics across CPU, memory, disk, and services
  • Identifies root cause with explicit reasoning
  • Applies the safest effective fix
  • Verifies the fix worked
  • Writes a structured post-incident report to ~/.hermes/incidents/
  • Creates a new prevention skill in ~/.hermes/skills/ so it handles this faster next time

How It Uses Every Hermes Feature

This project was designed to push every capability of Hermes Agent:

Hermes Feature How It's Used
Persistent Memory Builds a system topology map over time. Learns which services fail together, time-of-day patterns, and which remediations work on YOUR infrastructure.
Skill Auto-Creation After every novel incident, writes a new SKILL.md prevention playbook. Hermes gets measurably better at your stack over weeks.
Cron Scheduler Every 5 min: critical health check. Every hour: full audit. Daily 08:00: morning briefing to Telegram.
Gateway (Telegram/Discord) Real-time P0 alerts, resolution notices, and daily briefings delivered to your phone.
Subagent Spawning For multi-service environments, spawns parallel subagents to investigate nginx, database, and application layers simultaneously.
Session Search (FTS5) "Have we seen this error before?" - searches past incidents for matching patterns.
execute_code Collapses multi-step diagnostic pipelines into single inference turns, dramatically reducing latency.
MCP Integration Connects to cloud provider APIs (AWS/GCP/Azure MCP servers) for auto-scaling and cloud-native remediation.

Architecture

flowchart TD
    ALERT([🚨 Incident Alert]) --> DETECT

    DETECT["πŸ” DETECT<br/>Gather system vitals<br/>CPU β€’ Memory β€’ Disk β€’ Services"]
    TRIAGE["βš–οΈ TRIAGE<br/>Classify severity<br/>P0 Β· P1 Β· P2 Β· P3"]
    DIAGNOSE["πŸ”¬ DIAGNOSE<br/>Root cause analysis<br/>Logs Β· Processes Β· Stack traces"]
    REMEDIATE["πŸ”§ REMEDIATE<br/>Apply safest fix<br/>Tier 1 β†’ 2 β†’ 3"]
    VERIFY["βœ… VERIFY<br/>Confirm resolution<br/>Before vs after metrics"]

    DETECT --> TRIAGE --> DIAGNOSE --> REMEDIATE --> VERIFY

    CRON["⏱️ CRON<br/>Every 5 min: health check<br/>Every hour: full audit<br/>Daily 08:00: briefing"]
    CRON -->|triggers| DETECT

    LEARN["🧠 LEARN<br/>Write post-incident report<br/>Create prevention SKILL.md<br/>Update MEMORY.md<br/>Search past incidents (FTS5)"]
    VERIFY --> LEARN

    GATEWAY["πŸ“² GATEWAY<br/>Telegram Β· Discord Β· Slack"]
    TRIAGE -->|"🚨 P0/P1 alert"| GATEWAY
    VERIFY -->|"βœ… resolved"| GATEWAY
    CRON -->|"πŸ“‹ daily briefing"| GATEWAY

    style DETECT fill:#1e3a5f,color:#fff
    style TRIAGE fill:#7b2d00,color:#fff
    style DIAGNOSE fill:#1e3a5f,color:#fff
    style REMEDIATE fill:#1a4731,color:#fff
    style VERIFY fill:#1a4731,color:#fff
    style LEARN fill:#3d2068,color:#fff
    style CRON fill:#2d2d2d,color:#fff
    style GATEWAY fill:#2d2d2d,color:#fff
    style ALERT fill:#7b2d00,color:#fff
Loading

Project Structure

graph LR
    ROOT["πŸ“ hermes-incident-commander"]

    ROOT --> SKILLS["πŸ“ skills/"]
    ROOT --> ENVS["πŸ“ environments/"]
    ROOT --> DEMO["πŸ“ demo/"]
    ROOT --> TESTS["πŸ“ tests/"]
    ROOT --> DOCS["πŸ“ docs/"]
    ROOT --> REQ["πŸ“„ requirements.txt"]

    SKILLS --> SKILL_MD["πŸ“„ incident-commander/SKILL.md<br/>← install into ~/.hermes/skills/"]

    ENVS --> ENV_PY["🐍 incident_env.py<br/>← Atropos RL environment"]
    ENVS --> ENV_CFG["βš™οΈ incident_config.yaml<br/>← training configuration"]

    DEMO --> DEMO_PY["🐍 demo_incident.py<br/>← standalone demo"]

    TESTS --> TEST_PY["🐍 test_incident_env.py<br/>← pytest test suite"]

    DOCS --> SETUP["πŸ“„ SETUP.md"]
    DOCS --> WRITEUP["πŸ“„ WRITEUP.md"]

    style ROOT fill:#1e3a5f,color:#fff
    style SKILL_MD fill:#1a4731,color:#fff
    style ENV_PY fill:#3d2068,color:#fff
    style DEMO_PY fill:#7b2d00,color:#fff
    style TEST_PY fill:#2d2d2d,color:#fff
Loading

Installation (Full Hermes Setup)

1. Install Hermes Agent

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

2. Configure Hermes

hermes setup        # Interactive setup wizard
hermes model        # Choose your model (Nous Portal recommended)
hermes gateway setup  # Connect Telegram/Discord for alerts

3. Install the Incident Commander Skill

# Copy the skill to Hermes's skills directory
cp -r skills/incident-commander ~/.hermes/skills/

# Verify it's loaded
hermes
> /skills

4. Set Up Monitoring Cron Jobs

In your Hermes conversation:

Set up incident monitoring: run a health check every 5 minutes and alert me
on Telegram if anything is P0 or P1. Send me a daily briefing at 08:00.

Hermes will install the cron jobs automatically.

5. Run the RL Training Environment (Optional)

# Install Atropos
pip install atroposlib

# Generate SFT training data
python environments/incident_env.py process --config environments/incident_config.yaml

# Full RL training (requires VLLM)
python environments/incident_env.py serve --config environments/incident_config.yaml

Reward Function (for RL Training)

The training environment uses a multi-component reward that captures real SRE quality:

pie title Reward Components
    "Resolution β€” Did the incident get fixed?" : 50
    "RCA Quality β€” Root cause explained?" : 15
    "Report Quality β€” Post-mortem written?" : 15
    "Skill Created β€” Prevention skill added?" : 10
    "Response Speed β€” Fast MTTR?" : 5
    "Tool Efficiency β€” Minimal tool calls?" : 5
Loading

Incident Scenarios (Training Scenarios)

ID Severity Category Description
svc-crash-nginx P0 service nginx crashed, website unreachable
disk-full-logs P1 disk 95% disk usage from exploded log files
memory-leak-process P1 memory Mystery process eating 150MB+
cpu-runaway-process P2 cpu 95% CPU from runaway computation
failed-systemd-unit P2 service Custom worker service in failed state

Running Tests

# Install test dependencies
pip install pytest pytest-asyncio

# Run full test suite
pytest tests/ -v

# Run specific test classes
pytest tests/test_incident_env.py::TestScenarioDefinitions -v
pytest tests/test_incident_env.py::TestRewardFunction -v
pytest tests/test_incident_env.py::TestSkillFile -v

Why This Wins

  1. Real problem, real impact. P0 incidents cost companies thousands of dollars per minute. Shaving 30 minutes off MTTR with an autonomous agent is immediately valuable.

  2. Uses every Hermes capability. Memory, skills, cron, gateway, subagents, session search, execute_code - all integrated into a coherent, meaningful workflow.

  3. Self-improving. The longer Hermes runs, the better it gets at your specific infrastructure. This is Hermes's core promise - "the agent that grows with you" - demonstrated concretely.

  4. Closes the training loop. The Atropos RL environment means this isn't just a demo - it's a path to training models that are genuinely better at agentic SRE tasks.

  5. Ships with working code. The demo runs standalone, the tests pass, and the skill file installs in one command.


License

MIT


Built with Hermes Agent - the agent that grows with you.

About

Autonomous SRE agent built on Hermes - detects, heals, and learns from production incidents. Uses Memory + Skills + Cron + Gateway + Subagents + Atropos RL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages