Skip to content

canetizen/opensre-distributed-rca

Repository files navigation

opensre-distributed-rca

A self-contained experiment that exercises OpenSRE — an AI-driven SRE agent — against a realistic distributed system. The system under test is a Combat Management System (CMS) simulation that runs on a single host in Docker, instrumented end-to-end so OpenSRE has the same telemetry surfaces a human SRE would see.


1. Purpose

CMS is a representative example of a hard-to-debug distributed system: many loosely-coupled services exchange messages over a transport (DDS) with strict QoS contracts, mission-critical SLOs, and failure modes that span process, transport, and content layers. This experiment gives OpenSRE a concrete environment in which to:

  1. observe a system that produces real metrics, structured logs, and distributed traces;
  2. ingest a Grafana-style alert that points at a fault we have ground truth for;
  3. carry out an investigation against the same telemetry surfaces a real SRE would use (Prometheus, Loki, Tempo, Grafana), and emit a structured RCA;
  4. let us compare its conclusion to the known root cause stored next to the chaos injector, so we can score the agent across a matrix of scenarios.

Six chaos scenarios cover the full spectrum from "a service crashed" to "the transport silently drops messages because of a QoS mismatch" — the last is intentionally hard, because that class of failure has no application-level exception and tests whether an AI SRE can correlate across signals.


2. Reference hardware

These results were produced on:

Component Value
OS Ubuntu 24.04.3 LTS, kernel 6.17.0-22
CPU AMD Ryzen 9 4900H, 8 cores / 16 threads
RAM 16 GB
GPU NVIDIA GeForce GTX 1660 Ti (6 GB VRAM), driver 580.126.09
Docker snap-installed Docker 28.4.0
Local LLM Ollama on the host, model qwen2.5:7b

Ollama runs on the host (systemd service) so it can drive the GPU natively. The OpenSRE container reaches it via host.docker.internal:11434.


3. Repository layout

opensre-distributed-rca/
├── README.md
├── docker-compose.yml         # Full stack
├── Dockerfile                 # CMS service image, parameterised by SERVICE_NAME
├── Dockerfile.opensre         # OpenSRE CLI image
├── Makefile                   # make bootstrap | up-sre | run-all | ...
├── pyproject.toml
├── shared/
│   ├── messages.idl           # Canonical DDS message reference
│   └── cyclonedds.xml         # Domain-wide CycloneDDS config
├── shared_lib/                # Common Python: messages, DDS facade, telemetry,
│                              # logging, runtime, trace propagation, world model
├── services/                  # Eight runnable services (see §4.2)
├── observability/
│   ├── otel-collector/config.yaml
│   ├── prometheus/prometheus.yml
│   ├── loki/loki-config.yaml
│   ├── tempo/tempo.yaml
│   └── grafana/
│       ├── dashboards/cms-overview.json
│       └── provisioning/{datasources,dashboards,alerting}/
├── scenarios/                 # 6 chaos scenarios + alert payloads
├── scripts/run-all-investigations.sh
├── opensre-config/integrations.json
└── results/                   # Artefacts from each opensre investigate run

4. Architecture under test

4.1 System topology

An air-defence Combat Management System with three sensor classes that fuse their detections into a single tactical picture, classify threats, issue mission orders, and engage targets. Three hostile aircraft (one classified MISSILE, two AIRCRAFT) fly deterministic circular paths at altitude. Each sensor independently observes them with its own noise and detection profile, and track-fusion correlates the noisy detections into a single tactical track per real target.

 radar-sensor ─┐
 esm-sensor   ─┼─► track-fusion ─► threat-evaluator ─► command-center ─► effector-manager
 eo-sensor    ─┘                                              │                  │
                                                              ▼                  ▼
                                                       tactical-display ◄─── (events)

4.2 Services

Eight Python services communicate over CycloneDDS. Every published message carries a cms.TraceContext field so distributed traces survive the DDS hop and reconstruct end-to-end in Tempo.

Service Rate / role Publishes Subscribes
radar-sensor 10 Hz active radar; low position noise; no fine classification SourceTrack
esm-sensor 5 Hz passive RF; wider position noise; classifies emitter SourceTrack
eo-sensor 2 Hz electro-optical; tight position when it detects (70%) SourceTrack
track-fusion Nearest-neighbour correlation, queue-based TacticalTrack SourceTrack
threat-evaluator Rule-based scoring (speed, altitude, classification) ThreatAssessment TacticalTrack
command-center Issues ENGAGE / MONITOR orders against threats MissionOrder ThreatAssessment
effector-manager Simulates engagement lifecycle (ACCEPTED → IN_FLIGHT → HIT/MISS) EngagementStatus MissionOrder
tactical-display Operator REST surface; aggregates the live tactical picture (HTTP) every CMS topic

Common signals (every service)

Wired by shared_lib/runtime.py and shared_lib/telemetry.py, so they fire from every service regardless of business logic.

  • Logs: service_starting, signal_received, service_stopping, dds_writer_ready (with env_qos_override flag), dds_subscriber_ready.
  • Metrics: each service exposes a Prometheus /metrics endpoint on port 910x; the OTel SDK additionally pushes the same metrics over OTLP to the collector for redundancy.
  • Traces: every span carries service.name (the resource attribute) and the inbound trace context is continued on every DDS hop.

Per-service signals

Beyond the common floor, each service emits the following:

radar-sensor / esm-sensor / eo-sensor

Three sensors share shared_lib/sensor_runtime.py, so their signal set is identical (only the values of sensor_id and sensor_type differ).

Signal Type Notes
cms_sensor_published_total{sensor_id} Counter Source tracks written
cms_sensor_publish_seconds{sensor_id} Histogram Wall-clock per write
cms_sensor_undetected_total{sensor_id} Counter Targets missed by detection probability
sensor_started Log One-shot on boot; logs target rate + detection probability
sensor_heartbeat Log Every 30 s; logs publish_rate_hz_observed + drop_rate_hz_observed
sensor_loop_overrun Log Warning when a scan can't keep cadence
sensor.scan.<sensor_id> Span One per scan; attributes: sensor.id, sensor.type, scan.target_count
track-fusion

Most heavily instrumented service — it sits on the critical correlation path and is also the chaos target for half the scenarios.

Signal Type Notes
cms_fusion_ingested_total{sensor_id} Counter Source tracks ingested per sensor
cms_fusion_published_total Counter Tactical tracks published
cms_fusion_processing_seconds Histogram Per-track correlation wall-clock
cms_fusion_queue_depth Gauge Pending source tracks awaiting fusion
cms_fusion_active_tracks Gauge Currently maintained tactical tracks
cms_fusion_memory_bytes Gauge Process RSS, sampled ~1 Hz at full ingest
cms_fusion_contributors_changed_total{change} Counter change ∈ {added, removed}
fusion_queue_full_dropping Log Warning when inbox is saturated
tactical_track_contributors_changed Log Info; emitted when a tactical track's contributing-sensor set changes between publishes
fusion.ingest Span Continues the inbound sensor span; attribute: tactical_track_id
threat-evaluator
Signal Type Notes
cms_threat_evaluated_total{level} Counter Per-level evaluations
cms_threat_processing_seconds Histogram Per-track evaluation wall-clock
cms_threat_level_transitions_total{from_level, to_level} Counter Per-track level transitions
threat_level_changed Log Info; track + prior level + new level + rationale
threat.evaluate Span Continues the fusion span; attributes: threat.level, threat.priority
command-center
Signal Type Notes
cms_orders_issued_total{action} Counter action ∈ {ENGAGE, MONITOR}
cms_command_processing_seconds Histogram Per-order build + write wall-clock
mission_order_issued Log Info; order_id + action + target + threat_level
command.issue_order Span Continues the threat-evaluator span
effector-manager
Signal Type Notes
cms_orders_received_total{action} Counter Mission orders received
cms_engagement_outcomes_total{outcome} Counter outcome ∈ {HIT, MISS}
cms_engagement_transitions_total{status} Counter status ∈ {ACCEPTED, IN_FLIGHT, HIT, MISS}
cms_engagement_duration_seconds Histogram Total lifecycle wall-clock; custom buckets up to 30 s
engagement_status_changed Log Info on every transition
engagement_resolved Log Info on the final HIT/MISS
effector.engage Span Continues the command-center span; attributes: order.id, order.action
tactical-display
Signal Type Notes
cms_display_received_total{topic} Counter Per-topic ingest count
cms_display_handler_seconds{topic} Histogram Per-message handler wall-clock
cms_display_active_tracks Gauge Tracks visible in the latest snapshot
http_starting Log Info on boot with HTTP port
display.ingest.<topic> Span Continues the inbound trace; one per ingested message

4.3 DDS QoS profiles

QoS profiles are centralised in shared_lib/dds_io.py. The dds-qos-mismatch scenario uses an env-var hook (CMS_FORCE_QOS=RELIABLE_KEEP_LAST) to override one writer's profile so it stops matching its consumer's reader.

Topic Reliability History Durability
SourceTrackTopic BEST_EFFORT KEEP_LAST(10) VOLATILE
TacticalTrackTopic RELIABLE KEEP_LAST(50) VOLATILE
ThreatAssessmentTopic RELIABLE KEEP_LAST(20) VOLATILE
MissionOrderTopic RELIABLE KEEP_ALL TRANSIENT_LOCAL
EngagementStatusTopic RELIABLE KEEP_LAST(20) VOLATILE

4.4 Observability stack

The observability layer is what OpenSRE queries during an investigation.

OpenTelemetry is the in-process instrumentation contract for every CMS service. The OTel Python SDK (initialised once in shared_lib/telemetry.py) emits three signal types from inside the service:

  • Traces — every DDS hand-off (sensor scan → fusion ingest → threat evaluate → mission order → engagement) is a span; the W3C trace context is serialised into the cms.TraceContext field of each DDS message so the trace is not severed by the transport (see shared_lib/trace_propagation.py).
  • Metrics — service counters, latency histograms, and queue-depth gauges defined alongside the business logic.
  • Logs — Python's stdlib logging is wrapped by structlog, then forwarded as OTel LogRecords (shared_lib/logging_config.py).

All three signals are pushed over OTLP gRPC to the OpenTelemetry Collector, which is the only ingress point for telemetry. The collector then fans out to the storage backends:

  • traces → Tempo
  • metrics → Prometheus via remote-write (and scraped directly from each service's /metrics for redundancy)
  • logs → Loki via OTLP/HTTP

Grafana sits on top with provisioned datasources, a CMS dashboard, and the alert rules that drive each scenario's investigation. Alert queries are listed in §4.5.

Retention windows (matter for OpenSRE's time-bounded queries):

Backend Retention
Prometheus TSDB 24 h (--storage.tsdb.retention.time=24h)
Tempo 24 h (block_retention: 24h)
Loki 168 h (the Loki default; reject_old_samples_max_age: 168h)

4.5 Alert rules

Provisioned in observability/grafana/provisioning/alerting/alerts.yaml. All seven rules query Prometheus over a 2 min look-back window. Most chaos scenarios are covered by a single rule; dds-qos-mismatch and memory-leak fire two complementary rules at different signal strengths (a leading "early" rule plus a lagging "cascade" rule), so OpenSRE can be tested both on the early-warning signal and on the post-cascade evidence.

Rule (uid) PromQL for Severity Scenario Signal kind
sensor-down (1 - up{job="cms-services", instance=~"(radar|esm|eo)-sensor:.+"}) > 0 30 s critical sensor-down, network-partition scrape-failure (process down OR unreachable)
fusion-latency-high histogram_quantile(0.95, sum(rate(cms_fusion_processing_seconds_bucket[1m])) by (le)) > 0.5 1 m critical fusion-cpu-starvation SLO breach
tactical-tracks-collapsed (3 - cms_fusion_active_tracks) > 0 1 m warning dds-qos-mismatch cascade — fewer fused tracks than expected
sensor-flood sum by (sensor_id) (rate(cms_sensor_published_total[1m])) > 100 30 s critical message-flood per-sensor rate anomaly
fusion-queue-deep cms_fusion_queue_depth > 1000 1 m warning memory-leak cascade — back-pressure
fusion-contributors-dropped sum(increase(cms_fusion_contributors_changed_total{change="removed"}[2m])) > 0 30 s warning dds-qos-mismatch early — a sensor disappeared from a fused track
fusion-memory-elevated cms_fusion_memory_bytes > 200000000 1 m warning memory-leak early — RSS approaching the 256 MiB chaos cap

5. Setup (one-time)

Each make target is idempotent — re-running is safe.

make ollama-install     # install Ollama on the host (sudo prompts)
make ollama-config      # bind Ollama to 0.0.0.0:11434 + restart (sudo)
make ollama-pull        # pull the local LLM (qwen2.5:7b, ~5 GB)
make opensre-image      # clone upstream OpenSRE + build the CLI image

Or all four in one shot:

make bootstrap

make opensre-image clones Tracer-Cloud/opensre at v2026.4.25 by default. Override the repo or ref to build from a fork or branch:

make opensre-image OPENSRE_REPO=https://github.com/<fork>/opensre.git OPENSRE_REF=main

6. Lifecycle & URLs

make build              # build the 8 CMS service images (first run only)
make up-sre             # CMS + observability + OpenSRE
make ps                 # container status
make logs S=track-fusion
make down               # stop everything
make clean              # stop + wipe volumes

Host ports are remapped into the 13xxx/19xxx range to avoid clashing with other developer-machine observability stacks. Container-internal ports are unchanged.

Surface URL
Grafana http://localhost:13000 (anonymous Admin)
Prometheus http://localhost:19090
Tempo (HTTP) http://localhost:13200
Loki (HTTP) http://localhost:13100
Tactical display API http://localhost:18080/api/picture
Per-service /metrics http://localhost:19101..19108/metrics
OTLP (collector) gRPC localhost:14317 / HTTP localhost:14318

7. Chaos scenarios

Each scenario lives in scenarios/<name>/ with:

  • inject.sh — applies the failure
  • cleanup.sh — reverts it
  • alert.json — the Grafana-style alert payload fed to OpenSRE
  • expected_rca.md — ground-truth root cause + scoring rubric

alert.json schema and non-obvious fields

Three fields look optional on the surface but materially change how OpenSRE plans the investigation:

  • alert_source: "grafana" — required. Without it, app/nodes/plan_actions/detect_sources.py suppresses the Grafana tools for "non-Grafana alerts" and the agent ends up with no Loki / Tempo / Mimir handles at all. Every scenario's payload sets this explicitly.
  • pipeline_name — used by app/tools/GrafanaLogsTool as the default service_name when building the LogQL query. If it is missing or set to a value that doesn't exist as a Loki label (e.g. the OpenSRE template default events_fact), the resulting {service_name="events_fact"} query returns nothing or 400s. Each scenario sets this to one of our actual service_name labels (radar-sensor, track-fusion, …).
  • commonAnnotations.scenario — not interpreted by OpenSRE; this is a tag the runner reads when matching a result back to its ground-truth folder. Keep it equal to the directory name.

Other fields (title, alert_name, severity, commonLabels.*) are the standard Grafana webhook shape; OpenSRE's app/nodes/extract_alert/ reads what it can and falls back to unknown for missing fields.

# Scenario One-liner
1 sensor-down Process gone — find the missing service
2 fusion-cpu-starvation Consumer is the bottleneck while inputs look healthy
3 dds-qos-mismatch Transport-layer silent message loss; hardest to diagnose
4 message-flood Anomalous publish rate; fusion is the symptom, not the cause
5 memory-leak Slow degradation, eventual OOM and cascade
6 network-partition Container alive but unreachable — distinct from sensor-down

7.1 sensor-down

Injectdocker stop radar-sensor. Sends SIGTERM to the radar container; the process exits cleanly (the runtime registers a SIGTERM handler that flips the stop event). The container is left in Exited state, name still occupied, so cleanup can simply start it again.

Cleanupdocker start radar-sensor brings the same container back.

Signal that fires the alert — Prometheus scrape against radar-sensor:9101 fails for more than 30 s; the rule (1 - up{job="cms-services", instance=~"(radar|esm|eo)-sensor:.+"}) > 0 becomes truthy with instance="radar-sensor:9101".

7.2 fusion-cpu-starvation

Injectdocker update --cpus 0.05 track-fusion. Re-applies a cgroup CPU quota to the running container; no restart. Fusion still processes, just at ~5% of one core, so the queue starts to back up under the 45 msg/s aggregate sensor inflow.

Cleanupdocker update --cpus 0 track-fusion removes the quota (0 means unlimited).

Signal that fires the alerthistogram_quantile(0.95, sum(rate(cms_fusion_processing_seconds_bucket[1m])) by (le)) > 0.5, i.e. fusion's p95 processing latency crosses the 500 ms SLO.

7.3 dds-qos-mismatch

Injectdocker rm -f esm-sensor, then docker run a fresh ESM container with the same name and image but with CMS_FORCE_QOS=RELIABLE_KEEP_LAST. The shared library (shared_lib/dds_io.py) reads that env var when constructing writers and overrides the canonical SourceTrackTopic profile (which is BEST_EFFORT). The reader on track-fusion still asks for BEST_EFFORT, so CycloneDDS sees an incompatible reliability pair: discovery succeeds, but no samples are delivered. No application-level exception is raised.

Cleanupdocker stop esm-sensor, then docker compose up -d esm-sensor. Compose recreates ESM under its canonical config (no env-var override), restoring the matched QoS pair.

Signal that fires the alert — track-fusion still publishes tactical tracks but their contributing_sensors list no longer contains ESM-2; once the dropout is large enough, the (3 - cms_fusion_active_tracks) > 0 rule fires (active tracks fall below the steady-state of 3). The transport-layer hint (Cyclone's INCOMPATIBLE_QOS warning) shows up only on container stderr.

7.4 message-flood

Injectdocker rm -f radar-sensor, then docker run a fresh radar container with PUBLISH_RATE_HZ=200. Twenty times the designed 10 Hz cadence; combined with three targets per scan that's ~600 msg/s instead of ~30, ~13× the steady-state per-sensor rate.

Cleanupdocker stop radar-sensor, then docker compose up -d radar-sensor. Compose recreates radar under its canonical 10 Hz cadence.

Signal that fires the alertsum by (sensor_id) (rate(cms_sensor_published_total[1m])) > 100 fires for sensor_id="RADAR-1" once the rate window catches up. Other sensors stay at baseline, which is what makes the diagnosis "radar is the cause, fusion is the symptom" rather than the other way around.

7.5 memory-leak

Injectdocker rm -f track-fusion, then docker run fresh with --memory 256m and CMS_LEAK_RATE_BYTES=10240. Track-fusion's main.py has a clearly-marked chaos hook: when the env var is set, it appends a 10 KiB byte buffer to a long-lived list for every ingested SourceTrack. With sensors aggregating ~45 msg/s, that's ~27 MB/min of unfreed memory — a few minutes from boot to OOM under the 256 MiB cap.

Cleanupdocker stop track-fusion, then docker compose up -d track-fusion. Compose recreates fusion under its canonical config with no leak hook and no memory limit.

Signal that fires the alertcms_fusion_queue_depth > 1000 when the inbound queue starts climbing as ingest stalls under memory pressure. (The OOM kill itself produces a container restart; that event is observable as a gap in fusion-side metrics.)

7.6 network-partition

Injectdocker network disconnect opensre-distributed-rca_cms esm-sensor. The container keeps running with all its sockets, but it is detached from the docker network, so:

  • Prometheus can't scrape esm-sensor:9102.
  • The OpenTelemetry SDK can't reach the collector, so no logs / traces / metrics from ESM after the partition point.
  • Other DDS participants on the bridge see ESM disappear from discovery and stop receiving samples.

This is the exact pair that distinguishes "process crashed" from "network detached": same external symptom (no data from ESM), very different remediation.

Cleanupdocker network connect opensre-distributed-rca_cms esm-sensor re-attaches the container to the bridge; CycloneDDS rediscovers the participant within ~10 s.

Signal that fires the alert — same up == 0 rule as sensor-down, but for instance="esm-sensor:9102". The whole point of the scenario is that a careless RCA will conflate the two while a rigorous one will notice the container is still in running state.

Why dds-qos-mismatch is the centrepiece

DDS QoS mismatches are notorious in production: there is no application-level exception, no log line saying "I lost 100 messages". The only signals are transport-layer Cyclone warnings and the absence of expected sensors in the fused output. An AI SRE that correctly identifies this scenario is doing real distributed-systems debugging, not pattern matching on stack traces.


8. Running scenarios through OpenSRE

make run-all                              # all six scenarios, isolated
make run NAME=fusion-cpu-starvation       # one scenario

Both targets go through scripts/run-all-investigations.sh, which enforces strict isolation: every scenario starts with a fresh docker compose down --volumes && up -d, waits for the sensors to publish

1 msg/s (so the baseline is observable), injects the fault, waits 75 s for the alert window to fill, calls opensre investigate against the scenario's alert.json, copies the artefacts to results/<scenario>/, and runs cleanup.sh.

The per-scenario folder is wiped at the start of each run, so the artefacts always reflect the most recent execution. The investigation calls into:

  • Prometheus for metric queries
  • Loki for log search
  • Tempo for trace lookup
  • Grafana for alert metadata
  • Ollama (host) for local LLM inference — no cloud calls

See opensre-config/integrations.json for how OpenSRE is wired to the four data backends.

8.1 What the investigation does (run.log key phases)

A single opensre investigate run is an iterative loop. The phase markers in results/<scenario>/run.log show what is happening:

Reading alert         → LLM extracts (alert_name, pipeline_name, severity)
                        from the JSON payload
Loading integrations  → opensre reads ~/.tracer/integrations.json,
                        validates Grafana endpoint, classifies into
                        grafana / grafana_local / etc.
Planning              → LLM proposes a list of tool calls to make
                        (Grafana alerts, Grafana Loki, Mimir, Tempo,
                        run-diagnostic-code, get-sre-guidance)
Gathering evidence    → opensre dispatches each tool call concurrently,
                        collects results (n logs / m traces / k metrics)
Diagnosing            → LLM reasons over collected evidence and emits a
                        confidence score (0-100)
                        ↳ if confidence is low or evidence is thin,
                          the loop returns to Planning with new tool
                          choices, up to a configured iteration limit
Investigation complete→ final RCA report (root_cause + findings + cited
                        evidence with reproducible Grafana links) is
                        written to /tmp/rca.json

Typical durations on the reference hardware: Reading alert 15-20 s, each Planning 20-45 s, each Diagnosing 15-30 s. OpenSRE caps the loop at 5 Planning/Diagnosing iterations; in this experiment every scenario hit that cap (one of the iterations is consumed by the run_diagnostic_code failure described in §8.2). Total wall-clock per investigation lands at 217-249 s. With qwen2.5:7b the GPU stays at ~5 GB VRAM and ~30-90% utilisation throughout.

8.2 Known issues

OpenSRE version under test: git v2026.4.25-5-g4e6c051 (HEAD a few documentation commits past the v2026.4.25 tag); pyproject.toml still self-reports 2026.4.5 because the release pipeline does not bump that field on every tag.

  • run_diagnostic_code consistently fails with TypeError: ... missing 1 required positional argument during the Gathering phase. Reproduces under both llama3.1:8b and qwen2.5:7b, so the root cause is in OpenSRE's tool registry / decorator wiring rather than the model: the tool is registered via @tool(...) without is_available / extract_params, so it leaks into the planner's choice list and then crashes on dispatch when the runner has no LLM-supplied args to pass. Investigations still complete (the agent keeps planning around the failure), but it inflates iteration count and shows up as a [WARNING] Action failed line in every run.log.

9. Per-scenario results

For each run, the artefacts are at results/<scenario>/: inject.log, run.log (full investigation transcript), rca.json (structured RCA), and cleanup.log.

Scenario Expected OpenSRE actual Iter. Conf. Score
sensor-down radar-sensor stopped → restart "service started and signal received… no further log entries or metrics indicating when the problem occurred" 5 100% 1/3: ⚠️ saw signal_received but didn't query heartbeats / metrics that would prove the gap; no remediation
fusion-cpu-starvation fusion CPU-bound → relax CPU quota or scale "tactical_track_contributors_changed events indicate dynamic track adjustments, but lack of performance metrics… does not reveal a definitive root cause" 5 87% 1/4: ✅ named track-fusion, used the new tactical_track_contributors_changed event we added but misinterpreted it as the cause; never queried cms_fusion_processing_seconds
message-flood radar publishing > 100 msg/s → revert config "the publish rate of sensor data increased rapidly beyond the target, leading to overruns and heartbeats indicating dropped messages" 5 100% 3/5: ✅ correct service + ✅ correctly compared sensor_started target rate vs sensor_heartbeat observed rate (added by this experiment); ❌ no remediation
network-partition esm container alive but network detached → reconnect "the sensor service failed to initialize properly due to missing configuration" 5 100% 1/4 with misattribution: ✅ named esm-sensor; ❌ inferred "init failure" instead of network detach — agent saw the alert hint "container itself reports running" in the JSON but did not weight it
dds-qos-mismatch DDS QoS mismatch on esm publisher → align QoS "Change in Tactical Track Contributors" 5 100% 2/5, no hallucinations: ✅ named track-fusion + ✅ cited the tactical_track_contributors_changed event we added (the canonical signal for this scenario); ❌ stopped at "contributors changed" rather than reaching "QoS mismatch"
memory-leak track-fusion memory grows linearly → profile / rollback "there is an imbalance in track contributions that leads to queue overflow" — also cited aws_batch_jobs as evidence (hallucinated source) 5 75% 1/5 + 1 hallucination: 75 s wait window is too short for the 10 KiB-per-track leak to push RSS to the 256 MiB cap, so the actual signature was absent; agent fell back to fitting whatever signal was there

Scoring rubric (also in each expected_rca.md):

  • +1 for naming the right service
  • +1 for citing the right evidence (metric / log / trace)
  • +1 for suggesting the right remediation
  • −1 or −2 for misattributions

10. Discussion & findings

Caveat — this is one run, not a study. Each of the six scenarios was injected and investigated exactly once under the conditions described in §6 and §8. The findings below describe what a single end-to-end execution produced; nothing here is a statistical claim. A later iteration of this experiment would re-run the matrix multiple times per scenario and per model to get distributions instead of point observations.

Headline numbers

  • 6/6 scenarios completed end-to-end without infrastructure errors (clean inject/cleanup logs, no container restart conflicts, every artefact written, every reset_lab gate passed).
  • 4/6 scenarios returned a final confidence of 100% (sensor-down, message-flood, network-partition, dds-qos-mismatch); 87% for fusion-cpu-starvation, 75% for memory-leak. Confidence is poorly calibrated — high confidence does not imply the conclusion is right (network-partition was 100% confident in a wrong answer).
  • Each scenario hit the OpenSRE-internal 5-iteration planning cap, with run_diagnostic_code consuming one of those iterations per run on the TypeError described in §8.2.

What the instrumentation enabled

Two scenarios produced substantive RCAs that depended directly on signals this experiment adds:

  • message-flood — the agent put together the sensor_started log (which records the target rate, e.g. 10 Hz) and the sensor_heartbeat log (observed rate every 30 s). Its conclusion cites "publish rate increased rapidly beyond the target, leading to overruns and heartbeats indicating dropped messages." Without those two structured events the agent would have had only a scrape rate to look at, which it has shown elsewhere it does not query.
  • dds-qos-mismatch — the agent cited the tactical_track_contributors_changed log (a contributing-sensor set changed between consecutive publishes) and named track-fusion correctly. This is exactly the signal we added for this scenario. The conclusion stops at "contributors changed" rather than reaching "QoS mismatch", but the agent is now at least chasing the right thread.

Where the model fell short

This experiment generates rich, current evidence in all four backends (Loki, Tempo, Prometheus, Grafana alerts). The pattern in this single run is that the bottleneck is the model's reasoning, not the data:

  • Stops at the first plausible cause. dds-qos-mismatch reaches "contributors changed" but never asks "why did they change?". A larger or more capable model would likely chain one more inference step into the QoS layer.
  • Heavy log preference, neglects metrics. fusion-cpu-starvation has a histogram (cms_fusion_processing_seconds) that would prove the SLO breach in one query, and we exposed it specifically for this scenario. The agent never asked Mimir for it. Logs are the dominant tool the planner reaches for; histograms / gauges are selected almost as an afterthought.
  • Honours alert payload structured fields, not its prose. The commonAnnotations.summary and free-text context field in alert.json carry useful hints (e.g. "container itself reports running but unreachable" for network-partition), but qwen2.5:7b collapses these into the basic (alert_name, pipeline, severity) triple before reasoning, so the prose hint never reaches the planner.
  • One hallucinated evidence source. Memory-leak's report references aws_batch_jobs, which does not exist anywhere in this experiment. This is the only fabricated source in the run.
  • Slow-degradation scenarios need their own time budget. memory-leak is a 5-10 min phenomenon but our 75 s alert-window wait is shared across all scenarios. Without a longer wait the leak's signature simply is not yet present when the agent looks.

Suggested next steps

  1. Try a larger / more capable model. Same setup, same scenarios, swap OLLAMA_MODEL to qwen2.5:14b or wire up a cloud Anthropic / OpenAI key. The instrumentation in this experiment gives any model a fighting chance; the open question is whether a bigger one actually uses it. Highest-leverage next step.
  2. Run each scenario several times. This run is a single point-observation per scenario; a robust evaluation needs distributions. With the runner already in place, looping the matrix 3-5× per model is mechanical.
  3. Per-scenario wait window. Let the runner read a wait_seconds from scenarios/<name>/inject.sh (or a sibling meta.yaml) instead of a fixed 75 s, so slow-degradation scenarios (memory-leak) actually surface their signature before the agent looks.
  4. Keep observability state between scenarios. Strict isolation improves reproducibility but starves the agent of the cumulative historical context that real investigations rely on. Resetting only the CMS-service in-memory state (not Loki/Prom/Tempo volumes) is closer to a real production deployment.

11. Non-goals

  • Not production hardened (no HA, no secrets management, no RBAC).
  • Not a real CMS — physics, sensor models, and engagement logic are deliberately small for clarity.
  • Not multi-host — DDS discovery is configured for the local docker bridge.

About

Experiment evaluating OpenSRE on a distributed system over CycloneDDS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors