opensre-distributed-rca

A self-contained experiment that exercises OpenSRE — an AI-driven SRE agent — against a realistic distributed system. The system under test is a Combat Management System (CMS) simulation that runs on a single host in Docker, instrumented end-to-end so OpenSRE has the same telemetry surfaces a human SRE would see.

1. Purpose

CMS is a representative example of a hard-to-debug distributed system: many loosely-coupled services exchange messages over a transport (DDS) with strict QoS contracts, mission-critical SLOs, and failure modes that span process, transport, and content layers. This experiment gives OpenSRE a concrete environment in which to:

observe a system that produces real metrics, structured logs, and distributed traces;
ingest a Grafana-style alert that points at a fault we have ground truth for;
carry out an investigation against the same telemetry surfaces a real SRE would use (Prometheus, Loki, Tempo, Grafana), and emit a structured RCA;
let us compare its conclusion to the known root cause stored next to the chaos injector, so we can score the agent across a matrix of scenarios.

Six chaos scenarios cover the full spectrum from "a service crashed" to "the transport silently drops messages because of a QoS mismatch" — the last is intentionally hard, because that class of failure has no application-level exception and tests whether an AI SRE can correlate across signals.

2. Reference hardware

These results were produced on:

Component	Value
OS	Ubuntu 24.04.3 LTS, kernel 6.17.0-22
CPU	AMD Ryzen 9 4900H, 8 cores / 16 threads
RAM	16 GB
GPU	NVIDIA GeForce GTX 1660 Ti (6 GB VRAM), driver 580.126.09
Docker	snap-installed Docker 28.4.0
Local LLM	Ollama on the host, model `qwen2.5:7b`

Ollama runs on the host (systemd service) so it can drive the GPU natively. The OpenSRE container reaches it via host.docker.internal:11434.

3. Repository layout

opensre-distributed-rca/
├── README.md
├── docker-compose.yml         # Full stack
├── Dockerfile                 # CMS service image, parameterised by SERVICE_NAME
├── Dockerfile.opensre         # OpenSRE CLI image
├── Makefile                   # make bootstrap | up-sre | run-all | ...
├── pyproject.toml
├── shared/
│   ├── messages.idl           # Canonical DDS message reference
│   └── cyclonedds.xml         # Domain-wide CycloneDDS config
├── shared_lib/                # Common Python: messages, DDS facade, telemetry,
│                              # logging, runtime, trace propagation, world model
├── services/                  # Eight runnable services (see §4.2)
├── observability/
│   ├── otel-collector/config.yaml
│   ├── prometheus/prometheus.yml
│   ├── loki/loki-config.yaml
│   ├── tempo/tempo.yaml
│   └── grafana/
│       ├── dashboards/cms-overview.json
│       └── provisioning/{datasources,dashboards,alerting}/
├── scenarios/                 # 6 chaos scenarios + alert payloads
├── scripts/run-all-investigations.sh
├── opensre-config/integrations.json
└── results/                   # Artefacts from each opensre investigate run

4. Architecture under test

4.1 System topology

An air-defence Combat Management System with three sensor classes that fuse their detections into a single tactical picture, classify threats, issue mission orders, and engage targets. Three hostile aircraft (one classified MISSILE, two AIRCRAFT) fly deterministic circular paths at altitude. Each sensor independently observes them with its own noise and detection profile, and track-fusion correlates the noisy detections into a single tactical track per real target.

 radar-sensor ─┐
 esm-sensor   ─┼─► track-fusion ─► threat-evaluator ─► command-center ─► effector-manager
 eo-sensor    ─┘                                              │                  │
                                                              ▼                  ▼
                                                       tactical-display ◄─── (events)

4.2 Services

Eight Python services communicate over CycloneDDS. Every published message carries a cms.TraceContext field so distributed traces survive the DDS hop and reconstruct end-to-end in Tempo.

Service	Rate / role	Publishes	Subscribes
`radar-sensor`	10 Hz active radar; low position noise; no fine classification	`SourceTrack`	—
`esm-sensor`	5 Hz passive RF; wider position noise; classifies emitter	`SourceTrack`	—
`eo-sensor`	2 Hz electro-optical; tight position when it detects (70%)	`SourceTrack`	—
`track-fusion`	Nearest-neighbour correlation, queue-based	`TacticalTrack`	`SourceTrack`
`threat-evaluator`	Rule-based scoring (speed, altitude, classification)	`ThreatAssessment`	`TacticalTrack`
`command-center`	Issues `ENGAGE` / `MONITOR` orders against threats	`MissionOrder`	`ThreatAssessment`
`effector-manager`	Simulates engagement lifecycle (ACCEPTED → IN_FLIGHT → HIT/MISS)	`EngagementStatus`	`MissionOrder`
`tactical-display`	Operator REST surface; aggregates the live tactical picture	(HTTP)	every CMS topic

Common signals (every service)

Wired by shared_lib/runtime.py and shared_lib/telemetry.py, so they fire from every service regardless of business logic.

Logs: service_starting, signal_received, service_stopping, dds_writer_ready (with env_qos_override flag), dds_subscriber_ready.
Metrics: each service exposes a Prometheus /metrics endpoint on port 910x; the OTel SDK additionally pushes the same metrics over OTLP to the collector for redundancy.
Traces: every span carries service.name (the resource attribute) and the inbound trace context is continued on every DDS hop.

Per-service signals

Beyond the common floor, each service emits the following:

`radar-sensor` / `esm-sensor` / `eo-sensor`

Three sensors share shared_lib/sensor_runtime.py, so their signal set is identical (only the values of sensor_id and sensor_type differ).

Signal	Type	Notes
`cms_sensor_published_total{sensor_id}`	Counter	Source tracks written
`cms_sensor_publish_seconds{sensor_id}`	Histogram	Wall-clock per write
`cms_sensor_undetected_total{sensor_id}`	Counter	Targets missed by detection probability
`sensor_started`	Log	One-shot on boot; logs target rate + detection probability
`sensor_heartbeat`	Log	Every 30 s; logs `publish_rate_hz_observed` + `drop_rate_hz_observed`
`sensor_loop_overrun`	Log	Warning when a scan can't keep cadence
`sensor.scan.<sensor_id>`	Span	One per scan; attributes: `sensor.id`, `sensor.type`, `scan.target_count`

`track-fusion`

Most heavily instrumented service — it sits on the critical correlation path and is also the chaos target for half the scenarios.

Signal	Type	Notes
`cms_fusion_ingested_total{sensor_id}`	Counter	Source tracks ingested per sensor
`cms_fusion_published_total`	Counter	Tactical tracks published
`cms_fusion_processing_seconds`	Histogram	Per-track correlation wall-clock
`cms_fusion_queue_depth`	Gauge	Pending source tracks awaiting fusion
`cms_fusion_active_tracks`	Gauge	Currently maintained tactical tracks
`cms_fusion_memory_bytes`	Gauge	Process RSS, sampled ~1 Hz at full ingest
`cms_fusion_contributors_changed_total{change}`	Counter	`change` ∈ {added, removed}
`fusion_queue_full_dropping`	Log	Warning when inbox is saturated
`tactical_track_contributors_changed`	Log	Info; emitted when a tactical track's contributing-sensor set changes between publishes
`fusion.ingest`	Span	Continues the inbound sensor span; attribute: `tactical_track_id`

`threat-evaluator`

Signal	Type	Notes
`cms_threat_evaluated_total{level}`	Counter	Per-level evaluations
`cms_threat_processing_seconds`	Histogram	Per-track evaluation wall-clock
`cms_threat_level_transitions_total{from_level, to_level}`	Counter	Per-track level transitions
`threat_level_changed`	Log	Info; track + prior level + new level + rationale
`threat.evaluate`	Span	Continues the fusion span; attributes: `threat.level`, `threat.priority`

`command-center`

Signal	Type	Notes
`cms_orders_issued_total{action}`	Counter	`action` ∈ {ENGAGE, MONITOR}
`cms_command_processing_seconds`	Histogram	Per-order build + write wall-clock
`mission_order_issued`	Log	Info; order_id + action + target + threat_level
`command.issue_order`	Span	Continues the threat-evaluator span

`effector-manager`

Signal	Type	Notes
`cms_orders_received_total{action}`	Counter	Mission orders received
`cms_engagement_outcomes_total{outcome}`	Counter	`outcome` ∈ {HIT, MISS}
`cms_engagement_transitions_total{status}`	Counter	`status` ∈ {ACCEPTED, IN_FLIGHT, HIT, MISS}
`cms_engagement_duration_seconds`	Histogram	Total lifecycle wall-clock; custom buckets up to 30 s
`engagement_status_changed`	Log	Info on every transition
`engagement_resolved`	Log	Info on the final HIT/MISS
`effector.engage`	Span	Continues the command-center span; attributes: `order.id`, `order.action`

`tactical-display`

Signal	Type	Notes
`cms_display_received_total{topic}`	Counter	Per-topic ingest count
`cms_display_handler_seconds{topic}`	Histogram	Per-message handler wall-clock
`cms_display_active_tracks`	Gauge	Tracks visible in the latest snapshot
`http_starting`	Log	Info on boot with HTTP port
`display.ingest.<topic>`	Span	Continues the inbound trace; one per ingested message

4.3 DDS QoS profiles

QoS profiles are centralised in shared_lib/dds_io.py. The dds-qos-mismatch scenario uses an env-var hook (CMS_FORCE_QOS=RELIABLE_KEEP_LAST) to override one writer's profile so it stops matching its consumer's reader.

Topic	Reliability	History	Durability
`SourceTrackTopic`	BEST_EFFORT	KEEP_LAST(10)	VOLATILE
`TacticalTrackTopic`	RELIABLE	KEEP_LAST(50)	VOLATILE
`ThreatAssessmentTopic`	RELIABLE	KEEP_LAST(20)	VOLATILE
`MissionOrderTopic`	RELIABLE	KEEP_ALL	TRANSIENT_LOCAL
`EngagementStatusTopic`	RELIABLE	KEEP_LAST(20)	VOLATILE

4.4 Observability stack

The observability layer is what OpenSRE queries during an investigation.

OpenTelemetry is the in-process instrumentation contract for every CMS service. The OTel Python SDK (initialised once in shared_lib/telemetry.py) emits three signal types from inside the service:

Traces — every DDS hand-off (sensor scan → fusion ingest → threat evaluate → mission order → engagement) is a span; the W3C trace context is serialised into the cms.TraceContext field of each DDS message so the trace is not severed by the transport (see shared_lib/trace_propagation.py).
Metrics — service counters, latency histograms, and queue-depth gauges defined alongside the business logic.
Logs — Python's stdlib logging is wrapped by structlog, then forwarded as OTel LogRecords (shared_lib/logging_config.py).

All three signals are pushed over OTLP gRPC to the OpenTelemetry Collector, which is the only ingress point for telemetry. The collector then fans out to the storage backends:

traces → Tempo
metrics → Prometheus via remote-write (and scraped directly from each service's /metrics for redundancy)
logs → Loki via OTLP/HTTP

Grafana sits on top with provisioned datasources, a CMS dashboard, and the alert rules that drive each scenario's investigation. Alert queries are listed in §4.5.

Retention windows (matter for OpenSRE's time-bounded queries):

Backend	Retention
Prometheus TSDB	24 h (`--storage.tsdb.retention.time=24h`)
Tempo	24 h (`block_retention: 24h`)
Loki	168 h (the Loki default; `reject_old_samples_max_age: 168h`)

4.5 Alert rules

Provisioned in observability/grafana/provisioning/alerting/alerts.yaml. All seven rules query Prometheus over a 2 min look-back window. Most chaos scenarios are covered by a single rule; dds-qos-mismatch and memory-leak fire two complementary rules at different signal strengths (a leading "early" rule plus a lagging "cascade" rule), so OpenSRE can be tested both on the early-warning signal and on the post-cascade evidence.

Rule (`uid`)	PromQL	`for`	Severity	Scenario	Signal kind
`sensor-down`	`(1 - up{job="cms-services", instance=~"(radar\|esm\|eo)-sensor:.+"}) > 0`	30 s	critical	`sensor-down`, `network-partition`	scrape-failure (process down OR unreachable)
`fusion-latency-high`	`histogram_quantile(0.95, sum(rate(cms_fusion_processing_seconds_bucket[1m])) by (le)) > 0.5`	1 m	critical	`fusion-cpu-starvation`	SLO breach
`tactical-tracks-collapsed`	`(3 - cms_fusion_active_tracks) > 0`	1 m	warning	`dds-qos-mismatch`	cascade — fewer fused tracks than expected
`sensor-flood`	`sum by (sensor_id) (rate(cms_sensor_published_total[1m])) > 100`	30 s	critical	`message-flood`	per-sensor rate anomaly
`fusion-queue-deep`	`cms_fusion_queue_depth > 1000`	1 m	warning	`memory-leak`	cascade — back-pressure
`fusion-contributors-dropped`	`sum(increase(cms_fusion_contributors_changed_total{change="removed"}[2m])) > 0`	30 s	warning	`dds-qos-mismatch`	early — a sensor disappeared from a fused track
`fusion-memory-elevated`	`cms_fusion_memory_bytes > 200000000`	1 m	warning	`memory-leak`	early — RSS approaching the 256 MiB chaos cap

5. Setup (one-time)

Each make target is idempotent — re-running is safe.

make ollama-install     # install Ollama on the host (sudo prompts)
make ollama-config      # bind Ollama to 0.0.0.0:11434 + restart (sudo)
make ollama-pull        # pull the local LLM (qwen2.5:7b, ~5 GB)
make opensre-image      # clone upstream OpenSRE + build the CLI image

Or all four in one shot:

make bootstrap

make opensre-image clones Tracer-Cloud/opensre at v2026.4.25 by default. Override the repo or ref to build from a fork or branch:

make opensre-image OPENSRE_REPO=https://github.com/<fork>/opensre.git OPENSRE_REF=main

6. Lifecycle & URLs

make build              # build the 8 CMS service images (first run only)
make up-sre             # CMS + observability + OpenSRE
make ps                 # container status
make logs S=track-fusion
make down               # stop everything
make clean              # stop + wipe volumes

Host ports are remapped into the 13xxx/19xxx range to avoid clashing with other developer-machine observability stacks. Container-internal ports are unchanged.

Surface	URL
Grafana	http://localhost:13000 (anonymous Admin)
Prometheus	http://localhost:19090
Tempo (HTTP)	http://localhost:13200
Loki (HTTP)	http://localhost:13100
Tactical display API	http://localhost:18080/api/picture
Per-service `/metrics`	http://localhost:19101..19108/metrics
OTLP (collector)	gRPC localhost:14317 / HTTP localhost:14318

7. Chaos scenarios

Each scenario lives in scenarios/<name>/ with:

inject.sh — applies the failure
cleanup.sh — reverts it
alert.json — the Grafana-style alert payload fed to OpenSRE
expected_rca.md — ground-truth root cause + scoring rubric

`alert.json` schema and non-obvious fields

Three fields look optional on the surface but materially change how OpenSRE plans the investigation:

alert_source: "grafana" — required. Without it, app/nodes/plan_actions/detect_sources.py suppresses the Grafana tools for "non-Grafana alerts" and the agent ends up with no Loki / Tempo / Mimir handles at all. Every scenario's payload sets this explicitly.
pipeline_name — used by app/tools/GrafanaLogsTool as the default service_name when building the LogQL query. If it is missing or set to a value that doesn't exist as a Loki label (e.g. the OpenSRE template default events_fact), the resulting {service_name="events_fact"} query returns nothing or 400s. Each scenario sets this to one of our actual service_name labels (radar-sensor, track-fusion, …).
commonAnnotations.scenario — not interpreted by OpenSRE; this is a tag the runner reads when matching a result back to its ground-truth folder. Keep it equal to the directory name.

Other fields (title, alert_name, severity, commonLabels.*) are the standard Grafana webhook shape; OpenSRE's app/nodes/extract_alert/ reads what it can and falls back to unknown for missing fields.

#	Scenario	One-liner
1	`sensor-down`	Process gone — find the missing service
2	`fusion-cpu-starvation`	Consumer is the bottleneck while inputs look healthy
3	`dds-qos-mismatch`	Transport-layer silent message loss; hardest to diagnose
4	`message-flood`	Anomalous publish rate; fusion is the symptom, not the cause
5	`memory-leak`	Slow degradation, eventual OOM and cascade
6	`network-partition`	Container alive but unreachable — distinct from sensor-down

7.1 `sensor-down`

Inject — docker stop radar-sensor. Sends SIGTERM to the radar container; the process exits cleanly (the runtime registers a SIGTERM handler that flips the stop event). The container is left in Exited state, name still occupied, so cleanup can simply start it again.

Cleanup — docker start radar-sensor brings the same container back.

Signal that fires the alert — Prometheus scrape against radar-sensor:9101 fails for more than 30 s; the rule (1 - up{job="cms-services", instance=~"(radar|esm|eo)-sensor:.+"}) > 0 becomes truthy with instance="radar-sensor:9101".

7.2 `fusion-cpu-starvation`

Inject — docker update --cpus 0.05 track-fusion. Re-applies a cgroup CPU quota to the running container; no restart. Fusion still processes, just at ~5% of one core, so the queue starts to back up under the 45 msg/s aggregate sensor inflow.

Cleanup — docker update --cpus 0 track-fusion removes the quota (0 means unlimited).

Signal that fires the alert — histogram_quantile(0.95, sum(rate(cms_fusion_processing_seconds_bucket[1m])) by (le)) > 0.5, i.e. fusion's p95 processing latency crosses the 500 ms SLO.

7.3 `dds-qos-mismatch`

Inject — docker rm -f esm-sensor, then docker run a fresh ESM container with the same name and image but with CMS_FORCE_QOS=RELIABLE_KEEP_LAST. The shared library (shared_lib/dds_io.py) reads that env var when constructing writers and overrides the canonical SourceTrackTopic profile (which is BEST_EFFORT). The reader on track-fusion still asks for BEST_EFFORT, so CycloneDDS sees an incompatible reliability pair: discovery succeeds, but no samples are delivered. No application-level exception is raised.

Cleanup — docker stop esm-sensor, then docker compose up -d esm-sensor. Compose recreates ESM under its canonical config (no env-var override), restoring the matched QoS pair.

Signal that fires the alert — track-fusion still publishes tactical tracks but their contributing_sensors list no longer contains ESM-2; once the dropout is large enough, the (3 - cms_fusion_active_tracks) > 0 rule fires (active tracks fall below the steady-state of 3). The transport-layer hint (Cyclone's INCOMPATIBLE_QOS warning) shows up only on container stderr.

7.4 `message-flood`

Inject — docker rm -f radar-sensor, then docker run a fresh radar container with PUBLISH_RATE_HZ=200. Twenty times the designed 10 Hz cadence; combined with three targets per scan that's ~600 msg/s instead of ~30, ~13× the steady-state per-sensor rate.

Cleanup — docker stop radar-sensor, then docker compose up -d radar-sensor. Compose recreates radar under its canonical 10 Hz cadence.

Signal that fires the alert — sum by (sensor_id) (rate(cms_sensor_published_total[1m])) > 100 fires for sensor_id="RADAR-1" once the rate window catches up. Other sensors stay at baseline, which is what makes the diagnosis "radar is the cause, fusion is the symptom" rather than the other way around.

7.5 `memory-leak`

Inject — docker rm -f track-fusion, then docker run fresh with --memory 256m and CMS_LEAK_RATE_BYTES=10240. Track-fusion's main.py has a clearly-marked chaos hook: when the env var is set, it appends a 10 KiB byte buffer to a long-lived list for every ingested SourceTrack. With sensors aggregating ~45 msg/s, that's ~27 MB/min of unfreed memory — a few minutes from boot to OOM under the 256 MiB cap.

Cleanup — docker stop track-fusion, then docker compose up -d track-fusion. Compose recreates fusion under its canonical config with no leak hook and no memory limit.

Signal that fires the alert — cms_fusion_queue_depth > 1000 when the inbound queue starts climbing as ingest stalls under memory pressure. (The OOM kill itself produces a container restart; that event is observable as a gap in fusion-side metrics.)

7.6 `network-partition`

Inject — docker network disconnect opensre-distributed-rca_cms esm-sensor. The container keeps running with all its sockets, but it is detached from the docker network, so:

Prometheus can't scrape esm-sensor:9102.
The OpenTelemetry SDK can't reach the collector, so no logs / traces / metrics from ESM after the partition point.
Other DDS participants on the bridge see ESM disappear from discovery and stop receiving samples.

This is the exact pair that distinguishes "process crashed" from "network detached": same external symptom (no data from ESM), very different remediation.

Cleanup — docker network connect opensre-distributed-rca_cms esm-sensor re-attaches the container to the bridge; CycloneDDS rediscovers the participant within ~10 s.

Signal that fires the alert — same up == 0 rule as sensor-down, but for instance="esm-sensor:9102". The whole point of the scenario is that a careless RCA will conflate the two while a rigorous one will notice the container is still in running state.

Why `dds-qos-mismatch` is the centrepiece

DDS QoS mismatches are notorious in production: there is no application-level exception, no log line saying "I lost 100 messages". The only signals are transport-layer Cyclone warnings and the absence of expected sensors in the fused output. An AI SRE that correctly identifies this scenario is doing real distributed-systems debugging, not pattern matching on stack traces.

8. Running scenarios through OpenSRE

make run-all                              # all six scenarios, isolated
make run NAME=fusion-cpu-starvation       # one scenario

Both targets go through scripts/run-all-investigations.sh, which enforces strict isolation: every scenario starts with a fresh docker compose down --volumes && up -d, waits for the sensors to publish

1 msg/s (so the baseline is observable), injects the fault, waits 75 s for the alert window to fill, calls opensre investigate against the scenario's alert.json, copies the artefacts to results/<scenario>/, and runs cleanup.sh.

The per-scenario folder is wiped at the start of each run, so the artefacts always reflect the most recent execution. The investigation calls into:

Prometheus for metric queries
Loki for log search
Tempo for trace lookup
Grafana for alert metadata
Ollama (host) for local LLM inference — no cloud calls

See opensre-config/integrations.json for how OpenSRE is wired to the four data backends.

8.1 What the investigation does (run.log key phases)

A single opensre investigate run is an iterative loop. The phase markers in results/<scenario>/run.log show what is happening:

Reading alert         → LLM extracts (alert_name, pipeline_name, severity)
                        from the JSON payload
Loading integrations  → opensre reads ~/.tracer/integrations.json,
                        validates Grafana endpoint, classifies into
                        grafana / grafana_local / etc.
Planning              → LLM proposes a list of tool calls to make
                        (Grafana alerts, Grafana Loki, Mimir, Tempo,
                        run-diagnostic-code, get-sre-guidance)
Gathering evidence    → opensre dispatches each tool call concurrently,
                        collects results (n logs / m traces / k metrics)
Diagnosing            → LLM reasons over collected evidence and emits a
                        confidence score (0-100)
                        ↳ if confidence is low or evidence is thin,
                          the loop returns to Planning with new tool
                          choices, up to a configured iteration limit
Investigation complete→ final RCA report (root_cause + findings + cited
                        evidence with reproducible Grafana links) is
                        written to /tmp/rca.json

Typical durations on the reference hardware: Reading alert 15-20 s, each Planning 20-45 s, each Diagnosing 15-30 s. OpenSRE caps the loop at 5 Planning/Diagnosing iterations; in this experiment every scenario hit that cap (one of the iterations is consumed by the run_diagnostic_code failure described in §8.2). Total wall-clock per investigation lands at 217-249 s. With qwen2.5:7b the GPU stays at ~5 GB VRAM and ~30-90% utilisation throughout.

8.2 Known issues

OpenSRE version under test: git v2026.4.25-5-g4e6c051 (HEAD a few documentation commits past the v2026.4.25 tag); pyproject.toml still self-reports 2026.4.5 because the release pipeline does not bump that field on every tag.

run_diagnostic_code consistently fails with TypeError: ... missing 1 required positional argument during the Gathering phase. Reproduces under both llama3.1:8b and qwen2.5:7b, so the root cause is in OpenSRE's tool registry / decorator wiring rather than the model: the tool is registered via @tool(...) without is_available / extract_params, so it leaks into the planner's choice list and then crashes on dispatch when the runner has no LLM-supplied args to pass. Investigations still complete (the agent keeps planning around the failure), but it inflates iteration count and shows up as a [WARNING] Action failed line in every run.log.

9. Per-scenario results

For each run, the artefacts are at results/<scenario>/: inject.log, run.log (full investigation transcript), rca.json (structured RCA), and cleanup.log.

Scenario	Expected	OpenSRE actual	Iter.	Conf.	Score
`sensor-down`	radar-sensor stopped → restart	"service started and signal received… no further log entries or metrics indicating when the problem occurred"	5	100%	1/3: ⚠️ saw `signal_received` but didn't query heartbeats / metrics that would prove the gap; no remediation
`fusion-cpu-starvation`	fusion CPU-bound → relax CPU quota or scale	"tactical_track_contributors_changed events indicate dynamic track adjustments, but lack of performance metrics… does not reveal a definitive root cause"	5	87%	1/4: ✅ named track-fusion, used the new `tactical_track_contributors_changed` event we added but misinterpreted it as the cause; never queried `cms_fusion_processing_seconds`
`message-flood`	radar publishing > 100 msg/s → revert config	"the publish rate of sensor data increased rapidly beyond the target, leading to overruns and heartbeats indicating dropped messages"	5	100%	3/5: ✅ correct service + ✅ correctly compared `sensor_started` target rate vs `sensor_heartbeat` observed rate (added by this experiment); ❌ no remediation
`network-partition`	esm container alive but network detached → reconnect	"the sensor service failed to initialize properly due to missing configuration"	5	100%	1/4 with misattribution: ✅ named esm-sensor; ❌ inferred "init failure" instead of network detach — agent saw the alert hint "container itself reports running" in the JSON but did not weight it
`dds-qos-mismatch`	DDS QoS mismatch on esm publisher → align QoS	"Change in Tactical Track Contributors"	5	100%	2/5, no hallucinations: ✅ named track-fusion + ✅ cited the `tactical_track_contributors_changed` event we added (the canonical signal for this scenario); ❌ stopped at "contributors changed" rather than reaching "QoS mismatch"
`memory-leak`	track-fusion memory grows linearly → profile / rollback	"there is an imbalance in track contributions that leads to queue overflow" — also cited `aws_batch_jobs` as evidence (hallucinated source)	5	75%	1/5 + 1 hallucination: 75 s wait window is too short for the 10 KiB-per-track leak to push RSS to the 256 MiB cap, so the actual signature was absent; agent fell back to fitting whatever signal was there

Scoring rubric (also in each expected_rca.md):

+1 for naming the right service
+1 for citing the right evidence (metric / log / trace)
+1 for suggesting the right remediation
−1 or −2 for misattributions

10. Discussion & findings

Caveat — this is one run, not a study. Each of the six scenarios was injected and investigated exactly once under the conditions described in §6 and §8. The findings below describe what a single end-to-end execution produced; nothing here is a statistical claim. A later iteration of this experiment would re-run the matrix multiple times per scenario and per model to get distributions instead of point observations.

Headline numbers

6/6 scenarios completed end-to-end without infrastructure errors (clean inject/cleanup logs, no container restart conflicts, every artefact written, every reset_lab gate passed).
4/6 scenarios returned a final confidence of 100% (sensor-down, message-flood, network-partition, dds-qos-mismatch); 87% for fusion-cpu-starvation, 75% for memory-leak. Confidence is poorly calibrated — high confidence does not imply the conclusion is right (network-partition was 100% confident in a wrong answer).
Each scenario hit the OpenSRE-internal 5-iteration planning cap, with run_diagnostic_code consuming one of those iterations per run on the TypeError described in §8.2.

What the instrumentation enabled

Two scenarios produced substantive RCAs that depended directly on signals this experiment adds:

message-flood — the agent put together the sensor_started log (which records the target rate, e.g. 10 Hz) and the sensor_heartbeat log (observed rate every 30 s). Its conclusion cites "publish rate increased rapidly beyond the target, leading to overruns and heartbeats indicating dropped messages." Without those two structured events the agent would have had only a scrape rate to look at, which it has shown elsewhere it does not query.
dds-qos-mismatch — the agent cited the tactical_track_contributors_changed log (a contributing-sensor set changed between consecutive publishes) and named track-fusion correctly. This is exactly the signal we added for this scenario. The conclusion stops at "contributors changed" rather than reaching "QoS mismatch", but the agent is now at least chasing the right thread.

Where the model fell short

This experiment generates rich, current evidence in all four backends (Loki, Tempo, Prometheus, Grafana alerts). The pattern in this single run is that the bottleneck is the model's reasoning, not the data:

Stops at the first plausible cause. dds-qos-mismatch reaches "contributors changed" but never asks "why did they change?". A larger or more capable model would likely chain one more inference step into the QoS layer.
Heavy log preference, neglects metrics. fusion-cpu-starvation has a histogram (cms_fusion_processing_seconds) that would prove the SLO breach in one query, and we exposed it specifically for this scenario. The agent never asked Mimir for it. Logs are the dominant tool the planner reaches for; histograms / gauges are selected almost as an afterthought.
Honours alert payload structured fields, not its prose. The commonAnnotations.summary and free-text context field in alert.json carry useful hints (e.g. "container itself reports running but unreachable" for network-partition), but qwen2.5:7b collapses these into the basic (alert_name, pipeline, severity) triple before reasoning, so the prose hint never reaches the planner.
One hallucinated evidence source. Memory-leak's report references aws_batch_jobs, which does not exist anywhere in this experiment. This is the only fabricated source in the run.
Slow-degradation scenarios need their own time budget. memory-leak is a 5-10 min phenomenon but our 75 s alert-window wait is shared across all scenarios. Without a longer wait the leak's signature simply is not yet present when the agent looks.

Suggested next steps

Try a larger / more capable model. Same setup, same scenarios, swap OLLAMA_MODEL to qwen2.5:14b or wire up a cloud Anthropic / OpenAI key. The instrumentation in this experiment gives any model a fighting chance; the open question is whether a bigger one actually uses it. Highest-leverage next step.
Run each scenario several times. This run is a single point-observation per scenario; a robust evaluation needs distributions. With the runner already in place, looping the matrix 3-5× per model is mechanical.
Per-scenario wait window. Let the runner read a wait_seconds from scenarios/<name>/inject.sh (or a sibling meta.yaml) instead of a fixed 75 s, so slow-degradation scenarios (memory-leak) actually surface their signature before the agent looks.
Keep observability state between scenarios. Strict isolation improves reproducibility but starves the agent of the cumulative historical context that real investigations rely on. Resetting only the CMS-service in-memory state (not Loki/Prom/Tempo volumes) is closer to a real production deployment.

11. Non-goals

Not production hardened (no HA, no secrets management, no RBAC).
Not a real CMS — physics, sensor models, and engagement logic are deliberately small for clarity.
Not multi-host — DDS discovery is configured for the local docker bridge.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
observability		observability
opensre-config		opensre-config
results		results
scenarios		scenarios
scripts		scripts
services		services
shared		shared
shared_lib		shared_lib
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.opensre		Dockerfile.opensre
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

opensre-distributed-rca

1. Purpose

2. Reference hardware

3. Repository layout

4. Architecture under test

4.1 System topology

4.2 Services

Common signals (every service)

Per-service signals

radar-sensor / esm-sensor / eo-sensor

track-fusion

threat-evaluator

command-center

effector-manager

tactical-display

4.3 DDS QoS profiles

4.4 Observability stack

4.5 Alert rules

5. Setup (one-time)

6. Lifecycle & URLs

7. Chaos scenarios

alert.json schema and non-obvious fields

7.1 sensor-down

7.2 fusion-cpu-starvation

7.3 dds-qos-mismatch

7.4 message-flood

7.5 memory-leak

7.6 network-partition

Why dds-qos-mismatch is the centrepiece

8. Running scenarios through OpenSRE

8.1 What the investigation does (run.log key phases)

8.2 Known issues

9. Per-scenario results

10. Discussion & findings

Headline numbers

What the instrumentation enabled

Where the model fell short

Suggested next steps

11. Non-goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`radar-sensor` / `esm-sensor` / `eo-sensor`

`track-fusion`

`threat-evaluator`

`command-center`

`effector-manager`

`tactical-display`

`alert.json` schema and non-obvious fields

7.1 `sensor-down`

7.2 `fusion-cpu-starvation`

7.3 `dds-qos-mismatch`

7.4 `message-flood`

7.5 `memory-leak`

7.6 `network-partition`

Why `dds-qos-mismatch` is the centrepiece

Packages