A self-contained experiment that exercises OpenSRE — an AI-driven SRE agent — against a realistic distributed system. The system under test is a Combat Management System (CMS) simulation that runs on a single host in Docker, instrumented end-to-end so OpenSRE has the same telemetry surfaces a human SRE would see.
CMS is a representative example of a hard-to-debug distributed system: many loosely-coupled services exchange messages over a transport (DDS) with strict QoS contracts, mission-critical SLOs, and failure modes that span process, transport, and content layers. This experiment gives OpenSRE a concrete environment in which to:
- observe a system that produces real metrics, structured logs, and distributed traces;
- ingest a Grafana-style alert that points at a fault we have ground truth for;
- carry out an investigation against the same telemetry surfaces a real SRE would use (Prometheus, Loki, Tempo, Grafana), and emit a structured RCA;
- let us compare its conclusion to the known root cause stored next to the chaos injector, so we can score the agent across a matrix of scenarios.
Six chaos scenarios cover the full spectrum from "a service crashed" to "the transport silently drops messages because of a QoS mismatch" — the last is intentionally hard, because that class of failure has no application-level exception and tests whether an AI SRE can correlate across signals.
These results were produced on:
| Component | Value |
|---|---|
| OS | Ubuntu 24.04.3 LTS, kernel 6.17.0-22 |
| CPU | AMD Ryzen 9 4900H, 8 cores / 16 threads |
| RAM | 16 GB |
| GPU | NVIDIA GeForce GTX 1660 Ti (6 GB VRAM), driver 580.126.09 |
| Docker | snap-installed Docker 28.4.0 |
| Local LLM | Ollama on the host, model qwen2.5:7b |
Ollama runs on the host (systemd service) so it can drive the GPU
natively. The OpenSRE container reaches it via host.docker.internal:11434.
opensre-distributed-rca/
├── README.md
├── docker-compose.yml # Full stack
├── Dockerfile # CMS service image, parameterised by SERVICE_NAME
├── Dockerfile.opensre # OpenSRE CLI image
├── Makefile # make bootstrap | up-sre | run-all | ...
├── pyproject.toml
├── shared/
│ ├── messages.idl # Canonical DDS message reference
│ └── cyclonedds.xml # Domain-wide CycloneDDS config
├── shared_lib/ # Common Python: messages, DDS facade, telemetry,
│ # logging, runtime, trace propagation, world model
├── services/ # Eight runnable services (see §4.2)
├── observability/
│ ├── otel-collector/config.yaml
│ ├── prometheus/prometheus.yml
│ ├── loki/loki-config.yaml
│ ├── tempo/tempo.yaml
│ └── grafana/
│ ├── dashboards/cms-overview.json
│ └── provisioning/{datasources,dashboards,alerting}/
├── scenarios/ # 6 chaos scenarios + alert payloads
├── scripts/run-all-investigations.sh
├── opensre-config/integrations.json
└── results/ # Artefacts from each opensre investigate run
An air-defence Combat Management System with three sensor classes that fuse their detections into a single tactical picture, classify threats, issue mission orders, and engage targets. Three hostile aircraft (one classified MISSILE, two AIRCRAFT) fly deterministic circular paths at altitude. Each sensor independently observes them with its own noise and detection profile, and track-fusion correlates the noisy detections into a single tactical track per real target.
radar-sensor ─┐
esm-sensor ─┼─► track-fusion ─► threat-evaluator ─► command-center ─► effector-manager
eo-sensor ─┘ │ │
▼ ▼
tactical-display ◄─── (events)
Eight Python services communicate over CycloneDDS. Every published message
carries a cms.TraceContext field so distributed traces survive the DDS
hop and reconstruct end-to-end in Tempo.
| Service | Rate / role | Publishes | Subscribes |
|---|---|---|---|
radar-sensor |
10 Hz active radar; low position noise; no fine classification | SourceTrack |
— |
esm-sensor |
5 Hz passive RF; wider position noise; classifies emitter | SourceTrack |
— |
eo-sensor |
2 Hz electro-optical; tight position when it detects (70%) | SourceTrack |
— |
track-fusion |
Nearest-neighbour correlation, queue-based | TacticalTrack |
SourceTrack |
threat-evaluator |
Rule-based scoring (speed, altitude, classification) | ThreatAssessment |
TacticalTrack |
command-center |
Issues ENGAGE / MONITOR orders against threats |
MissionOrder |
ThreatAssessment |
effector-manager |
Simulates engagement lifecycle (ACCEPTED → IN_FLIGHT → HIT/MISS) | EngagementStatus |
MissionOrder |
tactical-display |
Operator REST surface; aggregates the live tactical picture | (HTTP) | every CMS topic |
Wired by shared_lib/runtime.py and shared_lib/telemetry.py, so they
fire from every service regardless of business logic.
- Logs:
service_starting,signal_received,service_stopping,dds_writer_ready(withenv_qos_overrideflag),dds_subscriber_ready. - Metrics: each service exposes a Prometheus
/metricsendpoint on port910x; the OTel SDK additionally pushes the same metrics over OTLP to the collector for redundancy. - Traces: every span carries
service.name(the resource attribute) and the inbound trace context is continued on every DDS hop.
Beyond the common floor, each service emits the following:
Three sensors share shared_lib/sensor_runtime.py, so their signal set
is identical (only the values of sensor_id and sensor_type differ).
| Signal | Type | Notes |
|---|---|---|
cms_sensor_published_total{sensor_id} |
Counter | Source tracks written |
cms_sensor_publish_seconds{sensor_id} |
Histogram | Wall-clock per write |
cms_sensor_undetected_total{sensor_id} |
Counter | Targets missed by detection probability |
sensor_started |
Log | One-shot on boot; logs target rate + detection probability |
sensor_heartbeat |
Log | Every 30 s; logs publish_rate_hz_observed + drop_rate_hz_observed |
sensor_loop_overrun |
Log | Warning when a scan can't keep cadence |
sensor.scan.<sensor_id> |
Span | One per scan; attributes: sensor.id, sensor.type, scan.target_count |
Most heavily instrumented service — it sits on the critical correlation path and is also the chaos target for half the scenarios.
| Signal | Type | Notes |
|---|---|---|
cms_fusion_ingested_total{sensor_id} |
Counter | Source tracks ingested per sensor |
cms_fusion_published_total |
Counter | Tactical tracks published |
cms_fusion_processing_seconds |
Histogram | Per-track correlation wall-clock |
cms_fusion_queue_depth |
Gauge | Pending source tracks awaiting fusion |
cms_fusion_active_tracks |
Gauge | Currently maintained tactical tracks |
cms_fusion_memory_bytes |
Gauge | Process RSS, sampled ~1 Hz at full ingest |
cms_fusion_contributors_changed_total{change} |
Counter | change ∈ {added, removed} |
fusion_queue_full_dropping |
Log | Warning when inbox is saturated |
tactical_track_contributors_changed |
Log | Info; emitted when a tactical track's contributing-sensor set changes between publishes |
fusion.ingest |
Span | Continues the inbound sensor span; attribute: tactical_track_id |
| Signal | Type | Notes |
|---|---|---|
cms_threat_evaluated_total{level} |
Counter | Per-level evaluations |
cms_threat_processing_seconds |
Histogram | Per-track evaluation wall-clock |
cms_threat_level_transitions_total{from_level, to_level} |
Counter | Per-track level transitions |
threat_level_changed |
Log | Info; track + prior level + new level + rationale |
threat.evaluate |
Span | Continues the fusion span; attributes: threat.level, threat.priority |
| Signal | Type | Notes |
|---|---|---|
cms_orders_issued_total{action} |
Counter | action ∈ {ENGAGE, MONITOR} |
cms_command_processing_seconds |
Histogram | Per-order build + write wall-clock |
mission_order_issued |
Log | Info; order_id + action + target + threat_level |
command.issue_order |
Span | Continues the threat-evaluator span |
| Signal | Type | Notes |
|---|---|---|
cms_orders_received_total{action} |
Counter | Mission orders received |
cms_engagement_outcomes_total{outcome} |
Counter | outcome ∈ {HIT, MISS} |
cms_engagement_transitions_total{status} |
Counter | status ∈ {ACCEPTED, IN_FLIGHT, HIT, MISS} |
cms_engagement_duration_seconds |
Histogram | Total lifecycle wall-clock; custom buckets up to 30 s |
engagement_status_changed |
Log | Info on every transition |
engagement_resolved |
Log | Info on the final HIT/MISS |
effector.engage |
Span | Continues the command-center span; attributes: order.id, order.action |
| Signal | Type | Notes |
|---|---|---|
cms_display_received_total{topic} |
Counter | Per-topic ingest count |
cms_display_handler_seconds{topic} |
Histogram | Per-message handler wall-clock |
cms_display_active_tracks |
Gauge | Tracks visible in the latest snapshot |
http_starting |
Log | Info on boot with HTTP port |
display.ingest.<topic> |
Span | Continues the inbound trace; one per ingested message |
QoS profiles are centralised in shared_lib/dds_io.py.
The dds-qos-mismatch scenario uses an env-var hook (CMS_FORCE_QOS=RELIABLE_KEEP_LAST)
to override one writer's profile so it stops matching its consumer's reader.
| Topic | Reliability | History | Durability |
|---|---|---|---|
SourceTrackTopic |
BEST_EFFORT | KEEP_LAST(10) | VOLATILE |
TacticalTrackTopic |
RELIABLE | KEEP_LAST(50) | VOLATILE |
ThreatAssessmentTopic |
RELIABLE | KEEP_LAST(20) | VOLATILE |
MissionOrderTopic |
RELIABLE | KEEP_ALL | TRANSIENT_LOCAL |
EngagementStatusTopic |
RELIABLE | KEEP_LAST(20) | VOLATILE |
The observability layer is what OpenSRE queries during an investigation.
OpenTelemetry is the in-process instrumentation contract for every CMS
service. The OTel Python SDK (initialised once in shared_lib/telemetry.py)
emits three signal types from inside the service:
- Traces — every DDS hand-off (sensor scan → fusion ingest → threat
evaluate → mission order → engagement) is a span; the W3C trace context
is serialised into the
cms.TraceContextfield of each DDS message so the trace is not severed by the transport (see shared_lib/trace_propagation.py). - Metrics — service counters, latency histograms, and queue-depth gauges defined alongside the business logic.
- Logs — Python's stdlib logging is wrapped by structlog, then
forwarded as OTel LogRecords (
shared_lib/logging_config.py).
All three signals are pushed over OTLP gRPC to the OpenTelemetry Collector, which is the only ingress point for telemetry. The collector then fans out to the storage backends:
- traces → Tempo
- metrics → Prometheus via remote-write (and scraped directly from
each service's
/metricsfor redundancy) - logs → Loki via OTLP/HTTP
Grafana sits on top with provisioned datasources, a CMS dashboard, and the alert rules that drive each scenario's investigation. Alert queries are listed in §4.5.
Retention windows (matter for OpenSRE's time-bounded queries):
| Backend | Retention |
|---|---|
| Prometheus TSDB | 24 h (--storage.tsdb.retention.time=24h) |
| Tempo | 24 h (block_retention: 24h) |
| Loki | 168 h (the Loki default; reject_old_samples_max_age: 168h) |
Provisioned in
observability/grafana/provisioning/alerting/alerts.yaml.
All seven rules query Prometheus over a 2 min look-back window. Most chaos
scenarios are covered by a single rule; dds-qos-mismatch and memory-leak
fire two complementary rules at different signal strengths (a leading
"early" rule plus a lagging "cascade" rule), so OpenSRE can be tested both
on the early-warning signal and on the post-cascade evidence.
Rule (uid) |
PromQL | for |
Severity | Scenario | Signal kind |
|---|---|---|---|---|---|
sensor-down |
(1 - up{job="cms-services", instance=~"(radar|esm|eo)-sensor:.+"}) > 0 |
30 s | critical | sensor-down, network-partition |
scrape-failure (process down OR unreachable) |
fusion-latency-high |
histogram_quantile(0.95, sum(rate(cms_fusion_processing_seconds_bucket[1m])) by (le)) > 0.5 |
1 m | critical | fusion-cpu-starvation |
SLO breach |
tactical-tracks-collapsed |
(3 - cms_fusion_active_tracks) > 0 |
1 m | warning | dds-qos-mismatch |
cascade — fewer fused tracks than expected |
sensor-flood |
sum by (sensor_id) (rate(cms_sensor_published_total[1m])) > 100 |
30 s | critical | message-flood |
per-sensor rate anomaly |
fusion-queue-deep |
cms_fusion_queue_depth > 1000 |
1 m | warning | memory-leak |
cascade — back-pressure |
fusion-contributors-dropped |
sum(increase(cms_fusion_contributors_changed_total{change="removed"}[2m])) > 0 |
30 s | warning | dds-qos-mismatch |
early — a sensor disappeared from a fused track |
fusion-memory-elevated |
cms_fusion_memory_bytes > 200000000 |
1 m | warning | memory-leak |
early — RSS approaching the 256 MiB chaos cap |
Each make target is idempotent — re-running is safe.
make ollama-install # install Ollama on the host (sudo prompts)
make ollama-config # bind Ollama to 0.0.0.0:11434 + restart (sudo)
make ollama-pull # pull the local LLM (qwen2.5:7b, ~5 GB)
make opensre-image # clone upstream OpenSRE + build the CLI imageOr all four in one shot:
make bootstrapmake opensre-image clones Tracer-Cloud/opensre
at v2026.4.25 by default. Override the repo or ref to build from a fork
or branch:
make opensre-image OPENSRE_REPO=https://github.com/<fork>/opensre.git OPENSRE_REF=mainmake build # build the 8 CMS service images (first run only)
make up-sre # CMS + observability + OpenSRE
make ps # container status
make logs S=track-fusion
make down # stop everything
make clean # stop + wipe volumesHost ports are remapped into the 13xxx/19xxx range to avoid clashing with other developer-machine observability stacks. Container-internal ports are unchanged.
Each scenario lives in scenarios/<name>/ with:
inject.sh— applies the failurecleanup.sh— reverts italert.json— the Grafana-style alert payload fed to OpenSREexpected_rca.md— ground-truth root cause + scoring rubric
Three fields look optional on the surface but materially change how OpenSRE plans the investigation:
alert_source: "grafana"— required. Without it,app/nodes/plan_actions/detect_sources.pysuppresses the Grafana tools for "non-Grafana alerts" and the agent ends up with no Loki / Tempo / Mimir handles at all. Every scenario's payload sets this explicitly.pipeline_name— used byapp/tools/GrafanaLogsToolas the defaultservice_namewhen building the LogQL query. If it is missing or set to a value that doesn't exist as a Loki label (e.g. the OpenSRE template defaultevents_fact), the resulting{service_name="events_fact"}query returns nothing or 400s. Each scenario sets this to one of our actualservice_namelabels (radar-sensor,track-fusion, …).commonAnnotations.scenario— not interpreted by OpenSRE; this is a tag the runner reads when matching a result back to its ground-truth folder. Keep it equal to the directory name.
Other fields (title, alert_name, severity, commonLabels.*) are
the standard Grafana webhook shape; OpenSRE's
app/nodes/extract_alert/ reads what it can and falls back to
unknown for missing fields.
| # | Scenario | One-liner |
|---|---|---|
| 1 | sensor-down |
Process gone — find the missing service |
| 2 | fusion-cpu-starvation |
Consumer is the bottleneck while inputs look healthy |
| 3 | dds-qos-mismatch |
Transport-layer silent message loss; hardest to diagnose |
| 4 | message-flood |
Anomalous publish rate; fusion is the symptom, not the cause |
| 5 | memory-leak |
Slow degradation, eventual OOM and cascade |
| 6 | network-partition |
Container alive but unreachable — distinct from sensor-down |
Inject — docker stop radar-sensor. Sends SIGTERM to the radar
container; the process exits cleanly (the runtime registers a SIGTERM
handler that flips the stop event). The container is left in Exited
state, name still occupied, so cleanup can simply start it again.
Cleanup — docker start radar-sensor brings the same container back.
Signal that fires the alert — Prometheus scrape against
radar-sensor:9101 fails for more than 30 s; the rule
(1 - up{job="cms-services", instance=~"(radar|esm|eo)-sensor:.+"}) > 0
becomes truthy with instance="radar-sensor:9101".
Inject — docker update --cpus 0.05 track-fusion. Re-applies a
cgroup CPU quota to the running container; no restart. Fusion still
processes, just at ~5% of one core, so the queue starts to back up
under the 45 msg/s aggregate sensor inflow.
Cleanup — docker update --cpus 0 track-fusion removes the quota
(0 means unlimited).
Signal that fires the alert —
histogram_quantile(0.95, sum(rate(cms_fusion_processing_seconds_bucket[1m])) by (le)) > 0.5,
i.e. fusion's p95 processing latency crosses the 500 ms SLO.
Inject — docker rm -f esm-sensor, then docker run a fresh ESM
container with the same name and image but with
CMS_FORCE_QOS=RELIABLE_KEEP_LAST. The shared library
(shared_lib/dds_io.py) reads that env var when
constructing writers and overrides the canonical
SourceTrackTopic profile (which is BEST_EFFORT). The reader on
track-fusion still asks for BEST_EFFORT, so CycloneDDS sees an
incompatible reliability pair: discovery succeeds, but no samples are
delivered. No application-level exception is raised.
Cleanup — docker stop esm-sensor, then
docker compose up -d esm-sensor. Compose recreates ESM under its
canonical config (no env-var override), restoring the matched QoS pair.
Signal that fires the alert — track-fusion still publishes tactical
tracks but their contributing_sensors list no longer contains
ESM-2; once the dropout is large enough, the
(3 - cms_fusion_active_tracks) > 0 rule fires (active tracks fall
below the steady-state of 3). The transport-layer hint (Cyclone's
INCOMPATIBLE_QOS warning) shows up only on container stderr.
Inject — docker rm -f radar-sensor, then docker run a fresh
radar container with PUBLISH_RATE_HZ=200. Twenty times the designed
10 Hz cadence; combined with three targets per scan that's ~600 msg/s
instead of ~30, ~13× the steady-state per-sensor rate.
Cleanup — docker stop radar-sensor, then
docker compose up -d radar-sensor. Compose recreates radar under its
canonical 10 Hz cadence.
Signal that fires the alert —
sum by (sensor_id) (rate(cms_sensor_published_total[1m])) > 100
fires for sensor_id="RADAR-1" once the rate window catches up. Other
sensors stay at baseline, which is what makes the diagnosis "radar is
the cause, fusion is the symptom" rather than the other way around.
Inject — docker rm -f track-fusion, then docker run fresh with
--memory 256m and CMS_LEAK_RATE_BYTES=10240. Track-fusion's
main.py has a clearly-marked chaos hook: when the env var is set, it
appends a 10 KiB byte buffer to a long-lived list for every ingested
SourceTrack. With sensors aggregating ~45 msg/s, that's ~27 MB/min of
unfreed memory — a few minutes from boot to OOM under the 256 MiB cap.
Cleanup — docker stop track-fusion, then
docker compose up -d track-fusion. Compose recreates fusion under its
canonical config with no leak hook and no memory limit.
Signal that fires the alert — cms_fusion_queue_depth > 1000
when the inbound queue starts climbing as ingest stalls under memory
pressure. (The OOM kill itself produces a container restart; that
event is observable as a gap in fusion-side metrics.)
Inject —
docker network disconnect opensre-distributed-rca_cms esm-sensor. The container
keeps running with all its sockets, but it is detached from the
docker network, so:
- Prometheus can't scrape
esm-sensor:9102. - The OpenTelemetry SDK can't reach the collector, so no logs / traces / metrics from ESM after the partition point.
- Other DDS participants on the bridge see ESM disappear from discovery and stop receiving samples.
This is the exact pair that distinguishes "process crashed" from "network detached": same external symptom (no data from ESM), very different remediation.
Cleanup — docker network connect opensre-distributed-rca_cms esm-sensor
re-attaches the container to the bridge; CycloneDDS rediscovers the
participant within ~10 s.
Signal that fires the alert — same up == 0 rule as sensor-down,
but for instance="esm-sensor:9102". The whole point of the scenario
is that a careless RCA will conflate the two while a rigorous one will
notice the container is still in running state.
DDS QoS mismatches are notorious in production: there is no application-level exception, no log line saying "I lost 100 messages". The only signals are transport-layer Cyclone warnings and the absence of expected sensors in the fused output. An AI SRE that correctly identifies this scenario is doing real distributed-systems debugging, not pattern matching on stack traces.
make run-all # all six scenarios, isolated
make run NAME=fusion-cpu-starvation # one scenarioBoth targets go through scripts/run-all-investigations.sh, which enforces
strict isolation: every scenario starts with a fresh
docker compose down --volumes && up -d, waits for the sensors to publish
1 msg/s (so the baseline is observable), injects the fault, waits 75 s for the alert window to fill, calls
opensre investigateagainst the scenario'salert.json, copies the artefacts toresults/<scenario>/, and runscleanup.sh.
The per-scenario folder is wiped at the start of each run, so the artefacts always reflect the most recent execution. The investigation calls into:
- Prometheus for metric queries
- Loki for log search
- Tempo for trace lookup
- Grafana for alert metadata
- Ollama (host) for local LLM inference — no cloud calls
See opensre-config/integrations.json for how OpenSRE is wired to the four data backends.
A single opensre investigate run is an iterative loop. The phase
markers in results/<scenario>/run.log show what is happening:
Reading alert → LLM extracts (alert_name, pipeline_name, severity)
from the JSON payload
Loading integrations → opensre reads ~/.tracer/integrations.json,
validates Grafana endpoint, classifies into
grafana / grafana_local / etc.
Planning → LLM proposes a list of tool calls to make
(Grafana alerts, Grafana Loki, Mimir, Tempo,
run-diagnostic-code, get-sre-guidance)
Gathering evidence → opensre dispatches each tool call concurrently,
collects results (n logs / m traces / k metrics)
Diagnosing → LLM reasons over collected evidence and emits a
confidence score (0-100)
↳ if confidence is low or evidence is thin,
the loop returns to Planning with new tool
choices, up to a configured iteration limit
Investigation complete→ final RCA report (root_cause + findings + cited
evidence with reproducible Grafana links) is
written to /tmp/rca.json
Typical durations on the reference hardware: Reading alert 15-20 s, each
Planning 20-45 s, each Diagnosing 15-30 s. OpenSRE caps the loop at 5
Planning/Diagnosing iterations; in this experiment every scenario hit
that cap (one of the iterations is consumed by the
run_diagnostic_code failure described in §8.2). Total wall-clock per
investigation lands at 217-249 s. With qwen2.5:7b the GPU stays at
~5 GB VRAM and ~30-90% utilisation throughout.
OpenSRE version under test: git v2026.4.25-5-g4e6c051 (HEAD a few
documentation commits past the v2026.4.25 tag); pyproject.toml
still self-reports 2026.4.5 because the release pipeline does not
bump that field on every tag.
run_diagnostic_codeconsistently fails withTypeError: ... missing 1 required positional argumentduring the Gathering phase. Reproduces under bothllama3.1:8bandqwen2.5:7b, so the root cause is in OpenSRE's tool registry / decorator wiring rather than the model: the tool is registered via@tool(...)withoutis_available/extract_params, so it leaks into the planner's choice list and then crashes on dispatch when the runner has no LLM-supplied args to pass. Investigations still complete (the agent keeps planning around the failure), but it inflates iteration count and shows up as a[WARNING] Action failedline in everyrun.log.
For each run, the artefacts are at results/<scenario>/:
inject.log, run.log (full investigation transcript), rca.json
(structured RCA), and cleanup.log.
| Scenario | Expected | OpenSRE actual | Iter. | Conf. | Score |
|---|---|---|---|---|---|
sensor-down |
radar-sensor stopped → restart | "service started and signal received… no further log entries or metrics indicating when the problem occurred" | 5 | 100% | 1/3: signal_received but didn't query heartbeats / metrics that would prove the gap; no remediation |
fusion-cpu-starvation |
fusion CPU-bound → relax CPU quota or scale | "tactical_track_contributors_changed events indicate dynamic track adjustments, but lack of performance metrics… does not reveal a definitive root cause" | 5 | 87% | 1/4: ✅ named track-fusion, used the new tactical_track_contributors_changed event we added but misinterpreted it as the cause; never queried cms_fusion_processing_seconds |
message-flood |
radar publishing > 100 msg/s → revert config | "the publish rate of sensor data increased rapidly beyond the target, leading to overruns and heartbeats indicating dropped messages" | 5 | 100% | 3/5: ✅ correct service + ✅ correctly compared sensor_started target rate vs sensor_heartbeat observed rate (added by this experiment); ❌ no remediation |
network-partition |
esm container alive but network detached → reconnect | "the sensor service failed to initialize properly due to missing configuration" | 5 | 100% | 1/4 with misattribution: ✅ named esm-sensor; ❌ inferred "init failure" instead of network detach — agent saw the alert hint "container itself reports running" in the JSON but did not weight it |
dds-qos-mismatch |
DDS QoS mismatch on esm publisher → align QoS | "Change in Tactical Track Contributors" | 5 | 100% | 2/5, no hallucinations: ✅ named track-fusion + ✅ cited the tactical_track_contributors_changed event we added (the canonical signal for this scenario); ❌ stopped at "contributors changed" rather than reaching "QoS mismatch" |
memory-leak |
track-fusion memory grows linearly → profile / rollback | "there is an imbalance in track contributions that leads to queue overflow" — also cited aws_batch_jobs as evidence (hallucinated source) |
5 | 75% | 1/5 + 1 hallucination: 75 s wait window is too short for the 10 KiB-per-track leak to push RSS to the 256 MiB cap, so the actual signature was absent; agent fell back to fitting whatever signal was there |
Scoring rubric (also in each expected_rca.md):
- +1 for naming the right service
- +1 for citing the right evidence (metric / log / trace)
- +1 for suggesting the right remediation
- −1 or −2 for misattributions
Caveat — this is one run, not a study. Each of the six scenarios was injected and investigated exactly once under the conditions described in §6 and §8. The findings below describe what a single end-to-end execution produced; nothing here is a statistical claim. A later iteration of this experiment would re-run the matrix multiple times per scenario and per model to get distributions instead of point observations.
- 6/6 scenarios completed end-to-end without infrastructure errors (clean inject/cleanup logs, no container restart conflicts, every artefact written, every reset_lab gate passed).
- 4/6 scenarios returned a final confidence of 100%
(
sensor-down,message-flood,network-partition,dds-qos-mismatch); 87% forfusion-cpu-starvation, 75% formemory-leak. Confidence is poorly calibrated — high confidence does not imply the conclusion is right (network-partition was 100% confident in a wrong answer). - Each scenario hit the OpenSRE-internal 5-iteration planning cap, with
run_diagnostic_codeconsuming one of those iterations per run on the TypeError described in §8.2.
Two scenarios produced substantive RCAs that depended directly on signals this experiment adds:
message-flood— the agent put together thesensor_startedlog (which records the target rate, e.g. 10 Hz) and thesensor_heartbeatlog (observed rate every 30 s). Its conclusion cites "publish rate increased rapidly beyond the target, leading to overruns and heartbeats indicating dropped messages." Without those two structured events the agent would have had only a scrape rate to look at, which it has shown elsewhere it does not query.dds-qos-mismatch— the agent cited thetactical_track_contributors_changedlog (a contributing-sensor set changed between consecutive publishes) and named track-fusion correctly. This is exactly the signal we added for this scenario. The conclusion stops at "contributors changed" rather than reaching "QoS mismatch", but the agent is now at least chasing the right thread.
This experiment generates rich, current evidence in all four backends (Loki, Tempo, Prometheus, Grafana alerts). The pattern in this single run is that the bottleneck is the model's reasoning, not the data:
- Stops at the first plausible cause.
dds-qos-mismatchreaches "contributors changed" but never asks "why did they change?". A larger or more capable model would likely chain one more inference step into the QoS layer. - Heavy log preference, neglects metrics.
fusion-cpu-starvationhas a histogram (cms_fusion_processing_seconds) that would prove the SLO breach in one query, and we exposed it specifically for this scenario. The agent never asked Mimir for it. Logs are the dominant tool the planner reaches for; histograms / gauges are selected almost as an afterthought. - Honours alert payload structured fields, not its prose. The
commonAnnotations.summaryand free-textcontextfield inalert.jsoncarry useful hints (e.g. "container itself reports running but unreachable" for network-partition), but qwen2.5:7b collapses these into the basic(alert_name, pipeline, severity)triple before reasoning, so the prose hint never reaches the planner. - One hallucinated evidence source. Memory-leak's report references
aws_batch_jobs, which does not exist anywhere in this experiment. This is the only fabricated source in the run. - Slow-degradation scenarios need their own time budget.
memory-leakis a 5-10 min phenomenon but our 75 s alert-window wait is shared across all scenarios. Without a longer wait the leak's signature simply is not yet present when the agent looks.
- Try a larger / more capable model. Same setup, same scenarios,
swap
OLLAMA_MODELtoqwen2.5:14bor wire up a cloud Anthropic / OpenAI key. The instrumentation in this experiment gives any model a fighting chance; the open question is whether a bigger one actually uses it. Highest-leverage next step. - Run each scenario several times. This run is a single point-observation per scenario; a robust evaluation needs distributions. With the runner already in place, looping the matrix 3-5× per model is mechanical.
- Per-scenario wait window. Let the runner read a
wait_secondsfromscenarios/<name>/inject.sh(or a siblingmeta.yaml) instead of a fixed 75 s, so slow-degradation scenarios (memory-leak) actually surface their signature before the agent looks. - Keep observability state between scenarios. Strict isolation improves reproducibility but starves the agent of the cumulative historical context that real investigations rely on. Resetting only the CMS-service in-memory state (not Loki/Prom/Tempo volumes) is closer to a real production deployment.
- Not production hardened (no HA, no secrets management, no RBAC).
- Not a real CMS — physics, sensor models, and engagement logic are deliberately small for clarity.
- Not multi-host — DDS discovery is configured for the local docker bridge.