Skip to content

Commit 9a1b5c3

Browse files
add synthetic RDS postgres RCA suite (#194)
* add synthetic RDS postgres RCA suite Add bundled synthetic RDS PostgreSQL scenarios and extend RCA evidence handling so the benchmark suite can exercise database-specific failure modes end to end. Made-with: Cursor * Move alert templates into the CLI module Keep starter alert payload templates scoped to the CLI surface so the top-level app package stays focused on shared runtime logic. Made-with: Cursor * consolidating versions Made-with: Cursor * updated towards aws model * refactoring the rl environment * adding healthy model * attempt at hooking up the synthetic test suite to the main agentic pipeline * synthetic healthy base case creation * implementing changes * Potential fix for pull request finding 'Statement has no effect' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * codeql improvements --------- Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
1 parent 53f017b commit 9a1b5c3

58 files changed

Lines changed: 3144 additions & 36 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Makefile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
-include .env
22
export
33

4-
.PHONY: install install-hooks onboard test test-full demo local-rca-demo alert-template investigate-alert verify-integrations check-docker check-langgraph check-langsmith-api-key grafana-local-up grafana-local-down grafana-local-seed local-grafana-live langgraph-build langgraph-deploy clean lint format build deploy deploy-lambda deploy-prefect deploy-flink destroy destroy-lambda destroy-prefect destroy-flink prefect-local-test simulate-k8s-alert test-k8s-local test-k8s test-k8s-datadog deploy-dd-monitors cleanup-dd-monitors deploy-eks destroy-eks test-k8s-eks datadog-demo crashloop-demo regen-trigger-config test-rca test-rca-grafana
4+
.PHONY: install install-hooks onboard test test-full demo local-rca-demo alert-template investigate-alert verify-integrations check-docker check-langgraph check-langsmith-api-key grafana-local-up grafana-local-down grafana-local-seed local-grafana-live langgraph-build langgraph-deploy clean lint format deploy deploy-lambda deploy-prefect deploy-flink destroy destroy-lambda destroy-prefect destroy-flink prefect-local-test simulate-k8s-alert test-k8s-local test-k8s test-k8s-datadog deploy-dd-monitors cleanup-dd-monitors deploy-eks destroy-eks test-k8s-eks datadog-demo crashloop-demo regen-trigger-config test-rca test-rca-grafana test-rds-synthetic
55

66
ifneq ($(wildcard .venv/bin/python),)
77
PYTHON = .venv/bin/python
@@ -96,6 +96,10 @@ prefect-demo:
9696
test-rca:
9797
$(PYTHON) -m tests.rca.run_rca_test $(FILE)
9898

99+
# Run synthetic RDS PostgreSQL RCA benchmark suite
100+
test-rds-synthetic:
101+
$(PYTHON) -m tests.synthetic_testing.rds_postgres.run_suite $(if $(SCENARIO),--scenario $(SCENARIO),)
102+
99103
# Boot local Grafana+Loki, seed deterministic test logs, then run the RCA pipeline
100104
# Requires GRAFANA_INSTANCE_URL + GRAFANA_READ_TOKEN in .env (see .env.example for local defaults)
101105
test-rca-grafana: grafana-local-up grafana-local-seed
@@ -324,6 +328,7 @@ help:
324328
@echo " make test-grafana - Run Grafana integration tests"
325329
@echo " make test-rca - Run all RCA markdown alert tests in tests/rca/"
326330
@echo " make test-rca FILE=pipeline_error_in_logs - Run a single RCA alert test"
331+
@echo " make test-rds-synthetic - Run the synthetic RDS PostgreSQL RCA suite"
327332
@echo " make clean - Clean up cache files"
328333
@echo " make lint - Lint code with ruff"
329334
@echo " make format - Format code with ruff"

app/agent/__init__.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,15 @@
11
"""Agent core - LangGraph investigation and report generation."""
22

3-
from app.agent.runners import run_investigation
3+
from __future__ import annotations
44

5-
__all__ = [
6-
"run_investigation",
7-
]
5+
from typing import Any
6+
7+
8+
def run_investigation(*args: Any, **kwargs: Any):
9+
"""Lazily import the full runner stack to avoid optional dependency churn at import time."""
10+
from app.agent.runners import run_investigation as _run_investigation
11+
12+
return _run_investigation(*args, **kwargs)
13+
14+
15+
__all__ = ["run_investigation"]

app/agent/nodes/plan_actions/detect_sources.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -325,21 +325,28 @@ def detect_sources(
325325
or annotations.get("correlation_id")
326326
)
327327

328-
# Only include Grafana when alert came from Grafana, or when source is truly unknown
328+
# Only include Grafana when alert came from Grafana, or when source is truly unknown,
329+
# or when a pre-injected backend is present (e.g. FixtureGrafanaBackend for synthetic tests).
329330
grafana_int = None
330331
grafana_local = False
331-
if resolved_integrations and alert_source in ("grafana", ""):
332+
if resolved_integrations:
332333
if resolved_integrations.get("grafana_local"):
333334
grafana_int = resolved_integrations["grafana_local"]
334335
grafana_local = True
335336
elif resolved_integrations.get("grafana"):
336337
grafana_int = resolved_integrations["grafana"]
337338

339+
# When a _backend is injected we allow any alert_source; otherwise restrict to Grafana/unknown.
340+
_has_injected_backend = bool(grafana_int and "_backend" in grafana_int)
341+
if grafana_int and not (_has_injected_backend or alert_source in ("grafana", "")):
342+
grafana_int = None # suppress real Grafana for non-Grafana alerts
343+
338344
if grafana_int:
339345
endpoint = grafana_int.get("endpoint", "")
340346
api_key = grafana_int.get("api_key", "")
341-
# Local Grafana uses anonymous auth (empty api_key is valid for localhost)
342-
if endpoint and (api_key or grafana_local):
347+
has_backend = "_backend" in grafana_int
348+
# Local Grafana uses anonymous auth; injected backends don't need credentials at all.
349+
if has_backend or (endpoint and (api_key or grafana_local)):
343350
service_name = _map_pipeline_to_service_name(pipeline_name) if pipeline_name else ""
344351

345352
grafana_params: dict[str, Any] = {
@@ -353,6 +360,8 @@ def detect_sources(
353360
}
354361
if execution_run_id:
355362
grafana_params["execution_run_id"] = execution_run_id
363+
if has_backend:
364+
grafana_params["_backend"] = grafana_int["_backend"]
356365
sources["grafana"] = grafana_params
357366

358367
# Only include Datadog when alert came from Datadog, or when source is truly unknown

app/agent/nodes/plan_actions/extract_keywords.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,13 @@
2727
"trace",
2828
"debug",
2929
"metrics",
30+
"rds",
31+
"postgres",
32+
"database",
33+
"replication",
34+
"connections",
35+
"storage",
36+
"failover",
3037
"cpu",
3138
"disk",
3239
"resource",

app/agent/nodes/publish_findings/node.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ def generate_report(state: InvestigationState) -> dict:
4646
logger.warning("[publish] ingest url update failed: %s", exc)
4747

4848
all_blocks = build_slack_blocks(ctx) + build_action_blocks(investigation_url, investigation_id)
49-
render_report(slack_message)
49+
render_report(slack_message, root_cause_category=state.get("root_cause_category"))
5050
open_in_editor(slack_message)
5151

5252
slack_ctx = state.get("slack_context", {})

app/agent/nodes/publish_findings/renderers/terminal.py

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ def _render_rich_evidence_item(console: Console, line: str) -> None:
106106
# Main render entry points
107107
# ─────────────────────────────────────────────────────────────────────────────
108108

109-
def render_report(slack_message: str) -> None:
109+
def render_report(slack_message: str, root_cause_category: str | None = None) -> None:
110110
"""Render the final RCA report to terminal."""
111111
fmt = get_output_format()
112112

@@ -118,19 +118,23 @@ def render_report(slack_message: str) -> None:
118118
return
119119

120120
if fmt == "rich":
121-
_render_rich_report(slack_message)
121+
_render_rich_report(slack_message, root_cause_category=root_cause_category)
122122
else:
123-
_render_plain_report(slack_message)
123+
_render_plain_report(slack_message, root_cause_category=root_cause_category)
124124

125125

126-
def _render_rich_report(slack_message: str) -> None:
126+
def _render_rich_report(slack_message: str, root_cause_category: str | None = None) -> None:
127127
console = Console()
128128
console.print()
129129

130-
# Completion dot at the top
130+
# Header varies by outcome
131131
done = Text()
132-
done.append(" ● ", style="bold green")
133-
done.append("Investigation complete", style="bold white")
132+
if root_cause_category == "healthy":
133+
done.append(" ✓ ", style="bold green")
134+
done.append("Systems healthy", style="bold green")
135+
else:
136+
done.append(" ● ", style="bold green")
137+
done.append("Investigation complete", style="bold white")
134138
console.print(done)
135139
console.print()
136140

@@ -196,9 +200,12 @@ def _render_rich_report(slack_message: str) -> None:
196200
console.print()
197201

198202

199-
def _render_plain_report(slack_message: str) -> None:
203+
def _render_plain_report(slack_message: str, root_cause_category: str | None = None) -> None:
200204
print()
201-
print("Investigation complete")
205+
if root_cause_category == "healthy":
206+
print("✓ Systems healthy")
207+
else:
208+
print("Investigation complete")
202209
print()
203210
clean = _strip_slack_links(_strip_mrkdwn(slack_message))
204211
print(clean)

app/agent/nodes/root_cause_diagnosis/claim_validator.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,20 @@ def _has_any_logs(evidence: dict[str, Any]) -> bool:
2323
)
2424

2525

26+
def _has_rds_metrics(evidence: dict[str, Any]) -> bool:
27+
metrics = evidence.get("rds_metrics", {})
28+
return bool(metrics and (metrics.get("metrics") or metrics.get("observations")))
29+
30+
31+
def _has_rds_events(evidence: dict[str, Any]) -> bool:
32+
return bool(evidence.get("rds_events"))
33+
34+
35+
def _has_performance_insights(evidence: dict[str, Any]) -> bool:
36+
insights = evidence.get("performance_insights", {})
37+
return bool(insights and (insights.get("top_sql") or insights.get("wait_events") or insights.get("observations")))
38+
39+
2640
def _datadog_logs_contain(evidence: dict[str, Any], keywords: tuple[str, ...]) -> bool:
2741
"""Check if any Datadog log message contains at least one of the given keywords."""
2842
for log in evidence.get("datadog_error_logs", []) + evidence.get("datadog_logs", []):
@@ -43,10 +57,32 @@ def validate_claim(claim: str, evidence: dict[str, Any]) -> bool:
4357

4458
if ("memory" in claim_lower or "cpu" in claim_lower) and not (
4559
evidence.get("host_metrics", {}).get("data")
60+
or _has_rds_metrics(evidence)
61+
or _has_performance_insights(evidence)
4662
or any(kw in claim_lower for kw in ("monitor", "datadog")) and has_dd
4763
):
4864
return False
4965

66+
if any(
67+
kw in claim_lower
68+
for kw in (
69+
"rds",
70+
"postgres",
71+
"database",
72+
"replica",
73+
"replication lag",
74+
"connection",
75+
"storage",
76+
"disk",
77+
"failover",
78+
"reboot",
79+
)
80+
) and not (_has_rds_metrics(evidence) or _has_rds_events(evidence) or _has_performance_insights(evidence)):
81+
return False
82+
83+
if ("query" in claim_lower or "sql" in claim_lower or "wait event" in claim_lower) and not _has_performance_insights(evidence):
84+
return False
85+
5086
if ("job" in claim_lower or "batch" in claim_lower) and not (
5187
evidence.get("failed_jobs") or has_dd
5288
):
@@ -104,6 +140,15 @@ def extract_evidence_sources(claim: str, evidence: dict[str, Any]) -> list[str]:
104140
sources.append("tracer_tools")
105141
if ("metric" in claim_lower or "memory" in claim_lower or "cpu" in claim_lower) and evidence.get("host_metrics", {}).get("data"):
106142
sources.append("host_metrics")
143+
if any(
144+
kw in claim_lower
145+
for kw in ("metric", "replica", "replication lag", "connection", "storage", "disk", "database", "rds")
146+
) and _has_rds_metrics(evidence):
147+
sources.append("rds_metrics")
148+
if any(kw in claim_lower for kw in ("event", "failover", "reboot", "promotion", "availability zone")) and _has_rds_events(evidence):
149+
sources.append("rds_events")
150+
if any(kw in claim_lower for kw in ("query", "sql", "db load", "wait event", "cpu", "load")) and _has_performance_insights(evidence):
151+
sources.append("performance_insights")
107152
if ("lambda" in claim_lower or "function" in claim_lower) and (
108153
evidence.get("lambda_logs") or evidence.get("lambda_function")
109154
):

app/agent/nodes/root_cause_diagnosis/evidence_checker.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,29 @@
22

33
from typing import Any
44

5+
# Alert state strings that indicate no active incident across common monitoring platforms.
6+
_HEALTHY_STATES = frozenset({"normal", "resolved", "ok"})
7+
8+
# Severity levels that are non-actionable (i.e. scheduled checks, informational only).
9+
_HEALTHY_SEVERITIES = frozenset({"info", "none", ""})
10+
11+
# Annotation keys whose non-empty presence signals an active error condition.
12+
_ERROR_ANNOTATION_KEYS = ("error", "error_message", "log_excerpt", "failed_steps")
13+
14+
# Evidence keys whose presence (even with empty values) confirms investigation was attempted.
15+
# An empty grafana_logs list is itself a healthy signal: no errors found during investigation.
16+
_INVESTIGATED_EVIDENCE_KEYS = frozenset({
17+
"grafana_logs",
18+
"grafana_metrics",
19+
"grafana_alert_rules",
20+
"rds_metrics",
21+
"rds_events",
22+
"performance_insights",
23+
"cloudwatch_logs",
24+
"datadog_logs",
25+
"datadog_monitors",
26+
})
27+
528

629
def check_evidence_availability(
730
context: dict[str, Any], evidence: dict[str, Any], raw_alert: dict | str
@@ -34,6 +57,9 @@ def check_evidence_availability(
3457
or evidence.get("s3_marker")
3558
or evidence.get("lambda_function")
3659
or evidence.get("lambda_logs")
60+
or evidence.get("rds_metrics")
61+
or evidence.get("rds_events")
62+
or evidence.get("performance_insights")
3763
)
3864

3965
# Check for evidence in alert annotations or raw text
@@ -54,6 +80,54 @@ def check_evidence_availability(
5480
return has_tracer_evidence, has_cloudwatch_evidence, has_alert_evidence
5581

5682

83+
def is_clearly_healthy(raw_alert: dict[str, Any] | str, evidence: dict[str, Any]) -> bool:
84+
"""Return True only when all four conditions confirm no active incident.
85+
86+
Conditions (all must hold):
87+
1. Alert ``state`` is in {"normal", "resolved", "ok"} — covers Grafana, CloudWatch,
88+
PagerDuty, and most other monitoring platforms.
89+
2. Alert ``severity`` is in {"info", "none", ""} — rules out a resolved-critical that
90+
still warrants investigation.
91+
3. No error-signal annotation keys (``error``, ``error_message``, ``log_excerpt``,
92+
``failed_steps``) are non-empty.
93+
4. At least one evidence key is populated — distinguishes "healthy evidence" from
94+
"no evidence gathered yet".
95+
96+
Blast radius if this misfires (false-healthy): the short-circuit returns
97+
root_cause_category="healthy" without an LLM call. A real incident would receive a
98+
"healthy" report. This is mitigated by:
99+
- The severity gate: firing critical/high/warning alerts never satisfy condition 2.
100+
- The HEALTHY_SHORT_CIRCUIT env flag (default "true") — set to "false" to disable
101+
without a deploy.
102+
"""
103+
if not isinstance(raw_alert, dict):
104+
return False
105+
106+
# Condition 1: alert state signals no active incident.
107+
state = str(raw_alert.get("state", "")).lower().strip()
108+
if state not in _HEALTHY_STATES:
109+
return False
110+
111+
# Condition 2: severity is non-actionable.
112+
labels = raw_alert.get("commonLabels", raw_alert.get("labels", {})) or {}
113+
severity = str(labels.get("severity", raw_alert.get("severity", ""))).lower().strip()
114+
if severity not in _HEALTHY_SEVERITIES:
115+
return False
116+
117+
# Condition 3: no error-signal annotations.
118+
annotations = (
119+
raw_alert.get("commonAnnotations", raw_alert.get("annotations", {})) or {}
120+
)
121+
if any(annotations.get(key) for key in _ERROR_ANNOTATION_KEYS):
122+
return False
123+
124+
# Condition 4: at least one known investigation key exists in evidence (even if empty).
125+
# An empty grafana_logs / grafana_metrics / etc. after a completed investigation is itself
126+
# a health signal — it means no errors were found. We only require that the key is present
127+
# (investigation was attempted), not that it contains data.
128+
return any(k in evidence for k in _INVESTIGATED_EVIDENCE_KEYS)
129+
130+
57131
def check_vendor_evidence_missing(evidence: dict[str, Any]) -> bool:
58132
"""
59133
Check if vendor/external API evidence is missing.

0 commit comments

Comments
 (0)