fix: emit healthy-short-circuit claims from investigation keys only#777
Conversation
When is_clearly_healthy() trips, _handle_healthy_finding built the
validated_claims list by iterating every evidence key and filtering by
truthiness. Two problems fall out of that:
1. Query-string metadata (grafana_logs_query, datadog_logs_query), counts
(eks_total_pods, datadog_monitors_count), and other non-data fields are
truthy and leak into the findings as e.g.
"datadog_logs_query data confirmed within normal operating bounds"
which is nonsensical — that field is a query, not data.
2. An empty grafana_logs/eks_events list after a completed investigation is
itself the healthy signal (per is_clearly_healthy's docstring and
condition 4), but an empty list is falsy and silently dropped, so the
claim that actually triggered the short-circuit never appears.
Iterate _INVESTIGATED_EVIDENCE_KEYS intersected with evidence.keys() (the
same gating is_clearly_healthy uses) and sort for deterministic output.
E2E validated on the synthetic healthy scenarios
(HEALTHY_SHORT_CIRCUIT=true, mocked backends, real Anthropic LLM for the
upstream graph steps):
tests/synthetic/eks/000-healthy passed, findings trimmed from
8 claims (4 noise) to 5 clean
claims; eks_events: [] now
surfaces as a healthy claim
instead of being dropped.
tests/synthetic/rds_postgres/000-healthy passed, efficiency_score=1.0.
Unit coverage added (24 tests): empty-list produces a claim, non-
investigation keys stay out, order is deterministic, every investigation
key is recognised (parametrized), and diagnose_root_cause short-circuits
without invoking the LLM.
Greptile SummaryThis PR fixes two related bugs in the healthy short-circuit path of Key changes:
Confidence Score: 5/5Safe to merge — minimal, targeted fix with comprehensive regression tests and no behavior changes outside the healthy short-circuit path. The change is 8 lines in one production file and is logically correct: it precisely mirrors the evidence-key gating already used by is_clearly_healthy, fixing both reported bugs (metadata leakage and dropped empty-list signals). Tests are thorough (24 cases including parametric full-key coverage and end-to-end LLM-bypass assertion). The PR description confirms all CI checks (pytest 2791 pass, ruff, mypy) are clean. No public API or state-shape changes. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[diagnose_root_cause] --> B{_short_circuit_enabled\nAND is_clearly_healthy?}
B -- Yes --> C[_handle_healthy_finding]
B -- No --> D{has_tracer OR\nhas_cloudwatch\nOR has_alert?}
D -- No --> E[_handle_insufficient_evidence]
D -- Yes --> F[LLM call → parse_root_cause]
C --> G["for k in sorted(_INVESTIGATED_EVIDENCE_KEYS)\n if k in evidence"]
G --> H["validated_claims: k + ' data confirmed within normal operating bounds'"]
H --> I["Return healthy finding\nroot_cause_category='healthy'\nvalidity_score=1.0\nno LLM call"]
Reviews (1): Last reviewed commit: "fix: emit healthy-short-circuit claims f..." | Re-trigger Greptile |
Follow-up to address two completeness concerns with the previous commit.
1. **Back-compat for truthy non-investigation data keys.** The previous
revision iterated only ``INVESTIGATED_EVIDENCE_KEYS``, which silently
dropped claims for truthy evidence entries like ``datadog_events``,
``datadog_error_logs``, ``honeycomb_traces``, ``eks_failing_pods``, and
similar adjacent data keys written by the investigate mappers.
Claim generation now unions two rules:
- Investigation keys present in evidence always produce a claim (empty
lists included — mirrors ``is_clearly_healthy`` condition 4).
- Other evidence keys produce a claim when they carry non-empty data
*and* do not match a metadata shape (query string, count, timing,
resource name, total, by-pipeline aggregate).
Metadata is detected via ``str.endswith`` on a small suffix tuple
(``_query``, ``_count``, ``_ms``, ``_source``, ``_name``, ``_namespace``,
``_total``, ``_by_pipeline``), plus ``total_`` prefix and ``_total_``
infix for keys like ``eks_total_pods`` / ``lambda_total_logs``. Strict
``endswith`` avoids false positives like ``grafana_service_names``
(plural — list of service names, real data).
2. **Rename ``_INVESTIGATED_EVIDENCE_KEYS`` to ``INVESTIGATED_EVIDENCE_KEYS``.**
The set is now imported from another module in this package, so the
leading underscore is misleading. Public name matches its actual API
surface. Internal references in ``evidence_checker.py`` and the two
test modules updated in the same commit.
Coverage additions (91 tests total across the two test modules, up from
the previous 24):
- ``TestTruthyNonInvestigationDataClaims`` parametrised over 18 adjacent
data keys — confirms the union rule keeps them as claims.
- ``TestMetadataKeyFiltering`` parametrised over 22 metadata shapes —
confirms queries, counts, totals, timings, and resource names are
excluded even when the adjacent investigation key is also present.
- Forward-compat test: a new ``my_new_source_data`` key not in the
investigation set is claimed automatically once populated by any
future mapper, without a code change.
E2E re-validated with real Anthropic LLM for upstream graph:
- ``tests/synthetic/eks/000-healthy`` — short-circuit still hits at 0ms,
5 clean claims (``datadog_logs``, ``datadog_monitors``, ``eks_deployments``,
``eks_events``, ``eks_pods``). Scenario passed.
- ``tests/synthetic/eks/001-oomkilled-crashloop`` — non-healthy path,
full LLM investigation runs (2× diagnose, 192s total), produces the
expected OOM/SIGKILL root cause with confidence scores. Scenario
passed, ``category=resource_exhaustion``. Confirms the change does
not leak into the non-healthy branch.
Verification:
- ``pytest tests/``: 2834 pass, 1 skipped, 0 failures.
- ``ruff check`` and ``ruff format --check``: clean.
- ``mypy app/``: unchanged pre-existing pymysql-stub warnings only.
…whitelist The previous revision used a ``str.endswith`` heuristic on a suffix tuple to separate data keys from metadata. Auditing every mapper in ``app/nodes/investigate/processing/post_process.py`` surfaced seven key shapes the heuristic did not cover: ``_service``, ``_id``, ``_url``, ``_text``, ``_limit``, ``_window``, and ``_dataset`` — each of which is metadata written alongside a real data key (e.g. ``grafana_logs_service`` next to ``grafana_logs``, ``honeycomb_query_url`` next to ``honeycomb_traces``, ``github_file_text`` next to ``github_file``). Replace the heuristic with an explicit ``CLAIM_EVIDENCE_KEYS`` whitelist in ``evidence_checker.py``, populated from a full audit of every mapper's primary data key output. The whitelist is a superset of ``INVESTIGATED_EVIDENCE_KEYS``; investigation keys qualify even when empty (an empty list after a completed investigation is the healthy signal), non-investigation whitelist entries qualify only when truthy. Everything else — metadata, unknown keys — is ignored. Trade-off: a new mapper whose primary data key is not added to the whitelist will not produce a healthy-short-circuit claim for that key until the whitelist is updated. That is the deliberate, safe default — silent inclusion risked the metadata-leak bug this PR is fixing. Test suite expanded to 188 tests in tests/nodes/root_cause_diagnosis/ covering: - Every member of INVESTIGATED_EVIDENCE_KEYS (parametrised; empty list produces a claim). - Every member of CLAIM_EVIDENCE_KEYS - INVESTIGATED_EVIDENCE_KEYS (parametrised twice: truthy produces a claim, empty does not). - 51 exhaustive metadata shapes from the mapper audit — none leak into claims in isolation or alongside an adjacent investigation key. - Guardrail assertions on the whitelist itself: investigation-subset property and no obvious-metadata-suffix offenders. - End-to-end ``diagnose_root_cause`` short-circuit test with a mixed evidence dict — LLM factory not invoked. Verification: - ``pytest``: 2931 pass, 1 skipped, 0 failures (was 2767 on main; +164). - ``ruff check`` / ``ruff format --check``: clean. - ``mypy app/``: unchanged pre-existing pymysql-stub warnings only. - EKS synthetic ``000-healthy`` e2e: 5 clean claims, short-circuit at 0ms.
|
Verified locally:
Fix is well-scoped 👍 |
Summary
_handle_healthy_findingbuilt thevalidated_claimslist by iterating every evidence key and filtering by truthiness. Two problems fall out of that:Metadata leaks into findings.
grafana_logs_query,datadog_logs_query, counts (datadog_monitors_count,eks_total_pods), timings (datadog_fetch_ms), service names (grafana_logs_service), trace IDs (honeycomb_trace_id), URLs (honeycomb_query_url) — all truthy, all surfaced as e.g.Empty-list evidence is silently dropped. An empty
grafana_logs/eks_eventsafter a completed investigation is itself the healthy signal (peris_clearly_healthy's condition 4 and docstring). Empty lists are falsy, so the claim that actually triggered the short-circuit never appears.Fix
Introduce an explicit
CLAIM_EVIDENCE_KEYSwhitelist inevidence_checker.py, populated by auditing every mapper inapp/nodes/investigate/processing/post_process.py. Replace the old truthy-filter with:CLAIM_EVIDENCE_KEYS(58 entries) is a superset ofINVESTIGATED_EVIDENCE_KEYS(16 entries) listing every primary data key across the 39 mappers. Metadata keys — queries, counts, timings, names, IDs, URLs, text summaries — are authoritatively excluded by omission. Also renames_INVESTIGATED_EVIDENCE_KEYS→INVESTIGATED_EVIDENCE_KEYS(cross-module import, leading underscore was misleading).Deliberate trade-off: a future mapper whose primary data key is not added to
CLAIM_EVIDENCE_KEYSwill not produce a healthy-short-circuit claim for that key until someone updates the whitelist. That's the safe default — silent inclusion is what produced the metadata-leak bug this PR fixes.E2E validation — full synthetic suites, real Anthropic LLM
HEALTHY_SHORT_CIRCUIT=true, mocked EKS/Grafana/Datadog backends, realclaude-sonnet-4-6reasoning +claude-haiku-4-5toolcall for the upstream graph.tests/synthetic/eks/000-healthy(short-circuit path — this PR's scope)Before (
main) — 8 claims, 4 metadata noise,eks_eventssilently dropped:After (this PR) — 5 clean claims,
eks_events: []now surfaced:Passed,
actual_category: healthy,matched_keywords: [normal, no failure]. Diagnose at0ms,healthy_short_circuit=true, LLM not invoked.Full EKS suite — 14 scenarios, all mocked backends
state: alerting⇒is_clearly_healthy=False⇒ short-circuit never fires, LLM categorization drift (e.g.got 'infrastructure', expected 'healthy')Full RDS suite — 15 scenarios,
--mock-grafanastate: alerting⇒ short-circuit never fires, LLM categorization / keyword-match driftConfirmed every failure is in
state: alertingwhere_handle_healthy_findingis unreachable (grepped viajson.load(alert.json)['state']for each). Both000-healthyscenarios — the only short-circuit exercises in the suites — PASS with clean output.Non-healthy regression check
tests/synthetic/eks/001-oomkilled-crashloop: full LLM investigation (2× diagnose, 192s), produced OOM/SIGKILL root cause,category=resource_exhaustion. Confirms the change is confined to the healthy branch.Unit coverage — 188 tests in
tests/nodes/root_cause_diagnosis/TestInvestigationKeyClaims— parametrised over every member ofINVESTIGATED_EVIDENCE_KEYS; empty list produces a claim.TestAdjacentDataKeys— parametrised twice overCLAIM_EVIDENCE_KEYS - INVESTIGATED_EVIDENCE_KEYS: truthy produces a claim, empty does not.TestNonWhitelistedKeysFiltered— 51 exhaustive metadata shapes from the mapper audit (counts, totals, queries, timings, names, IDs, URLs, limits, windows, text summaries); none leak in isolation or alongside adjacent investigation keys.TestWhitelistIntegrity—INVESTIGATED_EVIDENCE_KEYS ⊆ CLAIM_EVIDENCE_KEYS; no whitelist entry ends in an obvious-metadata suffix.TestHealthyFindingShape— shape, validity score, loop-count preservation, tracker completion, deterministic claim order.diagnose_root_causeshort-circuits on a mixed evidence dict, assertsmock_llm_factory.assert_not_called().Verification
test (ubuntu-latest)PASS,typecheckPASS,qualityPASS,CodeQLPASS,Analyze (python)PASSpytest tests/(Python 3.14, no coverage): 2931 pass, 1 skipped, 0 failures (was 2767 onmain; +164 new)ruff check app/ tests/: cleanruff format --check: cleanmypy app/: same 3 pre-existingtypes-PyMySQLstub warnings, nothing newScope
Two
app/files (evidence_checker.py,node.py), one test file. No public API changes aside from renaming the already-exported-by-behavior constant. No state-shape changes. No behavior change for the non-healthy branch. Orthogonal to open PR #672 (which extendsINVESTIGATED_EVIDENCE_KEYSitself for alertmanager/coralogix/honeycomb) — this PR composes cleanly regardless of which keys that set contains.