Skip to content

fix: emit healthy-short-circuit claims from investigation keys only#777

Merged
rrajan94 merged 3 commits intoTracer-Cloud:mainfrom
mayankbharati-ops:fix/healthy-short-circuit-claims-from-investigation-keys
Apr 23, 2026
Merged

fix: emit healthy-short-circuit claims from investigation keys only#777
rrajan94 merged 3 commits intoTracer-Cloud:mainfrom
mayankbharati-ops:fix/healthy-short-circuit-claims-from-investigation-keys

Conversation

@mayankbharati-ops
Copy link
Copy Markdown
Contributor

@mayankbharati-ops mayankbharati-ops commented Apr 22, 2026

Summary

_handle_healthy_finding built the validated_claims list by iterating every evidence key and filtering by truthiness. Two problems fall out of that:

  1. Metadata leaks into findings. grafana_logs_query, datadog_logs_query, counts (datadog_monitors_count, eks_total_pods), timings (datadog_fetch_ms), service names (grafana_logs_service), trace IDs (honeycomb_trace_id), URLs (honeycomb_query_url) — all truthy, all surfaced as e.g.

    "datadog_logs_query data confirmed within normal operating bounds"

  2. Empty-list evidence is silently dropped. An empty grafana_logs / eks_events after a completed investigation is itself the healthy signal (per is_clearly_healthy's condition 4 and docstring). Empty lists are falsy, so the claim that actually triggered the short-circuit never appears.

Fix

Introduce an explicit CLAIM_EVIDENCE_KEYS whitelist in evidence_checker.py, populated by auditing every mapper in app/nodes/investigate/processing/post_process.py. Replace the old truthy-filter with:

def _is_healthy_claim_key(key, value):
    if key in INVESTIGATED_EVIDENCE_KEYS:
        return True           # empty lists count — the healthy signal
    if key in CLAIM_EVIDENCE_KEYS:
        return bool(value)    # adjacent data keys claim when non-empty
    return False              # metadata / unknown keys never claim

CLAIM_EVIDENCE_KEYS (58 entries) is a superset of INVESTIGATED_EVIDENCE_KEYS (16 entries) listing every primary data key across the 39 mappers. Metadata keys — queries, counts, timings, names, IDs, URLs, text summaries — are authoritatively excluded by omission. Also renames _INVESTIGATED_EVIDENCE_KEYSINVESTIGATED_EVIDENCE_KEYS (cross-module import, leading underscore was misleading).

Deliberate trade-off: a future mapper whose primary data key is not added to CLAIM_EVIDENCE_KEYS will not produce a healthy-short-circuit claim for that key until someone updates the whitelist. That's the safe default — silent inclusion is what produced the metadata-leak bug this PR fixes.

E2E validation — full synthetic suites, real Anthropic LLM

HEALTHY_SHORT_CIRCUIT=true, mocked EKS/Grafana/Datadog backends, real claude-sonnet-4-6 reasoning + claude-haiku-4-5 toolcall for the upstream graph.

tests/synthetic/eks/000-healthy (short-circuit path — this PR's scope)

Before (main) — 8 claims, 4 metadata noise, eks_events silently dropped:

eks_pods, eks_total_pods (count), eks_deployments, eks_total_deployments (count),
datadog_monitors, datadog_monitors_count (count), datadog_logs, datadog_logs_query (query)

After (this PR) — 5 clean claims, eks_events: [] now surfaced:

datadog_logs, datadog_monitors, eks_deployments, eks_events, eks_pods

Passed, actual_category: healthy, matched_keywords: [normal, no failure]. Diagnose at 0ms, healthy_short_circuit=true, LLM not invoked.

Full EKS suite — 14 scenarios, all mocked backends

Result Count Details
PASS 10 000-healthy (short-circuit), 001-005, 007-010 (all non-short-circuit LLM path)
FAIL 4 006, 011, 012, 013 — all state: alertingis_clearly_healthy=False ⇒ short-circuit never fires, LLM categorization drift (e.g. got 'infrastructure', expected 'healthy')

Full RDS suite — 15 scenarios, --mock-grafana

Result Count Details
PASS 11 000-healthy (short-circuit), 001, 003-006, 008-011, 014
FAIL 4 002, 007, 012, 013 — all state: alerting ⇒ short-circuit never fires, LLM categorization / keyword-match drift

Confirmed every failure is in state: alerting where _handle_healthy_finding is unreachable (grepped via json.load(alert.json)['state'] for each). Both 000-healthy scenarios — the only short-circuit exercises in the suites — PASS with clean output.

Non-healthy regression check

tests/synthetic/eks/001-oomkilled-crashloop: full LLM investigation (2× diagnose, 192s), produced OOM/SIGKILL root cause, category=resource_exhaustion. Confirms the change is confined to the healthy branch.

Unit coverage — 188 tests in tests/nodes/root_cause_diagnosis/

  • TestInvestigationKeyClaims — parametrised over every member of INVESTIGATED_EVIDENCE_KEYS; empty list produces a claim.
  • TestAdjacentDataKeys — parametrised twice over CLAIM_EVIDENCE_KEYS - INVESTIGATED_EVIDENCE_KEYS: truthy produces a claim, empty does not.
  • TestNonWhitelistedKeysFiltered — 51 exhaustive metadata shapes from the mapper audit (counts, totals, queries, timings, names, IDs, URLs, limits, windows, text summaries); none leak in isolation or alongside adjacent investigation keys.
  • TestWhitelistIntegrityINVESTIGATED_EVIDENCE_KEYS ⊆ CLAIM_EVIDENCE_KEYS; no whitelist entry ends in an obvious-metadata suffix.
  • TestHealthyFindingShape — shape, validity score, loop-count preservation, tracker completion, deterministic claim order.
  • End-to-enddiagnose_root_cause short-circuits on a mixed evidence dict, asserts mock_llm_factory.assert_not_called().

Verification

  • CI (Python 3.13): test (ubuntu-latest) PASS, typecheck PASS, quality PASS, CodeQL PASS, Analyze (python) PASS
  • Local pytest tests/ (Python 3.14, no coverage): 2931 pass, 1 skipped, 0 failures (was 2767 on main; +164 new)
  • ruff check app/ tests/: clean
  • ruff format --check: clean
  • mypy app/: same 3 pre-existing types-PyMySQL stub warnings, nothing new

Scope

Two app/ files (evidence_checker.py, node.py), one test file. No public API changes aside from renaming the already-exported-by-behavior constant. No state-shape changes. No behavior change for the non-healthy branch. Orthogonal to open PR #672 (which extends INVESTIGATED_EVIDENCE_KEYS itself for alertmanager/coralogix/honeycomb) — this PR composes cleanly regardless of which keys that set contains.

When is_clearly_healthy() trips, _handle_healthy_finding built the
validated_claims list by iterating every evidence key and filtering by
truthiness. Two problems fall out of that:

1. Query-string metadata (grafana_logs_query, datadog_logs_query), counts
   (eks_total_pods, datadog_monitors_count), and other non-data fields are
   truthy and leak into the findings as e.g.
       "datadog_logs_query data confirmed within normal operating bounds"
   which is nonsensical — that field is a query, not data.

2. An empty grafana_logs/eks_events list after a completed investigation is
   itself the healthy signal (per is_clearly_healthy's docstring and
   condition 4), but an empty list is falsy and silently dropped, so the
   claim that actually triggered the short-circuit never appears.

Iterate _INVESTIGATED_EVIDENCE_KEYS intersected with evidence.keys() (the
same gating is_clearly_healthy uses) and sort for deterministic output.

E2E validated on the synthetic healthy scenarios
(HEALTHY_SHORT_CIRCUIT=true, mocked backends, real Anthropic LLM for the
upstream graph steps):

  tests/synthetic/eks/000-healthy        passed, findings trimmed from
                                          8 claims (4 noise) to 5 clean
                                          claims; eks_events: [] now
                                          surfaces as a healthy claim
                                          instead of being dropped.
  tests/synthetic/rds_postgres/000-healthy passed, efficiency_score=1.0.

Unit coverage added (24 tests): empty-list produces a claim, non-
investigation keys stay out, order is deterministic, every investigation
key is recognised (parametrized), and diagnose_root_cause short-circuits
without invoking the LLM.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR fixes two related bugs in the healthy short-circuit path of _handle_healthy_finding in the root-cause-diagnosis node. The old code built validated_claims by iterating all evidence keys filtered by truthiness, which caused query-string metadata keys (grafana_logs_query, datadog_logs_query, count fields) to incorrectly surface as data-backed claims, while empty-list evidence (the actual healthy signal per is_clearly_healthy condition 4) was silently dropped. The fix mirrors is_clearly_healthy's own gating by iterating sorted(_INVESTIGATED_EVIDENCE_KEYS) ∩ evidence.keys().

Key changes:

  • app/nodes/root_cause_diagnosis/node.py: 8-line listcomp rewrite — iterates sorted(_INVESTIGATED_EVIDENCE_KEYS) with k in evidence guard instead of iterating all evidence keys filtered by truthiness.
  • tests/nodes/root_cause_diagnosis/test_healthy_short_circuit.py: 24 new unit tests covering the regression (empty-list evidence → claim emitted), exclusion of non-investigation keys, deterministic ordering, full parametric coverage of every _INVESTIGATED_EVIDENCE_KEYS member, shape invariants, tracker assertions, and end-to-end LLM-bypass assertion.

Confidence Score: 5/5

Safe to merge — minimal, targeted fix with comprehensive regression tests and no behavior changes outside the healthy short-circuit path.

The change is 8 lines in one production file and is logically correct: it precisely mirrors the evidence-key gating already used by is_clearly_healthy, fixing both reported bugs (metadata leakage and dropped empty-list signals). Tests are thorough (24 cases including parametric full-key coverage and end-to-end LLM-bypass assertion). The PR description confirms all CI checks (pytest 2791 pass, ruff, mypy) are clean. No public API or state-shape changes.

No files require special attention.

Important Files Changed

Filename Overview
app/nodes/root_cause_diagnosis/node.py Core fix: replaces truthy-evidence iteration with sorted _INVESTIGATED_EVIDENCE_KEYS intersection guard, correctly including empty-list healthy signals and excluding query-string metadata keys.
tests/nodes/root_cause_diagnosis/test_healthy_short_circuit.py New test file with 24 tests covering the regression, non-investigation key exclusion, deterministic ordering, full parametric key coverage, shape invariants, and LLM-bypass assertion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[diagnose_root_cause] --> B{_short_circuit_enabled\nAND is_clearly_healthy?}
    B -- Yes --> C[_handle_healthy_finding]
    B -- No --> D{has_tracer OR\nhas_cloudwatch\nOR has_alert?}
    D -- No --> E[_handle_insufficient_evidence]
    D -- Yes --> F[LLM call → parse_root_cause]
    C --> G["for k in sorted(_INVESTIGATED_EVIDENCE_KEYS)\n    if k in evidence"]
    G --> H["validated_claims: k + ' data confirmed within normal operating bounds'"]
    H --> I["Return healthy finding\nroot_cause_category='healthy'\nvalidity_score=1.0\nno LLM call"]
Loading

Reviews (1): Last reviewed commit: "fix: emit healthy-short-circuit claims f..." | Re-trigger Greptile

Follow-up to address two completeness concerns with the previous commit.

1. **Back-compat for truthy non-investigation data keys.**  The previous
   revision iterated only ``INVESTIGATED_EVIDENCE_KEYS``, which silently
   dropped claims for truthy evidence entries like ``datadog_events``,
   ``datadog_error_logs``, ``honeycomb_traces``, ``eks_failing_pods``, and
   similar adjacent data keys written by the investigate mappers.

   Claim generation now unions two rules:

   - Investigation keys present in evidence always produce a claim (empty
     lists included — mirrors ``is_clearly_healthy`` condition 4).
   - Other evidence keys produce a claim when they carry non-empty data
     *and* do not match a metadata shape (query string, count, timing,
     resource name, total, by-pipeline aggregate).

   Metadata is detected via ``str.endswith`` on a small suffix tuple
   (``_query``, ``_count``, ``_ms``, ``_source``, ``_name``, ``_namespace``,
   ``_total``, ``_by_pipeline``), plus ``total_`` prefix and ``_total_``
   infix for keys like ``eks_total_pods`` / ``lambda_total_logs``. Strict
   ``endswith`` avoids false positives like ``grafana_service_names``
   (plural — list of service names, real data).

2. **Rename ``_INVESTIGATED_EVIDENCE_KEYS`` to ``INVESTIGATED_EVIDENCE_KEYS``.**
   The set is now imported from another module in this package, so the
   leading underscore is misleading. Public name matches its actual API
   surface. Internal references in ``evidence_checker.py`` and the two
   test modules updated in the same commit.

Coverage additions (91 tests total across the two test modules, up from
the previous 24):

- ``TestTruthyNonInvestigationDataClaims`` parametrised over 18 adjacent
  data keys — confirms the union rule keeps them as claims.
- ``TestMetadataKeyFiltering`` parametrised over 22 metadata shapes —
  confirms queries, counts, totals, timings, and resource names are
  excluded even when the adjacent investigation key is also present.
- Forward-compat test: a new ``my_new_source_data`` key not in the
  investigation set is claimed automatically once populated by any
  future mapper, without a code change.

E2E re-validated with real Anthropic LLM for upstream graph:

- ``tests/synthetic/eks/000-healthy`` — short-circuit still hits at 0ms,
  5 clean claims (``datadog_logs``, ``datadog_monitors``, ``eks_deployments``,
  ``eks_events``, ``eks_pods``). Scenario passed.
- ``tests/synthetic/eks/001-oomkilled-crashloop`` — non-healthy path,
  full LLM investigation runs (2× diagnose, 192s total), produces the
  expected OOM/SIGKILL root cause with confidence scores. Scenario
  passed, ``category=resource_exhaustion``. Confirms the change does
  not leak into the non-healthy branch.

Verification:

- ``pytest tests/``: 2834 pass, 1 skipped, 0 failures.
- ``ruff check`` and ``ruff format --check``: clean.
- ``mypy app/``: unchanged pre-existing pymysql-stub warnings only.
…whitelist

The previous revision used a ``str.endswith`` heuristic on a suffix tuple
to separate data keys from metadata. Auditing every mapper in
``app/nodes/investigate/processing/post_process.py`` surfaced seven key
shapes the heuristic did not cover: ``_service``, ``_id``, ``_url``,
``_text``, ``_limit``, ``_window``, and ``_dataset`` — each of which is
metadata written alongside a real data key (e.g. ``grafana_logs_service``
next to ``grafana_logs``, ``honeycomb_query_url`` next to
``honeycomb_traces``, ``github_file_text`` next to ``github_file``).

Replace the heuristic with an explicit ``CLAIM_EVIDENCE_KEYS`` whitelist in
``evidence_checker.py``, populated from a full audit of every mapper's
primary data key output. The whitelist is a superset of
``INVESTIGATED_EVIDENCE_KEYS``; investigation keys qualify even when empty
(an empty list after a completed investigation is the healthy signal),
non-investigation whitelist entries qualify only when truthy. Everything
else — metadata, unknown keys — is ignored.

Trade-off: a new mapper whose primary data key is not added to the
whitelist will not produce a healthy-short-circuit claim for that key
until the whitelist is updated. That is the deliberate, safe default —
silent inclusion risked the metadata-leak bug this PR is fixing.

Test suite expanded to 188 tests in tests/nodes/root_cause_diagnosis/
covering:

- Every member of INVESTIGATED_EVIDENCE_KEYS (parametrised; empty list
  produces a claim).
- Every member of CLAIM_EVIDENCE_KEYS - INVESTIGATED_EVIDENCE_KEYS
  (parametrised twice: truthy produces a claim, empty does not).
- 51 exhaustive metadata shapes from the mapper audit — none leak into
  claims in isolation or alongside an adjacent investigation key.
- Guardrail assertions on the whitelist itself: investigation-subset
  property and no obvious-metadata-suffix offenders.
- End-to-end ``diagnose_root_cause`` short-circuit test with a mixed
  evidence dict — LLM factory not invoked.

Verification:
- ``pytest``: 2931 pass, 1 skipped, 0 failures (was 2767 on main; +164).
- ``ruff check`` / ``ruff format --check``: clean.
- ``mypy app/``: unchanged pre-existing pymysql-stub warnings only.
- EKS synthetic ``000-healthy`` e2e: 5 clean claims, short-circuit at 0ms.
@rrajan94
Copy link
Copy Markdown
Collaborator

Verified locally:

  • Whitelist completeness: every key post_process.py mappers write is covered by CLAIM_EVIDENCE_KEYS (160 mapper keys audited against 58 whitelist entries).
  • pytest tests/nodes/root_cause_diagnosis/test_healthy_short_circuit.py: 164 passed.
  • make lint: clean.
  • make typecheck: 379 files, no issues.

Fix is well-scoped 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants