Skip to content

fix: recognise Alertmanager, Coralogix, and Honeycomb keys in is_clearly_healthy short-circuit#672

Closed
kaushal-bakrania wants to merge 2 commits intoTracer-Cloud:mainfrom
kaushal-bakrania:fix/is-clearly-healthy-recognise-alertmanager-coralogix-honeycomb
Closed

fix: recognise Alertmanager, Coralogix, and Honeycomb keys in is_clearly_healthy short-circuit#672
kaushal-bakrania wants to merge 2 commits intoTracer-Cloud:mainfrom
kaushal-bakrania:fix/is-clearly-healthy-recognise-alertmanager-coralogix-honeycomb

Conversation

@kaushal-bakrania
Copy link
Copy Markdown
Contributor

Summary

  • Same drift class as [BUG] is_clearly_healthy short-circuit never fires for pure-EKS healthy states — eks_* keys missing from _INVESTIGATED_EVIDENCE_KEYS #582 (EKS fix): _INVESTIGATED_EVIDENCE_KEYS in evidence_checker.py was missing alertmanager_alerts, alertmanager_silences, coralogix_logs, coralogix_error_logs, and honeycomb_traces — keys that the investigation merge step actually writes via the matching _map_* mappers in post_process.py.
  • Consequence: condition 4 of is_clearly_healthy never fires for pure-Alertmanager / pure-Coralogix / pure-Honeycomb healthy states, so every resolved low-severity alert from those stacks pays a full LLM RCA round-trip instead of taking the fast path. Cost + latency paper-cut, not a correctness bug.
  • Fix adds the five missing keys to the frozenset. The severity gate (condition 2) still rejects firing critical alerts when these evidence keys are populated, so the safety properties of the short-circuit are unchanged.

Fixes #670.

Test plan

  • make lint — clean
  • make typecheck — clean
  • pytest tests/nodes/root_cause_diagnosis/test_evidence_checker.py — 28 passed (14 new + 14 existing)
  • Added parameterised tests covering all five new keys plus a firing-critical safety test
  • Reproduction from the issue now prints missing: [] and all four keys print short-circuit -> True

…n is_clearly_healthy short-circuit

Same drift class as Tracer-Cloud#582 (which fixed it for EKS): the investigation
merge step writes alertmanager_alerts / alertmanager_silences /
coralogix_logs / coralogix_error_logs / honeycomb_traces into the
evidence dict via the corresponding _map_* mappers in post_process.py,
but these keys were not added to _INVESTIGATED_EVIDENCE_KEYS. As a
result, condition 4 of is_clearly_healthy never fires for
pure-Alertmanager / pure-Coralogix / pure-Honeycomb healthy states
and every resolved low-severity alert from those stacks pays a full
LLM RCA round-trip instead of taking the fast path.

Add the five missing keys to the frozenset. The severity gate
(condition 2) still rejects firing critical alerts even when these
evidence keys are populated, so the safety properties of the
short-circuit are unchanged.

Fixes Tracer-Cloud#670
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 19, 2026

Greptile Summary

This PR fixes a recurring drift pattern where _INVESTIGATED_EVIDENCE_KEYS in evidence_checker.py was not updated when new integrations (Alertmanager, Coralogix, Honeycomb) were added, causing is_clearly_healthy condition 4 to never fire for those stacks and forcing a full LLM RCA round-trip on every resolved low-severity alert from them. The fix adds the five missing keys (alertmanager_alerts, alertmanager_silences, coralogix_logs, coralogix_error_logs, honeycomb_traces) and backs them with parameterized tests plus a severity-gate safety check.

Confidence Score: 5/5

Safe to merge — the change is additive, well-tested, and the safety properties of the short-circuit are unchanged.

No P0/P1 issues found. The added keys are verified against what the post_process.py mappers actually write. Tests cover all five new keys plus the critical-severity rejection path. The only finding is a P2 observation about potentially similar gaps for Vercel/GitHub integrations, which is pre-existing and out of scope for this PR.

No files require special attention.

Important Files Changed

Filename Overview
app/nodes/root_cause_diagnosis/evidence_checker.py Adds 5 missing mapper-output keys to _INVESTIGATED_EVIDENCE_KEYS and check_evidence_availability; keys correctly match what map_alertmanager*, _map_coralogix_logs, and _map_honeycomb_traces write in post_process.py.
tests/nodes/root_cause_diagnosis/test_evidence_checker.py Adds parameterized tests for all 5 new keys plus a safety test confirming critical-severity alerts still reject; good coverage and follows existing test patterns.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Incoming resolved alert] --> B{Condition 1\nstate in healthy_states?}
    B -- No --> FULL[Full LLM RCA round-trip]
    B -- Yes --> C{Condition 2\nseverity in healthy_severities?}
    C -- No --> FULL
    C -- Yes --> D{Condition 3\nno error annotations?}
    D -- No --> FULL
    D -- Yes --> E{Condition 4\nany key in _INVESTIGATED_EVIDENCE_KEYS?}
    E -- No --> FULL
    E -- Yes --> SC[Short-circuit → healthy]

    subgraph "Keys added by this PR"
        K1[alertmanager_alerts]
        K2[alertmanager_silences]
        K3[coralogix_logs]
        K4[coralogix_error_logs]
        K5[honeycomb_traces]
    end

    K1 & K2 & K3 & K4 & K5 --> E
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: app/nodes/root_cause_diagnosis/evidence_checker.py
Line: 16-44

Comment:
**Potential same-drift gap for Vercel and GitHub integrations**

`_map_vercel_deployment_status` writes `vercel_deployments`, `_map_vercel_deployment_logs` writes `vercel_deployment` / `vercel_events`, and the three GitHub mappers write `github_code_matches`, `github_file`, and `github_commits` — none of which appear in `_INVESTIGATED_EVIDENCE_KEYS`. If a pure-Vercel or pure-GitHub healthy investigation runs, condition 4 of `is_clearly_healthy` would still not fire, forcing the same full LLM round-trip this PR was designed to eliminate. Worth checking whether those integrations should be included here too.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix: recognise alertmanager/coralogix/ho..." | Re-trigger Greptile

…e_availability

The same drift class the frozenset fix addressed: _map_* mappers write
these keys into the evidence dict, but the has_cloudwatch_evidence OR-
chain was not updated alongside. A firing-critical alert from these
stacks with a sparse payload (no tracer_web_run, no annotation body)
would fall into _handle_insufficient_evidence and skip the reasoning
LLM — the mirror image of the short-circuit miss this PR already fixes.

Review feedback from Greptile on PR Tracer-Cloud#672.
kaushal-bakrania added a commit to kaushal-bakrania/opensre that referenced this pull request Apr 19, 2026
…acer-Cloud#684)

The key is already in _INVESTIGATED_EVIDENCE_KEYS and emitted by the
_map_grafana_alert_rules mapper, but check_evidence_availability's
has_cloudwatch_evidence OR-chain never probed it. A healthy alert whose
only evidence was grafana_alert_rules fell through to
_handle_insufficient_evidence instead of satisfying the evidence gate.

Same drift-bug class as commit 3dbfb52 (EKS) and open PR Tracer-Cloud#672 (AM/CL/HC).
VaibhavUpreti pushed a commit that referenced this pull request Apr 19, 2026
…) (#685)

The key is already in _INVESTIGATED_EVIDENCE_KEYS and emitted by the
_map_grafana_alert_rules mapper, but check_evidence_availability's
has_cloudwatch_evidence OR-chain never probed it. A healthy alert whose
only evidence was grafana_alert_rules fell through to
_handle_insufficient_evidence instead of satisfying the evidence gate.

Same drift-bug class as commit 3dbfb52 (EKS) and open PR #672 (AM/CL/HC).
@VaibhavUpreti
Copy link
Copy Markdown
Member

@copilot resolve the merge conflicts in this pull request

@Devesh36
Copy link
Copy Markdown
Collaborator

Closing in favour of #777 which has been merged and covers this fix as part of a broader whitelist approach. The missing keys (alertmanager, coralogix, honeycomb) are now handled via CLAIM_EVIDENCE_KEYS.

@Devesh36 Devesh36 closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] is_clearly_healthy short-circuit ignores Alertmanager, Coralogix, and Honeycomb evidence keys

3 participants