Skip to content

synthetic: prevent overdiagnosis in noisy-but-healthy scenario (007)#978

Merged
muddlebee merged 2 commits intoTracer-Cloud:mainfrom
cerencamkiran:fix-007-noisy-healthy-validation
Apr 28, 2026
Merged

synthetic: prevent overdiagnosis in noisy-but-healthy scenario (007)#978
muddlebee merged 2 commits intoTracer-Cloud:mainfrom
cerencamkiran:fix-007-noisy-healthy-validation

Conversation

@cerencamkiran
Copy link
Copy Markdown
Collaborator

@cerencamkiran cerencamkiran commented Apr 26, 2026

Fixes #603

Summary

This PR improves synthetic QA behavior for scenario 007 (noisy-but-healthy system) by fixing a core reasoning issue where the agent overdiagnosed failures from normal metric patterns.

Problem

The agent incorrectly inferred root causes such as connection pool leaks or resource exhaustion when observing:

  • oscillating connection counts (55–65% of max)
  • moderate CPU utilization (40–70%)
  • short-lived latency spikes
  • no error logs or failure events

This is a common LLM failure mode: interpreting trends and noise as signals of failure, instead of applying threshold-based reasoning.

As a result, scenario 007 failed due to false positives and hallucinated root causes.

Changes

  • Added explicit Healthy System Detection rules to the database directive
  • Enforced relative (not strict) threshold-based classification:
    • connections not near exhaustion
    • CPU not near saturation
    • no errors → no failure
  • Added scoped prohibitions under healthy-system conditions
  • Enforced early termination when no failure conditions are met
  • Restored stale-alert/autoscaling recovery logic
  • Updated answer.yml to align with evaluator behavior
  • Added QA_VALIDATION.md documenting expected reasoning and failure modes

Result

PASS 007-connection-pressure-noisy-healthy category=healthy

The agent now:

  • correctly classifies the system as healthy
  • avoids hallucinated root causes
  • distinguishes noise vs real failure
  • stops investigation early when appropriate

Validation

Run:

python -m tests.synthetic.rds_postgres.run_suite --scenario 007-connection-pressure-noisy-healthy --mock-grafana

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

This PR adds explicit healthy-system detection thresholds and strict prohibitions to the database directive in prompt_builder.py to fix over-diagnosis in scenario 007, and ships a new QA_VALIDATION.md for documentation. The fix correctly addresses the noisy-but-healthy scenario, but two changes raise regression risk against the broader scenario suite:

  • The STRICT PROHIBITIONS block uses all-caps section-header formatting that LLMs are likely to treat as globally scoped, potentially suppressing valid connection-leak/resource-exhaustion diagnoses in scenarios like 002 where connections are genuinely near 100%.
  • The stale-alert / autoscaling recovery guidance ("threshold was briefly crossed but autoscaling recovered…") was silently removed, which is specifically required by scenario 013's required_keywords (recovered, autoscal).

Confidence Score: 3/5

Two P1 findings — ambiguous prohibition scope and missing stale-alert guidance — risk breaking existing passing scenarios before this can safely merge.

Both P1 findings are present-defect concerns: the unconditional formatting of STRICT PROHIBITIONS could suppress correct diagnoses in other scenarios, and the removed autoscaling recovery guidance could break scenario 013's keyword checks. These are existing, passing parts of the synthetic test suite that this PR could inadvertently regress.

app/nodes/root_cause_diagnosis/prompt_builder.py — specifically the scoping of STRICT PROHIBITIONS (lines 251–263) and the removed stale-alert guidance (line 238).

Important Files Changed

Filename Overview
app/nodes/root_cause_diagnosis/prompt_builder.py Expanded healthy-system detection rules with explicit thresholds and STRICT PROHIBITIONS; the prohibitions' formatting may make them appear globally scoped, risking regression on connection-exhaustion and similar scenarios; the stale-alert/autoscaling recovery guidance was also silently dropped.
tests/synthetic/rds_postgres/007-connection-pressure-noisy-healthy/answer.yml required_keywords reduced from ["operating bounds","no failure"] to ["healthy"], making the evaluator keyword check trivially satisfied by the category field alone.
tests/synthetic/rds_postgres/007-connection-pressure-noisy-healthy/QA_VALIDATION.md New documentation file clearly describing expected agent reasoning, failure modes, and reviewer checklist for scenario 007; no issues found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Database evidence detected] --> B{Is it a DB incident?}
    B -- No --> Z[Skip directive]
    B -- Yes --> C[Apply database directive]
    C --> D{ALL healthy conditions met?\n- connections < 70%\n- CPU not sustained > 80-85%\n- no errors}
    D -- Yes --> E[Classify: healthy\nStop investigation early]
    D -- No --> F{Check specific failure pattern}
    F --> G[connections near 100%?\n→ resource_exhaustion\nconnection pool leak]
    F --> H[FreeStorageSpace → 0?\n→ resource_exhaustion\nstorage exhaustion]
    F --> I[CPU near 100% + bad query?\n→ resource_exhaustion\nCPU saturation]
    F --> J[VACUUM/checkpoint storm?\n→ resource_exhaustion\nI/O saturation]
    F --> K[Threshold briefly crossed\nbut autoscaling recovered?\n→ healthy stale alert]
    style E fill:#90ee90
    style K stroke:#ff8c00,stroke-width:2px,stroke-dasharray:5
Loading

Reviews (1): Last reviewed commit: "fix: prevent overdiagnosis in healthy/no..." | Re-trigger Greptile

Comment on lines +251 to +263
STRICT PROHIBITIONS:
- Do NOT infer connection pool leaks
- Do NOT infer resource exhaustion
- Do NOT generate speculative root causes
- Do NOT interpret monotonic increase or oscillation as failure

IMPORTANT:
- Trend ≠ failure
- Oscillation ≠ instability
- Moderate utilization ≠ degradation

If an alert fires without errors or threshold breaches:
- treat it as a noisy or warning-level alert, NOT a real incident
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 STRICT PROHIBITIONS scope is ambiguous — may suppress valid diagnoses

The STRICT PROHIBITIONS and IMPORTANT: blocks are visually formatted as section-level headers (no bullet prefix, ALL-CAPS), so an LLM is likely to interpret them as global rules rather than conditionals scoped to the preceding If ALL of the following are true: block. "Do NOT infer connection pool leaks" and "Do NOT infer resource exhaustion" applied globally would directly contradict the positive guidance at lines 231–232 and could cause scenarios like 002-connection-exhaustion (connections at 490/500 max) to mis-classify as healthy.

Scoping the prohibitions explicitly fixes this — e.g., nest them under the conditional block or add a preamble like "When the above healthy conditions are ALL met: ..."

- Compositional Faults: If two completely independent workloads cause two separate faults simultaneously (e.g., CPU saturation from an analytics SELECT AND storage exhaustion from an audit_log INSERT), explicitly identify BOTH as independent root causes. Use `resource_exhaustion` as ROOT_CAUSE_CATEGORY and describe both causes clearly in ROOT_CAUSE (e.g., "Two independent root causes: ..."). Trace each causal chain separately in CAUSAL_CHAIN. Connection spikes and ReplicaLag are often just downstream symptoms of the blocked writers.
- Misleading Context: Check RDS event timestamps carefully! Ignore historical events (maintenance, failovers, replica promotions) that completed hours before the current incident started.
- Healthy Systems / Stale Alerts: If metrics are oscillating but remain within normal operating bounds (e.g. connections at 55-65%, CPU at 40-70%, no error logs), the system is `healthy`. If a threshold was briefly crossed (e.g. low FreeStorageSpace) but autoscaling successfully expanded the volume and fully recovered the system before the investigation, the system is `healthy` and the alert is stale.
- Healthy Systems / Stale Alerts: If metrics are oscillating but remain within normal operating bounds (e.g. connections at 55-65%, CPU at 40-70%, no error logs), the system is `healthy`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stale-alert / autoscaling recovery guidance was removed — risks breaking scenario 013

The original line also contained: "If a threshold was briefly crossed (e.g. low FreeStorageSpace) but autoscaling successfully expanded the volume and fully recovered the system before the investigation, the system is healthy and the alert is stale." This was the only prompt-level hint guiding the agent to reason about autoscaling-driven recovery as a stale-alert pattern. Scenario 013 (answer.yml) requires the response to include the keywords recovered and autoscal — without this guidance the agent may correctly output healthy via the new Detection block but omit the storage-autoscaling reasoning, failing those keyword checks.

Comment on lines +2 to +3
required_keywords:
- operating bounds
- no failure
- healthy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 required_keywords reduced to a trivially satisfied value

healthy will always appear in ROOT_CAUSE_CATEGORY: healthy, making the keyword check redundant with the category check. Other scenarios use more discriminative phrases (e.g., "no failure", "normal bounds", "operating bounds") that verify the agent expressed the right reasoning, not just produced the right category label. Consider retaining at least one reasoning-level keyword (e.g., "no failure", "normal", or "operating bounds") alongside "healthy" to preserve meaningful test coverage.

@cerencamkiran
Copy link
Copy Markdown
Collaborator Author

BEFORE
Ekran görüntüsü 2026-04-26 170746
Ekran görüntüsü 2026-04-26 170757

@cerencamkiran
Copy link
Copy Markdown
Collaborator Author

AFTER
Ekran görüntüsü 2026-04-26 190902
Ekran görüntüsü 2026-04-26 190926

@cerencamkiran
Copy link
Copy Markdown
Collaborator Author

hey @muddlebee, i guess everthing is ok

@muddlebee muddlebee merged commit a26d2fb into Tracer-Cloud:main Apr 28, 2026
7 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

🎯 Bullseye. @cerencamkiran opened a PR, kept the vibes clean, and got it merged. Absolute cinema. 🎬


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[synthetic-qa] 007-connection-pressure-noisy-healthy: Validate agent resists diagnosing noisy-but-healthy system

3 participants