synthetic: prevent overdiagnosis in noisy-but-healthy scenario (007)#978
Conversation
Greptile SummaryThis PR adds explicit healthy-system detection thresholds and strict prohibitions to the database directive in
Confidence Score: 3/5Two P1 findings — ambiguous prohibition scope and missing stale-alert guidance — risk breaking existing passing scenarios before this can safely merge. Both P1 findings are present-defect concerns: the unconditional formatting of STRICT PROHIBITIONS could suppress correct diagnoses in other scenarios, and the removed autoscaling recovery guidance could break scenario 013's keyword checks. These are existing, passing parts of the synthetic test suite that this PR could inadvertently regress. app/nodes/root_cause_diagnosis/prompt_builder.py — specifically the scoping of STRICT PROHIBITIONS (lines 251–263) and the removed stale-alert guidance (line 238). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Database evidence detected] --> B{Is it a DB incident?}
B -- No --> Z[Skip directive]
B -- Yes --> C[Apply database directive]
C --> D{ALL healthy conditions met?\n- connections < 70%\n- CPU not sustained > 80-85%\n- no errors}
D -- Yes --> E[Classify: healthy\nStop investigation early]
D -- No --> F{Check specific failure pattern}
F --> G[connections near 100%?\n→ resource_exhaustion\nconnection pool leak]
F --> H[FreeStorageSpace → 0?\n→ resource_exhaustion\nstorage exhaustion]
F --> I[CPU near 100% + bad query?\n→ resource_exhaustion\nCPU saturation]
F --> J[VACUUM/checkpoint storm?\n→ resource_exhaustion\nI/O saturation]
F --> K[Threshold briefly crossed\nbut autoscaling recovered?\n→ healthy stale alert]
style E fill:#90ee90
style K stroke:#ff8c00,stroke-width:2px,stroke-dasharray:5
Reviews (1): Last reviewed commit: "fix: prevent overdiagnosis in healthy/no..." | Re-trigger Greptile |
| STRICT PROHIBITIONS: | ||
| - Do NOT infer connection pool leaks | ||
| - Do NOT infer resource exhaustion | ||
| - Do NOT generate speculative root causes | ||
| - Do NOT interpret monotonic increase or oscillation as failure | ||
|
|
||
| IMPORTANT: | ||
| - Trend ≠ failure | ||
| - Oscillation ≠ instability | ||
| - Moderate utilization ≠ degradation | ||
|
|
||
| If an alert fires without errors or threshold breaches: | ||
| - treat it as a noisy or warning-level alert, NOT a real incident |
There was a problem hiding this comment.
STRICT PROHIBITIONS scope is ambiguous — may suppress valid diagnoses
The STRICT PROHIBITIONS and IMPORTANT: blocks are visually formatted as section-level headers (no bullet prefix, ALL-CAPS), so an LLM is likely to interpret them as global rules rather than conditionals scoped to the preceding If ALL of the following are true: block. "Do NOT infer connection pool leaks" and "Do NOT infer resource exhaustion" applied globally would directly contradict the positive guidance at lines 231–232 and could cause scenarios like 002-connection-exhaustion (connections at 490/500 max) to mis-classify as healthy.
Scoping the prohibitions explicitly fixes this — e.g., nest them under the conditional block or add a preamble like "When the above healthy conditions are ALL met: ..."
| - Compositional Faults: If two completely independent workloads cause two separate faults simultaneously (e.g., CPU saturation from an analytics SELECT AND storage exhaustion from an audit_log INSERT), explicitly identify BOTH as independent root causes. Use `resource_exhaustion` as ROOT_CAUSE_CATEGORY and describe both causes clearly in ROOT_CAUSE (e.g., "Two independent root causes: ..."). Trace each causal chain separately in CAUSAL_CHAIN. Connection spikes and ReplicaLag are often just downstream symptoms of the blocked writers. | ||
| - Misleading Context: Check RDS event timestamps carefully! Ignore historical events (maintenance, failovers, replica promotions) that completed hours before the current incident started. | ||
| - Healthy Systems / Stale Alerts: If metrics are oscillating but remain within normal operating bounds (e.g. connections at 55-65%, CPU at 40-70%, no error logs), the system is `healthy`. If a threshold was briefly crossed (e.g. low FreeStorageSpace) but autoscaling successfully expanded the volume and fully recovered the system before the investigation, the system is `healthy` and the alert is stale. | ||
| - Healthy Systems / Stale Alerts: If metrics are oscillating but remain within normal operating bounds (e.g. connections at 55-65%, CPU at 40-70%, no error logs), the system is `healthy`. |
There was a problem hiding this comment.
Stale-alert / autoscaling recovery guidance was removed — risks breaking scenario 013
The original line also contained: "If a threshold was briefly crossed (e.g. low FreeStorageSpace) but autoscaling successfully expanded the volume and fully recovered the system before the investigation, the system is healthy and the alert is stale." This was the only prompt-level hint guiding the agent to reason about autoscaling-driven recovery as a stale-alert pattern. Scenario 013 (answer.yml) requires the response to include the keywords recovered and autoscal — without this guidance the agent may correctly output healthy via the new Detection block but omit the storage-autoscaling reasoning, failing those keyword checks.
| required_keywords: | ||
| - operating bounds | ||
| - no failure | ||
| - healthy |
There was a problem hiding this comment.
required_keywords reduced to a trivially satisfied value
healthy will always appear in ROOT_CAUSE_CATEGORY: healthy, making the keyword check redundant with the category check. Other scenarios use more discriminative phrases (e.g., "no failure", "normal bounds", "operating bounds") that verify the agent expressed the right reasoning, not just produced the right category label. Consider retaining at least one reasoning-level keyword (e.g., "no failure", "normal", or "operating bounds") alongside "healthy" to preserve meaningful test coverage.
|
hey @muddlebee, i guess everthing is ok |
|
🎯 Bullseye. @cerencamkiran opened a PR, kept the vibes clean, and got it merged. Absolute cinema. 🎬 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |





Fixes #603
Summary
This PR improves synthetic QA behavior for scenario 007 (noisy-but-healthy system) by fixing a core reasoning issue where the agent overdiagnosed failures from normal metric patterns.
Problem
The agent incorrectly inferred root causes such as connection pool leaks or resource exhaustion when observing:
This is a common LLM failure mode: interpreting trends and noise as signals of failure, instead of applying threshold-based reasoning.
As a result, scenario 007 failed due to false positives and hallucinated root causes.
Changes
answer.ymlto align with evaluator behaviorQA_VALIDATION.mddocumenting expected reasoning and failure modesResult
The agent now:
healthyValidation
Run: