synthetic: tighten validation and fix RCA reasoning for replication lag red-herring scenario (006) by cerencamkiran · Pull Request #843 · Tracer-Cloud/opensre

cerencamkiran · 2026-04-24T15:41:01Z

Fixes #602

Summary

Improves both evaluation quality and RCA reasoning for scenario 006 (replication lag with CPU red herring).

Motivation

Previously, the agent could pass this scenario while incorrectly describing CPU as a second independent or contributing root cause. This violated the scenario’s intended behavior, where CPU is explicitly a red herring and causally unrelated to replication lag.

This PR ensures that:

incorrect multi-root-cause reasoning fails validation
correct causal reasoning (single root cause + red herring) consistently passes

Changes

1. Validation tightening

Refined required_keywords to focus on core reasoning signals instead of rigid phrasing
Removed overly strict keywords that caused false negatives
Added QA_VALIDATION.md to clearly define:
- expected reasoning
- strict failure modes
- human review expectations

2. Prompt improvement

Updated the replication lag directive in prompt_builder.py to:

enforce single root cause identification
explicitly classify unrelated CPU signals as red herrings
prevent treating independent signals as:
- second root causes
- contributing causes

Result

The agent now:
- correctly identifies WAL-driven replication lag as the only root cause
- explicitly labels CPU as a red herring
- separates independent workloads (UPDATE vs SELECT)
Validation now:
- fails incorrect causal reasoning
- passes only when reasoning matches scenario intent

Repro

python -m tests.synthetic.rds_postgres.run_suite --scenario 006-replication-lag-cpu-redherring --mock-grafana

greptile-apps · 2026-04-24T15:42:59Z

Greptile Summary

This PR tightens scenario 006 validation (replication lag + CPU red herring) by refining required_keywords, adding a forbidden_phrases list in answer.yml, updating the LLM prompt to explicitly classify CPU as a red herring, and adding QA_VALIDATION.md. The prompt change and keyword additions look correct, but the new forbidden_phrases key is not wired into the evaluation pipeline and will be silently ignored at runtime.

forbidden_phrases in answer.yml is never parsed or checked: ScenarioAnswerKey, AnswerKeySchema, _parse_answer_yaml(), and run_suite.py's evaluator all have no knowledge of this field. The intended failure modes (\"two root causes\", \"contributing cause\", etc.) are unenforced. The existing forbidden_keywords field is the wired-up equivalent and should be used instead (or forbidden_phrases must be added to all four touch-points).

Confidence Score: 4/5

Safe to merge once forbidden_phrases is replaced with (or wired to) forbidden_keywords; the prompt improvement is solid.

One P1 finding: the PR's stated goal of failing incorrect multi-root-cause reasoning is not actually enforced because forbidden_phrases is silently ignored. The fix is straightforward (rename to forbidden_keywords or add the field to the data pipeline), but until it's in place the new validation rules are entirely dead code.

answer.yml (forbidden_phrases not enforced), and the four support files that need updating if a new field is introduced: schemas.py, scenario_loader.py, run_suite.py.

Important Files Changed

Filename	Overview
tests/synthetic/rds_postgres/006-replication-lag-cpu-redherring/answer.yml	Adds `forbidden_phrases` key that is never read by the validation engine — the core enforcement goal of the PR is silently a no-op; also strengthens `required_keywords` and adds documentation.
app/nodes/root_cause_diagnosis/prompt_builder.py	Replication lag directive updated to explicitly label CPU as a red herring and block multi-root-cause reasoning; change is clear and internally consistent with the compositional faults directive.
tests/synthetic/rds_postgres/006-replication-lag-cpu-redherring/QA_VALIDATION.md	New documentation file correctly captures expected behaviour, failure modes, and passing criteria for scenario 006; content is accurate and well-structured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_suite.py evaluator] --> B{Check required_keywords}
    B -->|pass| C{Check forbidden_categories}
    C -->|pass| D{Check forbidden_keywords}
    D -->|pass| E[PASS]
    B -->|fail| F[FAIL: missing keywords]
    C -->|fail| G[FAIL: forbidden category]
    D -->|fail| H[FAIL: forbidden keyword found]

    subgraph answer_yml["answer.yml (006)"]
        K[required_keywords ✅ enforced]
        L[forbidden_categories ✅ enforced]
        M[forbidden_phrases ❌ silently ignored]
        N[forbidden_keywords — field exists but unused in 006]
    end

    K -.-> B
    L -.-> C
    M -. never parsed .-> X((dead code))
    N -.-> D

_{Reviews (1): Last reviewed commit: "synthetic: tighten validation for replic..." | Re-trigger Greptile}

greptile-apps · 2026-04-24T15:43:03Z

+forbidden_phrases:
+  - two root causes
+  - multiple root causes
+  - independent root causes
+  - second root cause
+  - contributing cause


forbidden_phrases is never read or enforced

ScenarioAnswerKey (in scenario_loader.py) has no forbidden_phrases field, _parse_answer_yaml() never reads it, AnswerKeySchema in schemas.py doesn't declare it, and run_suite.py's evaluation logic has no corresponding check. The YAML key is silently dropped on load, so phrases like "two root causes" and "contributing cause" will never actually fail a run — the central goal of this PR's "validation tightening" is not enforced at runtime.

The existing forbidden_keywords field is wired end-to-end (TypedDict → dataclass → parser → evaluator). These phrases should either be moved to forbidden_keywords, or forbidden_phrases needs to be integrated into schemas.py, scenario_loader.py, and run_suite.py.

cerencamkiran · 2026-04-24T16:05:51Z

BEFORE

cerencamkiran · 2026-04-24T16:06:31Z

AFTER

muddlebee · 2026-04-27T15:38:04Z

@cerencamkiran pls fix the conflicts!

cerencamkiran · 2026-04-28T17:05:13Z

Conflicts are fixed now @muddlebee👍

github-actions · 2026-04-28T17:08:00Z

🧑‍💻 @cerencamkiran has entered the contributor hall of fame. Merged. Done. Shipped. Go touch grass (then come back with another PR). 🌱

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

muddlebee · 2026-04-28T17:08:34Z

contributor hall of fame.

congrats 😆

synthetic: tighten validation for replication lag red herring

bf048ad

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

fix: enforce multi-root-cause failure via forbidden_keywords

9d869cb

Merge branch 'main' into fix-006-red-herring-validation

beabc6b

muddlebee merged commit 09dc355 into Tracer-Cloud:main Apr 28, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthetic: tighten validation and fix RCA reasoning for replication lag red-herring scenario (006)#843

synthetic: tighten validation and fix RCA reasoning for replication lag red-herring scenario (006)#843
muddlebee merged 3 commits intoTracer-Cloud:mainfrom
cerencamkiran:fix-006-red-herring-validation

cerencamkiran commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

cerencamkiran commented Apr 24, 2026

Uh oh!

cerencamkiran commented Apr 24, 2026

Uh oh!

muddlebee commented Apr 27, 2026

Uh oh!

cerencamkiran commented Apr 28, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

muddlebee commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cerencamkiran commented Apr 24, 2026

Summary

Motivation

Changes

1. Validation tightening

2. Prompt improvement

Result

Repro

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

cerencamkiran commented Apr 24, 2026

Uh oh!

cerencamkiran commented Apr 24, 2026

Uh oh!

muddlebee commented Apr 27, 2026

Uh oh!

cerencamkiran commented Apr 28, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

muddlebee commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants