synthetic: tighten validation and fix RCA reasoning for replication lag red-herring scenario (006)#843
Conversation
Greptile SummaryThis PR tightens scenario 006 validation (replication lag + CPU red herring) by refining
Confidence Score: 4/5Safe to merge once One P1 finding: the PR's stated goal of failing incorrect multi-root-cause reasoning is not actually enforced because
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[run_suite.py evaluator] --> B{Check required_keywords}
B -->|pass| C{Check forbidden_categories}
C -->|pass| D{Check forbidden_keywords}
D -->|pass| E[PASS]
B -->|fail| F[FAIL: missing keywords]
C -->|fail| G[FAIL: forbidden category]
D -->|fail| H[FAIL: forbidden keyword found]
subgraph answer_yml["answer.yml (006)"]
K[required_keywords ✅ enforced]
L[forbidden_categories ✅ enforced]
M[forbidden_phrases ❌ silently ignored]
N[forbidden_keywords — field exists but unused in 006]
end
K -.-> B
L -.-> C
M -. never parsed .-> X((dead code))
N -.-> D
Reviews (1): Last reviewed commit: "synthetic: tighten validation for replic..." | Re-trigger Greptile |
| forbidden_phrases: | ||
| - two root causes | ||
| - multiple root causes | ||
| - independent root causes | ||
| - second root cause | ||
| - contributing cause |
There was a problem hiding this comment.
forbidden_phrases is never read or enforced
ScenarioAnswerKey (in scenario_loader.py) has no forbidden_phrases field, _parse_answer_yaml() never reads it, AnswerKeySchema in schemas.py doesn't declare it, and run_suite.py's evaluation logic has no corresponding check. The YAML key is silently dropped on load, so phrases like "two root causes" and "contributing cause" will never actually fail a run — the central goal of this PR's "validation tightening" is not enforced at runtime.
The existing forbidden_keywords field is wired end-to-end (TypedDict → dataclass → parser → evaluator). These phrases should either be moved to forbidden_keywords, or forbidden_phrases needs to be integrated into schemas.py, scenario_loader.py, and run_suite.py.
|
@cerencamkiran pls fix the conflicts! |
|
Conflicts are fixed now @muddlebee👍 |
|
🧑💻 @cerencamkiran has entered the contributor hall of fame. Merged. Done. Shipped. Go touch grass (then come back with another PR). 🌱 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |
congrats 😆 |





Fixes #602
Summary
Improves both evaluation quality and RCA reasoning for scenario 006 (replication lag with CPU red herring).
Motivation
Previously, the agent could pass this scenario while incorrectly describing CPU as a second independent or contributing root cause. This violated the scenario’s intended behavior, where CPU is explicitly a red herring and causally unrelated to replication lag.
This PR ensures that:
Changes
1. Validation tightening
required_keywordsto focus on core reasoning signals instead of rigid phrasingQA_VALIDATION.mdto clearly define:2. Prompt improvement
Updated the replication lag directive in
prompt_builder.pyto:Result
The agent now:
Validation now:
Repro