Skip to content

synthetic: tighten validation and fix RCA reasoning for replication lag red-herring scenario (006)#843

Merged
muddlebee merged 3 commits intoTracer-Cloud:mainfrom
cerencamkiran:fix-006-red-herring-validation
Apr 28, 2026
Merged

synthetic: tighten validation and fix RCA reasoning for replication lag red-herring scenario (006)#843
muddlebee merged 3 commits intoTracer-Cloud:mainfrom
cerencamkiran:fix-006-red-herring-validation

Conversation

@cerencamkiran
Copy link
Copy Markdown
Contributor

Fixes #602

Summary

Improves both evaluation quality and RCA reasoning for scenario 006 (replication lag with CPU red herring).

Motivation

Previously, the agent could pass this scenario while incorrectly describing CPU as a second independent or contributing root cause. This violated the scenario’s intended behavior, where CPU is explicitly a red herring and causally unrelated to replication lag.

This PR ensures that:

  • incorrect multi-root-cause reasoning fails validation
  • correct causal reasoning (single root cause + red herring) consistently passes

Changes

1. Validation tightening

  • Refined required_keywords to focus on core reasoning signals instead of rigid phrasing
  • Removed overly strict keywords that caused false negatives
  • Added QA_VALIDATION.md to clearly define:
    • expected reasoning
    • strict failure modes
    • human review expectations

2. Prompt improvement

Updated the replication lag directive in prompt_builder.py to:

  • enforce single root cause identification
  • explicitly classify unrelated CPU signals as red herrings
  • prevent treating independent signals as:
    • second root causes
    • contributing causes

Result

  • The agent now:

    • correctly identifies WAL-driven replication lag as the only root cause
    • explicitly labels CPU as a red herring
    • separates independent workloads (UPDATE vs SELECT)
  • Validation now:

    • fails incorrect causal reasoning
    • passes only when reasoning matches scenario intent

Repro

python -m tests.synthetic.rds_postgres.run_suite --scenario 006-replication-lag-cpu-redherring --mock-grafana

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR tightens scenario 006 validation (replication lag + CPU red herring) by refining required_keywords, adding a forbidden_phrases list in answer.yml, updating the LLM prompt to explicitly classify CPU as a red herring, and adding QA_VALIDATION.md. The prompt change and keyword additions look correct, but the new forbidden_phrases key is not wired into the evaluation pipeline and will be silently ignored at runtime.

  • forbidden_phrases in answer.yml is never parsed or checked: ScenarioAnswerKey, AnswerKeySchema, _parse_answer_yaml(), and run_suite.py's evaluator all have no knowledge of this field. The intended failure modes (\"two root causes\", \"contributing cause\", etc.) are unenforced. The existing forbidden_keywords field is the wired-up equivalent and should be used instead (or forbidden_phrases must be added to all four touch-points).

Confidence Score: 4/5

Safe to merge once forbidden_phrases is replaced with (or wired to) forbidden_keywords; the prompt improvement is solid.

One P1 finding: the PR's stated goal of failing incorrect multi-root-cause reasoning is not actually enforced because forbidden_phrases is silently ignored. The fix is straightforward (rename to forbidden_keywords or add the field to the data pipeline), but until it's in place the new validation rules are entirely dead code.

answer.yml (forbidden_phrases not enforced), and the four support files that need updating if a new field is introduced: schemas.py, scenario_loader.py, run_suite.py.

Important Files Changed

Filename Overview
tests/synthetic/rds_postgres/006-replication-lag-cpu-redherring/answer.yml Adds forbidden_phrases key that is never read by the validation engine — the core enforcement goal of the PR is silently a no-op; also strengthens required_keywords and adds documentation.
app/nodes/root_cause_diagnosis/prompt_builder.py Replication lag directive updated to explicitly label CPU as a red herring and block multi-root-cause reasoning; change is clear and internally consistent with the compositional faults directive.
tests/synthetic/rds_postgres/006-replication-lag-cpu-redherring/QA_VALIDATION.md New documentation file correctly captures expected behaviour, failure modes, and passing criteria for scenario 006; content is accurate and well-structured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_suite.py evaluator] --> B{Check required_keywords}
    B -->|pass| C{Check forbidden_categories}
    C -->|pass| D{Check forbidden_keywords}
    D -->|pass| E[PASS]
    B -->|fail| F[FAIL: missing keywords]
    C -->|fail| G[FAIL: forbidden category]
    D -->|fail| H[FAIL: forbidden keyword found]

    subgraph answer_yml["answer.yml (006)"]
        K[required_keywords ✅ enforced]
        L[forbidden_categories ✅ enforced]
        M[forbidden_phrases ❌ silently ignored]
        N[forbidden_keywords — field exists but unused in 006]
    end

    K -.-> B
    L -.-> C
    M -. never parsed .-> X((dead code))
    N -.-> D
Loading

Reviews (1): Last reviewed commit: "synthetic: tighten validation for replic..." | Re-trigger Greptile

Comment on lines +12 to +17
forbidden_phrases:
- two root causes
- multiple root causes
- independent root causes
- second root cause
- contributing cause
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 forbidden_phrases is never read or enforced

ScenarioAnswerKey (in scenario_loader.py) has no forbidden_phrases field, _parse_answer_yaml() never reads it, AnswerKeySchema in schemas.py doesn't declare it, and run_suite.py's evaluation logic has no corresponding check. The YAML key is silently dropped on load, so phrases like "two root causes" and "contributing cause" will never actually fail a run — the central goal of this PR's "validation tightening" is not enforced at runtime.

The existing forbidden_keywords field is wired end-to-end (TypedDict → dataclass → parser → evaluator). These phrases should either be moved to forbidden_keywords, or forbidden_phrases needs to be integrated into schemas.py, scenario_loader.py, and run_suite.py.

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

BEFORE
Ekran görüntüsü 2026-04-24 171524
Ekran görüntüsü 2026-04-24 171543

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

AFTER
Ekran görüntüsü 2026-04-24 190433
Ekran görüntüsü 2026-04-24 190455

@muddlebee
Copy link
Copy Markdown
Collaborator

@cerencamkiran pls fix the conflicts!

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Conflicts are fixed now @muddlebee👍

@muddlebee muddlebee merged commit 09dc355 into Tracer-Cloud:main Apr 28, 2026
7 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

🧑‍💻 @cerencamkiran has entered the contributor hall of fame. Merged. Done. Shipped. Go touch grass (then come back with another PR). 🌱


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

@muddlebee
Copy link
Copy Markdown
Collaborator

contributor hall of fame.

congrats 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[synthetic-qa] 006-replication-lag-cpu-redherring: Validate agent ignores CPU red herring

2 participants