fix: improve WAL-driven replication lag RCA (scenario 001) + add QA validation by cerencamkiran · Pull Request #1214 · Tracer-Cloud/opensre

cerencamkiran · 2026-05-02T14:11:09Z

Fixes #597

The changes

This PR improves root cause analysis (RCA) quality for the 001-replication-lag synthetic scenario by enforcing correct WAL-driven replication reasoning.

Key improvements

Improves replication lag reasoning:

Ensures the agent identifies replication lag as a downstream effect of a write-heavy workload on the primary
Enforces correct causal chain: write workload → WAL generation → replica replay lag

Enforces mechanism-level RCA:

Requires explanation of WAL generation vs replica replay mismatch
Prevents shallow “replication lag” diagnoses without causal explanation
Requires replay to enforce WAL vs replica replay mechanism

Strengthens evidence usage:

Requires both CloudWatch metrics (ReplicaLag, WriteIOPS, TransactionLogsGeneration)
Requires Performance Insights evidence (write-heavy SQL + WAL waits)

Prevents common misdiagnoses:

Disallows cpu_saturation and connection_exhaustion as root causes
Ensures CPU is treated as a downstream effect, not the initiating cause
Prevents blaming the replica instead of the primary workload

Adds QA validation:

Introduces QA_VALIDATION.md for scenario 001
Documents expected reasoning, evidence usage, and failure modes
Helps prevent regressions in replication-related RCA

Result

Scenario 001-replication-lag now produces stronger, mechanism-level RCA output with correct attribution and evidence grounding.

PASS 001-replication-lag category=resource_exhaustion

Results: 1/1 passed

Run

python -m tests.synthetic.rds_postgres.run_suite --scenario 001-replication-lag --mock-grafana

greptile-apps · 2026-05-02T14:14:03Z

Greptile Summary

This PR tightens the synthetic test for scenario 001-replication-lag by adding the replay keyword, forbidding connection_exhaustion as a category, requiring both CloudWatch and Performance Insights evidence sources, and adding a QA_VALIDATION.md document. The core scorer in run_suite.py gains a special branch that validates PI evidence via output-token matching rather than the evidence-dict check used for all other sources.

Confidence Score: 5/5

Safe to merge; all findings are P2 design notes with no current runtime breakage.

No P0/P1 defects found. The two P2 observations (global token list, text-vs-evidence asymmetry) are limitations of the current architecture rather than bugs that cause wrong pass/fail results today.

tests/synthetic/rds_postgres/run_suite.py — the new performance_insights_tokens branch warrants attention when future scenarios add aws_performance_insights to their evidence requirements.

Important Files Changed

Filename	Overview
tests/synthetic/rds_postgres/run_suite.py	Adds special-case handling for `aws_performance_insights` in `score_result` using hardcoded output-token matching instead of evidence-dict lookup; the token list is global/WAL-specific and the check is weaker (text-based) than the corresponding CloudWatch check.
tests/synthetic/rds_postgres/001-replication-lag/answer.yml	Adds `replay` keyword, `required_evidence_sources` (CloudWatch + PI), and `connection_exhaustion` to `forbidden_categories`; straightforward tightening of the scenario answer key.
tests/synthetic/rds_postgres/001-replication-lag/QA_VALIDATION.md	New documentation file describing expected reasoning, evidence, and failure modes for the WAL-driven replication lag scenario; no executable code.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[score_result called] --> B{root_cause present?}
    B -- No --> FAIL1[failure: no root cause]
    B -- Yes --> C{category matches?}
    C -- No --> FAIL2[failure: wrong category]
    C -- Yes --> D{missing keywords?}
    D -- Yes --> FAIL3[failure: missing keywords]
    D -- No --> E{forbidden category?}
    E -- Yes --> FAIL4[failure: forbidden category]
    E -- No --> F{forbidden keywords?}
    F -- Yes --> FAIL5[failure: forbidden keywords]
    F -- No --> G{required_evidence_sources?}
    G -- No --> PASS
    G -- Yes --> H{source == aws_performance_insights?}
    H -- Yes --> I{PI tokens in normalized_output?}
    I -- No --> FAIL6[failure: PI evidence not gathered]
    I -- Yes --> J[continue to next source]
    H -- No --> K{evidence dict has state_key?}
    K -- No --> FAIL7[failure: evidence not gathered]
    K -- Yes --> J
    J --> G
    G --> L{failover event reasoning required?}
    L -- Yes --> M{RDS event sequence present?}
    M -- No --> FAIL8[failure: missing event sequence]
    M -- Yes --> PASS[passed = True]
    L -- No --> PASS

_{Reviews (4): Last reviewed commit: "fix: validate performance insights evide..." | Re-trigger Greptile}

greptile-apps · 2026-05-02T14:14:07Z

+            if isinstance(raw, list):
+                msg = "Judge response JSON must be an object"
+                raise ValueError(msg)


Array fence causes early bail-out, skipping valid earlier fences

When iterating fences in reverse, hitting a JSON array immediately raises ValueError without giving the remaining (earlier) fences a chance to return a valid dict. A real LLM response that includes an example array fence after the result fence — e.g. an answer fence followed by a prose explanation with a ["item1"] fence — would incorrectly fail here instead of returning the already-seen correct answer.

The fix is to continue past array fences and only raise after the entire loop is exhausted with no dict found. There is no existing test covering the case: "last fence is array, earlier fence is valid dict".

Suggested change

if isinstance(raw, list):

msg = "Judge response JSON must be an object"

raise ValueError(msg)

if isinstance(raw, dict):

return cast(dict[str, Any], raw)

# array fence — skip and keep trying earlier fences

# (do not raise here; a previous fence may contain the answer)

…alidation

cerencamkiran · 2026-05-02T16:01:15Z

@greptile-apps re-review

greptile-apps · 2026-05-02T16:03:15Z

+required_evidence_sources:
+  - aws_cloudwatch_metrics
+  - aws_performance_insights


required_evidence_sources enforcement is ineffective for aws_performance_insights

run_suite.py defines _EVIDENCE_KEY_MAP which maps both aws_cloudwatch_metrics and aws_performance_insights to the same internal state key grafana_metrics. When score_result iterates required_evidence_sources, it resolves each source to its state key before checking evidence.get(state_key). Since both entries resolve to grafana_metrics, an agent that uses CloudWatch metrics but never consults Performance Insights still satisfies both checks — the aws_performance_insights requirement is silently a no-op. The PR intent to "require both metric-level and query-level evidence" is not enforced.

cerencamkiran · 2026-05-02T16:30:20Z

@greptile-apps re-review

cerencamkiran · 2026-05-02T16:37:41Z

cerencamkiran · 2026-05-02T16:38:25Z

P1 is addressed.

aws_performance_insights is now validated via PI-specific signals (Top SQL, AAS, WALWrite), not generic grafana_metrics.

The text vs evidence asymmetry can be improved separately.

muddlebee · 2026-05-03T05:34:06Z

+
        for source_key in answer_key.required_evidence_sources:
+            if source_key == "aws_performance_insights":
+                if not any(token in normalized_output for token in performance_insights_tokens):


Keep the state evidence check for aws_performance_insights. Keyword-only output matching can pass on hallucinated AAS/db load mentions even when Performance Insights was never queried.

cerencamkiran · 2026-05-03T10:13:02Z

Thanks for your feedback @muddlebee. Tested locally with both positive and negative cases.

Metrics only (no PI signals) → correctly fails
PI-like text without state evidence → correctly fails

So aws_performance_insights now requires both state evidence and PI-specific signals, preventing hallucinated passes.

github-actions · 2026-05-05T16:37:42Z

⚡ LGTM → Merged. @cerencamkiran, your work is in. Every commit counts — thank you for this one.

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

cerencamkiran force-pushed the synthetic-001-replication-lag-validation branch from 72eef1f to 3e022b1 Compare May 2, 2026 14:13

greptile-apps Bot reviewed May 2, 2026

View reviewed changes

fix: improve WAL-driven replication lag RCA (scenario 001) + add QA v…

e367fb2

…alidation

cerencamkiran force-pushed the synthetic-001-replication-lag-validation branch from 3e022b1 to e367fb2 Compare May 2, 2026 14:21

fix: require replay mechanism for replication lag validation

635ff1b

greptile-apps Bot reviewed May 2, 2026

View reviewed changes

fix: validate performance insights evidence separately

441c115

muddlebee reviewed May 3, 2026

View reviewed changes

cerencamkiran added 2 commits May 3, 2026 13:05

fix: validate performance insights evidence using state + output signals

f7a45de

style: format run_suite.py with ruff

7236a65

muddlebee added the pending triage label May 4, 2026

muddlebee merged commit 7bb92a9 into Tracer-Cloud:main May 5, 2026
13 checks passed

WatchTree-19 mentioned this pull request May 5, 2026

[synthetic-qa] 002-connection-exhaustion: Validate agent identifies connection pool leak #598

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve WAL-driven replication lag RCA (scenario 001) + add QA validation#1214

fix: improve WAL-driven replication lag RCA (scenario 001) + add QA validation#1214
muddlebee merged 5 commits intoTracer-Cloud:mainfrom
cerencamkiran:synthetic-001-replication-lag-validation

cerencamkiran commented May 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 2, 2026

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

greptile-apps Bot May 2, 2026

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

muddlebee May 3, 2026

Uh oh!

cerencamkiran commented May 3, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            if isinstance(raw, list):
-                msg = "Judge response JSON must be an object"
-                raise ValueError(msg)
+            if isinstance(raw, dict):
+                return cast(dict[str, Any], raw)
+            # array fence — skip and keep trying earlier fences
+            # (do not raise here; a previous fence may contain the answer)

Conversation

cerencamkiran commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The changes

Key improvements

Result

Run

Uh oh!

greptile-apps Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

greptile-apps Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

muddlebee May 3, 2026

Choose a reason for hiding this comment

Uh oh!

cerencamkiran commented May 3, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cerencamkiran commented May 2, 2026 •

edited

Loading

greptile-apps Bot commented May 2, 2026 •

edited

Loading