Skip to content

fix: improve WAL-driven replication lag RCA (scenario 001) + add QA validation#1214

Merged
muddlebee merged 5 commits intoTracer-Cloud:mainfrom
cerencamkiran:synthetic-001-replication-lag-validation
May 5, 2026
Merged

fix: improve WAL-driven replication lag RCA (scenario 001) + add QA validation#1214
muddlebee merged 5 commits intoTracer-Cloud:mainfrom
cerencamkiran:synthetic-001-replication-lag-validation

Conversation

@cerencamkiran
Copy link
Copy Markdown
Contributor

@cerencamkiran cerencamkiran commented May 2, 2026

Fixes #597

The changes

This PR improves root cause analysis (RCA) quality for the 001-replication-lag synthetic scenario by enforcing correct WAL-driven replication reasoning.

Key improvements

Improves replication lag reasoning:

  • Ensures the agent identifies replication lag as a downstream effect of a write-heavy workload on the primary
  • Enforces correct causal chain: write workload → WAL generation → replica replay lag

Enforces mechanism-level RCA:

  • Requires explanation of WAL generation vs replica replay mismatch
  • Prevents shallow “replication lag” diagnoses without causal explanation
  • Requires replay to enforce WAL vs replica replay mechanism

Strengthens evidence usage:

  • Requires both CloudWatch metrics (ReplicaLag, WriteIOPS, TransactionLogsGeneration)
  • Requires Performance Insights evidence (write-heavy SQL + WAL waits)

Prevents common misdiagnoses:

  • Disallows cpu_saturation and connection_exhaustion as root causes
  • Ensures CPU is treated as a downstream effect, not the initiating cause
  • Prevents blaming the replica instead of the primary workload

Adds QA validation:

  • Introduces QA_VALIDATION.md for scenario 001
  • Documents expected reasoning, evidence usage, and failure modes
  • Helps prevent regressions in replication-related RCA

Result

Scenario 001-replication-lag now produces stronger, mechanism-level RCA output with correct attribution and evidence grounding.

PASS 001-replication-lag category=resource_exhaustion

Results: 1/1 passed

Run

python -m tests.synthetic.rds_postgres.run_suite --scenario 001-replication-lag --mock-grafana

@cerencamkiran cerencamkiran force-pushed the synthetic-001-replication-lag-validation branch from 72eef1f to 3e022b1 Compare May 2, 2026 14:13
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 2, 2026

Greptile Summary

This PR tightens the synthetic test for scenario 001-replication-lag by adding the replay keyword, forbidding connection_exhaustion as a category, requiring both CloudWatch and Performance Insights evidence sources, and adding a QA_VALIDATION.md document. The core scorer in run_suite.py gains a special branch that validates PI evidence via output-token matching rather than the evidence-dict check used for all other sources.

Confidence Score: 5/5

Safe to merge; all findings are P2 design notes with no current runtime breakage.

No P0/P1 defects found. The two P2 observations (global token list, text-vs-evidence asymmetry) are limitations of the current architecture rather than bugs that cause wrong pass/fail results today.

tests/synthetic/rds_postgres/run_suite.py — the new performance_insights_tokens branch warrants attention when future scenarios add aws_performance_insights to their evidence requirements.

Important Files Changed

Filename Overview
tests/synthetic/rds_postgres/run_suite.py Adds special-case handling for aws_performance_insights in score_result using hardcoded output-token matching instead of evidence-dict lookup; the token list is global/WAL-specific and the check is weaker (text-based) than the corresponding CloudWatch check.
tests/synthetic/rds_postgres/001-replication-lag/answer.yml Adds replay keyword, required_evidence_sources (CloudWatch + PI), and connection_exhaustion to forbidden_categories; straightforward tightening of the scenario answer key.
tests/synthetic/rds_postgres/001-replication-lag/QA_VALIDATION.md New documentation file describing expected reasoning, evidence, and failure modes for the WAL-driven replication lag scenario; no executable code.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[score_result called] --> B{root_cause present?}
    B -- No --> FAIL1[failure: no root cause]
    B -- Yes --> C{category matches?}
    C -- No --> FAIL2[failure: wrong category]
    C -- Yes --> D{missing keywords?}
    D -- Yes --> FAIL3[failure: missing keywords]
    D -- No --> E{forbidden category?}
    E -- Yes --> FAIL4[failure: forbidden category]
    E -- No --> F{forbidden keywords?}
    F -- Yes --> FAIL5[failure: forbidden keywords]
    F -- No --> G{required_evidence_sources?}
    G -- No --> PASS
    G -- Yes --> H{source == aws_performance_insights?}
    H -- Yes --> I{PI tokens in normalized_output?}
    I -- No --> FAIL6[failure: PI evidence not gathered]
    I -- Yes --> J[continue to next source]
    H -- No --> K{evidence dict has state_key?}
    K -- No --> FAIL7[failure: evidence not gathered]
    K -- Yes --> J
    J --> G
    G --> L{failover event reasoning required?}
    L -- Yes --> M{RDS event sequence present?}
    M -- No --> FAIL8[failure: missing event sequence]
    M -- Yes --> PASS[passed = True]
    L -- No --> PASS
Loading

Reviews (4): Last reviewed commit: "fix: validate performance insights evide..." | Re-trigger Greptile

Comment on lines +62 to +64
if isinstance(raw, list):
msg = "Judge response JSON must be an object"
raise ValueError(msg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Array fence causes early bail-out, skipping valid earlier fences

When iterating fences in reverse, hitting a JSON array immediately raises ValueError without giving the remaining (earlier) fences a chance to return a valid dict. A real LLM response that includes an example array fence after the result fence — e.g. an answer fence followed by a prose explanation with a ["item1"] fence — would incorrectly fail here instead of returning the already-seen correct answer.

The fix is to continue past array fences and only raise after the entire loop is exhausted with no dict found. There is no existing test covering the case: "last fence is array, earlier fence is valid dict".

Suggested change
if isinstance(raw, list):
msg = "Judge response JSON must be an object"
raise ValueError(msg)
if isinstance(raw, dict):
return cast(dict[str, Any], raw)
# array fence — skip and keep trying earlier fences
# (do not raise here; a previous fence may contain the answer)

@cerencamkiran cerencamkiran force-pushed the synthetic-001-replication-lag-validation branch from 3e022b1 to e367fb2 Compare May 2, 2026 14:21
@cerencamkiran
Copy link
Copy Markdown
Contributor Author

@greptile-apps re-review

Comment on lines +8 to +10
required_evidence_sources:
- aws_cloudwatch_metrics
- aws_performance_insights
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 required_evidence_sources enforcement is ineffective for aws_performance_insights

run_suite.py defines _EVIDENCE_KEY_MAP which maps both aws_cloudwatch_metrics and aws_performance_insights to the same internal state key grafana_metrics. When score_result iterates required_evidence_sources, it resolves each source to its state key before checking evidence.get(state_key). Since both entries resolve to grafana_metrics, an agent that uses CloudWatch metrics but never consults Performance Insights still satisfies both checks — the aws_performance_insights requirement is silently a no-op. The PR intent to "require both metric-level and query-level evidence" is not enforced.

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

@greptile-apps re-review

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Ekran görüntüsü 2026-05-02 193636 Ekran görüntüsü 2026-05-02 193708

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

P1 is addressed.

aws_performance_insights is now validated via PI-specific signals (Top SQL, AAS, WALWrite), not generic grafana_metrics.

The text vs evidence asymmetry can be improved separately.


for source_key in answer_key.required_evidence_sources:
if source_key == "aws_performance_insights":
if not any(token in normalized_output for token in performance_insights_tokens):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the state evidence check for aws_performance_insights. Keyword-only output matching can pass on hallucinated AAS/db load mentions even when Performance Insights was never queried.

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Thanks for your feedback @muddlebee. Tested locally with both positive and negative cases.

  • Metrics only (no PI signals) → correctly fails
  • PI-like text without state evidence → correctly fails

So aws_performance_insights now requires both state evidence and PI-specific signals, preventing hallucinated passes.

@muddlebee muddlebee merged commit 7bb92a9 into Tracer-Cloud:main May 5, 2026
13 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

LGTM → Merged. @cerencamkiran, your work is in. Every commit counts — thank you for this one.


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[synthetic-qa] 001-replication-lag: Validate agent identifies WAL-driven replication lag

2 participants