fix: improve WAL-driven replication lag RCA (scenario 001) + add QA validation#1214
Conversation
72eef1f to
3e022b1
Compare
Greptile SummaryThis PR tightens the synthetic test for scenario Confidence Score: 5/5Safe to merge; all findings are P2 design notes with no current runtime breakage. No P0/P1 defects found. The two P2 observations (global token list, text-vs-evidence asymmetry) are limitations of the current architecture rather than bugs that cause wrong pass/fail results today. tests/synthetic/rds_postgres/run_suite.py — the new Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[score_result called] --> B{root_cause present?}
B -- No --> FAIL1[failure: no root cause]
B -- Yes --> C{category matches?}
C -- No --> FAIL2[failure: wrong category]
C -- Yes --> D{missing keywords?}
D -- Yes --> FAIL3[failure: missing keywords]
D -- No --> E{forbidden category?}
E -- Yes --> FAIL4[failure: forbidden category]
E -- No --> F{forbidden keywords?}
F -- Yes --> FAIL5[failure: forbidden keywords]
F -- No --> G{required_evidence_sources?}
G -- No --> PASS
G -- Yes --> H{source == aws_performance_insights?}
H -- Yes --> I{PI tokens in normalized_output?}
I -- No --> FAIL6[failure: PI evidence not gathered]
I -- Yes --> J[continue to next source]
H -- No --> K{evidence dict has state_key?}
K -- No --> FAIL7[failure: evidence not gathered]
K -- Yes --> J
J --> G
G --> L{failover event reasoning required?}
L -- Yes --> M{RDS event sequence present?}
M -- No --> FAIL8[failure: missing event sequence]
M -- Yes --> PASS[passed = True]
L -- No --> PASS
Reviews (4): Last reviewed commit: "fix: validate performance insights evide..." | Re-trigger Greptile |
| if isinstance(raw, list): | ||
| msg = "Judge response JSON must be an object" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Array fence causes early bail-out, skipping valid earlier fences
When iterating fences in reverse, hitting a JSON array immediately raises ValueError without giving the remaining (earlier) fences a chance to return a valid dict. A real LLM response that includes an example array fence after the result fence — e.g. an answer fence followed by a prose explanation with a ["item1"] fence — would incorrectly fail here instead of returning the already-seen correct answer.
The fix is to continue past array fences and only raise after the entire loop is exhausted with no dict found. There is no existing test covering the case: "last fence is array, earlier fence is valid dict".
| if isinstance(raw, list): | |
| msg = "Judge response JSON must be an object" | |
| raise ValueError(msg) | |
| if isinstance(raw, dict): | |
| return cast(dict[str, Any], raw) | |
| # array fence — skip and keep trying earlier fences | |
| # (do not raise here; a previous fence may contain the answer) |
3e022b1 to
e367fb2
Compare
|
@greptile-apps re-review |
| required_evidence_sources: | ||
| - aws_cloudwatch_metrics | ||
| - aws_performance_insights |
There was a problem hiding this comment.
required_evidence_sources enforcement is ineffective for aws_performance_insights
run_suite.py defines _EVIDENCE_KEY_MAP which maps both aws_cloudwatch_metrics and aws_performance_insights to the same internal state key grafana_metrics. When score_result iterates required_evidence_sources, it resolves each source to its state key before checking evidence.get(state_key). Since both entries resolve to grafana_metrics, an agent that uses CloudWatch metrics but never consults Performance Insights still satisfies both checks — the aws_performance_insights requirement is silently a no-op. The PR intent to "require both metric-level and query-level evidence" is not enforced.
|
@greptile-apps re-review |
|
P1 is addressed.
The text vs evidence asymmetry can be improved separately. |
|
|
||
| for source_key in answer_key.required_evidence_sources: | ||
| if source_key == "aws_performance_insights": | ||
| if not any(token in normalized_output for token in performance_insights_tokens): |
There was a problem hiding this comment.
Keep the state evidence check for aws_performance_insights. Keyword-only output matching can pass on hallucinated AAS/db load mentions even when Performance Insights was never queried.
|
Thanks for your feedback @muddlebee. Tested locally with both positive and negative cases.
So |
|
⚡ LGTM → Merged. @cerencamkiran, your work is in. Every commit counts — thank you for this one. 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |



Fixes #597
The changes
This PR improves root cause analysis (RCA) quality for the
001-replication-lagsynthetic scenario by enforcing correct WAL-driven replication reasoning.Key improvements
Improves replication lag reasoning:
Enforces mechanism-level RCA:
replayto enforce WAL vs replica replay mechanismStrengthens evidence usage:
ReplicaLag,WriteIOPS,TransactionLogsGeneration)Prevents common misdiagnoses:
cpu_saturationandconnection_exhaustionas root causesAdds QA validation:
QA_VALIDATION.mdfor scenario 001Result
Scenario
001-replication-lagnow produces stronger, mechanism-level RCA output with correct attribution and evidence grounding.PASS 001-replication-lag category=resource_exhaustion
Results: 1/1 passed
Run