fix: resolve synthetic RDS scenario 008 scoring and planner loops#844
Conversation
Greptile SummaryThis PR fixes two independent bugs in the RDS synthetic test pipeline: (1) Confidence Score: 5/5Safe to merge; both fixes are correct, well-tested, and limited in scope. All remaining findings are P2 style issues (non-idiomatic set constructor, local variable naming convention). No logic errors, data integrity issues, or correctness problems were found. Both changed code paths are covered by targeted regression tests. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[plan_actions called] --> B[select_actions\nfilters executed actions]
B --> C[plan_actions_with_llm\nLLM returns planned_actions]
C --> D[_seed_plan_actions\nfilter: keep only available_action_names]
D --> E{plan.actions empty?}
E -- No --> F[Enforce tool_budget cap]
E -- Yes --> G["Fallback: plan.actions = [available_action_names[0]]"]
G --> F
F --> H[Return plan]
subgraph score_result ["score_result (run_suite.py)"]
I[Compute normalized_required_keywords\nfrom fixture answer_key] --> J{failover_required_tokens\n⊆ normalized_required_keywords?}
J -- No --> K[Skip failover event\nsequence check]
J -- Yes --> L[Check RDS event reasoning\nin root_cause + validated_claims]
L --> M{mentions_event_reasoning?}
M -- No --> N[FAIL: RDS events not used\nas primary reasoning signal]
M -- Yes --> O{sequence_present?}
O -- No --> P[FAIL: sequence not\nexplicitly listed]
O -- Yes --> Q[PASS]
end
|
| if action_name in available_action_names | ||
| ] | ||
|
|
||
| allowed_action_names = set[str](available_action_names) |
There was a problem hiding this comment.
Non-idiomatic generic-alias constructor call
set[str](iterable) invokes a types.GenericAlias as a constructor — it works at runtime (Python 3.9+ delegates to the origin type set), but it is the only occurrence of this pattern in the codebase and mypy may infer the result as set[Any] rather than set[str]. The idiomatic equivalent is a plain set() call with a type annotation.
| allowed_action_names = set[str](available_action_names) | |
| allowed_action_names: set[str] = set(available_action_names) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Addressing this in 13bf1b6. I changed the typed set construction to the annotated set(...) form in plan_actions.py, and also cleaned up the local failover sequence tuple naming in run_suite.py as a follow-up style fix. Focused regression tests and ruff checks were rerun after the update.
|
Addressing the two Greptile nits in a follow-up commit:
Re-ran focused regression tests and ruff checks after the change. |
|
hey @yas789 But the planner-loop fix doesn't fully cover this case. The investigation hit
|
|
Thanks for flagging this. I reproduced the issue locally with a focused regression test: after a deterministic I’m addressing this by separating successful and failed action history:
The planner now blocks successful actions plus exhausted actions. Non-retryable failures like I also added tests for:
Local verification is green: focused planner tests, lint, format check, typecheck, |
|
Re-tested with be04428. The exhausted-action tracking solves the loop:
Thanks @yas789 for the deeper architectural fix. 🙌 |
|
🎊 Achievement unlocked: PR Merged. @yas789 passed code review, survived CI, and shipped. Respect. 🤝 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |


Fixes #604
Describe the changes you have made in this PR -
Screenshots of the UI changes (If any) -
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
008-storage-full-missing-metricwas already semantically correct on currentmain, but the synthetic scorer was still applying failover-only RDS event wording checks to every scenario that requiredaws_rds_events.required_keywordsexplicitly demand the failover timeline phrases.available_action_namesand falling back to the next valid action when the LLM repeats an exhausted Grafana query.008were already correct; the remaining failures were in the synthetic scorer and the planner/controller loop.tests/synthetic/rds_postgres/run_suite.py::score_resultnow gates failover-only scoring on failover-specific required keywords.app/nodes/plan_actions/plan_actions.py::_seed_plan_actionsnow filters planner output to valid currently-available actions.app/nodes/plan_actions/plan_actions.py::plan_actionsnow falls back deterministically when the planner returns only unavailable/already-executed actions.Checklist before requesting a review
Verification
python -m pytest tests/nodes/plan_actions/test_reroute_and_budget.pypython -m pytest tests/synthetic/rds_postgres/test_suite.py -k \"not test_level\"python -m tests.synthetic.rds_postgres.run_suite --scenario 008-storage-full-missing-metric --mock-grafana --jsonpython -m pytest tests/synthetic/rds_postgres/test_suite.py -k \"008-storage-full-missing-metric\" -vv(withOPENAI_API_KEYinjected for pytest)make lintmake format-checkmake typecheck