fix: resolve synthetic RDS scenario 008 scoring and planner loops by yas789 · Pull Request #844 · Tracer-Cloud/opensre

yas789 · 2026-04-24T16:18:11Z

Fixes #604

Describe the changes you have made in this PR -

fixed the synthetic RDS scorer so failover-only event wording is enforced only for scenarios that explicitly require the failover timeline language
fixed planner/controller behavior so already-executed or unavailable actions are filtered out before execution, with a fallback to the next valid action when the LLM repeats a stale Grafana query
added regression tests for storage-vs-failover scoring and for planner fallback when only repeated actions are returned

Screenshots of the UI changes (If any) -

None

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The root-cause diagnosis for scenario 008-storage-full-missing-metric was already semantically correct on current main, but the synthetic scorer was still applying failover-only RDS event wording checks to every scenario that required aws_rds_events.
I narrowed that scoring rule so it only applies when the scenario's required_keywords explicitly demand the failover timeline phrases.
The scenario still had a trajectory problem because the planner could return already-executed actions even after prompt-side filtering. I fixed that at the controller layer by filtering the LLM's returned actions against the current available_action_names and falling back to the next valid action when the LLM repeats an exhausted Grafana query.
I chose this approach instead of retuning the diagnosis prompt because the evidence and RCA output for 008 were already correct; the remaining failures were in the synthetic scorer and the planner/controller loop.
Key functions changed:
- tests/synthetic/rds_postgres/run_suite.py::score_result now gates failover-only scoring on failover-specific required keywords.
- app/nodes/plan_actions/plan_actions.py::_seed_plan_actions now filters planner output to valid currently-available actions.
- app/nodes/plan_actions/plan_actions.py::plan_actions now falls back deterministically when the planner returns only unavailable/already-executed actions.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Verification

python -m pytest tests/nodes/plan_actions/test_reroute_and_budget.py
python -m pytest tests/synthetic/rds_postgres/test_suite.py -k \"not test_level\"
python -m tests.synthetic.rds_postgres.run_suite --scenario 008-storage-full-missing-metric --mock-grafana --json
python -m pytest tests/synthetic/rds_postgres/test_suite.py -k \"008-storage-full-missing-metric\" -vv (with OPENAI_API_KEY injected for pytest)
make lint
make format-check
make typecheck

greptile-apps · 2026-04-24T16:21:37Z

Greptile Summary

This PR fixes two independent bugs in the RDS synthetic test pipeline: (1) _seed_plan_actions now filters out any LLM-suggested actions that are not in available_action_names, and a controller-level fallback forces the first valid available action when the planner returns an empty list after filtering; (2) score_result in run_suite.py now gates the failover event-sequence check on whether the scenario's own required_keywords contain all five failover timeline tokens, rather than keying off aws_rds_events presence — preventing scenario 008 (storage-full) from being misgraded. Both fixes are covered by new unit tests that would have caught the original regressions.

Confidence Score: 5/5

Safe to merge; both fixes are correct, well-tested, and limited in scope.

All remaining findings are P2 style issues (non-idiomatic set constructor, local variable naming convention). No logic errors, data integrity issues, or correctness problems were found. Both changed code paths are covered by targeted regression tests.

No files require special attention.

Important Files Changed

Filename	Overview
app/nodes/plan_actions/plan_actions.py	Adds availability filter in `_seed_plan_actions` and a fallback action when the planner returns an empty list; logic is correct, one non-idiomatic `set[str](...)` instantiation.
tests/synthetic/rds_postgres/run_suite.py	Restricts failover event-sequence scoring to scenarios whose `required_keywords` contain the exact failover timeline tokens; fixes the scenario 008 misgrading. Minor naming convention issue with `_REQUIRED_SEQUENCE_TOKENS`.
tests/nodes/plan_actions/test_reroute_and_budget.py	Adds two well-structured regression tests: filter for unavailable actions in `_seed_plan_actions`, and planner fallback when only already-executed actions are returned.
tests/synthetic/rds_postgres/test_suite.py	Adds three scorer unit tests covering: storage scenario not misgraded, failover scenario still fails without RDS event reasoning, and failover scenario passes with correct reasoning.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[plan_actions called] --> B[select_actions\nfilters executed actions]
    B --> C[plan_actions_with_llm\nLLM returns planned_actions]
    C --> D[_seed_plan_actions\nfilter: keep only available_action_names]
    D --> E{plan.actions empty?}
    E -- No --> F[Enforce tool_budget cap]
    E -- Yes --> G["Fallback: plan.actions = [available_action_names[0]]"]
    G --> F
    F --> H[Return plan]

    subgraph score_result ["score_result (run_suite.py)"]
        I[Compute normalized_required_keywords\nfrom fixture answer_key] --> J{failover_required_tokens\n⊆ normalized_required_keywords?}
        J -- No --> K[Skip failover event\nsequence check]
        J -- Yes --> L[Check RDS event reasoning\nin root_cause + validated_claims]
        L --> M{mentions_event_reasoning?}
        M -- No --> N[FAIL: RDS events not used\nas primary reasoning signal]
        M -- Yes --> O{sequence_present?}
        O -- No --> P[FAIL: sequence not\nexplicitly listed]
        O -- Yes --> Q[PASS]
    end

Comments Outside Diff (1)

tests/synthetic/rds_postgres/run_suite.py, line 324-329 (link)

Module-level constant naming used for a local variable

_REQUIRED_SEQUENCE_TOKENS uses the ALL_CAPS_WITH_UNDERSCORES convention reserved for module-level constants (PEP 8 §Constants). As a local tuple defined inside an if-block it should use a regular lowercase name.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "fix(synthetic-rds): stop misgrading and ..." | Re-trigger Greptile}

greptile-apps · 2026-04-24T16:21:41Z

        if action_name in available_action_names
    ]

+    allowed_action_names = set[str](available_action_names)


Non-idiomatic generic-alias constructor call

set[str](iterable) invokes a types.GenericAlias as a constructor — it works at runtime (Python 3.9+ delegates to the origin type set), but it is the only occurrence of this pattern in the codebase and mypy may infer the result as set[Any] rather than set[str]. The idiomatic equivalent is a plain set() call with a type annotation.

Suggested change

allowed_action_names = set[str](available_action_names)

allowed_action_names: set[str] = set(available_action_names)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Addressing this in 13bf1b6. I changed the typed set construction to the annotated set(...) form in plan_actions.py, and also cleaned up the local failover sequence tuple naming in run_suite.py as a follow-up style fix. Focused regression tests and ruff checks were rerun after the update.

yas789 · 2026-04-24T16:32:54Z

Addressing the two Greptile nits in a follow-up commit:

switched the typed set construction to the annotated form
renamed the local failover sequence tuple to lowercase to match local-variable naming conventions

Re-ran focused regression tests and ruff checks after the change.

rrajan94 · 2026-04-28T09:46:55Z

hey @yas789
Tested locally with --mock-grafana. Scorer fix works, scenario 008 passes. ✅

But the planner-loop fix doesn't fully cover this case. The investigation hit max_loops: 4 because the planner kept selecting run_diagnostic_code four times in a row, even though it failed every time with the same TypeError:

[WARNING] Action failed: run_diagnostic_code:FAILED(TypeError: run_diagnostic_code() missing 1 require)

yas789 · 2026-04-28T18:55:40Z

Thanks for flagging this. I reproduced the issue locally with a focused regression test: after a deterministic run_diagnostic_code failure with a TypeError, the next planning pass still offered run_diagnostic_code again because failed actions were not recorded in the planner’s exclusion history.

I’m addressing this by separating successful and failed action history:

actions remains successful actions only, so synthetic trajectory scoring and report summaries are not polluted by failed attempts.
failed_actions records failure metadata for audit/debugging.
exhausted_actions records failed actions that should be excluded from future planning.

The planner now blocks successful actions plus exhausted actions. Non-retryable failures like TypeError, missing required arguments, unknown actions, and invalid responses are exhausted immediately. Retryable failures like timeouts, throttling, connection errors, and 5xx errors remain retryable until they hit a deterministic retry threshold.

I also added tests for:

reproducing the repeated failed-action selection
immediate exhaustion for non-retryable failures
retryable failures remaining selectable before the threshold
retryable failures becoming exhausted at the threshold
reroute logic not repeatedly triggering on exhausted actions

Local verification is green: focused planner tests, lint, format check, typecheck, make test-cov, and the original synthetic 008-storage-full-missing-metric scenario. I’ll push the implementation after one final review of the diff.

…ssue/604-storage-full-missing-metric

rrajan94 · 2026-04-29T03:03:19Z

Re-tested with be04428. The exhausted-action tracking solves the loop:

loops_used: 3 (down from 4/4)
Zero run_diagnostic_code calls in the trace
25/25 unit tests pass, including the 4 new failure-classification tests
make lint / format-check / typecheck all clean

Thanks @yas789 for the deeper architectural fix. 🙌

github-actions · 2026-04-29T03:03:42Z

🎊 Achievement unlocked: PR Merged. @yas789 passed code review, survived CI, and shipped. Respect. 🤝

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

fix(synthetic-rds): stop misgrading and requery loops in 008

bf41a29

yas789 mentioned this pull request Apr 24, 2026

[synthetic-qa] 008-storage-full-missing-metric: Validate agent infers storage exhaustion without FreeStorageSpace #604

Closed

5 tasks

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

style: address synthetic RDS review nits

13bf1b6

yas789 added 2 commits April 28, 2026 20:06

Merge branch 'main' of https://github.com/Tracer-Cloud/opensre into i…

0e462d6

…ssue/604-storage-full-missing-metric

fix(planner): block exhausted failed actions

be04428

rrajan94 merged commit 1e57c59 into Tracer-Cloud:main Apr 29, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve synthetic RDS scenario 008 scoring and planner loops#844

fix: resolve synthetic RDS scenario 008 scoring and planner loops#844
rrajan94 merged 4 commits intoTracer-Cloud:mainfrom
yas789:issue/604-storage-full-missing-metric

yas789 commented Apr 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

yas789 Apr 24, 2026

Uh oh!

yas789 commented Apr 24, 2026 •

edited

Loading

Uh oh!

rrajan94 commented Apr 28, 2026

Uh oh!

yas789 commented Apr 28, 2026

Uh oh!

rrajan94 commented Apr 29, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	allowed_action_names = set[str](available_action_names)
	allowed_action_names: set[str] = set(available_action_names)

Conversation

yas789 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the changes you have made in this PR -

Screenshots of the UI changes (If any) -

Code Understanding and AI Usage

Checklist before requesting a review

Verification

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

yas789 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

yas789 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rrajan94 commented Apr 28, 2026

Uh oh!

yas789 commented Apr 28, 2026

Uh oh!

rrajan94 commented Apr 29, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yas789 commented Apr 24, 2026 •

edited

Loading

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

yas789 commented Apr 24, 2026 •

edited

Loading