Skip to content

fix: resolve synthetic RDS scenario 008 scoring and planner loops#844

Merged
rrajan94 merged 4 commits intoTracer-Cloud:mainfrom
yas789:issue/604-storage-full-missing-metric
Apr 29, 2026
Merged

fix: resolve synthetic RDS scenario 008 scoring and planner loops#844
rrajan94 merged 4 commits intoTracer-Cloud:mainfrom
yas789:issue/604-storage-full-missing-metric

Conversation

@yas789
Copy link
Copy Markdown
Contributor

@yas789 yas789 commented Apr 24, 2026

Fixes #604

Describe the changes you have made in this PR -

  • fixed the synthetic RDS scorer so failover-only event wording is enforced only for scenarios that explicitly require the failover timeline language
  • fixed planner/controller behavior so already-executed or unavailable actions are filtered out before execution, with a fallback to the next valid action when the LLM repeats a stale Grafana query
  • added regression tests for storage-vs-failover scoring and for planner fallback when only repeated actions are returned

Screenshots of the UI changes (If any) -

  • None

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

  • The root-cause diagnosis for scenario 008-storage-full-missing-metric was already semantically correct on current main, but the synthetic scorer was still applying failover-only RDS event wording checks to every scenario that required aws_rds_events.
  • I narrowed that scoring rule so it only applies when the scenario's required_keywords explicitly demand the failover timeline phrases.
  • The scenario still had a trajectory problem because the planner could return already-executed actions even after prompt-side filtering. I fixed that at the controller layer by filtering the LLM's returned actions against the current available_action_names and falling back to the next valid action when the LLM repeats an exhausted Grafana query.
  • I chose this approach instead of retuning the diagnosis prompt because the evidence and RCA output for 008 were already correct; the remaining failures were in the synthetic scorer and the planner/controller loop.
  • Key functions changed:
    • tests/synthetic/rds_postgres/run_suite.py::score_result now gates failover-only scoring on failover-specific required keywords.
    • app/nodes/plan_actions/plan_actions.py::_seed_plan_actions now filters planner output to valid currently-available actions.
    • app/nodes/plan_actions/plan_actions.py::plan_actions now falls back deterministically when the planner returns only unavailable/already-executed actions.

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Verification

  • python -m pytest tests/nodes/plan_actions/test_reroute_and_budget.py
  • python -m pytest tests/synthetic/rds_postgres/test_suite.py -k \"not test_level\"
  • python -m tests.synthetic.rds_postgres.run_suite --scenario 008-storage-full-missing-metric --mock-grafana --json
  • python -m pytest tests/synthetic/rds_postgres/test_suite.py -k \"008-storage-full-missing-metric\" -vv (with OPENAI_API_KEY injected for pytest)
  • make lint
  • make format-check
  • make typecheck

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR fixes two independent bugs in the RDS synthetic test pipeline: (1) _seed_plan_actions now filters out any LLM-suggested actions that are not in available_action_names, and a controller-level fallback forces the first valid available action when the planner returns an empty list after filtering; (2) score_result in run_suite.py now gates the failover event-sequence check on whether the scenario's own required_keywords contain all five failover timeline tokens, rather than keying off aws_rds_events presence — preventing scenario 008 (storage-full) from being misgraded. Both fixes are covered by new unit tests that would have caught the original regressions.

Confidence Score: 5/5

Safe to merge; both fixes are correct, well-tested, and limited in scope.

All remaining findings are P2 style issues (non-idiomatic set constructor, local variable naming convention). No logic errors, data integrity issues, or correctness problems were found. Both changed code paths are covered by targeted regression tests.

No files require special attention.

Important Files Changed

Filename Overview
app/nodes/plan_actions/plan_actions.py Adds availability filter in _seed_plan_actions and a fallback action when the planner returns an empty list; logic is correct, one non-idiomatic set[str](...) instantiation.
tests/synthetic/rds_postgres/run_suite.py Restricts failover event-sequence scoring to scenarios whose required_keywords contain the exact failover timeline tokens; fixes the scenario 008 misgrading. Minor naming convention issue with _REQUIRED_SEQUENCE_TOKENS.
tests/nodes/plan_actions/test_reroute_and_budget.py Adds two well-structured regression tests: filter for unavailable actions in _seed_plan_actions, and planner fallback when only already-executed actions are returned.
tests/synthetic/rds_postgres/test_suite.py Adds three scorer unit tests covering: storage scenario not misgraded, failover scenario still fails without RDS event reasoning, and failover scenario passes with correct reasoning.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[plan_actions called] --> B[select_actions\nfilters executed actions]
    B --> C[plan_actions_with_llm\nLLM returns planned_actions]
    C --> D[_seed_plan_actions\nfilter: keep only available_action_names]
    D --> E{plan.actions empty?}
    E -- No --> F[Enforce tool_budget cap]
    E -- Yes --> G["Fallback: plan.actions = [available_action_names[0]]"]
    G --> F
    F --> H[Return plan]

    subgraph score_result ["score_result (run_suite.py)"]
        I[Compute normalized_required_keywords\nfrom fixture answer_key] --> J{failover_required_tokens\n⊆ normalized_required_keywords?}
        J -- No --> K[Skip failover event\nsequence check]
        J -- Yes --> L[Check RDS event reasoning\nin root_cause + validated_claims]
        L --> M{mentions_event_reasoning?}
        M -- No --> N[FAIL: RDS events not used\nas primary reasoning signal]
        M -- Yes --> O{sequence_present?}
        O -- No --> P[FAIL: sequence not\nexplicitly listed]
        O -- Yes --> Q[PASS]
    end
Loading

Comments Outside Diff (1)

  1. tests/synthetic/rds_postgres/run_suite.py, line 324-329 (link)

    P2 Module-level constant naming used for a local variable

    _REQUIRED_SEQUENCE_TOKENS uses the ALL_CAPS_WITH_UNDERSCORES convention reserved for module-level constants (PEP 8 §Constants). As a local tuple defined inside an if-block it should use a regular lowercase name.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "fix(synthetic-rds): stop misgrading and ..." | Re-trigger Greptile

Comment thread app/nodes/plan_actions/plan_actions.py Outdated
if action_name in available_action_names
]

allowed_action_names = set[str](available_action_names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-idiomatic generic-alias constructor call

set[str](iterable) invokes a types.GenericAlias as a constructor — it works at runtime (Python 3.9+ delegates to the origin type set), but it is the only occurrence of this pattern in the codebase and mypy may infer the result as set[Any] rather than set[str]. The idiomatic equivalent is a plain set() call with a type annotation.

Suggested change
allowed_action_names = set[str](available_action_names)
allowed_action_names: set[str] = set(available_action_names)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing this in 13bf1b6. I changed the typed set construction to the annotated set(...) form in plan_actions.py, and also cleaned up the local failover sequence tuple naming in run_suite.py as a follow-up style fix. Focused regression tests and ruff checks were rerun after the update.

@yas789
Copy link
Copy Markdown
Contributor Author

yas789 commented Apr 24, 2026

Addressing the two Greptile nits in a follow-up commit:

  • switched the typed set construction to the annotated form
  • renamed the local failover sequence tuple to lowercase to match local-variable naming conventions

Re-ran focused regression tests and ruff checks after the change.

@rrajan94
Copy link
Copy Markdown
Collaborator

hey @yas789
Tested locally with --mock-grafana. Scorer fix works, scenario 008 passes. ✅

But the planner-loop fix doesn't fully cover this case. The investigation hit max_loops: 4 because the planner kept selecting run_diagnostic_code four times in a row, even though it failed every time with the same TypeError:

[WARNING] Action failed: run_diagnostic_code:FAILED(TypeError: run_diagnostic_code() missing 1 require)

image

@yas789
Copy link
Copy Markdown
Contributor Author

yas789 commented Apr 28, 2026

Thanks for flagging this. I reproduced the issue locally with a focused regression test: after a deterministic run_diagnostic_code failure with a TypeError, the next planning pass still offered run_diagnostic_code again because failed actions were not recorded in the planner’s exclusion history.

I’m addressing this by separating successful and failed action history:

  • actions remains successful actions only, so synthetic trajectory scoring and report summaries are not polluted by failed attempts.
  • failed_actions records failure metadata for audit/debugging.
  • exhausted_actions records failed actions that should be excluded from future planning.

The planner now blocks successful actions plus exhausted actions. Non-retryable failures like TypeError, missing required arguments, unknown actions, and invalid responses are exhausted immediately. Retryable failures like timeouts, throttling, connection errors, and 5xx errors remain retryable until they hit a deterministic retry threshold.

I also added tests for:

  • reproducing the repeated failed-action selection
  • immediate exhaustion for non-retryable failures
  • retryable failures remaining selectable before the threshold
  • retryable failures becoming exhausted at the threshold
  • reroute logic not repeatedly triggering on exhausted actions

Local verification is green: focused planner tests, lint, format check, typecheck, make test-cov, and the original synthetic 008-storage-full-missing-metric scenario. I’ll push the implementation after one final review of the diff.

@rrajan94
Copy link
Copy Markdown
Collaborator

Re-tested with be04428. The exhausted-action tracking solves the loop:

  • loops_used: 3 (down from 4/4)
  • Zero run_diagnostic_code calls in the trace
  • 25/25 unit tests pass, including the 4 new failure-classification tests
  • make lint / format-check / typecheck all clean

Thanks @yas789 for the deeper architectural fix. 🙌

@rrajan94 rrajan94 merged commit 1e57c59 into Tracer-Cloud:main Apr 29, 2026
7 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

🎊 Achievement unlocked: PR Merged. @yas789 passed code review, survived CI, and shipped. Respect. 🤝


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[synthetic-qa] 008-storage-full-missing-metric: Validate agent infers storage exhaustion without FreeStorageSpace

2 participants