Skip to content

fix(rca): surface bad-query Performance Insights evidence#1268

Merged
muddlebee merged 5 commits intoTracer-Cloud:mainfrom
sundaram2021:fix/cpu-saturation-bad-query
May 5, 2026
Merged

fix(rca): surface bad-query Performance Insights evidence#1268
muddlebee merged 5 commits intoTracer-Cloud:mainfrom
sundaram2021:fix/cpu-saturation-bad-query

Conversation

@sundaram2021
Copy link
Copy Markdown
Contributor

Fixes #600

Describe the changes you have made in this PR -

This PR improves RDS CPU saturation diagnosis for the synthetic bad-query case by making Performance Insights evidence more explicit in the RCA prompt.

Changes:

  • Renders fixture-native Performance Insights fields such as statement, db_load_avg, nested wait_events, top_wait_events, top_users, and top_hosts.
  • Labels Grafana log entries sourced from aws_performance_insights as Performance Insights.
  • Adds guidance to name the exact SQL statement and rule out connection exhaustion when DB connections are stable.
  • Adds focused prompt-builder tests for the issue evidence path.

Demo/Screenshot for feature changes and bug fixes -

Proof:

  • pytest tests\nodes\root_cause_diagnosis\test_prompt_builder.py -q passed
  • pytest tests\nodes\root_cause_diagnosis -q passed: 198 passed
  • ruff check app\nodes\root_cause_diagnosis\prompt_builder.py tests\nodes\root_cause_diagnosis\test_prompt_builder.py passed

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The issue requires the RCA prompt to expose the exact bad SQL query and Performance Insights signals behind RDS CPU saturation. I chose to improve evidence rendering rather than loosen scoring, because the fixture already contains the correct ground truth.

The main implementation adds support for both existing and fixture-native Performance Insights field names, including statement/sql, db_load_avg/db_load, nested wait events, top users, and top hosts. Grafana log entries sourced from Performance Insights are now labeled clearly so the model can connect the evidence to the required reasoning. Tests assert the exact SQL, db load, wait events, source label, and bad-query directive appear in the generated prompt.

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note for Maintainer/Team Members

please run the following test after setting up your ANTHROPIC_API_KEY

python -m tests.synthetic.rds_postgres.run_suite --scenario 004-cpu-saturation-bad-query --mock-grafana --json

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR enriches the RCA prompt builder to surface richer Performance Insights evidence for the bad-query CPU saturation scenario: it adds support for fixture-native field names (statement, db_load_avg, nested wait_events, top_wait_events, top_users, top_hosts), labels PI-sourced Grafana log entries as "Performance Insights", and adds a new LLM directive to name the exact SQL and rule out connection exhaustion. Three focused unit tests cover the new rendering paths. All changes are additive and scoped to evidence formatting, with no changes to scoring or graph logic.

Confidence Score: 4/5

Safe to merge; only P2 style concerns remain.

All findings are P2 (raw source_type passthrough and a minor test coverage gap). No logic bugs or security issues were found. The new helpers are well-typed and the core formatting changes are correct.

No files require special attention; the source_type mapping in _format_grafana_log_entry is worth revisiting if new source types are added.

Important Files Changed

Filename Overview
app/nodes/root_cause_diagnosis/prompt_builder.py Adds helper functions for formatting Grafana log entries with source labels and rendering expanded Performance Insights fields (statement, db_load_avg, nested wait_events, top_users, top_hosts). Logic is clean and safe; minor concern that non-PI source_type values are passed verbatim to the LLM prompt.
tests/nodes/root_cause_diagnosis/test_prompt_builder.py New test file with three focused unit tests covering fixture-native PI fields, Grafana log source labelling, and the bad-query directive. Tests have proper type annotations and follow project conventions. Missing coverage for the grafana_error_logs rendering path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Grafana log entry] --> B{isinstance dict?}
    B -- No --> C[str truncated to 300]
    B -- Yes --> D[extract message, source_type, source_identifier]
    D --> E{source_type == aws_performance_insights?}
    E -- Yes --> F[relabel as Performance Insights]
    E -- No --> G[keep raw source_type]
    F --> H[join source parts]
    G --> H
    H --> I{source non-empty?}
    I -- No --> J[return message only]
    I -- Yes --> K[return source + message]

    subgraph PI[Performance Insights Section]
        L[top_sql items] --> M[sql = sql or statement]
        M --> N[db_load = db_load or db_load_avg]
        N --> O[render wait_events inline via _format_wait_events]
        P[wait_events or top_wait_events] --> Q[Top Wait Events block]
        R[top_users] --> S[Top Users block]
        T[top_hosts] --> U[Top Hosts block]
    end
Loading

Reviews (1): Last reviewed commit: "test(rca): cover performance insights pr..." | Re-trigger Greptile

Comment thread app/nodes/root_cause_diagnosis/prompt_builder.py
Comment thread tests/nodes/root_cause_diagnosis/test_prompt_builder.py
@cerencamkiran
Copy link
Copy Markdown
Collaborator

Thanks @sundaram2021.

Ran this locally.

Validation:

  • pytest tests\nodes\root_cause_diagnosis -q → all passing
  • synthetic: 004-cpu-saturation-bad-query → passed

I also validated against main: the scenario already passes on main, so this is not addressing a failing case. The change is primarily improving prompt construction and evidence clarity.

Overall, I agree with the direction. Making Performance Insights evidence more explicit (statement/db_load/wait_events, etc.) and reinforcing the exact-SQL + connection-exhaustion rule-out guidance should reduce variance and make the RCA reasoning more deterministic across providers.

Greptile’s comments are valid and worth addressing as follow-ups:

  • _format_grafana_log_entry: avoid leaking raw source_type values into the prompt; a mapping or safe fallback would keep the prompt cleaner
  • add minimal coverage for the grafana_error_logs branch since it shares the same formatter.

@sundaram2021
Copy link
Copy Markdown
Contributor Author

thanks for testing this out @cerencamkiran .
I was just waiting for testing , I'm resolving the greptile comments now ....

@sundaram2021
Copy link
Copy Markdown
Contributor Author

CI is failing becuase of some missing configuration in api keys
image
@muddlebee can look into it .

@muddlebee
Copy link
Copy Markdown
Collaborator

@sundaram2021 Merge/rebase latest main into your PR branch

@sundaram2021
Copy link
Copy Markdown
Contributor Author

@sundaram2021 Merge/rebase latest main into your PR branch

done

@sundaram2021
Copy link
Copy Markdown
Contributor Author

hey @rrajan94 can you please review this once , its all tested and everything passes successfully .
thanks :))

@cerencamkiran
Copy link
Copy Markdown
Collaborator

Tested locally:

  • test_prompt_builder.py passes
  • 004-cpu-saturation-bad-query now correctly surfaces the exact SQL, db_load / wait events, and rules out connection exhaustion
  • also checked another scenario to ensure the stronger bad-query guidance does not regress the CPU red-herring case

This looks like a solid improvement to the evidence rendering / diagnosis prompt layer.

One small robustness nit: in the top wait events section, we still read the wait name with item.get("name", "unknown"), while _format_wait_events() supports both name and wait_event. If Performance Insights fixtures can emit wait_event there as well, it may be worth using the same fallback to avoid rendering unknown.

Good work @sundaram2021

@sundaram2021
Copy link
Copy Markdown
Contributor Author

Tested locally:

  • test_prompt_builder.py passes
  • 004-cpu-saturation-bad-query now correctly surfaces the exact SQL, db_load / wait events, and rules out connection exhaustion
  • also checked another scenario to ensure the stronger bad-query guidance does not regress the CPU red-herring case

This looks like a solid improvement to the evidence rendering / diagnosis prompt layer.

One small robustness nit: in the top wait events section, we still read the wait name with item.get("name", "unknown"), while _format_wait_events() supports both name and wait_event. If Performance Insights fixtures can emit wait_event there as well, it may be worth using the same fallback to avoid rendering unknown.

Good work @sundaram2021

followed the convention .... I thought of changing it but it will add inconsistency, if tests have not passed then it would be worth changing but I think its okay for now

@muddlebee muddlebee merged commit 683f374 into Tracer-Cloud:main May 5, 2026
10 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

🤖 CI passed. Linter didn't scream. Reviewer typed LGTM. @sundaram2021, every machine in this pipeline just slow-clapped. 🖥️✨


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[synthetic-qa] 004-cpu-saturation-bad-query: Validate agent identifies the specific bad query

3 participants