fix(rca): surface bad-query Performance Insights evidence by sundaram2021 · Pull Request #1268 · Tracer-Cloud/opensre

sundaram2021 · 2026-05-04T14:02:23Z

Fixes #600

Describe the changes you have made in this PR -

This PR improves RDS CPU saturation diagnosis for the synthetic bad-query case by making Performance Insights evidence more explicit in the RCA prompt.

Changes:

Renders fixture-native Performance Insights fields such as statement, db_load_avg, nested wait_events, top_wait_events, top_users, and top_hosts.
Labels Grafana log entries sourced from aws_performance_insights as Performance Insights.
Adds guidance to name the exact SQL statement and rule out connection exhaustion when DB connections are stable.
Adds focused prompt-builder tests for the issue evidence path.

Demo/Screenshot for feature changes and bug fixes -

Proof:

pytest tests\nodes\root_cause_diagnosis\test_prompt_builder.py -q passed
pytest tests\nodes\root_cause_diagnosis -q passed: 198 passed
ruff check app\nodes\root_cause_diagnosis\prompt_builder.py tests\nodes\root_cause_diagnosis\test_prompt_builder.py passed

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The issue requires the RCA prompt to expose the exact bad SQL query and Performance Insights signals behind RDS CPU saturation. I chose to improve evidence rendering rather than loosen scoring, because the fixture already contains the correct ground truth.

The main implementation adds support for both existing and fixture-native Performance Insights field names, including statement/sql, db_load_avg/db_load, nested wait events, top users, and top hosts. Grafana log entries sourced from Performance Insights are now labeled clearly so the model can connect the evidence to the required reasoning. Tests assert the exact SQL, db load, wait events, source label, and bad-query directive appear in the generated prompt.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note for Maintainer/Team Members

please run the following test after setting up your ANTHROPIC_API_KEY

python -m tests.synthetic.rds_postgres.run_suite --scenario 004-cpu-saturation-bad-query --mock-grafana --json

greptile-apps · 2026-05-04T14:05:10Z

Greptile Summary

This PR enriches the RCA prompt builder to surface richer Performance Insights evidence for the bad-query CPU saturation scenario: it adds support for fixture-native field names (statement, db_load_avg, nested wait_events, top_wait_events, top_users, top_hosts), labels PI-sourced Grafana log entries as "Performance Insights", and adds a new LLM directive to name the exact SQL and rule out connection exhaustion. Three focused unit tests cover the new rendering paths. All changes are additive and scoped to evidence formatting, with no changes to scoring or graph logic.

Confidence Score: 4/5

Safe to merge; only P2 style concerns remain.

All findings are P2 (raw source_type passthrough and a minor test coverage gap). No logic bugs or security issues were found. The new helpers are well-typed and the core formatting changes are correct.

No files require special attention; the source_type mapping in _format_grafana_log_entry is worth revisiting if new source types are added.

Important Files Changed

Filename	Overview
app/nodes/root_cause_diagnosis/prompt_builder.py	Adds helper functions for formatting Grafana log entries with source labels and rendering expanded Performance Insights fields (statement, db_load_avg, nested wait_events, top_users, top_hosts). Logic is clean and safe; minor concern that non-PI source_type values are passed verbatim to the LLM prompt.
tests/nodes/root_cause_diagnosis/test_prompt_builder.py	New test file with three focused unit tests covering fixture-native PI fields, Grafana log source labelling, and the bad-query directive. Tests have proper type annotations and follow project conventions. Missing coverage for the grafana_error_logs rendering path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Grafana log entry] --> B{isinstance dict?}
    B -- No --> C[str truncated to 300]
    B -- Yes --> D[extract message, source_type, source_identifier]
    D --> E{source_type == aws_performance_insights?}
    E -- Yes --> F[relabel as Performance Insights]
    E -- No --> G[keep raw source_type]
    F --> H[join source parts]
    G --> H
    H --> I{source non-empty?}
    I -- No --> J[return message only]
    I -- Yes --> K[return source + message]

    subgraph PI[Performance Insights Section]
        L[top_sql items] --> M[sql = sql or statement]
        M --> N[db_load = db_load or db_load_avg]
        N --> O[render wait_events inline via _format_wait_events]
        P[wait_events or top_wait_events] --> Q[Top Wait Events block]
        R[top_users] --> S[Top Users block]
        T[top_hosts] --> U[Top Hosts block]
    end

_{Reviews (1): Last reviewed commit: "test(rca): cover performance insights pr..." | Re-trigger Greptile}

cerencamkiran · 2026-05-04T14:37:33Z

Thanks @sundaram2021.

Ran this locally.

Validation:

pytest tests\nodes\root_cause_diagnosis -q → all passing
synthetic: 004-cpu-saturation-bad-query → passed

I also validated against main: the scenario already passes on main, so this is not addressing a failing case. The change is primarily improving prompt construction and evidence clarity.

Overall, I agree with the direction. Making Performance Insights evidence more explicit (statement/db_load/wait_events, etc.) and reinforcing the exact-SQL + connection-exhaustion rule-out guidance should reduce variance and make the RCA reasoning more deterministic across providers.

Greptile’s comments are valid and worth addressing as follow-ups:

_format_grafana_log_entry: avoid leaking raw source_type values into the prompt; a mapping or safe fallback would keep the prompt cleaner
add minimal coverage for the grafana_error_logs branch since it shares the same formatter.

sundaram2021 · 2026-05-04T14:44:34Z

thanks for testing this out @cerencamkiran .
I was just waiting for testing , I'm resolving the greptile comments now ....

sundaram2021 · 2026-05-04T18:02:27Z

CI is failing becuase of some missing configuration in api keys

@muddlebee can look into it .

muddlebee · 2026-05-05T07:57:18Z

@sundaram2021 Merge/rebase latest main into your PR branch

…ix/cpu-saturation-bad-query

sundaram2021 · 2026-05-05T08:28:38Z

@sundaram2021 Merge/rebase latest main into your PR branch

done

sundaram2021 · 2026-05-05T08:38:05Z

hey @rrajan94 can you please review this once , its all tested and everything passes successfully .
thanks :))

cerencamkiran · 2026-05-05T09:23:06Z

Tested locally:

test_prompt_builder.py passes
004-cpu-saturation-bad-query now correctly surfaces the exact SQL, db_load / wait events, and rules out connection exhaustion
also checked another scenario to ensure the stronger bad-query guidance does not regress the CPU red-herring case

This looks like a solid improvement to the evidence rendering / diagnosis prompt layer.

One small robustness nit: in the top wait events section, we still read the wait name with item.get("name", "unknown"), while _format_wait_events() supports both name and wait_event. If Performance Insights fixtures can emit wait_event there as well, it may be worth using the same fallback to avoid rendering unknown.

Good work @sundaram2021

sundaram2021 · 2026-05-05T09:30:28Z

Tested locally:

test_prompt_builder.py passes

004-cpu-saturation-bad-query now correctly surfaces the exact SQL, db_load / wait events, and rules out connection exhaustion

also checked another scenario to ensure the stronger bad-query guidance does not regress the CPU red-herring case

This looks like a solid improvement to the evidence rendering / diagnosis prompt layer.

One small robustness nit: in the top wait events section, we still read the wait name with item.get("name", "unknown"), while _format_wait_events() supports both name and wait_event. If Performance Insights fixtures can emit wait_event there as well, it may be worth using the same fallback to avoid rendering unknown.

Good work @sundaram2021

followed the convention .... I thought of changing it but it will add inconsistency, if tests have not passed then it would be worth changing but I think its okay for now

github-actions · 2026-05-05T16:43:57Z

🤖 CI passed. Linter didn't scream. Reviewer typed LGTM. @sundaram2021, every machine in this pipeline just slow-clapped. 🖥️✨

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

sundaram2021 added 2 commits May 4, 2026 19:26

fix(rca): surface bad-query performance insights evidence

ae45606

test(rca): cover performance insights prompt rendering

31f9e5c

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread app/nodes/root_cause_diagnosis/prompt_builder.py

Comment thread tests/nodes/root_cause_diagnosis/test_prompt_builder.py

fix(rca): sanitize grafana source labels

4a04c98

muddlebee added the pending triage label May 4, 2026

Merge branch 'main' into fix/cpu-saturation-bad-query

1f62634

Merge branch 'main' of https://github.com/sundaram2021/opensre into f…

5da15f2

…ix/cpu-saturation-bad-query

muddlebee merged commit 683f374 into Tracer-Cloud:main May 5, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rca): surface bad-query Performance Insights evidence#1268

fix(rca): surface bad-query Performance Insights evidence#1268
muddlebee merged 5 commits intoTracer-Cloud:mainfrom
sundaram2021:fix/cpu-saturation-bad-query

sundaram2021 commented May 4, 2026

Uh oh!

greptile-apps Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

cerencamkiran commented May 4, 2026

Uh oh!

sundaram2021 commented May 4, 2026

Uh oh!

sundaram2021 commented May 4, 2026

Uh oh!

muddlebee commented May 5, 2026

Uh oh!

sundaram2021 commented May 5, 2026

Uh oh!

sundaram2021 commented May 5, 2026

Uh oh!

cerencamkiran commented May 5, 2026

Uh oh!

sundaram2021 commented May 5, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sundaram2021 commented May 4, 2026

Describe the changes you have made in this PR -

Demo/Screenshot for feature changes and bug fixes -

Code Understanding and AI Usage

Checklist before requesting a review

Note for Maintainer/Team Members

Uh oh!

greptile-apps Bot commented May 4, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

cerencamkiran commented May 4, 2026

Uh oh!

sundaram2021 commented May 4, 2026

Uh oh!

sundaram2021 commented May 4, 2026

Uh oh!

muddlebee commented May 5, 2026

Uh oh!

sundaram2021 commented May 5, 2026

Uh oh!

sundaram2021 commented May 5, 2026

Uh oh!

cerencamkiran commented May 5, 2026

Uh oh!

sundaram2021 commented May 5, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants