fix(rca): surface bad-query Performance Insights evidence#1268
fix(rca): surface bad-query Performance Insights evidence#1268muddlebee merged 5 commits intoTracer-Cloud:mainfrom
Conversation
Greptile SummaryThis PR enriches the RCA prompt builder to surface richer Performance Insights evidence for the bad-query CPU saturation scenario: it adds support for fixture-native field names ( Confidence Score: 4/5Safe to merge; only P2 style concerns remain. All findings are P2 (raw source_type passthrough and a minor test coverage gap). No logic bugs or security issues were found. The new helpers are well-typed and the core formatting changes are correct. No files require special attention; the source_type mapping in _format_grafana_log_entry is worth revisiting if new source types are added. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Grafana log entry] --> B{isinstance dict?}
B -- No --> C[str truncated to 300]
B -- Yes --> D[extract message, source_type, source_identifier]
D --> E{source_type == aws_performance_insights?}
E -- Yes --> F[relabel as Performance Insights]
E -- No --> G[keep raw source_type]
F --> H[join source parts]
G --> H
H --> I{source non-empty?}
I -- No --> J[return message only]
I -- Yes --> K[return source + message]
subgraph PI[Performance Insights Section]
L[top_sql items] --> M[sql = sql or statement]
M --> N[db_load = db_load or db_load_avg]
N --> O[render wait_events inline via _format_wait_events]
P[wait_events or top_wait_events] --> Q[Top Wait Events block]
R[top_users] --> S[Top Users block]
T[top_hosts] --> U[Top Hosts block]
end
Reviews (1): Last reviewed commit: "test(rca): cover performance insights pr..." | Re-trigger Greptile |
|
Thanks @sundaram2021. Ran this locally. Validation:
I also validated against main: the scenario already passes on main, so this is not addressing a failing case. The change is primarily improving prompt construction and evidence clarity. Overall, I agree with the direction. Making Performance Insights evidence more explicit (statement/db_load/wait_events, etc.) and reinforcing the exact-SQL + connection-exhaustion rule-out guidance should reduce variance and make the RCA reasoning more deterministic across providers. Greptile’s comments are valid and worth addressing as follow-ups:
|
|
thanks for testing this out @cerencamkiran . |
|
CI is failing becuase of some missing configuration in api keys |
|
@sundaram2021 Merge/rebase latest main into your PR branch |
done |
|
hey @rrajan94 can you please review this once , its all tested and everything passes successfully . |
|
Tested locally:
This looks like a solid improvement to the evidence rendering / diagnosis prompt layer. One small robustness nit: in the top wait events section, we still read the wait name with Good work @sundaram2021 |
followed the convention .... I thought of changing it but it will add inconsistency, if tests have not passed then it would be worth changing but I think its okay for now |
|
🤖 CI passed. Linter didn't scream. Reviewer typed LGTM. @sundaram2021, every machine in this pipeline just slow-clapped. 🖥️✨ 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |


Fixes #600
Describe the changes you have made in this PR -
This PR improves RDS CPU saturation diagnosis for the synthetic bad-query case by making Performance Insights evidence more explicit in the RCA prompt.
Changes:
statement,db_load_avg, nestedwait_events,top_wait_events,top_users, andtop_hosts.aws_performance_insightsasPerformance Insights.Demo/Screenshot for feature changes and bug fixes -
Proof:
pytest tests\nodes\root_cause_diagnosis\test_prompt_builder.py -qpassedpytest tests\nodes\root_cause_diagnosis -qpassed:198 passedruff check app\nodes\root_cause_diagnosis\prompt_builder.py tests\nodes\root_cause_diagnosis\test_prompt_builder.pypassedCode Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
The issue requires the RCA prompt to expose the exact bad SQL query and Performance Insights signals behind RDS CPU saturation. I chose to improve evidence rendering rather than loosen scoring, because the fixture already contains the correct ground truth.
The main implementation adds support for both existing and fixture-native Performance Insights field names, including
statement/sql,db_load_avg/db_load, nested wait events, top users, and top hosts. Grafana log entries sourced from Performance Insights are now labeled clearly so the model can connect the evidence to the required reasoning. Tests assert the exact SQL, db load, wait events, source label, and bad-query directive appear in the generated prompt.Checklist before requesting a review
Note for Maintainer/Team Members
please run the following test after setting up your
ANTHROPIC_API_KEY