refactor: improve RDS storage-full RCA evidence handling by sundaram2021 · Pull Request #1279 · Tracer-Cloud/opensre

sundaram2021 · 2026-05-04T16:51:34Z

Fixes #599

Describe the changes you have made in this PR -

This PR improves RDS storage exhaustion diagnosis for the synthetic 003-storage-full case by making Grafana-backed RDS evidence more explicit in the RCA prompt.

Changes:

Adds compact Prometheus/Mimir metric summaries for time-series evidence.
Maps Grafana-backed RDS metrics into structured CloudWatch-style evidence.
Derives RDS events and Performance Insights evidence from Grafana log streams.
Renders FreeStorageSpace and WriteIOPS summaries in the diagnosis prompt instead of truncated raw metric JSON.
Normalizes fixture-backed Grafana alert rules to match the production rules shape.
Adds focused regression tests for the storage-full Grafana evidence path.

Demo/Screenshot for feature changes and bug fixes -

Proof:

pytest tests\tools\test_grafana_metrics_tool.py tests\tools\test_grafana_alert_rules_tool.py tests\nodes\root_cause_diagnosis\test_rds_grafana_evidence.py -q passed: 26 passed
pytest tests\synthetic\rds_postgres\test_suite.py -k "score_result or scenario_metadata or scenario_evidence or load_all or inheritance" -q passed: 12 passed
make typecheck passed
make format-check passed

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The issue requires the RCA prompt to expose the full storage-exhaustion evidence path: FreeStorageSpace collapsing, elevated WriteIOPS, the definitive RDS storage event, and the bulk archival INSERT shown by Performance Insights. I chose to improve evidence shaping and rendering rather than changing the answer key or hard-coding the scenario.

The main implementation keeps the existing architecture intact: Grafana tools collect raw metrics/logs/rules, post-processing derives structured RDS evidence, and the diagnosis prompt renders concise evidence summaries. The new metric-summary utility preserves raw time-series data while adding first/latest/min/max/peak/trend summaries so the model can reason over the storage and write I/O signals reliably. Tests assert that the generated prompt contains the storage trend, RDS event, alert rule, and bulk INSERT evidence.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note for Maintainer/Team/Reviewer Members

please run the following test after setting up your ANTHROPIC_API_KEY

python -m tests.synthetic.rds_postgres.run_suite --scenario 003-storage-full --mock-grafana --json

greptile-apps · 2026-05-04T16:55:12Z

Greptile Summary

This PR improves the RDS storage-exhaustion RCA path by deriving structured aws_rds_events, aws_performance_insights, and aws_cloudwatch_metrics evidence from Grafana log/metric streams, adding a new metric_summary utility for compact Prometheus/Mimir summaries, and rendering those summaries in the diagnosis prompt in place of truncated raw JSON. A backend alert-rules normalizer is also added so fixture-backed responses match the production rules shape.

Confidence Score: 4/5

Safe to merge; the only findings are P2 quality/robustness items that do not affect correctness on the happy path.

All findings are P2: a silent timestamp-format fallback for float-typed Loki nanosecond values, and a minor DRY issue where stats are computed twice per metric series. No data loss, no crash, no security concern.

app/nodes/investigate/processing/post_process.py (_timestamp_from_loki_ns), app/tools/utils/metric_summary.py (duplicate stats traversal)

Important Files Changed

Filename	Overview
app/tools/utils/metric_summary.py	New utility for Prometheus/Mimir metric summaries; well-structured with good edge-case handling, but duplicates min/max traversal between _build_summary_line and summarize_prometheus_metrics.
app/nodes/investigate/processing/post_process.py	Adds RDS-event, Performance Insights, and CloudWatch-metric derivation from Grafana data; the Loki nanosecond timestamp helper can silently produce an unreadable string for float-typed values.
app/nodes/root_cause_diagnosis/prompt_builder.py	Switches Grafana Metrics section to prefer compact summaries over raw JSON; falls back gracefully to old path when summaries are absent.
app/tools/GrafanaAlertRulesTool/init.py	Adds _normalize_backend_alert_rules to convert fixture/backend ruler responses into the client rules shape, and exposes total_rules in the return dict.
app/nodes/plan_actions/build_prompt.py	Two small prompt additions: a hint to use query_grafana_metrics for DB resource metrics and guidance to collect metrics/logs/alert-rules together for storage-pressure scenarios.
tests/nodes/root_cause_diagnosis/test_rds_grafana_evidence.py	New integration test asserting structured Grafana evidence and key strings in the diagnosis prompt for the 003-storage-full scenario.
tests/tools/test_grafana_alert_rules_tool.py	Extends existing alert-rules test to verify the new normalized rules shape and total_rules count.
tests/tools/test_grafana_metrics_tool.py	Adds a fixture-backed test that verifies FreeStorageSpace and WriteIOPS summaries have the expected trend and content.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    GM[query_grafana_metrics] -->|raw Prometheus series| MAP_M[_map_grafana_metrics]
    GL[query_grafana_logs] -->|Loki log entries| MAP_L[_map_grafana_logs]
    GA[query_grafana_alert_rules] -->|backend ruler response| NORM[_normalize_backend_alert_rules]

    MAP_M -->|summarize_prometheus_metrics| SUMM[grafana_metric_summaries]
    SUMM -->|aws_rds_ prefix filter| CWM[aws_cloudwatch_metrics]

    MAP_L -->|source_type=db-instance filter| RDSE[aws_rds_events]
    MAP_L -->|Top SQL / Top Wait regex parse| PI[aws_performance_insights]

    NORM -->|rules list| ART[grafana_alert_rules]

    CWM --> PROMPT[build_diagnosis_prompt]
    RDSE --> PROMPT
    PI --> PROMPT
    ART --> PROMPT
    SUMM --> PROMPT

    PROMPT -->|RDS CloudWatch Metrics + Grafana Metrics + RDS Events + Performance Insights + Alert Rules| LLM[RCA LLM]

_{Reviews (1): Last reviewed commit: "fix(rds): improve Grafana evidence for s..." | Re-trigger Greptile}

rrajan94 · 2026-05-04T17:25:49Z

LGTM 👍
Verified end-to-end on 003-storage-full (correct RCA, 100% confidence) with no regressions on 001-replication-lag or 005-failover.

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread app/nodes/investigate/processing/post_process.py

Comment thread app/tools/utils/metric_summary.py Outdated

sundaram2021 added 3 commits May 4, 2026 22:31

feat(metrics): add Prometheus metric summaries

6ab18d6

test(rds): add Grafana evidence regression coverage

5f63076

fix(rds): improve Grafana evidence for storage RCA

a58f680

sundaram2021 force-pushed the refactor/validate-agent-storage-exhaustion branch from 8908ca7 to a58f680 Compare May 4, 2026 17:02

rrajan94 merged commit b514caa into Tracer-Cloud:main May 4, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve RDS storage-full RCA evidence handling#1279

refactor: improve RDS storage-full RCA evidence handling#1279
rrajan94 merged 3 commits intoTracer-Cloud:mainfrom
sundaram2021:refactor/validate-agent-storage-exhaustion

sundaram2021 commented May 4, 2026

Uh oh!

greptile-apps Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

rrajan94 commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sundaram2021 commented May 4, 2026

Describe the changes you have made in this PR -

Demo/Screenshot for feature changes and bug fixes -

Code Understanding and AI Usage

Checklist before requesting a review

Note for Maintainer/Team/Reviewer Members

Uh oh!

greptile-apps Bot commented May 4, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

rrajan94 commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants