Skip to content

refactor: improve RDS storage-full RCA evidence handling#1279

Merged
rrajan94 merged 3 commits intoTracer-Cloud:mainfrom
sundaram2021:refactor/validate-agent-storage-exhaustion
May 4, 2026
Merged

refactor: improve RDS storage-full RCA evidence handling#1279
rrajan94 merged 3 commits intoTracer-Cloud:mainfrom
sundaram2021:refactor/validate-agent-storage-exhaustion

Conversation

@sundaram2021
Copy link
Copy Markdown
Contributor

Fixes #599

Describe the changes you have made in this PR -

This PR improves RDS storage exhaustion diagnosis for the synthetic 003-storage-full case by making Grafana-backed RDS evidence more explicit in the RCA prompt.

Changes:

  • Adds compact Prometheus/Mimir metric summaries for time-series evidence.
  • Maps Grafana-backed RDS metrics into structured CloudWatch-style evidence.
  • Derives RDS events and Performance Insights evidence from Grafana log streams.
  • Renders FreeStorageSpace and WriteIOPS summaries in the diagnosis prompt instead of truncated raw metric JSON.
  • Normalizes fixture-backed Grafana alert rules to match the production rules shape.
  • Adds focused regression tests for the storage-full Grafana evidence path.

Demo/Screenshot for feature changes and bug fixes -

Proof:

  • pytest tests\tools\test_grafana_metrics_tool.py tests\tools\test_grafana_alert_rules_tool.py tests\nodes\root_cause_diagnosis\test_rds_grafana_evidence.py -q passed: 26 passed
  • pytest tests\synthetic\rds_postgres\test_suite.py -k "score_result or scenario_metadata or scenario_evidence or load_all or inheritance" -q passed: 12 passed
  • make typecheck passed
  • make format-check passed

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The issue requires the RCA prompt to expose the full storage-exhaustion evidence path: FreeStorageSpace collapsing, elevated WriteIOPS, the definitive RDS storage event, and the bulk archival INSERT shown by Performance Insights. I chose to improve evidence shaping and rendering rather than changing the answer key or hard-coding the scenario.

The main implementation keeps the existing architecture intact: Grafana tools collect raw metrics/logs/rules, post-processing derives structured RDS evidence, and the diagnosis prompt renders concise evidence summaries. The new metric-summary utility preserves raw time-series data while adding first/latest/min/max/peak/trend summaries so the model can reason over the storage and write I/O signals reliably. Tests assert that the generated prompt contains the storage trend, RDS event, alert rule, and bulk INSERT evidence.

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note for Maintainer/Team/Reviewer Members

please run the following test after setting up your ANTHROPIC_API_KEY

python -m tests.synthetic.rds_postgres.run_suite --scenario 003-storage-full --mock-grafana --json

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR improves the RDS storage-exhaustion RCA path by deriving structured aws_rds_events, aws_performance_insights, and aws_cloudwatch_metrics evidence from Grafana log/metric streams, adding a new metric_summary utility for compact Prometheus/Mimir summaries, and rendering those summaries in the diagnosis prompt in place of truncated raw JSON. A backend alert-rules normalizer is also added so fixture-backed responses match the production rules shape.

Confidence Score: 4/5

Safe to merge; the only findings are P2 quality/robustness items that do not affect correctness on the happy path.

All findings are P2: a silent timestamp-format fallback for float-typed Loki nanosecond values, and a minor DRY issue where stats are computed twice per metric series. No data loss, no crash, no security concern.

app/nodes/investigate/processing/post_process.py (_timestamp_from_loki_ns), app/tools/utils/metric_summary.py (duplicate stats traversal)

Important Files Changed

Filename Overview
app/tools/utils/metric_summary.py New utility for Prometheus/Mimir metric summaries; well-structured with good edge-case handling, but duplicates min/max traversal between _build_summary_line and summarize_prometheus_metrics.
app/nodes/investigate/processing/post_process.py Adds RDS-event, Performance Insights, and CloudWatch-metric derivation from Grafana data; the Loki nanosecond timestamp helper can silently produce an unreadable string for float-typed values.
app/nodes/root_cause_diagnosis/prompt_builder.py Switches Grafana Metrics section to prefer compact summaries over raw JSON; falls back gracefully to old path when summaries are absent.
app/tools/GrafanaAlertRulesTool/init.py Adds _normalize_backend_alert_rules to convert fixture/backend ruler responses into the client rules shape, and exposes total_rules in the return dict.
app/nodes/plan_actions/build_prompt.py Two small prompt additions: a hint to use query_grafana_metrics for DB resource metrics and guidance to collect metrics/logs/alert-rules together for storage-pressure scenarios.
tests/nodes/root_cause_diagnosis/test_rds_grafana_evidence.py New integration test asserting structured Grafana evidence and key strings in the diagnosis prompt for the 003-storage-full scenario.
tests/tools/test_grafana_alert_rules_tool.py Extends existing alert-rules test to verify the new normalized rules shape and total_rules count.
tests/tools/test_grafana_metrics_tool.py Adds a fixture-backed test that verifies FreeStorageSpace and WriteIOPS summaries have the expected trend and content.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    GM[query_grafana_metrics] -->|raw Prometheus series| MAP_M[_map_grafana_metrics]
    GL[query_grafana_logs] -->|Loki log entries| MAP_L[_map_grafana_logs]
    GA[query_grafana_alert_rules] -->|backend ruler response| NORM[_normalize_backend_alert_rules]

    MAP_M -->|summarize_prometheus_metrics| SUMM[grafana_metric_summaries]
    SUMM -->|aws_rds_ prefix filter| CWM[aws_cloudwatch_metrics]

    MAP_L -->|source_type=db-instance filter| RDSE[aws_rds_events]
    MAP_L -->|Top SQL / Top Wait regex parse| PI[aws_performance_insights]

    NORM -->|rules list| ART[grafana_alert_rules]

    CWM --> PROMPT[build_diagnosis_prompt]
    RDSE --> PROMPT
    PI --> PROMPT
    ART --> PROMPT
    SUMM --> PROMPT

    PROMPT -->|RDS CloudWatch Metrics + Grafana Metrics + RDS Events + Performance Insights + Alert Rules| LLM[RCA LLM]
Loading

Reviews (1): Last reviewed commit: "fix(rds): improve Grafana evidence for s..." | Re-trigger Greptile

Comment thread app/nodes/investigate/processing/post_process.py
Comment thread app/tools/utils/metric_summary.py Outdated
@sundaram2021 sundaram2021 force-pushed the refactor/validate-agent-storage-exhaustion branch from 8908ca7 to a58f680 Compare May 4, 2026 17:02
@rrajan94
Copy link
Copy Markdown
Collaborator

rrajan94 commented May 4, 2026

LGTM 👍
Verified end-to-end on 003-storage-full (correct RCA, 100% confidence) with no regressions on 001-replication-lag or 005-failover.

@rrajan94 rrajan94 merged commit b514caa into Tracer-Cloud:main May 4, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[synthetic-qa] 003-storage-full: Validate agent identifies storage exhaustion

2 participants