refactor: improve RDS storage-full RCA evidence handling#1279
Conversation
Greptile SummaryThis PR improves the RDS storage-exhaustion RCA path by deriving structured Confidence Score: 4/5Safe to merge; the only findings are P2 quality/robustness items that do not affect correctness on the happy path. All findings are P2: a silent timestamp-format fallback for float-typed Loki nanosecond values, and a minor DRY issue where stats are computed twice per metric series. No data loss, no crash, no security concern. app/nodes/investigate/processing/post_process.py (_timestamp_from_loki_ns), app/tools/utils/metric_summary.py (duplicate stats traversal) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
GM[query_grafana_metrics] -->|raw Prometheus series| MAP_M[_map_grafana_metrics]
GL[query_grafana_logs] -->|Loki log entries| MAP_L[_map_grafana_logs]
GA[query_grafana_alert_rules] -->|backend ruler response| NORM[_normalize_backend_alert_rules]
MAP_M -->|summarize_prometheus_metrics| SUMM[grafana_metric_summaries]
SUMM -->|aws_rds_ prefix filter| CWM[aws_cloudwatch_metrics]
MAP_L -->|source_type=db-instance filter| RDSE[aws_rds_events]
MAP_L -->|Top SQL / Top Wait regex parse| PI[aws_performance_insights]
NORM -->|rules list| ART[grafana_alert_rules]
CWM --> PROMPT[build_diagnosis_prompt]
RDSE --> PROMPT
PI --> PROMPT
ART --> PROMPT
SUMM --> PROMPT
PROMPT -->|RDS CloudWatch Metrics + Grafana Metrics + RDS Events + Performance Insights + Alert Rules| LLM[RCA LLM]
Reviews (1): Last reviewed commit: "fix(rds): improve Grafana evidence for s..." | Re-trigger Greptile |
8908ca7 to
a58f680
Compare
|
LGTM 👍 |
Fixes #599
Describe the changes you have made in this PR -
This PR improves RDS storage exhaustion diagnosis for the synthetic
003-storage-fullcase by making Grafana-backed RDS evidence more explicit in the RCA prompt.Changes:
FreeStorageSpaceandWriteIOPSsummaries in the diagnosis prompt instead of truncated raw metric JSON.rulesshape.Demo/Screenshot for feature changes and bug fixes -
Proof:
pytest tests\tools\test_grafana_metrics_tool.py tests\tools\test_grafana_alert_rules_tool.py tests\nodes\root_cause_diagnosis\test_rds_grafana_evidence.py -qpassed:26 passedpytest tests\synthetic\rds_postgres\test_suite.py -k "score_result or scenario_metadata or scenario_evidence or load_all or inheritance" -qpassed:12 passedmake typecheckpassedmake format-checkpassedCode Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
The issue requires the RCA prompt to expose the full storage-exhaustion evidence path:
FreeStorageSpacecollapsing, elevatedWriteIOPS, the definitive RDS storage event, and the bulk archivalINSERTshown by Performance Insights. I chose to improve evidence shaping and rendering rather than changing the answer key or hard-coding the scenario.The main implementation keeps the existing architecture intact: Grafana tools collect raw metrics/logs/rules, post-processing derives structured RDS evidence, and the diagnosis prompt renders concise evidence summaries. The new metric-summary utility preserves raw time-series data while adding first/latest/min/max/peak/trend summaries so the model can reason over the storage and write I/O signals reliably. Tests assert that the generated prompt contains the storage trend, RDS event, alert rule, and bulk
INSERTevidence.Checklist before requesting a review
Note for Maintainer/Team/Reviewer Members
please run the following test after setting up your
ANTHROPIC_API_KEY