-
Notifications
You must be signed in to change notification settings - Fork 715
fix: metrics of compactor pending jobs #8854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
Fixed a critical bug in compactor pending jobs metrics where counts were being overwritten instead of accumulated when multiple streams share the same organization and stream type.
Key Changes:
- Fixed
get_pending_jobs_count()across all three database implementations (MySQL, Postgres, SQLite) by changing*e = countsto*e += countswhen aggregating job counts by stream type - Optimized compactor performance by reducing default interval from 60s to 10s
- Increased default batch size for better throughput: 4x CPU cores in production mode, 100 in local mode
Technical Details:
The bug occurred because the SQL query returns one row per stream (format: org/stream_type/stream_name), and when aggregating by (org, stream_type) pairs, the code was overwriting previous counts instead of summing them. For example, if org1/logs/stream1 had 10 jobs and org1/logs/stream2 had 5 jobs, the metric would incorrectly report 5 instead of 15.
The fix ensures accurate reporting of pending compaction jobs across all streams.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk - it fixes a critical metrics bug with a simple, correct logic change
- The changes are straightforward and fix an obvious bug (overwriting vs accumulating counts). The fix is applied consistently across all three database implementations. The config changes improve performance without introducing risks. SQL queries already use proper parameter binding for safety.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/infra/src/file_list/mysql.rs | 5/5 | Fixed critical bug: changed assignment to accumulation when counting pending jobs by stream type |
| src/infra/src/file_list/postgres.rs | 5/5 | Fixed critical bug: changed assignment to accumulation when counting pending jobs by stream type |
| src/infra/src/file_list/sqlite.rs | 5/5 | Fixed critical bug: changed assignment to accumulation when counting pending jobs by stream type |
| src/config/src/config.rs | 5/5 | Adjusted compactor config defaults: reduced interval from 60s to 10s and increased batch_size (4x CPU cores in production, 100 in local mode) |
Sequence Diagram
sequenceDiagram
participant Compactor as Compactor Job
participant FileList as File List Service
participant DB as Database (MySQL/Postgres/SQLite)
participant Metrics as Prometheus Metrics
Note over Compactor: Every 300s (pending_jobs_metric_interval)
Compactor->>FileList: get_pending_jobs_count()
FileList->>DB: SELECT stream, status, count(*)<br/>WHERE status = Pending<br/>GROUP BY stream, status
DB-->>FileList: Results (one row per stream)
Note over FileList: Parse stream format:<br/>org/stream_type/stream_name
loop For each result row
FileList->>FileList: Extract org and stream_type
Note over FileList: OLD BUG: *e = counts (overwrites)<br/>NEW FIX: *e += counts (accumulates)
FileList->>FileList: Accumulate counts by<br/>(org, stream_type)
end
FileList-->>Compactor: HashMap<org, HashMap<stream_type, count>>
loop Reset all org metrics
Compactor->>Metrics: Set COMPACT_PENDING_JOBS = 0
end
loop Set new metrics
Compactor->>Metrics: Set COMPACT_PENDING_JOBS<br/>(org, stream_type) = count
end
4 files reviewed, no comments
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 364 | 342 | 0 | 19 | 3 | 94% | 4m 39s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 364 | 344 | 0 | 19 | 1 | 95% | 4m 39s |
PR Type
Bug fix, Enhancement
Description
Fix compactor pending jobs metric aggregation
Adjust compaction interval default to 10 seconds
Tune default compaction batch size by mode
Diagram Walkthrough
File Walkthrough
config.rs
Tune compaction defaults: interval and batch sizesrc/config/src/config.rs
compact.intervalfrom 60 to 10.compact.batch_sizeto 100 in local mode.compact.batch_sizetocpu_num * 4otherwise.mysql.rs
Fix MySQL pending job count aggregationsrc/infra/src/file_list/mysql.rs
+=) per org/type.postgres.rs
Fix Postgres pending job count aggregationsrc/infra/src/file_list/postgres.rs
+=) per org/type.sqlite.rs
Fix SQLite pending job count aggregationsrc/infra/src/file_list/sqlite.rs
+=) per org/type.