fix: metrics of compactor pending jobs #8854

hengfeiyang · 2025-10-21T03:19:50Z

PR Type

Bug fix, Enhancement

Description

Fix compactor pending jobs metric aggregation
Adjust compaction interval default to 10 seconds
Tune default compaction batch size by mode

Diagram Walkthrough

flowchart LR
  CFG["Config defaults"]
  METRIC["Pending jobs aggregation"]
  STORAGE_MY["MySQL impl"]
  STORAGE_PG["Postgres impl"]
  STORAGE_SQL["SQLite impl"]

  CFG -- "interval=10, batch size tuning" --> METRIC
  METRIC -- "sum counts per org/type" --> STORAGE_MY
  METRIC -- "sum counts per org/type" --> STORAGE_PG
  METRIC -- "sum counts per org/type" --> STORAGE_SQL

File Walkthrough

Relevant files

Enhancement

config.rs `Tune compaction defaults: interval and batch size` src/config/src/config.rs Lower default `compact.interval` from 60 to 10. Set `compact.batch_size` to 100 in local mode. Set `compact.batch_size` to `cpu_num * 4` otherwise.	+6/-2

Bug fix

mysql.rs `Fix MySQL pending job count aggregation` src/infra/src/file_list/mysql.rs Aggregate pending job counts by summing (`+=`) per org/type.	+1/-1
postgres.rs `Fix Postgres pending job count aggregation` src/infra/src/file_list/postgres.rs Aggregate pending job counts by summing (`+=`) per org/type.	+1/-1
sqlite.rs `Fix SQLite pending job count aggregation` src/infra/src/file_list/sqlite.rs Aggregate pending job counts by summing (`+=`) per org/type.	+1/-1

github-actions · 2025-10-21T03:20:46Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Behavior Change The default compact interval is reduced from 60 to 10 seconds, which increases scheduling frequency and system load. Validate impact on resource usage and whether documentation/flags reflect this new default. if cfg.compact.interval < 1 { cfg.compact.interval = 10; } Batch Size Logic Batch size now depends on mode: 100 for local and cpu_num4 otherwise. Confirm cpu_num is always >=1 and that these defaults are appropriate across environments; consider upper bounds to avoid oversized batches on high-core machines. if cfg.compact.batch_size < 1 { if cfg.common.local_mode { cfg.compact.batch_size = 100; } else { cfg.compact.batch_size = cfg.limit.cpu_num as i64 4; } } Aggregation Semantics Pending jobs are now summed across entries for the same org/type. Ensure upstream query does not already aggregate, to avoid double counting; verify consistency across MySQL/Postgres/SQLite implementations and metrics consumers. job_status .entry(org) .or_default() .entry(stream_type) .and_modify(\|e\| { *e += counts; }) .or_insert(counts);

github-actions · 2025-10-21T03:21:01Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Prevent integer overflow on sum Adding `counts` repeatedly can overflow if values are large or untrusted. Use checked addition with a saturation or error path to prevent wraparound and corrupted metrics. src/infra/src/file_list/mysql.rs [1639-1641] .entry(stream_type) - .and_modify(\|e\| e += counts) + .and_modify(\|e\| { + if let Some(sum) = e.checked_add(counts) { + e = sum; + } else { + // Saturate on overflow to avoid wraparound + *e = i64::MAX; + } + }) .or_insert(counts); Suggestion importance[1-10]: 8 __ Why: Correctly identifies potential overflow when accumulating `counts` and proposes a safe, localized fix using checked addition with saturation, enhancing correctness under edge conditions.	Medium
General	Clamp batch size to safe range Multiplying CPU by 4 for batch size can oversubscribe resources on large machines and cause memory pressure. Cap the value to a sane maximum and ensure a minimum of 1 to avoid zero or negative edge cases. src/config/src/config.rs [2806-2812] if cfg.compact.batch_size < 1 { - if cfg.common.local_mode { - cfg.compact.batch_size = 100; - } else { - cfg.compact.batch_size = cfg.limit.cpu_num as i64 * 4; - } + let base = if cfg.common.local_mode { 100 } else { (cfg.limit.cpu_num as i64).saturating_mul(4) }; + // Clamp to a reasonable range to avoid overloads on large hosts + cfg.compact.batch_size = base.clamp(1, 10_000); } Suggestion importance[1-10]: 7 __ Why: Adds protective clamping to prevent runaway batch sizes on large hosts, improving robustness without changing semantics for typical cases. It’s a reasonable safeguard though not strictly required by the diff.	Medium
General	Restore safer default interval Lowering the default interval from 60 to 10 may cause excessive compaction frequency and load spikes. Consider keeping a safer minimum by clamping to a reasonable floor or making the value proportional to environment (e.g., local vs. prod). src/config/src/config.rs [2777-2779] if cfg.compact.interval < 1 { - cfg.compact.interval = 10; + cfg.compact.interval = if cfg.common.local_mode { 10 } else { 60 }; } Suggestion importance[1-10]: 6 __ Why: Sensible proposal to avoid overly aggressive compaction by keeping a higher default in non-local mode; aligns with prior behavior. However, it’s a policy choice rather than a correctness bug, so impact is moderate.	Low

greptile-apps

Greptile Overview

Summary

Fixed a critical bug in compactor pending jobs metrics where counts were being overwritten instead of accumulated when multiple streams share the same organization and stream type.

Key Changes:

Fixed get_pending_jobs_count() across all three database implementations (MySQL, Postgres, SQLite) by changing *e = counts to *e += counts when aggregating job counts by stream type
Optimized compactor performance by reducing default interval from 60s to 10s
Increased default batch size for better throughput: 4x CPU cores in production mode, 100 in local mode

Technical Details:
The bug occurred because the SQL query returns one row per stream (format: org/stream_type/stream_name), and when aggregating by (org, stream_type) pairs, the code was overwriting previous counts instead of summing them. For example, if org1/logs/stream1 had 10 jobs and org1/logs/stream2 had 5 jobs, the metric would incorrectly report 5 instead of 15.

The fix ensures accurate reporting of pending compaction jobs across all streams.

Confidence Score: 5/5

This PR is safe to merge with minimal risk - it fixes a critical metrics bug with a simple, correct logic change
The changes are straightforward and fix an obvious bug (overwriting vs accumulating counts). The fix is applied consistently across all three database implementations. The config changes improve performance without introducing risks. SQL queries already use proper parameter binding for safety.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
src/infra/src/file_list/mysql.rs	5/5	Fixed critical bug: changed assignment to accumulation when counting pending jobs by stream type
src/infra/src/file_list/postgres.rs	5/5	Fixed critical bug: changed assignment to accumulation when counting pending jobs by stream type
src/infra/src/file_list/sqlite.rs	5/5	Fixed critical bug: changed assignment to accumulation when counting pending jobs by stream type
src/config/src/config.rs	5/5	Adjusted compactor config defaults: reduced interval from 60s to 10s and increased batch_size (4x CPU cores in production, 100 in local mode)

Sequence Diagram

sequenceDiagram
    participant Compactor as Compactor Job
    participant FileList as File List Service
    participant DB as Database (MySQL/Postgres/SQLite)
    participant Metrics as Prometheus Metrics
    
    Note over Compactor: Every 300s (pending_jobs_metric_interval)
    Compactor->>FileList: get_pending_jobs_count()
    FileList->>DB: SELECT stream, status, count(*)<br/>WHERE status = Pending<br/>GROUP BY stream, status
    DB-->>FileList: Results (one row per stream)
    
    Note over FileList: Parse stream format:<br/>org/stream_type/stream_name
    
    loop For each result row
        FileList->>FileList: Extract org and stream_type
        Note over FileList: OLD BUG: *e = counts (overwrites)<br/>NEW FIX: *e += counts (accumulates)
        FileList->>FileList: Accumulate counts by<br/>(org, stream_type)
    end
    
    FileList-->>Compactor: HashMap<org, HashMap<stream_type, count>>
    
    loop Reset all org metrics
        Compactor->>Metrics: Set COMPACT_PENDING_JOBS = 0
    end
    
    loop Set new metrics
        Compactor->>Metrics: Set COMPACT_PENDING_JOBS<br/>(org, stream_type) = count
    end

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

testdino-playwright-reporter · 2025-10-21T03:45:10Z

⚠️ Test Run Unstable

Author: `hengfeiyang` | Branch: `fix/compactor-metrics` | Commit: `1859ffc`

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
All tests passed	364	342	0	19	3	94%	4m 39s

View Detailed Results

testdino-playwright-reporter · 2025-10-21T03:52:54Z

⚠️ Test Run Unstable

Author: `hengfeiyang` | Branch: `fix/compactor-metrics` | Commit: `1859ffc`

Testdino Test Results

Status	Total	Passed	Failed	Skipped	Flaky	Pass Rate	Duration
All tests passed	364	344	0	19	1	95%	4m 39s

View Detailed Results

hengfeiyang added 2 commits October 21, 2025 11:16

fix: metrics of compactor pending jobs

923d6ce

fix: rollback a comment

f3a0b30

github-actions bot added ☢️ Bug Something isn't working Review effort 2/5 labels Oct 21, 2025

greptile-apps bot reviewed Oct 21, 2025

View reviewed changes

Merge branch 'main' into fix/compactor-metrics

1859ffc

haohuaijin approved these changes Oct 21, 2025

View reviewed changes

hengfeiyang merged commit cffdf90 into main Oct 21, 2025
32 checks passed

hengfeiyang deleted the fix/compactor-metrics branch October 21, 2025 04:50

uddhavdave pushed a commit that referenced this pull request Oct 27, 2025

fix: metrics of compactor pending jobs (#8854)

bde5e8b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: metrics of compactor pending jobs #8854

fix: metrics of compactor pending jobs #8854

Uh oh!

hengfeiyang commented Oct 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

testdino-playwright-reporter bot commented Oct 21, 2025

Uh oh!

testdino-playwright-reporter bot commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: metrics of compactor pending jobs #8854

fix: metrics of compactor pending jobs #8854

Uh oh!

Conversation

hengfeiyang commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

github-actions bot commented Oct 21, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Oct 21, 2025

PR Code Suggestions ✨

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

testdino-playwright-reporter bot commented Oct 21, 2025

⚠️ Test Run Unstable

Author: hengfeiyang | Branch: fix/compactor-metrics | Commit: 1859ffc

Testdino Test Results

Uh oh!

testdino-playwright-reporter bot commented Oct 21, 2025

⚠️ Test Run Unstable

Author: hengfeiyang | Branch: fix/compactor-metrics | Commit: 1859ffc

Testdino Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hengfeiyang commented Oct 21, 2025 •

edited by github-actions bot

Loading

Author: `hengfeiyang` | Branch: `fix/compactor-metrics` | Commit: `1859ffc`

Author: `hengfeiyang` | Branch: `fix/compactor-metrics` | Commit: `1859ffc`