Skip to content

tsdb: Early compaction of stale series#16929

Merged
codesome merged 5 commits intomainfrom
codesome/stale-series-compaction
Jan 24, 2026
Merged

tsdb: Early compaction of stale series#16929
codesome merged 5 commits intomainfrom
codesome/stale-series-compaction

Conversation

@codesome
Copy link
Copy Markdown
Member

@codesome codesome commented Jul 25, 2025

Closes #13616

Based on prometheus/proposals#55

Stale series tracking was added in #16925. This PR compacts the stale series into its own block before the normal compaction hits. Here is how the config works:

  • stale_series_compaction_threshold: As soon as the ratio of stale series in the head block crosses StaleSeriesImmediateCompactionThreshold, TSDB performs a stale series compaction and puts all the stale series into a block and removed it from the head, but it does not remove it from the WAL. (technically this condition is checked every minute and not exactly immediate)

Additional details

  • WAL replay: after a stale series compaction, tombstones are added with (MinInt64, MaxInt64) for all these stale series. During WAL replay we add a special condition where when we find such tombstone, it immediately removes the series from the memory instead of storing the tombstone. This is required so that we don't spike up memory during WAL replay and also don't keep the compacted stale series in the memory.
  • Head block truncation ignores this block via the added metadata, similar to out-of-order blocks.
[ENHANCEMENT] tsdb: Experimental support for early compaction of stale series in the memory with configurable threshold.

@codesome codesome force-pushed the codesome/stale-series-compaction branch 2 times, most recently from 7f92b48 to f6d7ac4 Compare July 25, 2025 23:41
@codesome
Copy link
Copy Markdown
Member Author

/prombench main

@prombot
Copy link
Copy Markdown
Contributor

prombot commented Jul 25, 2025

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-16929 and main

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

  • To restart benchmark: /prombench restart main
  • To stop benchmark: /prombench cancel
  • To print help: /prombench help

@codesome
Copy link
Copy Markdown
Member Author

/prombench cancel

@prombot
Copy link
Copy Markdown
Contributor

prombot commented Jul 26, 2025

Benchmark cancel is in progress.

@codesome
Copy link
Copy Markdown
Member Author

Looks like stale series tracking is not working. Stale samples are not being put for the series I guess.

@codesome codesome force-pushed the codesome/stale-series-compaction branch 2 times, most recently from 4486bdb to e693e22 Compare July 29, 2025 00:50
@codesome
Copy link
Copy Markdown
Member Author

/prombench main

@prombot
Copy link
Copy Markdown
Contributor

prombot commented Jul 29, 2025

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-16929 and main

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

  • To restart benchmark: /prombench restart main
  • To stop benchmark: /prombench cancel
  • To print help: /prombench help

@codesome
Copy link
Copy Markdown
Member Author

/prombench stop

@prombot
Copy link
Copy Markdown
Contributor

prombot commented Jul 29, 2025

Incorrect /prombench syntax; command requires one argument that matches (master|main|v[0-9]+\.[0-9]+\.[0-9]+\S*) regex.

Available Commands:

  • To start benchmark: /prombench <branch or git tag to compare with>
  • To restart benchmark: /prombench <branch or git tag to compare with>
  • To stop benchmark: /prombench cancel
  • To print help: /prombench help

Advanced Flags for start and restart Commands:

  • --bench.directory=<sub-directory of github.com/prometheus/test-infra/prombench
    • See the details here, defaults to manifests/prombench.
  • --bench.version=<branch | @commit>
    • See the details here, defaults to master.

Examples:

  • /prombench v3.0.0
  • /prombench v3.0.0 --bench.version=@aca1803ccf5d795eee4b0848707eab26d05965cc --bench.directory=manifests/prombench

@codesome
Copy link
Copy Markdown
Member Author

/prombench main

@prombot
Copy link
Copy Markdown
Contributor

prombot commented Jul 29, 2025

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-16929 and main

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

  • To restart benchmark: /prombench restart main
  • To stop benchmark: /prombench cancel
  • To print help: /prombench help

@codesome
Copy link
Copy Markdown
Member Author

/prombench cancel

@prombot
Copy link
Copy Markdown
Contributor

prombot commented Jul 29, 2025

Benchmark cancel is in progress.

@codesome
Copy link
Copy Markdown
Member Author

Here are the results from the PoC and the results are in-line with the expectations.

The good: memory and number of timeseries stayed down. For the prombench data pattern, the memory used was consistently 30-60% down.

The bad: The CPU usage is higher and instant queries took a hit. This is mostly because now it requires merging stale block and head block results for instant queries. 20-30% higher for this benchmark and peaks up to 50% sometimes.

I will fix some of the edge cases in the code and run it in our internal cluster to get some real world numbers.

Screenshot 2025-07-28 at 8 07 05 PM Screenshot 2025-07-28 at 8 07 12 PM Screenshot 2025-07-28 at 8 07 21 PM Screenshot 2025-07-28 at 8 07 28 PM Screenshot 2025-07-28 at 8 08 18 PM

@codesome
Copy link
Copy Markdown
Member Author

/prombench cancel

@prombot
Copy link
Copy Markdown
Contributor

prombot commented Jul 29, 2025

Benchmark cancel is in progress.

@codesome
Copy link
Copy Markdown
Member Author

Prombench decided to keep running even after stopping. So we have more numbers to look at.

Memory savings is great, but query/cpu is bonkers. I am guessing the queries in prombench are trying to touch a whole lot of series continuously.

Screenshot 2025-07-29 at 12 04 57 PM Screenshot 2025-07-29 at 12 05 15 PM Screenshot 2025-07-29 at 12 05 56 PM Screenshot 2025-07-29 at 12 06 15 PM

@codesome
Copy link
Copy Markdown
Member Author

Took a quick look at the profiles and confirmed that instant queries is taking the extra CPU. In the below pic, the red box is an additional CPU that stale series compaction introduces, since now it has to look at the block on disk for all instant queries. There is no way around it.

Screenshot 2025-07-29 at 12 15 15 PM

Here are the profiles that I downloaded from prombench:
main branch profile.pb.gz
stale series profile.pb.gz

@SuperQ
Copy link
Copy Markdown
Member

SuperQ commented Aug 5, 2025

The memory results look really good. For sure something we will want behind a feature flag for now. If we can improve on the CPU overhead, this may be something to enable by default in the future.

@bwplotka
Copy link
Copy Markdown
Member

bwplotka commented Aug 6, 2025

I'm surprised instant queries (or any queries) against TSDB blocks are so CPU intensive on Prometheus. OR... prombench results are not realistic -- like it spams queries way to often then realistically users would do. One important case is ofc alerting/recording rules - if they hit TSDB block, that block should be partially cached then (see below)

It would be useful to understand a single common query CPU overhead for in-mem and TSDB block... also we could cache index a bit at least for stale near-real time blocks to mitigate some CPU with a bit more memory (hopefully this will diminishes the memory results - this cache should only be short-living for similar queries or heavy instant query load 🤔). Maybe we learn about some need for optimizing the TSDB read path with this work (:

I still think this would be an interesting mode e.g. for us (Google), where we keep local query capability for debugging in some cases but we use cloud as a first order.

Thanks for extensive research!

@codesome codesome force-pushed the codesome/stale-series-compaction branch 3 times, most recently from 2cc719a to 56f761d Compare August 7, 2025 03:04
@bboreham
Copy link
Copy Markdown
Member

prombench results are not realistic -- like it spams queries way to often then realistically users would do

It queries frequently, which might be taken to simulate a large user population or a lot of recording rules, but perhaps more importantly it never queries more than 1 hour back. That is what PRs like prometheus/test-infra#782 are seeking to change.

So if you make every query hit every block, that will make quite a difference.

@codesome codesome force-pushed the codesome/stale-series-compaction branch 3 times, most recently from 2bc6610 to 5b1e6fe Compare August 26, 2025 01:40
@bboreham
Copy link
Copy Markdown
Member

Hello from the bug-scrub!

@jesusvazquez I see you were assigned - do you think you will get a chance to look at it?

@codesome codesome force-pushed the codesome/stale-series-compaction branch from 393bd08 to b209783 Compare November 26, 2025 02:56
@jesusvazquez
Copy link
Copy Markdown
Member

I'll have a look at this next week, starting my PTO today for a few days 🙏

@codesome codesome force-pushed the codesome/stale-series-compaction branch from 5265f19 to 3f51be0 Compare January 9, 2026 01:20
@codesome
Copy link
Copy Markdown
Member Author

codesome commented Jan 9, 2026

@jesusvazquez I have synced this PR with main branch and fixed the lint and is ready for review

Copy link
Copy Markdown
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, overall in good shape.

@codesome codesome force-pushed the codesome/stale-series-compaction branch from 519a2d2 to 72590c4 Compare January 24, 2026 01:16
@codesome codesome force-pushed the codesome/stale-series-compaction branch from 72590c4 to 3e4a094 Compare January 24, 2026 02:18
Copy link
Copy Markdown
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Based on our production testing, I the we found that ~50% was a good threshold.

Should we document any recommendations or wait for more user feedback?

@codesome
Copy link
Copy Markdown
Member Author

Should we document any recommendations or wait for more user feedback?

We should wait for some user feedback. IMO it's more to do with the pattern in which stale series ratio goes up and down and the memory headroom, and less about the actual value of the ratio. As part of my upcoming talk, I plan to do some more testing for config options.

@codesome codesome merged commit 9eb7873 into main Jan 24, 2026
87 of 90 checks passed
@codesome codesome deleted the codesome/stale-series-compaction branch January 24, 2026 23:18
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Mar 12, 2026
##### [\`v3.10.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.10.0)

Prometheus now offers a distroless Docker image variant alongside the default
busybox image. The distroless variant provides enhanced security with a minimal
base image, uses UID/GID 65532 (nonroot) instead of nobody, and removes the
VOLUME declaration. Both variants are available with `-busybox` and `-distroless`
tag suffixes (e.g., `prom/prometheus:latest-busybox`, `prom/prometheus:latest-distroless`).
The busybox image remains the default with no suffix for backwards compatibility
(e.g., `prom/prometheus:latest` points to the busybox variant).

For users migrating existing **named** volumes from the busybox image to the distroless variant, the ownership can be adjusted with:

```
docker run --rm -v prometheus-data:/prometheus alpine chown -R 65532:65532 /prometheus
```

Then, the container can be started with the old volume with:

```
docker run -v prometheus-data:/prometheus prom/prometheus:latest-distroless
```

User migrating from bind mounts might need to ajust permissions too, depending on their setup.

- \[CHANGE] Alerting: Add `alertmanager` dimension to following metrics: `prometheus_notifications_dropped_total`, `prometheus_notifications_queue_capacity`, `prometheus_notifications_queue_length`. [#16355](prometheus/prometheus#16355)
- \[CHANGE] UI: Hide expanded alert annotations by default, enabling more information density on the `/alerts` page. [#17611](prometheus/prometheus#17611)
- \[FEATURE] AWS SD: Add MSK Role. [#17600](prometheus/prometheus#17600)
- \[FEATURE] PromQL: Add `fill()` / `fill_left()` / `fill_right()` binop modifiers for specifying default values for missing series. [#17644](prometheus/prometheus#17644)
- \[FEATURE] Web: Add OpenAPI 3.2 specification for the HTTP API at `/api/v1/openapi.yaml`. [#17825](prometheus/prometheus#17825)
- \[FEATURE] Dockerfile: Add distroless image variant using UID/GID 65532 and no VOLUME declaration. Busybox image remains default. [#17876](prometheus/prometheus#17876)
- \[FEATURE] Web: Add on-demand wall time profiling under `<URL>/debug/pprof/fgprof`. [#18027](prometheus/prometheus#18027)
- \[ENHANCEMENT] PromQL: Add more detail to histogram quantile monotonicity info annotations. [#15578](prometheus/prometheus#15578)
- \[ENHANCEMENT] Alerting: Independent alertmanager sendloops. [#16355](prometheus/prometheus#16355)
- \[ENHANCEMENT] TSDB: Experimental support for early compaction of stale series in the memory with configurable threshold `stale_series_compaction_threshold` in the config file. [#16929](prometheus/prometheus#16929)
- \[ENHANCEMENT] Service Discovery: Service discoveries are now removable from the Prometheus binary through the Go build tag `remove_all_sd` and individual service discoveries can be re-added with the build tags `enable_<sd name>_sd`. Users can build a custom Prometheus with only the necessary SDs for a smaller binary size. [#17736](prometheus/prometheus#17736)
- \[ENHANCEMENT] Promtool: Support promql syntax features `promql-duration-expr` and `promql-extended-range-selectors`. [#17926](prometheus/prometheus#17926)
- \[PERF] PromQL: Avoid unnecessary label extraction in PromQL functions. [#17676](prometheus/prometheus#17676)
- \[PERF] PromQL: Improve performance of regex matchers like `.*-.*-.*`. [#17707](prometheus/prometheus#17707)
- \[PERF] OTLP: Add label caching for OTLP-to-Prometheus conversion to reduce allocations and improve latency. [#17860](prometheus/prometheus#17860)
- \[PERF] API: Compute `/api/v1/targets/relabel_steps` in a single pass instead of re-running relabeling for each prefix. [#17969](prometheus/prometheus#17969)
- \[PERF] tsdb: Optimize LabelValues intersection performance for matchers. [#18069](prometheus/prometheus#18069)
- \[BUGFIX] PromQL: Prevent query strings containing only UTF-8 continuation bytes from crashing Prometheus. [#17735](prometheus/prometheus#17735)
- \[BUGFIX] Web: Fix missing `X-Prometheus-Stopping` header for `/-/ready` endpoint in `NotReady` state. [#17795](prometheus/prometheus#17795)
- \[BUGFIX] PromQL: Fix PromQL `info()` function returning empty results when filtering by a label that exists on both the input metric and `target_info`. [#17817](prometheus/prometheus#17817)
- \[BUGFIX] TSDB: Fix a bug during exemplar buffer grow/shrink that could cause exemplars to be incorrectly discarded. [#17863](prometheus/prometheus#17863)
- \[BUGFIX] UI: Fix broken graph display after page reload, due to broken Y axis min encoding/decoding. [#17869](prometheus/prometheus#17869)
- \[BUGFIX] TSDB: Fix memory leaks in buffer pools by clearing reference fields (Labels, Histogram pointers, metadata strings) before returning buffers to pools. [#17879](prometheus/prometheus#17879)
- \[BUGFIX] PromQL: info function: fix series without identifying labels not being returned. [#17898](prometheus/prometheus#17898)
- \[BUGFIX] OTLP: Filter `__name__` from OTLP attributes to prevent duplicate labels. [#17917](prometheus/prometheus#17917)
- \[BUGFIX] TSDB: Fix division by zero when computing stale series ratio with empty head. [#17952](prometheus/prometheus#17952)
- \[BUGFIX] OTLP: Fix potential silent data loss for sum metrics. [#17954](prometheus/prometheus#17954)
- \[BUGFIX] PromQL: Fix smoothed interpolation across counter resets. [#17988](prometheus/prometheus#17988)
- \[BUGFIX] PromQL: Fix panic with `@` modifier on empty ranges. [#18020](prometheus/prometheus#18020)
- \[BUGFIX] PromQL: Fix `avg_over_time` for a single native histogram. [#18058](prometheus/prometheus#18058)
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Mar 13, 2026
##### [\`v3.10.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.10.0)

Prometheus now offers a distroless Docker image variant alongside the default
busybox image. The distroless variant provides enhanced security with a minimal
base image, uses UID/GID 65532 (nonroot) instead of nobody, and removes the
VOLUME declaration. Both variants are available with `-busybox` and `-distroless`
tag suffixes (e.g., `prom/prometheus:latest-busybox`, `prom/prometheus:latest-distroless`).
The busybox image remains the default with no suffix for backwards compatibility
(e.g., `prom/prometheus:latest` points to the busybox variant).

For users migrating existing **named** volumes from the busybox image to the distroless variant, the ownership can be adjusted with:

```
docker run --rm -v prometheus-data:/prometheus alpine chown -R 65532:65532 /prometheus
```

Then, the container can be started with the old volume with:

```
docker run -v prometheus-data:/prometheus prom/prometheus:latest-distroless
```

User migrating from bind mounts might need to ajust permissions too, depending on their setup.

- \[CHANGE] Alerting: Add `alertmanager` dimension to following metrics: `prometheus_notifications_dropped_total`, `prometheus_notifications_queue_capacity`, `prometheus_notifications_queue_length`. [#16355](prometheus/prometheus#16355)
- \[CHANGE] UI: Hide expanded alert annotations by default, enabling more information density on the `/alerts` page. [#17611](prometheus/prometheus#17611)
- \[FEATURE] AWS SD: Add MSK Role. [#17600](prometheus/prometheus#17600)
- \[FEATURE] PromQL: Add `fill()` / `fill_left()` / `fill_right()` binop modifiers for specifying default values for missing series. [#17644](prometheus/prometheus#17644)
- \[FEATURE] Web: Add OpenAPI 3.2 specification for the HTTP API at `/api/v1/openapi.yaml`. [#17825](prometheus/prometheus#17825)
- \[FEATURE] Dockerfile: Add distroless image variant using UID/GID 65532 and no VOLUME declaration. Busybox image remains default. [#17876](prometheus/prometheus#17876)
- \[FEATURE] Web: Add on-demand wall time profiling under `<URL>/debug/pprof/fgprof`. [#18027](prometheus/prometheus#18027)
- \[ENHANCEMENT] PromQL: Add more detail to histogram quantile monotonicity info annotations. [#15578](prometheus/prometheus#15578)
- \[ENHANCEMENT] Alerting: Independent alertmanager sendloops. [#16355](prometheus/prometheus#16355)
- \[ENHANCEMENT] TSDB: Experimental support for early compaction of stale series in the memory with configurable threshold `stale_series_compaction_threshold` in the config file. [#16929](prometheus/prometheus#16929)
- \[ENHANCEMENT] Service Discovery: Service discoveries are now removable from the Prometheus binary through the Go build tag `remove_all_sd` and individual service discoveries can be re-added with the build tags `enable_<sd name>_sd`. Users can build a custom Prometheus with only the necessary SDs for a smaller binary size. [#17736](prometheus/prometheus#17736)
- \[ENHANCEMENT] Promtool: Support promql syntax features `promql-duration-expr` and `promql-extended-range-selectors`. [#17926](prometheus/prometheus#17926)
- \[PERF] PromQL: Avoid unnecessary label extraction in PromQL functions. [#17676](prometheus/prometheus#17676)
- \[PERF] PromQL: Improve performance of regex matchers like `.*-.*-.*`. [#17707](prometheus/prometheus#17707)
- \[PERF] OTLP: Add label caching for OTLP-to-Prometheus conversion to reduce allocations and improve latency. [#17860](prometheus/prometheus#17860)
- \[PERF] API: Compute `/api/v1/targets/relabel_steps` in a single pass instead of re-running relabeling for each prefix. [#17969](prometheus/prometheus#17969)
- \[PERF] tsdb: Optimize LabelValues intersection performance for matchers. [#18069](prometheus/prometheus#18069)
- \[BUGFIX] PromQL: Prevent query strings containing only UTF-8 continuation bytes from crashing Prometheus. [#17735](prometheus/prometheus#17735)
- \[BUGFIX] Web: Fix missing `X-Prometheus-Stopping` header for `/-/ready` endpoint in `NotReady` state. [#17795](prometheus/prometheus#17795)
- \[BUGFIX] PromQL: Fix PromQL `info()` function returning empty results when filtering by a label that exists on both the input metric and `target_info`. [#17817](prometheus/prometheus#17817)
- \[BUGFIX] TSDB: Fix a bug during exemplar buffer grow/shrink that could cause exemplars to be incorrectly discarded. [#17863](prometheus/prometheus#17863)
- \[BUGFIX] UI: Fix broken graph display after page reload, due to broken Y axis min encoding/decoding. [#17869](prometheus/prometheus#17869)
- \[BUGFIX] TSDB: Fix memory leaks in buffer pools by clearing reference fields (Labels, Histogram pointers, metadata strings) before returning buffers to pools. [#17879](prometheus/prometheus#17879)
- \[BUGFIX] PromQL: info function: fix series without identifying labels not being returned. [#17898](prometheus/prometheus#17898)
- \[BUGFIX] OTLP: Filter `__name__` from OTLP attributes to prevent duplicate labels. [#17917](prometheus/prometheus#17917)
- \[BUGFIX] TSDB: Fix division by zero when computing stale series ratio with empty head. [#17952](prometheus/prometheus#17952)
- \[BUGFIX] OTLP: Fix potential silent data loss for sum metrics. [#17954](prometheus/prometheus#17954)
- \[BUGFIX] PromQL: Fix smoothed interpolation across counter resets. [#17988](prometheus/prometheus#17988)
- \[BUGFIX] PromQL: Fix panic with `@` modifier on empty ranges. [#18020](prometheus/prometheus#18020)
- \[BUGFIX] PromQL: Fix `avg_over_time` for a single native histogram. [#18058](prometheus/prometheus#18058)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eager compaction of stale series

7 participants