tsdb: Early compaction of stale series#16929
Conversation
7f92b48 to
f6d7ac4
Compare
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
|
/prombench cancel |
|
Benchmark cancel is in progress. |
|
Looks like stale series tracking is not working. Stale samples are not being put for the series I guess. |
4486bdb to
e693e22
Compare
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
|
/prombench stop |
|
Incorrect Available Commands:
Advanced Flags for
Examples:
|
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
|
/prombench cancel |
|
Benchmark cancel is in progress. |
|
/prombench cancel |
|
Benchmark cancel is in progress. |
|
Took a quick look at the profiles and confirmed that instant queries is taking the extra CPU. In the below pic, the red box is an additional CPU that stale series compaction introduces, since now it has to look at the block on disk for all instant queries. There is no way around it.
Here are the profiles that I downloaded from prombench: |
|
The memory results look really good. For sure something we will want behind a feature flag for now. If we can improve on the CPU overhead, this may be something to enable by default in the future. |
|
I'm surprised instant queries (or any queries) against TSDB blocks are so CPU intensive on Prometheus. OR... prombench results are not realistic -- like it spams queries way to often then realistically users would do. One important case is ofc alerting/recording rules - if they hit TSDB block, that block should be partially cached then (see below) It would be useful to understand a single common query CPU overhead for in-mem and TSDB block... also we could cache index a bit at least for stale near-real time blocks to mitigate some CPU with a bit more memory (hopefully this will diminishes the memory results - this cache should only be short-living for similar queries or heavy instant query load 🤔). Maybe we learn about some need for optimizing the TSDB read path with this work (: I still think this would be an interesting mode e.g. for us (Google), where we keep local query capability for debugging in some cases but we use cloud as a first order. Thanks for extensive research! |
2cc719a to
56f761d
Compare
It queries frequently, which might be taken to simulate a large user population or a lot of recording rules, but perhaps more importantly it never queries more than 1 hour back. That is what PRs like prometheus/test-infra#782 are seeking to change. So if you make every query hit every block, that will make quite a difference. |
2bc6610 to
5b1e6fe
Compare
|
Hello from the bug-scrub! @jesusvazquez I see you were assigned - do you think you will get a chance to look at it? |
393bd08 to
b209783
Compare
|
I'll have a look at this next week, starting my PTO today for a few days 🙏 |
5265f19 to
3f51be0
Compare
|
@jesusvazquez I have synced this PR with main branch and fixed the lint and is ready for review |
jesusvazquez
left a comment
There was a problem hiding this comment.
Left a few comments, overall in good shape.
519a2d2 to
72590c4
Compare
Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
72590c4 to
3e4a094
Compare
SuperQ
left a comment
There was a problem hiding this comment.
Nice. Based on our production testing, I the we found that ~50% was a good threshold.
Should we document any recommendations or wait for more user feedback?
We should wait for some user feedback. IMO it's more to do with the pattern in which stale series ratio goes up and down and the memory headroom, and less about the actual value of the ratio. As part of my upcoming talk, I plan to do some more testing for config options. |
##### [\`v3.10.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.10.0) Prometheus now offers a distroless Docker image variant alongside the default busybox image. The distroless variant provides enhanced security with a minimal base image, uses UID/GID 65532 (nonroot) instead of nobody, and removes the VOLUME declaration. Both variants are available with `-busybox` and `-distroless` tag suffixes (e.g., `prom/prometheus:latest-busybox`, `prom/prometheus:latest-distroless`). The busybox image remains the default with no suffix for backwards compatibility (e.g., `prom/prometheus:latest` points to the busybox variant). For users migrating existing **named** volumes from the busybox image to the distroless variant, the ownership can be adjusted with: ``` docker run --rm -v prometheus-data:/prometheus alpine chown -R 65532:65532 /prometheus ``` Then, the container can be started with the old volume with: ``` docker run -v prometheus-data:/prometheus prom/prometheus:latest-distroless ``` User migrating from bind mounts might need to ajust permissions too, depending on their setup. - \[CHANGE] Alerting: Add `alertmanager` dimension to following metrics: `prometheus_notifications_dropped_total`, `prometheus_notifications_queue_capacity`, `prometheus_notifications_queue_length`. [#16355](prometheus/prometheus#16355) - \[CHANGE] UI: Hide expanded alert annotations by default, enabling more information density on the `/alerts` page. [#17611](prometheus/prometheus#17611) - \[FEATURE] AWS SD: Add MSK Role. [#17600](prometheus/prometheus#17600) - \[FEATURE] PromQL: Add `fill()` / `fill_left()` / `fill_right()` binop modifiers for specifying default values for missing series. [#17644](prometheus/prometheus#17644) - \[FEATURE] Web: Add OpenAPI 3.2 specification for the HTTP API at `/api/v1/openapi.yaml`. [#17825](prometheus/prometheus#17825) - \[FEATURE] Dockerfile: Add distroless image variant using UID/GID 65532 and no VOLUME declaration. Busybox image remains default. [#17876](prometheus/prometheus#17876) - \[FEATURE] Web: Add on-demand wall time profiling under `<URL>/debug/pprof/fgprof`. [#18027](prometheus/prometheus#18027) - \[ENHANCEMENT] PromQL: Add more detail to histogram quantile monotonicity info annotations. [#15578](prometheus/prometheus#15578) - \[ENHANCEMENT] Alerting: Independent alertmanager sendloops. [#16355](prometheus/prometheus#16355) - \[ENHANCEMENT] TSDB: Experimental support for early compaction of stale series in the memory with configurable threshold `stale_series_compaction_threshold` in the config file. [#16929](prometheus/prometheus#16929) - \[ENHANCEMENT] Service Discovery: Service discoveries are now removable from the Prometheus binary through the Go build tag `remove_all_sd` and individual service discoveries can be re-added with the build tags `enable_<sd name>_sd`. Users can build a custom Prometheus with only the necessary SDs for a smaller binary size. [#17736](prometheus/prometheus#17736) - \[ENHANCEMENT] Promtool: Support promql syntax features `promql-duration-expr` and `promql-extended-range-selectors`. [#17926](prometheus/prometheus#17926) - \[PERF] PromQL: Avoid unnecessary label extraction in PromQL functions. [#17676](prometheus/prometheus#17676) - \[PERF] PromQL: Improve performance of regex matchers like `.*-.*-.*`. [#17707](prometheus/prometheus#17707) - \[PERF] OTLP: Add label caching for OTLP-to-Prometheus conversion to reduce allocations and improve latency. [#17860](prometheus/prometheus#17860) - \[PERF] API: Compute `/api/v1/targets/relabel_steps` in a single pass instead of re-running relabeling for each prefix. [#17969](prometheus/prometheus#17969) - \[PERF] tsdb: Optimize LabelValues intersection performance for matchers. [#18069](prometheus/prometheus#18069) - \[BUGFIX] PromQL: Prevent query strings containing only UTF-8 continuation bytes from crashing Prometheus. [#17735](prometheus/prometheus#17735) - \[BUGFIX] Web: Fix missing `X-Prometheus-Stopping` header for `/-/ready` endpoint in `NotReady` state. [#17795](prometheus/prometheus#17795) - \[BUGFIX] PromQL: Fix PromQL `info()` function returning empty results when filtering by a label that exists on both the input metric and `target_info`. [#17817](prometheus/prometheus#17817) - \[BUGFIX] TSDB: Fix a bug during exemplar buffer grow/shrink that could cause exemplars to be incorrectly discarded. [#17863](prometheus/prometheus#17863) - \[BUGFIX] UI: Fix broken graph display after page reload, due to broken Y axis min encoding/decoding. [#17869](prometheus/prometheus#17869) - \[BUGFIX] TSDB: Fix memory leaks in buffer pools by clearing reference fields (Labels, Histogram pointers, metadata strings) before returning buffers to pools. [#17879](prometheus/prometheus#17879) - \[BUGFIX] PromQL: info function: fix series without identifying labels not being returned. [#17898](prometheus/prometheus#17898) - \[BUGFIX] OTLP: Filter `__name__` from OTLP attributes to prevent duplicate labels. [#17917](prometheus/prometheus#17917) - \[BUGFIX] TSDB: Fix division by zero when computing stale series ratio with empty head. [#17952](prometheus/prometheus#17952) - \[BUGFIX] OTLP: Fix potential silent data loss for sum metrics. [#17954](prometheus/prometheus#17954) - \[BUGFIX] PromQL: Fix smoothed interpolation across counter resets. [#17988](prometheus/prometheus#17988) - \[BUGFIX] PromQL: Fix panic with `@` modifier on empty ranges. [#18020](prometheus/prometheus#18020) - \[BUGFIX] PromQL: Fix `avg_over_time` for a single native histogram. [#18058](prometheus/prometheus#18058)
##### [\`v3.10.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.10.0) Prometheus now offers a distroless Docker image variant alongside the default busybox image. The distroless variant provides enhanced security with a minimal base image, uses UID/GID 65532 (nonroot) instead of nobody, and removes the VOLUME declaration. Both variants are available with `-busybox` and `-distroless` tag suffixes (e.g., `prom/prometheus:latest-busybox`, `prom/prometheus:latest-distroless`). The busybox image remains the default with no suffix for backwards compatibility (e.g., `prom/prometheus:latest` points to the busybox variant). For users migrating existing **named** volumes from the busybox image to the distroless variant, the ownership can be adjusted with: ``` docker run --rm -v prometheus-data:/prometheus alpine chown -R 65532:65532 /prometheus ``` Then, the container can be started with the old volume with: ``` docker run -v prometheus-data:/prometheus prom/prometheus:latest-distroless ``` User migrating from bind mounts might need to ajust permissions too, depending on their setup. - \[CHANGE] Alerting: Add `alertmanager` dimension to following metrics: `prometheus_notifications_dropped_total`, `prometheus_notifications_queue_capacity`, `prometheus_notifications_queue_length`. [#16355](prometheus/prometheus#16355) - \[CHANGE] UI: Hide expanded alert annotations by default, enabling more information density on the `/alerts` page. [#17611](prometheus/prometheus#17611) - \[FEATURE] AWS SD: Add MSK Role. [#17600](prometheus/prometheus#17600) - \[FEATURE] PromQL: Add `fill()` / `fill_left()` / `fill_right()` binop modifiers for specifying default values for missing series. [#17644](prometheus/prometheus#17644) - \[FEATURE] Web: Add OpenAPI 3.2 specification for the HTTP API at `/api/v1/openapi.yaml`. [#17825](prometheus/prometheus#17825) - \[FEATURE] Dockerfile: Add distroless image variant using UID/GID 65532 and no VOLUME declaration. Busybox image remains default. [#17876](prometheus/prometheus#17876) - \[FEATURE] Web: Add on-demand wall time profiling under `<URL>/debug/pprof/fgprof`. [#18027](prometheus/prometheus#18027) - \[ENHANCEMENT] PromQL: Add more detail to histogram quantile monotonicity info annotations. [#15578](prometheus/prometheus#15578) - \[ENHANCEMENT] Alerting: Independent alertmanager sendloops. [#16355](prometheus/prometheus#16355) - \[ENHANCEMENT] TSDB: Experimental support for early compaction of stale series in the memory with configurable threshold `stale_series_compaction_threshold` in the config file. [#16929](prometheus/prometheus#16929) - \[ENHANCEMENT] Service Discovery: Service discoveries are now removable from the Prometheus binary through the Go build tag `remove_all_sd` and individual service discoveries can be re-added with the build tags `enable_<sd name>_sd`. Users can build a custom Prometheus with only the necessary SDs for a smaller binary size. [#17736](prometheus/prometheus#17736) - \[ENHANCEMENT] Promtool: Support promql syntax features `promql-duration-expr` and `promql-extended-range-selectors`. [#17926](prometheus/prometheus#17926) - \[PERF] PromQL: Avoid unnecessary label extraction in PromQL functions. [#17676](prometheus/prometheus#17676) - \[PERF] PromQL: Improve performance of regex matchers like `.*-.*-.*`. [#17707](prometheus/prometheus#17707) - \[PERF] OTLP: Add label caching for OTLP-to-Prometheus conversion to reduce allocations and improve latency. [#17860](prometheus/prometheus#17860) - \[PERF] API: Compute `/api/v1/targets/relabel_steps` in a single pass instead of re-running relabeling for each prefix. [#17969](prometheus/prometheus#17969) - \[PERF] tsdb: Optimize LabelValues intersection performance for matchers. [#18069](prometheus/prometheus#18069) - \[BUGFIX] PromQL: Prevent query strings containing only UTF-8 continuation bytes from crashing Prometheus. [#17735](prometheus/prometheus#17735) - \[BUGFIX] Web: Fix missing `X-Prometheus-Stopping` header for `/-/ready` endpoint in `NotReady` state. [#17795](prometheus/prometheus#17795) - \[BUGFIX] PromQL: Fix PromQL `info()` function returning empty results when filtering by a label that exists on both the input metric and `target_info`. [#17817](prometheus/prometheus#17817) - \[BUGFIX] TSDB: Fix a bug during exemplar buffer grow/shrink that could cause exemplars to be incorrectly discarded. [#17863](prometheus/prometheus#17863) - \[BUGFIX] UI: Fix broken graph display after page reload, due to broken Y axis min encoding/decoding. [#17869](prometheus/prometheus#17869) - \[BUGFIX] TSDB: Fix memory leaks in buffer pools by clearing reference fields (Labels, Histogram pointers, metadata strings) before returning buffers to pools. [#17879](prometheus/prometheus#17879) - \[BUGFIX] PromQL: info function: fix series without identifying labels not being returned. [#17898](prometheus/prometheus#17898) - \[BUGFIX] OTLP: Filter `__name__` from OTLP attributes to prevent duplicate labels. [#17917](prometheus/prometheus#17917) - \[BUGFIX] TSDB: Fix division by zero when computing stale series ratio with empty head. [#17952](prometheus/prometheus#17952) - \[BUGFIX] OTLP: Fix potential silent data loss for sum metrics. [#17954](prometheus/prometheus#17954) - \[BUGFIX] PromQL: Fix smoothed interpolation across counter resets. [#17988](prometheus/prometheus#17988) - \[BUGFIX] PromQL: Fix panic with `@` modifier on empty ranges. [#18020](prometheus/prometheus#18020) - \[BUGFIX] PromQL: Fix `avg_over_time` for a single native histogram. [#18058](prometheus/prometheus#18058)










Closes #13616
Based on prometheus/proposals#55
Stale series tracking was added in #16925. This PR compacts the stale series into its own block before the normal compaction hits. Here is how the config works:
stale_series_compaction_threshold: As soon as the ratio of stale series in the head block crossesStaleSeriesImmediateCompactionThreshold, TSDB performs a stale series compaction and puts all the stale series into a block and removed it from the head, but it does not remove it from the WAL. (technically this condition is checked every minute and not exactly immediate)Additional details