Skip to content

[release-3.7] fix: Remote-write: revert changes in the queue resharding logic #17412

Merged
bwplotka merged 4 commits intoprometheus:release-3.7from
machine424:parrem
Oct 29, 2025
Merged

[release-3.7] fix: Remote-write: revert changes in the queue resharding logic #17412
bwplotka merged 4 commits intoprometheus:release-3.7from
machine424:parrem

Conversation

@machine424
Copy link
Copy Markdown
Member

@machine424 machine424 commented Oct 28, 2025

Which issue(s) does the PR fix:

Fixes #17384

Does this PR introduce a user-facing change?

[BUGFIX] Remote-write: fix a deadlock in the queue resharding logic that can lead to suboptimal queue behavior.

I think we can keep the metrics deprecations as they still stand (we don't guarantee any stability on metrics anyway...).
I'll revisit this later as I still think that we can move the logic closer to the queues and instead of all - dropped we can consider "needed to be enqueued"/appended".

I just reverted to the logic in 3.6, thus I didn't try to fix and run the test in https://github.com/prometheus/prometheus/pull/17407/files#diff-e55ce92fccc71019472af9eab3dd7c85fce2d2c6d520052bcc23b29ce7b962fbR2654, I'll revisit it later as well.

@machine424
Copy link
Copy Markdown
Member Author

I'll need to fix that DCO

…,histograms}_in_total and prometheus_remote_storage_highest_timestamp_in_seconds"

This reverts commit ba14bc4.

Signed-off-by: machine424 <[email protected]>
…anager level"

This reverts commit 184c7eb.

Signed-off-by: machine424 <[email protected]>
@bwplotka
Copy link
Copy Markdown
Member

I'm really sorry to be a pain, but how we know if we reverted properly OR if we reverted correct thing (OR that somebody will not re-revert things back) if we have no regression test? Is it hard to write one? Can we write skipped one for now if the risk of flakiness is high?

@machine424
Copy link
Copy Markdown
Member Author

I thought the existing tests + manual testing + reviews should be enough to validate the revert.
If the fix isn't urgent and the manual validation isn't enough I can adjust and include the test as well in the coming days.

As mentioned, I'll try to pick this later on main.

@machine424
Copy link
Copy Markdown
Member Author

I think I can put together a higher level test.

…uce the sharding scale up deadlock

Signed-off-by: machine424 <[email protected]>
@machine424
Copy link
Copy Markdown
Member Author

Test added, ran locally

On main c36e966, the test fails

=== RUN   TestRemoteWrite_ReshardingWithoutDeadlock
=== PAUSE TestRemoteWrite_ReshardingWithoutDeadlock
=== CONT  TestRemoteWrite_ReshardingWithoutDeadlock
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.835+01:00 level=INFO source=main.go:1557 msg="updated GOGC" old=100 new=75
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.835+01:00 level=INFO source=main.go:688 msg="Leaving GOMAXPROCS=12: CPU quota undefined" component=automaxprocs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.836+01:00 level=INFO source=memlimit.go:198 msg="GOMEMLIMIT is updated" component=automemlimit package=github.com/KimMachineGun/automemlimit/memlimit GOMEMLIMIT=30923764531 previous=9223372036854775807
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.836+01:00 level=INFO source=main.go:730 msg="No time or size retention was set so using the default time retention" duration=15d
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.836+01:00 level=INFO source=main.go:781 msg="Starting Prometheus Server" mode=server version="(version=, branch=, revision=unknown)"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.836+01:00 level=INFO source=main.go:786 msg="operational information" build_context="(go=go1.24.0, platform=darwin/arm64, user=, date=, tags=unknown)" host_details=(darwin) fd_limits="(soft=122880, hard=unlimited)" vm_limits="(soft=unlimited, hard=unlimited)"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.841+01:00 level=INFO source=web.go:661 msg="Start listening for connections" component=web address=0.0.0.0:52514
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.841+01:00 level=INFO source=main.go:1301 msg="Starting TSDB ..."
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.846+01:00 level=INFO source=tls_config.go:346 msg="Listening on" component=web address=[::]:52514
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.846+01:00 level=INFO source=tls_config.go:349 msg="TLS is disabled." component=web http2=false address=[::]:52514
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.848+01:00 level=INFO source=head.go:666 msg="Replaying on-disk memory mappable chunks if any" component=tsdb
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.848+01:00 level=INFO source=head.go:752 msg="On-disk memory mappable chunks replay completed" component=tsdb duration=32.917µs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.848+01:00 level=INFO source=head.go:760 msg="Replaying WAL, this may take a while" component=tsdb
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.848+01:00 level=INFO source=head.go:833 msg="WAL segment loaded" component=tsdb segment=0 maxSegment=0 duration=578µs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.849+01:00 level=INFO source=head.go:870 msg="WAL replay completed" component=tsdb checkpoint_replay_duration=167.459µs wal_replay_duration=615.416µs wbl_replay_duration=84ns chunk_snapshot_load_duration=0s mmap_chunk_replay_duration=32.917µs total_replay_duration=898.375µs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.858+01:00 level=INFO source=main.go:1322 msg="filesystem information" fs_type=1a
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.858+01:00 level=INFO source=main.go:1325 msg="TSDB started"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.858+01:00 level=INFO source=main.go:1510 msg="Loading configuration file" filename=/var/folders/pl/4zxvwhbd0q5bkw9vxnk08swm0000gn/T/TestRemoteWrite_ReshardingWithoutDeadlock3910989078/001/prometheus.yml
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.860+01:00 level=INFO source=watcher.go:240 msg="Starting WAL watcher" component=remote remote_name=4e57a9 url=http://127.0.0.1:52515 queue=4e57a9
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.860+01:00 level=INFO source=metadata_watcher.go:90 msg="Starting scraped metadata watcher" component=remote remote_name=4e57a9 url=http://127.0.0.1:52515
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.860+01:00 level=INFO source=watcher.go:292 msg="Replaying WAL" component=remote remote_name=4e57a9 url=http://127.0.0.1:52515 queue=4e57a9
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.868+01:00 level=INFO source=main.go:1550 msg="Completed loading of configuration file" db_storage=3.291µs remote_storage=2.010584ms web_handler=3.833µs query_engine=6.292µs scrape=7.151792ms scrape_sd=59.542µs notify=2.541µs notify_sd=1.958µs rules=3.125µs tracing=23.833µs filename=/var/folders/pl/4zxvwhbd0q5bkw9vxnk08swm0000gn/T/TestRemoteWrite_ReshardingWithoutDeadlock3910989078/001/prometheus.yml totalDuration=10.058208ms
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.868+01:00 level=INFO source=main.go:1286 msg="Server is ready to receive web requests."
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:39.868+01:00 level=INFO source=manager.go:190 msg="Starting rule manager..." component="rule manager"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:02:44.995+01:00 level=INFO source=watcher.go:538 msg="Done replaying WAL" component=remote remote_name=4e57a9 url=http://127.0.0.1:52515 duration=5.135316292s
    /path_to/prometheus/main_test.go:1009:
                Error Trace:    /path_to/prometheus/main_test.go:1009
                Error:          Condition never satisfied
                Test:           TestRemoteWrite_ReshardingWithoutDeadlock
--- FAIL: TestRemoteWrite_ReshardingWithoutDeadlock (30.99s)
FAIL
FAIL    github.com/prometheus/prometheus/cmd/prometheus 32.810s

On this branch, the test passes

=== RUN   TestRemoteWrite_ReshardingWithoutDeadlock
=== PAUSE TestRemoteWrite_ReshardingWithoutDeadlock
=== CONT  TestRemoteWrite_ReshardingWithoutDeadlock
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.985+01:00 level=INFO source=main.go:1549 msg="updated GOGC" old=100 new=75
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.986+01:00 level=INFO source=main.go:680 msg="Leaving GOMAXPROCS=12: CPU quota undefined" component=automaxprocs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.986+01:00 level=INFO source=memlimit.go:198 msg="GOMEMLIMIT is updated" component=automemlimit package=github.com/KimMachineGun/automemlimit/memlimit GOMEMLIMIT=30923764531 previous=9223372036854775807
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.986+01:00 level=INFO source=main.go:722 msg="No time or size retention was set so using the default time retention" duration=15d
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.986+01:00 level=INFO source=main.go:773 msg="Starting Prometheus Server" mode=server version="(version=, branch=, revision=unknown)"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.986+01:00 level=INFO source=main.go:778 msg="operational information" build_context="(go=go1.24.0, platform=darwin/arm64, user=, date=, tags=unknown)" host_details=(darwin) fd_limits="(soft=122880, hard=unlimited)" vm_limits="(soft=unlimited, hard=unlimited)"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.991+01:00 level=INFO source=web.go:660 msg="Start listening for connections" component=web address=0.0.0.0:52825
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.992+01:00 level=INFO source=main.go:1293 msg="Starting TSDB ..."
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.996+01:00 level=INFO source=tls_config.go:346 msg="Listening on" component=web address=[::]:52825
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.996+01:00 level=INFO source=tls_config.go:349 msg="TLS is disabled." component=web http2=false address=[::]:52825
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.998+01:00 level=INFO source=head.go:669 msg="Replaying on-disk memory mappable chunks if any" component=tsdb
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.998+01:00 level=INFO source=head.go:755 msg="On-disk memory mappable chunks replay completed" component=tsdb duration=28.875µs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.998+01:00 level=INFO source=head.go:763 msg="Replaying WAL, this may take a while" component=tsdb
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.998+01:00 level=INFO source=head.go:836 msg="WAL segment loaded" component=tsdb segment=0 maxSegment=0 duration=624.666µs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:22.998+01:00 level=INFO source=head.go:873 msg="WAL replay completed" component=tsdb checkpoint_replay_duration=120.833µs wal_replay_duration=663.375µs wbl_replay_duration=125ns chunk_snapshot_load_duration=0s mmap_chunk_replay_duration=28.875µs total_replay_duration=860.417µs
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.008+01:00 level=INFO source=main.go:1314 msg="filesystem information" fs_type=1a
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.008+01:00 level=INFO source=main.go:1317 msg="TSDB started"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.008+01:00 level=INFO source=main.go:1502 msg="Loading configuration file" filename=/var/folders/pl/4zxvwhbd0q5bkw9vxnk08swm0000gn/T/TestRemoteWrite_ReshardingWithoutDeadlock3772923741/001/prometheus.yml
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.010+01:00 level=INFO source=watcher.go:240 msg="Starting WAL watcher" component=remote remote_name=b23fc0 url=http://127.0.0.1:52826 queue=b23fc0
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.010+01:00 level=INFO source=metadata_watcher.go:90 msg="Starting scraped metadata watcher" component=remote remote_name=b23fc0 url=http://127.0.0.1:52826
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.011+01:00 level=INFO source=watcher.go:292 msg="Replaying WAL" component=remote remote_name=b23fc0 url=http://127.0.0.1:52826 queue=b23fc0
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.017+01:00 level=INFO source=main.go:1542 msg="Completed loading of configuration file" db_storage=2.666µs remote_storage=1.9665ms web_handler=12.042µs query_engine=4.834µs scrape=5.49725ms scrape_sd=57.625µs notify=5.958µs notify_sd=2.459µs rules=2.75µs tracing=24.375µs filename=/var/folders/pl/4zxvwhbd0q5bkw9vxnk08swm0000gn/T/TestRemoteWrite_ReshardingWithoutDeadlock3772923741/001/prometheus.yml totalDuration=8.285042ms
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.017+01:00 level=INFO source=main.go:1278 msg="Server is ready to receive web requests."
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:23.017+01:00 level=INFO source=manager.go:190 msg="Starting rule manager..." component="rule manager"
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:28.182+01:00 level=INFO source=watcher.go:538 msg="Done replaying WAL" component=remote remote_name=b23fc0 url=http://127.0.0.1:52826 duration=5.171649125s
    /path_to/prometheus/reload_test.go:194: time=2025-10-28T16:06:33.012+01:00 level=INFO source=queue_manager.go:1096 msg="Remote storage resharding" component=remote remote_name=b23fc0 url=http://127.0.0.1:52826 from=1 to=2
--- PASS: TestRemoteWrite_ReshardingWithoutDeadlock (11.85s)
PASS
ok      github.com/prometheus/prometheus/cmd/prometheus 14.636s

@cstyan
Copy link
Copy Markdown
Member

cstyan commented Oct 28, 2025

LGTM but I think we also want to revert the mixin alerts/dashboard changes from #17065 as well to use the original prometheus_remote_storage_highest_timestamp_in_seconds metric.

@machine424
Copy link
Copy Markdown
Member Author

LGTM but I think we also want to revert the mixin alerts/dashboard changes from #17065 as well to use the original prometheus_remote_storage_highest_timestamp_in_seconds metric.

Now that the resharding has no deadlock, the new prometheus_remote_storage_queue_highest_timestamp_seconds will be updated as expected and reflects what gets into the queues. I think we still need that to take relabeling into account.

@bwplotka
Copy link
Copy Markdown
Member

Thanks, taking a look

}

// TestRemoteWrite_ReshardingWithoutDeadlock ensures that resharding (scaling up) doesn't block when the shards are full.
// See: https://github.com/prometheus/prometheus/issues/17384.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// See: https://github.com/prometheus/prometheus/issues/17384.
// Regression test for https://github.com/prometheus/prometheus/issues/17384.

Copy link
Copy Markdown
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think it makes sense. New metrics and new alerts/dashboards to keep make sense, you don't revert them you just don't rely on those metrics for the desired sharding calculation 👍🏽

Can you fix commit/title message to be more clear? "reverting changes in queue sharding" is bit vague. We are reverting thing partially so let's be clear - I think what we do here is "fix resharding by reverting highestTimestamp move to queue_manager" or sth?

Thank for quick help on this! LGTM otherwise!

}

checkInitialDesiredShardsOnce.Do(func() {
s, err := getMetricValue(t, bytes.NewReader(metrics), model.MetricTypeGauge, "prometheus_remote_storage_shards_desired")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can introduce flakiness (resharding faster than this check) - I guess it's fine for now, no trivial way to improve it quickly. One way would be to start with default config that is fine with 1 shard, then scale capacity down to 1.

Would it be useful to add a quick // TODO about this idea?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could share a run showing the flakiness I can think about making the test more reliable on main.

@machine424
Copy link
Copy Markdown
Member Author

Thanks for taking a look.
I’m afraid we’re being too nitty for a revert, I believe the PR is ready to be merged.

the revert is not only about highestTimestamp, thus the generic but correct PR title. (Details can be found in the description or by checking the changes direclty)

If you think the PR isn’t ready yet, don’t hesitate to build on it.

@bwplotka
Copy link
Copy Markdown
Member

I’m afraid we’re being too nitty for a revert, I believe the PR is ready to be merged.

We're trying to ensure this will not cause more harm then helps. If we would simply revert whole commit, we wouldn't need to be picky.

I think it's ready, you addressed my comments!

@bwplotka bwplotka merged commit 6806b68 into prometheus:release-3.7 Oct 29, 2025
32 checks passed
@bwplotka
Copy link
Copy Markdown
Member

I squashed when merging, sorry if this was important to preserve history.

@machine424
Copy link
Copy Markdown
Member Author

If we would simply revert whole commit, we wouldn't need to be picky.

looking at the commits, one can see that I did perform a complete revert in the first two commits, then brought back the metrics only changes in a third commit and added the test in the forth one.

@bwplotka
Copy link
Copy Markdown
Member

Sure, but the effect is the same, eventually a partial revert was done in this set of commits, no?

As a result, I had to deeply understand the change and challenge some details instead of just trusting that old thing was working (which is proven by 3.6 working fine). What am I missing? (:

You are welcome to push back on my comments (which you did), my review is not perfect and I understand your sentiment to "just merge quickly". Apologies if you feel it's too detailed and time consuming but in my opinion either you do a proper, high quality, detailed review (and as a result a good change) or you don't review. Otherwise just more work, bugs, frustrations happens.

Let me know what you think I should improve in my review style, I am keen to learn.

Anyway, you did help amazingly to fix this AND addressed my comments (especially thanks for the test, it's game changing), so I'm happy.

@bwplotka
Copy link
Copy Markdown
Member

I could definitely notice multiple commits and review each separate commit, I guess it could be useful (GH does not make it easy though) and I could avoid squash merging 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants