-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Remote write fails to scale after 3.7 upgrade in some clusters #17384
Description
What did you do?
I upgraded a few Prometheus instances to from 3.6.0 to 3.7.1.
These instances are spread across the world, and send data to a central instance via remote write.
What did you expect to see?
I didn't expect to see any issues with remote write. The instances have been running happily for months.
What did you see instead? Under which circumstances?
Some of the instances, especially those with worse network connection, started to lag behind (more than an hour), and the number of shards did not increase.
As an example, one of these instances has the following remote write config:
remote_write:
- queue_config:
capacity: 3000
max_samples_per_send: 1000
max_shards: 75
remote_timeout: 15s
The number of shards was between 3 and 4, even with an hour delay, and there was enough CPU and memory available for it to scale.
CPU: 2 cores requested, <20% used
Memory: 16GB requested, 25% used
After rolling it back to 3.6, everything went back to normal.
I tried the upgrade again after a day, and I hit the same issues. Rolling it back solved it again.
System information
Linux 6.12.10-76061203-generic x86_64
Prometheus version
prometheus, version 3.7.1 (branch: HEAD, revision: 0aeb4fddc93b64e4e95104d5e8ea8b55ad36fb61)
build user: root@54bf11233185
build date: 20251017-06:31:55
go version: go1.25.3
platform: linux/amd64
tags: netgo,builtinassets