Skip to content

Remote write fails to scale after 3.7 upgrade in some clusters #17384

@psalaberria002

Description

@psalaberria002

What did you do?

I upgraded a few Prometheus instances to from 3.6.0 to 3.7.1.

These instances are spread across the world, and send data to a central instance via remote write.

What did you expect to see?

I didn't expect to see any issues with remote write. The instances have been running happily for months.

What did you see instead? Under which circumstances?

Some of the instances, especially those with worse network connection, started to lag behind (more than an hour), and the number of shards did not increase.

As an example, one of these instances has the following remote write config:

    remote_write:
    - queue_config:
        capacity: 3000
        max_samples_per_send: 1000
        max_shards: 75
      remote_timeout: 15s

The number of shards was between 3 and 4, even with an hour delay, and there was enough CPU and memory available for it to scale.

CPU: 2 cores requested, <20% used
Memory: 16GB requested, 25% used

After rolling it back to 3.6, everything went back to normal.

I tried the upgrade again after a day, and I hit the same issues. Rolling it back solved it again.

System information

Linux 6.12.10-76061203-generic x86_64

Prometheus version

prometheus, version 3.7.1 (branch: HEAD, revision: 0aeb4fddc93b64e4e95104d5e8ea8b55ad36fb61)
  build user:       root@54bf11233185
  build date:       20251017-06:31:55
  go version:       go1.25.3
  platform:         linux/amd64
  tags:             netgo,builtinassets

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions