Remote write fails to scale after 3.7 upgrade in some clusters

### What did you do?

I upgraded a few Prometheus instances to from 3.6.0 to 3.7.1.

These instances are spread across the world, and send data to a central instance via remote write.

### What did you expect to see?

I didn't expect to see any issues with remote write. The instances have been running happily for months.


### What did you see instead? Under which circumstances?

Some of the instances, especially those with worse network connection, started to lag behind (more than an hour), and the number of shards did not increase. 

As an example, one of these instances has the following remote write config:
```
    remote_write:
    - queue_config:
        capacity: 3000
        max_samples_per_send: 1000
        max_shards: 75
      remote_timeout: 15s
```

The number of shards was between 3 and 4, even with an hour delay, and there was enough CPU and memory available for it to scale.

CPU: 2 cores requested, <20% used
Memory: 16GB requested, 25% used

After rolling it back to 3.6, everything went back to normal.

I tried the upgrade again after a day, and I hit the same issues. Rolling it back solved it again.

### System information

Linux 6.12.10-76061203-generic x86_64

### Prometheus version

```text
prometheus, version 3.7.1 (branch: HEAD, revision: 0aeb4fddc93b64e4e95104d5e8ea8b55ad36fb61)
  build user:       root@54bf11233185
  build date:       20251017-06:31:55
  go version:       go1.25.3
  platform:         linux/amd64
  tags:             netgo,builtinassets
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote write fails to scale after 3.7 upgrade in some clusters #17384

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remote write fails to scale after 3.7 upgrade in some clusters #17384

Description

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions