-
Notifications
You must be signed in to change notification settings - Fork 10.3k
WAL watcher unable to catch up when backlogged #13471
Description
What did you do?
I reported this in Slack, and while I don't have a simple reproduction, @cstyan confirmed that this looks like a bug: https://cloud-native.slack.com/archives/C167KFM6C/p1706219700271849
There's a scenario where, if the WAL watcher falls behind after startup, it becomes unable to catch up.
Specifically, the lastSegment variable in Run in watcher.go is never updated after the watcher starts, so if we've ever been caught up to the latest WAL segment in a Prometheus process lifetime, even if we subsequently fell behind again, the tail parameter will be set to true:
prometheus/tsdb/wlog/watcher.go
Lines 301 to 303 in 998fafe
| if err := w.watch(currentSegment, currentSegment >= lastSegment); err != nil && !errors.Is(err, ErrIgnorable) { | |
| return err | |
| } |
If
tail is true, we'll only actually perform a read from the current WAL segment when we get a Notify event for a write, or when the every-15-second read timeout fires: prometheus/tsdb/wlog/watcher.go
Lines 478 to 487 in 998fafe
| case <-readTicker.C: | |
| level.Debug(w.logger).Log("msg", "Watcher is reading the WAL due to timeout, haven't received any write notifications recently", "timeout", readTimeout) | |
| err := w.readAndHandleError(reader, segmentNum, tail, size) | |
| if err != nil { | |
| return err | |
| } | |
| // still want to reset the ticker so we don't read too often | |
| readTicker.Reset(readTimeout) | |
| case <-w.readNotify: |
If Notify events are infrequent, e.g. if they occur approximately once per new WAL segment, and the rate of new samples & new segments is roughly constant, this will mean our rate of reading old records will be roughly capped at the rate of new WAL segments being written, and so we'll gradually keep falling behind. We appear to have seen this behavior in production repeatedly--specifically the symptom where, when the WAL watcher falls behind, it gets stuck well below the throughput that it's able to achieve at startup.
This is similar to #13111 but isn't limited to the use of agent mode (although it does require a more specific load pattern).
I suspect a fix might be as simple as setting lastSegment = max(lastSegment, currentSegment) here:
prometheus/tsdb/wlog/watcher.go
Lines 308 to 309 in 998fafe
| } | |
... but that might have wider implications of which I'm not aware.
What did you expect to see?
I expected that the WAL watcher used by remote write would be able to catch up from a backlog that occurs between restarts (e.g. due to a blip on the receiver side) as quickly as it is able to catch up from a backlog at startup.
What did you see instead? Under which circumstances?
The WAL watcher throughput becomes a bottleneck on remote write and the backlog grows indefinitely.
System information
Linux 5.15.109+ x86_64
Prometheus version
prometheus, version 2.48.0 (branch: HEAD, revision: 6d80b30990bc297d95b5c844e118c4011fad8054)
build user: root@26117804242c
build date: 20231116-04:35:21
go version: go1.21.4
platform: linux/amd64
tags: netgo,builtinassets,stringlabels
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
No response