Skip to content

WAL watcher unable to catch up when backlogged #13471

@djudd

Description

@djudd

What did you do?

I reported this in Slack, and while I don't have a simple reproduction, @cstyan confirmed that this looks like a bug: https://cloud-native.slack.com/archives/C167KFM6C/p1706219700271849

There's a scenario where, if the WAL watcher falls behind after startup, it becomes unable to catch up.

Specifically, the lastSegment variable in Run in watcher.go is never updated after the watcher starts, so if we've ever been caught up to the latest WAL segment in a Prometheus process lifetime, even if we subsequently fell behind again, the tail parameter will be set to true:

if err := w.watch(currentSegment, currentSegment >= lastSegment); err != nil && !errors.Is(err, ErrIgnorable) {
return err
}

If tail is true, we'll only actually perform a read from the current WAL segment when we get a Notify event for a write, or when the every-15-second read timeout fires:
case <-readTicker.C:
level.Debug(w.logger).Log("msg", "Watcher is reading the WAL due to timeout, haven't received any write notifications recently", "timeout", readTimeout)
err := w.readAndHandleError(reader, segmentNum, tail, size)
if err != nil {
return err
}
// still want to reset the ticker so we don't read too often
readTicker.Reset(readTimeout)
case <-w.readNotify:

If Notify events are infrequent, e.g. if they occur approximately once per new WAL segment, and the rate of new samples & new segments is roughly constant, this will mean our rate of reading old records will be roughly capped at the rate of new WAL segments being written, and so we'll gradually keep falling behind. We appear to have seen this behavior in production repeatedly--specifically the symptom where, when the WAL watcher falls behind, it gets stuck well below the throughput that it's able to achieve at startup.

This is similar to #13111 but isn't limited to the use of agent mode (although it does require a more specific load pattern).

I suspect a fix might be as simple as setting lastSegment = max(lastSegment, currentSegment) here:


... but that might have wider implications of which I'm not aware.

What did you expect to see?

I expected that the WAL watcher used by remote write would be able to catch up from a backlog that occurs between restarts (e.g. due to a blip on the receiver side) as quickly as it is able to catch up from a backlog at startup.

What did you see instead? Under which circumstances?

The WAL watcher throughput becomes a bottleneck on remote write and the backlog grows indefinitely.

System information

Linux 5.15.109+ x86_64

Prometheus version

prometheus, version 2.48.0 (branch: HEAD, revision: 6d80b30990bc297d95b5c844e118c4011fad8054)
  build user:       root@26117804242c
  build date:       20231116-04:35:21
  go version:       go1.21.4
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions