Skip to content

WAL watcher replay in agent mode takes significantly more time since 2.45 #13111

@jlevesy

Description

@jlevesy

What did you do?

Hello 👋

I'm not sure this is a use case you're willing to support, so feel free to reclassify // close this 😄. But I'm wondering if this issue does not impact all agents with existing data on disk, so here I go anyway 😁.

Let met give you a bit of context:

  • We're using a pair of Prometheus instances in agent mode in our Kubernetes clusters, and run a leader election between them using a sidecar I wrote called prometheus-elector so that we only have one replica sending samples.
  • Both the replicas have persistent volumes attached, which means that they always retain some (likely stale) data on disk.
  • This sidecar generates a Prometheus configuration based on the leader status of the pod
    • If the pod is follower, it generates a minimal configuration so that Prometheus actually starts and scraps itself
    • If the pod is leader, it generates a full configuration with all our scrape_configs and a remote_write target so that the Prometheus container starts collecting and sending samples to our remote storage.
  • When the pod is promoted leader, prometheus-elector regenerates an actual Prometheus configuration, writes it on disk then tells the prometheus container to reload its configuration with the API.
  • This makes Prometheus create a WAL watcher which replays all the WAL from disk and starts sending samples . For any versions <= 2.44.0, the WAL watcher replay took ~10s. This gave us decent fail-over time when the leading replica went down.
  • But when we tried to upgrade to a version >= 2.45.0, we started to notice that WAL watcher is now taking ~5m to catch up when a replica is elected leader(for a similar amount of segments), which becomes problematic because it makes the agent stop sending samples during the replay then then creates a surge of samples sent after the WAL is fully replayed, causing the remote storage to reject some of the sent samples 😭.

I believe this change of behavior was introduced #11949.

  • In 2.44, the watch method was calling readSegment every readPeriod (10ms)... it was actively polling and eating CPU I suppose :D.
  • Now in 2.45+ this behavior was changed to actually rely on an external actor calling Notify when a new segment needs to be read. And relies on a 15s timeout to call readSegment.
  • But as far as I understand this PR (I'm not 100% sure of this 😅), in agent mode, nobody actually calls Notify anymore, so we end up reading one segment every 15s, causing the WAL to take a much longer time to be replayed in our case. Running Prometheus in debug mode higlighted that by printing this debug line every 15s (see the attached logs extract)
  • EDIT: rolling out a v2.47.2 with this stupid patch restores the previous behavior.

I would be happy to know if you consider this as a bug and if you see any option for fixing this. With a bit of guidance I would be happy to contribute it 😁.

Thanks anyway for your work (and time reading this 😅)!

What did you expect to see?

WAL replay in agent mode happening within a """decent""" time, in the observed cases roughly 10s.

What did you see instead? Under which circumstances?

WAL replay taking ~5minutes, and a surge of samples being sent after the replay for a similar amount of segments to replay.

(under the circumstances explained above)

System information

No response

Prometheus version

>= v2.45.0

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

With 2.44 WAL replay takes ~10s

ts=2023-11-08T13:40:30.295Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Replaying WAL" queue=mimir_remote_write
ts=2023-11-08T13:40:30.295Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Reading checkpoint" dir=data-agent/wal/checkpoint.00016830
....
ts=2023-11-08T13:40:33.635Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Tailing WAL" lastCheckpoint=data-agent/wal/checkpoint.00016830 checkpointIndex=16830 currentSegment=16831 lastSegment=16845 
ts=2023-11-08T13:40:33.636Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16831
ts=2023-11-08T13:40:34.138Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16832
ts=2023-11-08T13:40:34.667Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16833
...
ts=2023-11-08T13:40:41.191Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Done replaying WAL" duration=10.896565194s

With 2.45, WAL replay takes ~5m, please note the ~15s interval between two "Processing segment" log.

ts=2023-11-08T15:05:15.553Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Replaying WAL" queue=mimir_remote_write
ts=2023-11-08T15:05:15.553Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Reading checkpoint" dir=data-agent/wal/checkpoint.00016830
...
ts=2023-11-08T15:05:18.849Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Tailing WAL" lastCheckpoint=data-agent/wal/checkpoint.00016830 checkpointIndex=16830 currentSegment=16831 lastSegment=16852
ts=2023-11-08T15:05:18.849Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16831
ts=2023-11-08T15:05:33.850Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Watcher is reading the WAL due to timeout, haven't received any write notifications recently" timeout=15s
ts=2023-11-08T15:05:34.352Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16832
ts=2023-11-08T15:05:49.838Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16833
ts=2023-11-08T15:06:05.358Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16834
ts=2023-11-08T15:06:20.851Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16835
ts=2023-11-08T15:06:35.852Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Watcher is reading the WAL due to timeout, haven't received any write notifications recently" timeout=15s
ts=2023-11-08T15:06:36.348Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16836
...
ts=2023-11-08T15:10:58.599Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Done replaying WAL" duration=5m43.046537509s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions