-
Notifications
You must be signed in to change notification settings - Fork 10.3k
WAL watcher replay in agent mode takes significantly more time since 2.45 #13111
Description
What did you do?
Hello 👋
I'm not sure this is a use case you're willing to support, so feel free to reclassify // close this 😄. But I'm wondering if this issue does not impact all agents with existing data on disk, so here I go anyway 😁.
Let met give you a bit of context:
- We're using a pair of Prometheus instances in agent mode in our Kubernetes clusters, and run a leader election between them using a sidecar I wrote called prometheus-elector so that we only have one replica sending samples.
- Both the replicas have persistent volumes attached, which means that they always retain some (likely stale) data on disk.
- This sidecar generates a Prometheus configuration based on the leader status of the pod
- If the pod is follower, it generates a minimal configuration so that Prometheus actually starts and scraps itself
- If the pod is leader, it generates a full configuration with all our
scrape_configsand aremote_writetarget so that the Prometheus container starts collecting and sending samples to our remote storage.
- When the pod is promoted leader,
prometheus-electorregenerates an actual Prometheus configuration, writes it on disk then tells theprometheuscontainer to reload its configuration with the API. - This makes Prometheus create a WAL watcher which replays all the WAL from disk and starts sending samples . For any versions
<= 2.44.0, the WAL watcher replay took~10s. This gave us decent fail-over time when the leading replica went down. - But when we tried to upgrade to a version
>= 2.45.0, we started to notice that WAL watcher is now taking~5mto catch up when a replica is elected leader(for a similar amount of segments), which becomes problematic because it makes the agent stop sending samples during the replay then then creates a surge of samples sent after the WAL is fully replayed, causing the remote storage to reject some of the sent samples 😭.
I believe this change of behavior was introduced #11949.
- In 2.44, the watch method was calling readSegment every readPeriod (10ms)... it was actively polling and eating CPU I suppose :D.
- Now in 2.45+ this behavior was changed to actually rely on an external actor calling Notify when a new segment needs to be read. And relies on a 15s timeout to call
readSegment. - But as far as I understand this PR (I'm not 100% sure of this 😅), in agent mode, nobody actually calls
Notifyanymore, so we end up reading one segment every 15s, causing the WAL to take a much longer time to be replayed in our case. Running Prometheus in debug mode higlighted that by printing this debug line every 15s (see the attached logs extract) - EDIT: rolling out a v2.47.2 with this stupid patch restores the previous behavior.
I would be happy to know if you consider this as a bug and if you see any option for fixing this. With a bit of guidance I would be happy to contribute it 😁.
Thanks anyway for your work (and time reading this 😅)!
What did you expect to see?
WAL replay in agent mode happening within a """decent""" time, in the observed cases roughly 10s.
What did you see instead? Under which circumstances?
WAL replay taking ~5minutes, and a surge of samples being sent after the replay for a similar amount of segments to replay.
(under the circumstances explained above)
System information
No response
Prometheus version
>= v2.45.0
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
With 2.44 WAL replay takes ~10s
ts=2023-11-08T13:40:30.295Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Replaying WAL" queue=mimir_remote_write
ts=2023-11-08T13:40:30.295Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Reading checkpoint" dir=data-agent/wal/checkpoint.00016830
....
ts=2023-11-08T13:40:33.635Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Tailing WAL" lastCheckpoint=data-agent/wal/checkpoint.00016830 checkpointIndex=16830 currentSegment=16831 lastSegment=16845
ts=2023-11-08T13:40:33.636Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16831
ts=2023-11-08T13:40:34.138Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16832
ts=2023-11-08T13:40:34.667Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16833
...
ts=2023-11-08T13:40:41.191Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Done replaying WAL" duration=10.896565194s
With 2.45, WAL replay takes ~5m, please note the ~15s interval between two "Processing segment" log.
ts=2023-11-08T15:05:15.553Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Replaying WAL" queue=mimir_remote_write
ts=2023-11-08T15:05:15.553Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Reading checkpoint" dir=data-agent/wal/checkpoint.00016830
...
ts=2023-11-08T15:05:18.849Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Tailing WAL" lastCheckpoint=data-agent/wal/checkpoint.00016830 checkpointIndex=16830 currentSegment=16831 lastSegment=16852
ts=2023-11-08T15:05:18.849Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16831
ts=2023-11-08T15:05:33.850Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Watcher is reading the WAL due to timeout, haven't received any write notifications recently" timeout=15s
ts=2023-11-08T15:05:34.352Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16832
ts=2023-11-08T15:05:49.838Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16833
ts=2023-11-08T15:06:05.358Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16834
ts=2023-11-08T15:06:20.851Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16835
ts=2023-11-08T15:06:35.852Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Watcher is reading the WAL due to timeout, haven't received any write notifications recently" timeout=15s
ts=2023-11-08T15:06:36.348Z caller=dedupe.go:112 component=remote level=debug remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Processing segment" currentSegment=16836
...
ts=2023-11-08T15:10:58.599Z caller=dedupe.go:112 component=remote level=info remote_name=mimir_remote_write url=http://write.mimir.local/api/v1/push msg="Done replaying WAL" duration=5m43.046537509s