-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[Remote Write] Improve multi-queue WAL replay #6733
Copy link
Copy link
Open
Labels
Description
Remote write WAL replay can take 2-3x longer than TSDB's head init. This is for an instance with a ~17g WAL and 3 remote write queues. I believe this is likely due to the fact that each queue is replaying the WAL separately, which could cause disk IO issues.
Excessive WAL replay times can cause further issues where remote write falls behind in WAL segments, and potentially never catches up due to max_shards.
We should investigate ways to improve the replay time, such as sharing a single replay between queues that start at the same time or replaying multiple segments in parallel (but ensuring we don't overwrite series with the same ref_id but seen in older segments).
Reactions are currently unavailable