[Remote Write] Improve multi-queue WAL replay

Remote write WAL replay can take 2-3x longer than TSDB's head init. This is for an instance with a ~17g WAL and 3 remote write queues. I believe this is likely due to the fact that each queue is replaying the WAL separately, which could cause disk IO issues.

Excessive WAL replay times can cause further issues where remote write falls behind in WAL segments, and potentially never catches up due to max_shards.

We should investigate ways to improve the replay time, such as sharing a single replay between queues that start at the same time or replaying multiple segments in parallel (but ensuring we don't overwrite series with the same ref_id but seen in older segments).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Write] Improve multi-queue WAL replay #6733

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Remote Write] Improve multi-queue WAL replay #6733

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions