Skip to content

[Remote Write] Improve multi-queue WAL replay #6733

@cstyan

Description

@cstyan

Remote write WAL replay can take 2-3x longer than TSDB's head init. This is for an instance with a ~17g WAL and 3 remote write queues. I believe this is likely due to the fact that each queue is replaying the WAL separately, which could cause disk IO issues.

Excessive WAL replay times can cause further issues where remote write falls behind in WAL segments, and potentially never catches up due to max_shards.

We should investigate ways to improve the replay time, such as sharing a single replay between queues that start at the same time or replaying multiple segments in parallel (but ensuring we don't overwrite series with the same ref_id but seen in older segments).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions