-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
(Spinoff from tricky discussions on #13190).
Lucene's ConcurrentMergeScheduler allows users to set a mbPerSec rate limit on bytes written for each merge. It's a complex feature to implement, and the new intra-merge concurrency coming shortly makes it even trickier (see above PR). It is also best effort since separate threads writing bytes during merging "check in" only periodically after enough bytes have been written on their private thread.
However, there is one known bug that I would call a bug (and not a "best effort" limitation): it uses a naive instant measure of the IO rate, meaning it simply looks at the time of the last "check-in" from a writing thread, versus the current time and the current number of bytes written, and decides whether to pause. This is somewhat brutal as IO writes can be bursty: merging things like postings might mean quite a bit of CPU effort in between writes, sometimes, and other times, not (e.g. merging a large postings list). The instant approach we take today gives no credit for a longish period of time when no/few bytes were written.
It's sort of like telling a runner in a race that they get not credit from running slowly for a while, to get their wind back, and are not allowed to then sprint.
This means that if you were to sum up total bytes written by total time taken, the net mbPerSec is likely far below the specified limit.
A better model would be something like how AWS (and likely other cloud providers) handles IOPs limits on an EC2 instance using a "burst bucket" ("Burst IOPS is a feature of Amazon Web Services (AWS) EBS volume types that allows applications to store unused IOPS in a burst bucket and then drain them when needed" -- thank you Gemini for the summary). It would have some state, allowing a burst after a period of not much IO, and would more closely throttle to the overall target rate, while allowing some bursting to catch up for a lull.
We could maybe run some benchmarks to see in practice whether Lucene's IO during merging is smooth enough that this lack of burstiness isn't really hurting things much ...
Alternatively, we could remove this complex and tricky-to-implement and best-effort-with-known-bugs feature of Lucene? IO devices have only become more performant (many use cases store the Lucene index on SSDs now), machines have more RAM for the OS to cache writes and gradually spool them to disk, etc. This feature is also likely to be mis-used, and make users think Lucene cannot keep up with merging ...
Version and environment details
No response