-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
[Spinoff from this comment about the cool approach Tantivy's FST implementation uses to limit memory during construction. See also this awesome detailed blog post.]
The most RAM/CPU costly part of constructing an FST is recording all suffixes seen so far so that when you see a new suffix, if it was already seen before, you can share the previous suffix. To guarantee a minimal (smallest number of states) FST, you must record every such suffix. But this is crazy costly when compiling many keys, and in practice you can accept loss of minimality in exchange for trimming how many / which suffixes you store.
Lucene's FSTCompiler.Builder has three hairy integer options for this (minSuffixCount1, minSuffixCount2, and sharedMaxTailLength), but 1) nobody really knows exactly these do, 2) they are horribly indirect and heavily quantized ways to tune RAM/CPU. You don't know how much RAM/CPU you really are saving.
Whereas the approach in Tantivy's FST implementation (which was forked original from this implementation by Andrew Gallant) is a simple bounded HashMap keeping only the commonly reused suffixes. Then one could tune how large this is (Andrew suggests ~10K is large enough in practice) against how minimal you really need your FST to be.
I think we should replace the three confusing integers with this approach, which requires only a single integer. We could even make the single integer a bound on RAM required to make it even more clearly meaningful to users. We should experiment with some "typical" corpora in how Lucene uses FSTs (encode synonyms, terms, etc.) to find a good default. Today Lucene defaults to "save everything so you get the truly minimal FST".