Skip to content

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

@mikemccand

Description

@mikemccand

Description

[Spinoff from this comment about the cool approach Tantivy's FST implementation uses to limit memory during construction. See also this awesome detailed blog post.]

The most RAM/CPU costly part of constructing an FST is recording all suffixes seen so far so that when you see a new suffix, if it was already seen before, you can share the previous suffix. To guarantee a minimal (smallest number of states) FST, you must record every such suffix. But this is crazy costly when compiling many keys, and in practice you can accept loss of minimality in exchange for trimming how many / which suffixes you store.

Lucene's FSTCompiler.Builder has three hairy integer options for this (minSuffixCount1, minSuffixCount2, and sharedMaxTailLength), but 1) nobody really knows exactly these do, 2) they are horribly indirect and heavily quantized ways to tune RAM/CPU. You don't know how much RAM/CPU you really are saving.

Whereas the approach in Tantivy's FST implementation (which was forked original from this implementation by Andrew Gallant) is a simple bounded HashMap keeping only the commonly reused suffixes. Then one could tune how large this is (Andrew suggests ~10K is large enough in practice) against how minimal you really need your FST to be.

I think we should replace the three confusing integers with this approach, which requires only a single integer. We could even make the single integer a bound on RAM required to make it even more clearly meaningful to users. We should experiment with some "typical" corpora in how Lucene uses FSTs (encode synonyms, terms, etc.) to find a good default. Today Lucene defaults to "save everything so you get the truly minimal FST".

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions