Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

### Description

[Spinoff from [this comment](https://github.com/apache/lucene/issues/12513#issuecomment-1710089496) about the cool approach Tantivy's FST implementation uses to limit memory during construction.  See also this [awesome detailed blog post](https://blog.burntsushi.net/transducers).]

The most RAM/CPU costly part of constructing an FST is recording all suffixes seen so far so that when you see a new suffix, if it was already seen before, you can share the previous suffix.  To guarantee a minimal (smallest number of states) FST, you must record every such suffix.  But this is crazy costly when compiling many keys, and in practice you can accept loss of minimality in exchange for trimming how many / which suffixes you store.

Lucene's `FSTCompiler.Builder` has three hairy integer options for this (`minSuffixCount1`, `minSuffixCount2`, and `sharedMaxTailLength`), but 1) nobody really knows exactly these do, 2) they are horribly indirect and heavily quantized ways to tune RAM/CPU.  You don't know how much RAM/CPU you really are saving.

Whereas the approach in Tantivy's FST implementation (which was forked original from [this implementation](https://github.com/BurntSushi/fst) by [Andrew Gallant](https://github.com/BurntSushi)) is a simple bounded `HashMap` keeping only the commonly reused suffixes.  Then one could tune how large this is (Andrew suggests ~10K is large enough in practice) against how minimal you really need your FST to be.

I think we should replace the three confusing integers with this approach, which requires only a single integer.  We could even make the single integer a bound on RAM required to make it even more clearly meaningful to users.  We should experiment with some "typical" corpora in how Lucene uses FSTs (encode synonyms, terms, etc.) to find a good default.  Today Lucene defaults to "save everything so you get the truly minimal FST".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality #12542

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions