Lucene Change Log
For more information on past and future Lucene versions, please see:
http://s.apache.org/luceneversions
- Bug Fixes (6)
- GITHUB#13947: Fix ord-to-doc mapping when searching Lucene 9.0.0 hnsw indices
(Michael Sokolov, Ben Trent)
- GITHUB#13867: Fix backwards compatibility bug that caused 9.12.0 to
incorrectly throw IllegalStateException when trying to open an
IndexReader on an index created with quantized (int4, int7, int8)
KNN vectors using Lucene99HnswScalarQuantizedVectorsFormat. This
was an accidental backwards compatibility break: such indices should
be readable and writable by any future 9.x and 10.x Lucene releases.
But note that int8 compression was buggy and can no longer be
written but can be read from an existing index.
(Ionut
Anghelcovici, Michael McCandless)
- GITHUB#13841: Improve Tessellatorlogic when two holes share the same vertex with the polygon which was failing
in valid polygons.
(Ignacio Vera)
- GITHUB#13986: Allow easier configuration of the Panama Vectorization provider with
newer Java versions. Set the `org.apache.lucene.vectorization.upperJavaFeatureVersion`
system property to increase the set of Java versions that Panama Vectorization will
provide optimized implementations for.
(Chris Hegarty)
- GITHUB#14008: Counts provided by taxonomy facets in addition to another aggregation are now returned together with
their corresponding ordinals.
(Paul King)
- GITHUB#14027: Make SegmentInfos#readCommit(Directory, String, int) public
(Luca Cavanna)
- API Changes (1)
- GITHUB#13845: Add missing with-discountOverlaps Similarity constructor variants.
(Pierre Salagnac, Christine Poerschke, Robert Muir)
- Security Fixes (1)
- Deserialization of Untrusted Data vulnerability in Apache Lucene Replicator - CVE-2024-45772
(Summ3r from Vidar-Team, Robert Muir, Paul Irwin)
- API Changes (9)
- GITHUB#13806: Add TermInSetQuery#getBytesRefIterator to be able to iterate over query terms.
(Christoph Büscher)
- GITHUB#13469: Expose FlatVectorsFormat as a first-class format; can be configured using a custom Codec.
(Michael Sokolov)
- GITHUB#13612: Hunspell: add Suggester#proceedPastRep to avoid losing relevant suggestions.
(Peter Gromov)
- GITHUB#13603: Introduced `IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` protected method to
facilitate customizing per-leaf behavior of search without requiring to override
`search(LeafReaderContext[], Weight, Collector)` which requires overriding the entire loop across the leaves
(Luca Cavanna)
- GITHUB#13559: Add BitSet#nextSetBit(int, int) to get the index of the first set bit in range.
(Egor Potemkin)
- GITHUB#13568: Add DoubleValuesSource#toSortableLongDoubleValuesSource and
MultiDoubleValuesSource#toSortableMultiLongValuesSource methods.
(Shradha Shankar)
- GITHUB#13568, GITHUB#13750: Add DrillSideways#search method that supports any CollectorManagers for drill-sideways dimensions
or drill-down.
(Egor Potemkin)
- GITHUB#13737: Deprecate the FacetsCollector#search utility methods and add new corresponding method to
FacetsCollectorManager that accept a FacetsCollectorManager as last argument in place of a Collector.
(Luca Cavanna)
- GITHUB#13794: Deprecate BulkScorer#score(LeafCollector collector, Bits acceptDocs) in favour of
BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max). The method will be removed in the next
major version. Replace usages with the latter, providing 0 as min and DocIdSetIterator.NO_MORE_DOCS as max in case
the entire segment should be scored. Subclasses that override the method should instead override its replacement.
(Luca Cavanna)
- New Features (5)
- GITHUB#13430: Allow configuring the search concurrency via
TieredMergePolicy#setTargetSearchConcurrency. This in-turn instructs the
merge policy to try to have at least this number of segments on the highest
tier.
(Adrien Grand, Carlos Delgado)
- GITHUB#13517: Allow configuring the search concurrency on LogDocMergePolicy
and LogByteSizeMergePolicy via a new #setTargetConcurrency setter.
(Adrien Grand)
- GITHUB#13568: Add sandbox facets module to compute facets while collecting.
(Egor Potemkin, Shradha Shankar)
- GITHUB#13678: Add support JDK 23 to the Panama Vectorization Provider.
(Chris Hegarty)
- GITHUB#13689: Add a new faceting feature, dynamic range facets, which automatically picks a balanced set of numeric
ranges based on the distribution of values that occur across all hits. For use cases that have a highly variable
numeric doc values field, such as "price" in an e-commerce application, this facet method is powerful as it allows the
presented ranges to adapt depending on what hits the query actually matches. This is in contrast to existing range
faceting that requires the application to provide the specific fixed ranges up front.
(Yuting Gan, Greg Miller,
Stefan Vodita)
- Improvements (10)
- GITHUB#13475: Re-enable intra-merge parallelism except for terms, norms, and doc values.
Related to GITHUB#13478.
(Ben Trent)
- GITHUB#13548: Refactor and javadoc update for KNN vector writer classes.
(Patrick Zhai)
- GITHUB#13562: Add Intervals.regexp and Intervals.range methods to produce IntervalsSource
for regexp and range queries.
(Mayya Sharipova)
- GITHUB#13625: Remove BitSet#nextSetBit code duplication.
(Greg Miller)
- GITHUB#13285: Early terminate graph searches of AbstractVectorSimilarityQuery to follow timeout set from
IndexSearcher#setTimeout(QueryTimeout).
(Kaival Parikh)
- GITHUB#13633: Add ability to read/write knn vector values to a MemoryIndex.
(Ben Trent)
- GITHUB#12627: patch HNSW graphs to improve reachability of all nodes from entry points
- GITHUB#13201: Better cost estimation on MultiTermQuery over few terms.
(Michael Froh)
- GITHUB#13735: Migrate monitor package usage of deprecated IndexSearcher#search(Query, Collector)
to IndexSearcher#search(Query, CollectorManager).
(Greg Miller)
- GITHUB#13746: Introduce ProfilerCollectorManager to parallelize search when using ProfilerCollector.
(Luca Cavanna)
- Optimizations (18)
- GITHUB#13439: Avoid unnecessary memory allocation in PackedLongValues#Iterator.
(Zhang Chao)
- GITHUB##13425: Rewrite SortedNumericDocValuesRangeQuery to MatchNoDocsQuery when the upper bound is smaller than the
lower bound.
(Ioana Tagirta)
- GITHUB#13322: Implement Weight#count for vector values in the FieldExistsQuery.
(Pan Guixin)
- GITHUB#13454: MultiTermQuery returns null ScoreSupplier in cases where
no query terms are present in the index segment
(Mayya Sharipova)
- GITHUB#13431: Replace TreeMap and use compiled Patterns in Japanese UserDictionary.
(Bruno Roustant)
- GITHUB#12941: Don't preserve auxiliary buffer contents in LSBRadixSorter if it grows.
(Stefan Vodita)
- GITHUB#13175: Stop double-checking priority queue inserts in some FacetCount classes.
(Jakub Slowinski)
- GITHUB#13538: Slightly reduce heap usage for HNSW and scalar quantized vector writers.
(Ben Trent)
- GITHUB#12100: WordBreakSpellChecker.suggestWordBreaks now does a breadth first search, allowing it to return
better matches with fewer evaluations
(hossman)
- GITHUB#13582: Stop requiring MaxScoreBulkScorer's outer window from having at
least INNER_WINDOW_SIZE docs.
(Adrien Grand)
- GITHUB#13570, GITHUB#13574, GITHUB#13535: Avoid performance degradation with closing shared Arenas.
Closing many individual index files can potentially lead to a degradation in execution performance.
Index files are mmapped one-to-one with the JDK's foreign shared Arena. The JVM deoptimizes the top
few frames of all threads when closing a shared Arena (see JDK-8335480). We mitigate this situation
when running with JDK 21 and greater, by 1) using a confined Arena where appropriate, and 2) grouping
files from the same segment to a single shared Arena.
A system property has been added that allows to control the total maximum number of mmapped files
that may be associated with a single shared Arena. For example, to set the max number of permits to
256, pass the following on the command line
-
Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=256. Setting a value of 1 associates
a single file to a single shared arena.
(Chris Hegarty, Michael Gibney, Uwe Schindler)
- GITHUB#13585: Lucene912PostingsFormat, the new default postings format, now
only has 2 levels of skip data, which are inlined into postings instead of
being stored at the end of postings lists. This translates into better
performance for queries that need skipping such as conjunctions.
(Adrien Grand)
- GITHUB#13581: OnHeapHnswGraph no longer allocates a lock for every graph node
(Mike Sokolov)
- GITHUB#13636, GITHUB#13658: Optimizations to the decoding logic of blocks of
postings.
(Adrien Grand, Uwe Schindler, Greg Miller)
- GITHUB##13644: Improve NumericComparator competitive iterator logic by comparing the missing value with the top
value even after the hit queue is full
(Pan Guixin)
- GITHUB#13587: Use Max WAND optimizations with ToParentBlockJoinQuery when using ScoreMode.Max
(Mike Pellegrini)
- GITHUB#13742: Reorder checks in LRUQueryCache#count
(Shubham Chaudhary)
- GITHUB#13697: Add a bulk scorer to ToParentBlockJoinQuery, which delegates to the bulk scorer of the child query.
This should speed up query evaluation when the child query has a specialized bulk scorer, such as disjunctive queries.
(Mike Pellegrini)
- Changes in runtime behavior (1)
- GITHUB#13472: When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the
thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1
to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor
that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no
longer required.
(Armin Braun)
- Bug Fixes (12)
- GITHUB#13498: Avoid performance regression by constructing lazily the PointTree in NumericComparator,
(Ignacio Vera)
- GITHUB#13384: Fix highlighter to use longer passages instead of shorter individual terms.
(Zack Kendall)
- GITHUB#13463: Address bug in MultiLeafKnnCollector causing #minCompetitiveSimilarity to stay artificially low in
some corner cases.
(Greg Miller)
- GITHUB#13553: Correct RamUsageEstimate for scalar quantized knn vector formats so that raw vectors are correctly
accounted for.
(Ben Trent)
- GITHUB#13615: Correct scalar quantization when used in conjunction with COSINE similarity. Vectors are normalized
before quantization to ensure the cosine similarity is correctly calculated.
(Ben Trent)
- GITHUB#13627: Fix race condition on flush for DWPT seqNo generation.
(Ben Trent, Ao Li)
- GITHUB#13646: Fix rare test bug in TestLongValueFacetCounts that was introduced in 9.6.
(Greg Miller)
- GITHUB#13691: Fix incorrect exponent value in explain of SigmoidFunction.
(Owais Kazi)
- GITHUB#13703: Fix bug in LatLonPoint queries where narrow polygons close to latitude 90 don't
match any points due to an Integer overflow.
(Ignacio Vera)
- GITHUB#13641: Unify how KnnFormats handle missing fields and correctly handle missing vector fields when
merging segments.
(Ben Trent)
- GITHUB#13519: 8 bit scalar vector quantization is no longer
supported: it was buggy starting in 9.11 (GITHUB#13197). 4 and 7
bit quantization are still supported. Existing (9.x) Lucene indices
that previously used 8 bit quantization can still be read/searched
but the results from `KNN*VectorQuery` are silently buggy. Further
8 bit quantized vector indexing into such (9.11) indices is not
permitted, so your path forward if you wish to continue using the
same 9.11 index is to index additional vectors into the same field
with either 4 or 7 bit quantization (or no quantization), and ensure
all older (9.11 written) segments are rewritten either via
`IndexWriter.forceMerge` or
`IndexWriter.addIndexes(CodecReader...)`, or reindexing entirely.
- GITHUB#13799: Disable intra-merge parallelism for all structures but kNN vectors.
(Ben Trent)
- Build (1)
- GITHUB#13695, GITHUB#13696: Fix Gradle build sometimes gives spurious "unreferenced license file" warnings.
(Uwe Schindler)
- Other (2)
- GITHUB#13720: Add float comparison based on unit of least precision and use it to stop test failures caused by float
summation not being associative in IEEE 754.
(Alex Herbert, Stefan Vodita)
- Remove code triggering forbidden-apis regarding Java serialization.
(Uwe Schindler, Robert Muir)
- Bug Fixes (5)
- GITHUB#13498: Avoid performance regression by constructing lazily the PointTree in NumericComparator.
(Ignacio Vera)
- GITHUB#13501, GITHUB#13478: Remove intra-merge parallelism for everything except HNSW graph merges.
(Ben Trent)
- GITHUB#13498, GITHUB#13340: Allow adding a parent field to an index with no fields
(Michael Sokolov)
- GITHUB#12431: Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter
by unordered matches.
(Stephane Campinas)
- GITHUB#13493: StringValueFacetCounts stops throwing NPE when faceting over an empty match-set.
(Grebennikov Roman,
Stefan Vodita)
- API Changes (2)
- GITHUB#13145: Deprecate ByteBufferIndexInput as it will be removed in Lucene 10.0.
(Uwe Schindler)
- GITHUB#13422: an explicit dependency on the HPPC library is removed in favor of an internal repackaged copy in
oal.internal.hppc. If you relied on HPPC as a transitive dependency, you'll have to add it to your project explicitly.
The HPPC classes now bundled in Lucene core are internal and will have restricted access in future releases, please do
not use them.
(Bruno Roustant, Dawid Weiss, Uwe Schindler, Chris Hegarty)
- New Features (9)
- GITHUB#13125: Recursive graph bisection is now supported on indexes that have blocks, as long as
they configure a parent field via `IndexWriterConfig#setParentField`.
(Adrien Grand)
- GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter
and JapaneseKatakanaUppercaseFilter.
(Dai Sugimori)
- GITHUB#13196, GITHUB#13222: Add support for posix_madvise to MMapDirectory: If running on
Linux/macOS and Java 21 or later, MMapDirectory uses IOContext to pass suitable MADV flags to
kernel of operating system. In particular, merging now passes POSIX_MADV_SEQUENTIAL to the readers
that are being merged, and searching passes POSIX_MADV_RANDOM to vector data files - including
quantized vector data files, HNSW graphs, stored fields data files and term vectors data files.
This may improve paging logic especially when working with large indexes under memory pressure.
(Uwe Schindler, Chris Hegarty, Robert Muir, Adrien Grand)
- GITHUB#13197: Expand support for new scalar bit levels for HNSW vectors. This includes 4-bit vectors and an option
to compress them to gain a 50% reduction in memory usage.
(Ben Trent)
- GITHUB#13268: Add ability for UnifiedHighlighter to highlight a field based on combined matches from multiple fields.
(Mayya Sharipova, Jim Ferenczi)
- GITHUB#13288: Make HNSW and Flat storage vector formats easier to extend with new FlatVectorScorer interface. Add
new Hnsw format for binary quantized vectors.
(Ben Trent)
- GITHUB#13181: Add new VectorScorer interface to vector value iterators. This allows for vector codecs to supply
simpler and more optimized vector scoring when iterating vector values directly.
(Ben Trent)
- GITHUB#13414: Counts are always available in the result when using taxonomy facets.
(Stefan Vodita)
- GITHUB#13445: Add new option when calculating scalar quantiles. The new option of setting `confidenceInterval` to
`0` will now dynamically determine the quantiles through a grid search over multiple quantiles calculated
by multiple intervals.
(Ben Trent)
- Improvements (14)
- GITHUB#13092: `static final Map` constants have been made immutable
(Dmitry Cherniachenko)
- GITHUB#13041: TokenizedPhraseQueryNode code cleanup
(Dmitry Cherniachenko)
- GITHUB#13087: Changed `static final Set` constants to be immutable. Among others it affected
ScandinavianNormalizer.ALL_FOLDINGS set with public access.
(Dmitry Cherniachenko)
- GITHUB#13155: Hunspell: allow ignoring exceptions on duplicate ICONV/OCONV mappings
(Peter Gromov)
- GITHUB#13156: Hunspell: don't proceed with other suggestions if we found good REP ones
(Peter Gromov)
- GITHUB#13066: Support getMaxScore of DisjunctionSumScorer for non top level scoring clause
(Shintaro Murakami)
- GITHUB#13124: MergeScheduler can now provide an executor for intra-merge parallelism. The first
implementation is the ConcurrentMergeScheduler and the Lucene99HnswVectorsFormat will use it if no other
executor is provided.
(Ben Trent)
- GITHUB#13239: Upgrade icu4j to version 74.2.
(Robert Muir)
- GITHUB#13202: Early terminate graph and exact searches of AbstractKnnVectorQuery to follow timeout set from
IndexSearcher#setTimeout(QueryTimeout).
(Kaival Parikh)
- GITHUB#12966: Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself.
This reduces code duplication and enables future development.
(Stefan Vodita)
- GITHUB#13362: Add sub query explanations to DisjunctionMaxQuery, if the overall query didn't match.
(Tim Grein)
- GITHUB#13385: Add Intervals.noIntervals() method to produce an empty IntervalsSource.
(Aniketh Jain, Uwe Schindler, Alan Woodward))
- GITHUB#13276: UnifiedHighlighter: new 'passageSortComparator' option to allow sorting other than offset order.
(Seunghan Jung)
- GITHUB#13429: Hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore
(Peter Gromov)
- Optimizations (24)
- GITHUB#13306: Use RWLock to access LRUQueryCache to reduce contention.
(Boice Huang)
- GITHUB#13252: Replace handwritten loops compare with Arrays.compareUnsigned in SegmentTermsEnum.
(zhouhui)
- GITHUB#12996: Reduce ArrayUtil#grow in decompress.
(Zhang Chao)
- GITHUB#13115: Short circuit queued flush check when flush on update is disabled
(Prabhat Sharma)
- GITHUB#13085: Remove unnecessary toString() / substring() calls to save some String allocations
(Dmitry Cherniachenko)
- GITHUB#13121: Speedup multi-segment HNSW graph search for diversifying child kNN queries. Builds on GITHUB#12962.
(Ben Trent)
- GITHUB#13184: Make the HitQueue size more appropriate for KNN exact search
(Pan Guixin)
- GITHUB#13199: Speed up dynamic pruning by breaking point estimation when threshold get exceeded.
(Guo Feng)
- GITHUB#13203: Speed up writeGroupVInts
(Zhang Chao)
- GITHUB#13224: Use singleton for all-zeros DirectMonotonicReader.Meta
(Armin Braun)
- GITHUB#13232 : Introduce singleton for PackedInts.NullReader of size 256
(Armin Braun)
- GITHUB#11888: Binary search the BlockTree terms dictionary entries when all suffixes have the same length
in a leaf block, speeding up cases like primary key lookup on an id field when all ids are the same length.
(zhouhui)
- GITHUB#13149: Made PointRangeQuery faster, for some segment sizes, by reducing the amount of virtual calls to
IntersectVisitor::visit(int).
(Anton Hägerstrand)
- GITHUB#12966: FloatTaxonomyFacets can now collect values into a sparse structure, like IntTaxonomyFacets already
could.
(Stefan Vodita)
- GITHUB#13284: Per-field doc values and knn vectors readers now use a HashMap internally instead of
a TreeMap.
(Adrien Grand)
- GITHUB#13321: Improve compressed int4 quantized vector search by utilizing SIMD inline with the decompression
process.
(Ben Trent)
- GITHUB#12408: Lazy initialization improvements for Facets implementations when there are segments with no hits
to count.
(Greg Miller)
- GITHUB#13327: Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader.
(Bruno Roustant, David Smiley)
- GITHUB#13339: Add a MemorySegment Vector scorer - for scoring without copying on-heap
(Chris Hegarty)
- GITHUB#13368: Replace Map<Integer, Object> by primitive IntObjectHashMap.
(Bruno Roustant)
- GITHUB#13392: Replace Map<Long, Object> by primitive LongObjectHashMap.
(Bruno Roustant)
- GITHUB#13400: Replace Set<Integer> by IntHashSet and Set<Long> by LongHashSet.
(Bruno Roustant)
- GITHUB#13406: Replace List<Integer> by IntArrayList and List<Long> by LongArrayList.
(Bruno Roustant)
- GITHUB#13420: Replace Map<Character> by CharObjectHashMap and Set<Character> by CharHashSet.
(Bruno Roustant)
- Bug Fixes (16)
- GITHUB#13105: Fix ByteKnnVectorFieldSource & FloatKnnVectorFieldSource to work correctly when a segment does not contain
any docs with vectors
(hossman)
- GITHUB#13017: Fix DV update files referenced by merge will be deleted by concurrent flush.
(Jialiang Guo)
- GITHUB#13145: Detect MemorySegmentIndexInput correctly in NRTSuggester.
(Uwe Schindler)
- GITHUB#13154: Hunspell GeneratingSuggester: ensure there are never more than 100 roots to process
(Peter Gromov)
- GITHUB#13162: Fix NPE when LeafReader return null VectorValues
(Pan Guixin)
- GITHUB#13169: Fix potential race condition in DocumentsWriter & DocumentsWriterDeleteQueue
(Ben Trent)
- GITHUB#13204: Fix equals/hashCode of IOContext.
(Uwe Schindler, Robert Muir)
- GITHUB#13206: Subtract deleted file size from the cache size of NRTCachingDirectory.
(Jean-François Boeuf)
- GITHUB#12966: Aggregation facets no longer assume that aggregation values are positive.
(Stefan Vodita)
- GITHUB#13356: Ensure negative scores are not returned from scalar quantization scorer.
(Ben Trent)
- GITHUB#13366: Disallow NaN and Inf values in scalar quantization and better handle extreme cases.
(Ben Trent)
- GITHUB#13369: Fix NRT opening failure when soft deletes are enabled and the document fails to index before a point
field is written
(Ben Trent)
- GITHUB#13378: Fix points writing with no values
(Chris Hegarty)
- GITHUB#13374: Fix bug in SQ when just a single vector present in a segment
(Chris Hegarty)
- GITHUB#13376: Fix integer overflow exception in postings encoding as group-varint.
(Zhang Chao, Guo Feng)
- GITHUB#13421: Fixes TestOrdinalMap.testRamBytesUsed for multiple default PackedInts.NullReader instances.
(Amir Raza)
- Build (1)
- Upgrade forbiddenapis to version 3.7 and ASM for APIJAR extraction to 9.7.
(Uwe Schindler)
- Other (3)
- GITHUB#13068: Replace numerous `brToString(BytesRef)` copies with a `ToStringUtils` method
(Dmitry Cherniachenko)
- GITHUB#13077: Add public getter for SynonymQuery#field
(Andrey Bozhko)
- GITHUB#13393: Add support for reloading the SPI for KnnVectorsFormat class
(Navneet Verma)
- API Changes (4)
- GITHUB#12243: Mark TermInSetQuery ctors with varargs terms as @Deprecated. SortedSetDocValuesField#newSlowSetQuery,
SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a collection of terms as a param.
(Jakub Slowinski)
- GITHUB#11041: Deprecate IndexSearch#search(Query, Collector) in favor of
IndexSearcher#search(Query, CollectorManager) for TopFieldCollectorManager
and TopScoreDocCollectorManager.
(Zach Chen, Adrien Grand, Michael McCandless, Greg Miller, Luca Cavanna)
- GITHUB#12854: Mark DrillSideways#createDrillDownFacetsCollector as @Deprecated.
(Greg Miller)
- GITHUB#12624, GITHUB#12831: Allow FSTCompiler to stream to any DataOutput while building, and
make compile() only return the FSTMetadata. For on-heap (default) use case, please use
FST.fromFSTReader(fstMetadata, fstCompiler.getFSTReader()) to create the FST.
(Anh Dung Bui)
- New Features (4)
- GITHUB#12679: Add support for similarity-based vector searches using [Byte|Float]VectorSimilarityQuery. Uses a new
VectorSimilarityCollector to find all vectors scoring above a `resultSimilarity` while traversing the HNSW graph till
better-scoring nodes are available, or the best candidate is below a score of `traversalSimilarity` in the lowest
level.
(Aditya Prakash, Kaival Parikh)
- GITHUB#12829: For indices newly created as of 9.10.0 onwards, IndexWriter preserves document blocks indexed via
IndexWriter#addDocuments or IndexWriter#updateDocuments also when index sorting is configured. Document blocks are
maintained alongside their parent documents during sort and merge. IndexWriterConfig accepts a parent field that is used
to maintain block orders if index sorting is used. Note, this is fully optional in Lucene 9.x while will be mandatory for
indices that use document blocks together with index sorting as of 10.0.0.
(Simon Willnauer)
- GITHUB#12336: Index additional data per facet label in the taxonomy.
(Shai Erera, Egor Potemkin, Mike McCandless,
Stefan Vodita)
- GITHUB#12706: Add support for the final release of Java foreign memory API in Java 22 (and later).
Lucene's MMapDirectory will now mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) starting
from Java 19. Indexes closed while queries are running can no longer crash the JVM.
Support for vectorized implementations of VectorUtil based on jdk.incubator.vector APIs was added
for exactly Java 22. Therefore, applications started with command line parameter
"java --add-modules jdk.incubator.vector" will automatically use the new vectorized implementations
if running on a supported platform (Java 20/21/22 on x86 CPUs with AVX2 or later or ARM NEON CPUs).
This is an opt-in feature and requires explicit Java command line flag! When enabled, Lucene logs
a notice using java.util.logging. Please test thoroughly and report bugs/slowness to Lucene's mailing
list.
(Uwe Schindler, Chris Hegarty)
- Improvements (7)
- GITHUB#12870: Tighten synchronized loop in DirectoryTaxonomyReader#getOrdinal.
(Stefan Vodita)
- GITHUB#12812: Avoid overflows and false negatives in int slice buffer filled-with-zeros assertion.
(Stefan Vodita)
- GITHUB#12910: Refactor around NeighborArray to make it more self-contained.
(Patrick Zhai)
- GITHUB#12999: Use Automaton for SurroundQuery prefix/pattern matching
(Michael Gibney)
- GITHUB#13043: Support getMaxScore of ConjunctionScorer for non top level scoring clause.
(Shintaro Murakami)
- GITHUB#13055: Make DEFAULT_STOP_TAGS in KoreanPartOfSpeechStopFilter immutable
(Dmitry Cherniachenko)
- GITHUB#888: Use native byte order varhandles to spare CPU's byte swapping.
Tests are running with random byte order to ensure that the order does not affect correctness
of code. Native order was enabled for LZ4 compression.
(Uwe Schindler)
- Optimizations (11)
- LUCENE-10366: Override readVInt() and readVLong() in ByteBufferDataInput to help Hotspot inline method.
(Guo Feng)
- GITHUB#12839: Introduce method to grow arrays up to a given upper limit and use it to reduce overallocation for
DirectoryTaxonomyReader#getBulkOrdinals.
(Stefan Vodita)
- GITHUB#12841: Move group-varint encoding/decoding logic to DataOutput/DataInput.
(Adrien Grand, Zhang Chao, Uwe Schindler)
- GITHUB#12997 Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false.
(Zhang Chao, Adrien Grand)
- GITHUB#12989: Split taxonomy facet arrays across reusable chunks of elements to reduce allocations.
(Michael Froh, Stefan Vodita)
- GITHUB#13033: PointRangeQuery now exits earlier on segments whose values
don't intersect with the query range. When a PointRangeQuery is a required
clause of a boolean query, this helps save work on other required clauses of
the same boolean query.
(Adrien Grand)
- GITHUB#13026: ReqOptSumScorer will now propagate minimum competitive scores
to the optional clause if the required clause doesn't score. In practice,
this will help boolean queries that consist of a mix OF FILTER clauses and
SHOULD clauses.
(Adrien Grand)
- GITHUB#13052: Avoid set.removeAll(list) O(n^2) performance trap in the UpgradeIndexMergePolicy
(Dmitry Cherniachenko)
- GITHUB#13036 Optimize counts on two clause term disjunctions.
(Adrien Grand, Johannes Fredén)
- GITHUB#12962: Speedup concurrent multi-segment HNWS graph search
(Mayya Sharipova, Tom Veasey)
- GITHUB#13090: Prevent humongous allocations in ScalarQuantizer when building quantiles.
(Ben Trent)
- Bug Fixes (7)
- GITHUB#12866: Prevent extra similarity computation for single-level HNSW graphs.
(Kaival Parikh)
- GITHUB#12558: Ensure #finish is called on all drill-sideways FacetsCollectors even when no hits are scored.
(Greg Miller)
- GITHUB#12920: Address bug in TestDrillSideways#testCollectionTerminated that could occasionally cause the test to
fail with certain random seeds.
(Greg Miller)
- GITHUB#12885: Fixed the bug that JapaneseReadingFormFilter cannot convert some hiragana to romaji.
(Takuma Kuramitsu)
- GITHUB#12287: Fix a bug in ShapeTestUtil.
(Heemin Kim)
- GITHUB#13031: ScorerSupplier created by QueryProfilerWeight will propagate topLevelScoringClause to the sub ScorerSupplier.
(Shintaro Murakami)
- GITHUB#13059: Fixed missing IndicNormalization and DecimalDigit filters in TeluguAnalyzer normalization
(Dmitry Cherniachenko)
- Build (1)
- GITHUB#12931, GITHUB#12936, GITHUB#12937: Improve source file validation to detect incorrect
UTF-8 sequences and forbid U+200B; enable errorprone DisableUnicodeInCode check.
(Robert Muir, Uwe Schindler)
- Other (5)
- GITHUB#11023: Removing some dead code in CheckIndex.
(Jakub Slowinski)
- GITHUB#11023: Removing @lucene.experimental tags in testXXX methods in CheckIndex.
(Jakub Slowinski)
- GITHUB#12934: Cleaning up old references to Lucene/Solr.
(Jakub Slowinski)
- GITHUB#12967, GITHUB#13038, GITHUB#13040, GITHUB#13042, GITHUB#13047, GITHUB#13048, GITHUB#13049, GITHUB#13050, GITHUB#13051, GITHUB#13039:
Code cleanups and optimizations.
(Dmitry Cherniachenko)
- GITHUB#13053: Minor AnyQueryNode code cleanup
(Dmitry Cherniachenko)
- Bug Fixes (2)
- GITHUB#13027: Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat
(Ben Trent)
- GITHUB#13014: Rollback the tmp storage of BytesRefHash to -1 after sort
(Guo Feng)
- Bug Fixes (2)
- GITHUB#12898: JVM SIGSEGV crash when compiling computeCommonPrefixLengthAndBuildHistogram
(Chris Hegarty)
- GITHUB#12900: Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped
(Guo Feng, Mike McCandless)
- API Changes (13)
- GITHUB#12578: Deprecate IndexSearcher#getExecutor in favour of executing concurrent tasks using
the TaskExecutor that the searcher holds, retrieved via IndexSearcher#getTaskExecutor
(Luca Cavanna)
- GITHUB#12556: StoredFieldVisitor has a new expert method StoredFieldVisitor#binaryField(FieldInfo, DataInput, int)
that allows implementors to read binary values directly from the DataInput without having to allocate a byte[].
The default implementation allocates an ew byte array and call StoredFieldVisitor#binaryField(FieldInfo, byte[]).
(Ignacio Vera)
- GITHUB#12592: Add RandomAccessInput#length method to the RandomAccessInput interface. In addition deprecate
ByteBuffersDataInput#size in favour of this new method.
(Ignacio Vera)
- GITHUB#12718: Make IndexSearcher#getSlices final as it is not expected to be overridden
(Luca Cavanna)
- GITHUB#12427: Automata#makeStringUnion #makeBinaryStringUnion now accept Iterable<BytesRef> instead of
Collection<BytesRef>. They also now explicitly throw IllegalArgumentException if input data is not properly sorted
instead of relying on assert.
(Shubham Chaudhary)
- GITHUB#12180: Add TaxonomyReader#getBulkOrdinals method to more efficiently retrieve facet ordinals for multiple
FacetLabel at once.
(Egor Potemkin)
- GITHUB#12816: Add HumanReadableQuery which takes a description parameter for debugging purposes.
(Jakub Slowinski)
- GITHUB#12646, GITHUB#12690: Move FST#addNode to FSTCompiler to avoid a circular dependency
between FST and FSTCompiler
(Anh Dung Bui)
- GITHUB#12709: Consolidate FSTStore and BytesStore in FST. Created FSTReader which contains the common methods
of the two
(Anh Dung Bui)
- GITHUB#12735: Remove FSTCompiler#getTermCount() and FSTCompiler.UnCompiledNode#inputCount
(Anh Dung Bui)
- GITHUB-12695: Remove public constructor of FSTCompiler. Please use FSTCompiler.Builder
instead.
(Juan M. Caicedo)
- GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent
HNSW graph build.
(Shubham Chaudhary)
- GITHUB#12758, GITHUB#12803: Remove FST constructor with DataInput for metadata. Please
use the constructor with FSTMetadata instead.
(Anh Dung Bui)
- New Features (5)
- GITHUB#12548: Added similarityToQueryVector API to compute vector similarity scores
with DoubleValuesSource.
(Shubham Chaudhary)
- GITHUB#12685: Lucene now records if documents have been indexed as blocks in SegmentInfo. This is recorded on a per
segment basis and maintained across merges. The property is exposed via LeafReaderMetadata.
(Simon Willnauer)
- GITHUB#12582: Add int8 scalar quantization to the HNSW vector format. This optionally allows for more compact lossy
storage for the vectors, requiring about 75% memory for fast HNSW search.
(Ben Trent)
- GITHUB#12660: HNSW graph now can be merged with multiple thread. Configurable in Lucene99HnswVectorsFormat.
(Patrick Zhai)
- GITHUB#12729: Add new Lucene99FlatVectorsFormat for writing vectors in a flat format and refactor
Lucene99HnswVectorsFormat to reuse the flat format. Added new Lucene99HnswQuantizedVectorsFormat that uses
quantized vectors for its flat storage.
(Ben Trent)
- Improvements (16)
- GITHUB#12523: TaskExecutor waits for all tasks to complete before returning when Exceptions
are thrown during concurrent operations
(Michael Peterson)
- GITHUB#12574: Make TaskExecutor public so that it can be retrieved from the searcher and used
outside of the o.a.l.search package
(Luca Cavanna)
- GITHUB#12603: Simplify the TaskExecutor API by exposing a single invokeAll method that takes a
collection of callables, executes them and returns their results
(Luca Cavanna)
- GITHUB#12606: Create a TaskExecutor when an executor is not provided to the IndexSearcher, in
order to simplify consumer's code
(Luca Cavanna)
- GITHUB#12676: Improve logging of vector support if vector module was enabled but Java version
is too old. It also logs partial vectorization support if old CPU or disabled AVX.
(Uwe Schindler, Robert Muir)
- GITHUB#12677: Better detect vector module in non-default setups (e.g., custom module layers).
(Uwe Schindler)
- GITHUB#12634, GITHUB#12632, GITHUB#12680, GITHUB#12681, GITHUB#12731, GITHUB#12737: Speed up
Panama vector support and test improvements.
(Uwe Schindler, Robert Muir)
- GITHUB#12586: Remove over-counting of deleted terms.
(Guo Feng)
- GITHUB#12705, GITHUB#12705, GITHUB#12785: Improve handling of NullPointerException and
IllegalStateException in MMapDirectory's IndexInputs. Also makes sure to close master
MemorySegmentIndexInput while not throwing IllegalStateException (retry in spin loop).
Also improve TestMmapDirectory.testAceWithThreads to run faster and use less resources.
(Uwe Schindler, Maurizio Cimadamore, Michael Sokolov)
- GITHUB#12689: TaskExecutor to cancel all tasks on exception to avoid needless computation.
(Luca Cavanna)
- GITHUB#12754: Refactor lookup of Hotspot VM options and do not initialize constants with NULL
if SecurityManager prevents access.
(Uwe Schindler)
- GITHUB#12801: Remove possible contention on a ReentrantReadWriteLock in
Monitor which could result in searches waiting for commits.
(Davis Cook)
- GITHUB#11277, LUCENE-10241: Upgrade to OpenNLP to 1.9.4.
(Jeff Zemerick)
- GITHUB#12542: FSTCompiler can now approximately limit how much RAM it uses to share
suffixes during FST construction using the suffixRAMLimitMB method. Larger values
result in a more minimal FST (more common suffixes are shard). Pass
Double.POSITIVE_INFINITY to use as much RAM as is needed to create a purely
minimal FST. Inspired by this Rust FST implemention:
https://blog.burntsushi.net/transducers
(Mike McCandless)
- GITHUB#12738: NodeHash now stores the FST nodes data instead of just node addresses
(Anh Dung Bui)
- GITHUB#12847: Test2BFST now reports the time it took to build the FST and the real FST size
(Anh Dung Bui)
- Optimizations (26)
- GITHUB#12183: Make TermStates#build concurrent.
(Shubham Chaudhary)
- GITHUB#12573: Use radix sort to speed up the sorting of deleted terms.
(Guo Feng)
- GITHUB#12382: Faster top-level conjunctions on term queries when sorting by
descending score.
(Adrien Grand)
- GITHUB#12591: Use stable radix sort to speed up the sorting of update terms.
(Guo Feng)
- GITHUB#12587: Use radix sort to speed up the sorting of terms in TermInSetQuery.
(Guo Feng)
- GITHUB#12604: Estimate the block size of FST BytesStore in BlockTreeTermsWriter
to reduce GC load during indexing.
(Guo Feng)
- GITHUB#12623: Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter.
(Guo Feng)
- GITHUB#12631: Write MSB VLong for better outputs sharing in block tree index, decreasing ~14% size
of .tip file.
(Guo Feng)
- GITHUB#12668: ImpactsEnums now decode frequencies lazily like PostingsEnums.
(Adrien Grand)
- GITHUB#12651: Use 2d array for OnHeapHnswGraph representation.
(Patrick Zhai)
- GITHUB#12653: Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip.
(Shubham Chaudhary)
- GITHUB#12589: Disjunctions now sometimes run as conjunctions when the minimum
competitive score requires multiple clauses to match.
(Adrien Grand)
- GITHUB#12710: Use Arrays#mismatch for Outputs#common operations.
(Guo Feng)
- GITHUB#12712: Speed up sorting postings file with an offline radix sorter in BPIndexReader.
(Guo Feng)
- GITHUB#12702: Disable suffix sharing for block tree index, making writing the terms dictionary index faster
and less RAM hungry, while making the index a bit (~1.X% for the terms index file on wikipedia).
(Guo Feng, Mike McCandless)
- GITHUB#12726: Return the same input vector if its a unit vector in VectorUtil#l2normalize.
(Shubham Chaudhary)
- GITHUB#12719: Top-level conjunctions that are not sorted by score now have a
specialized bulk scorer.
(Adrien Grand)
- GITHUB#12696: Change Postings back to using FOR in Lucene99PostingsFormat. Freqs, positions and offset keep using PFOR.
(Jakub Slowinski)
- GITHUB#1052: Faster merging of terms enums.
(Adrien Grand)
- GITHUB#11903: Faster sort on high-cardinality string fields.
(Adrien Grand)
- GITHUB#12381: Skip docs with DocValues in NumericLeafComparator.
(Lu Xugang, Adrien Grand)
- GITHUB#12784: Cache buckets to speed up BytesRefHash#sort.
(Guo Feng)
- GITHUB#12806: Utilize exact kNN search when gathering k >= numVectors in a segment
(Ben Trent)
- GITHUB#12782: Use group-varint encoding for the tail of postings.
(Adrien Grand, Zhang Chao)
- GITHUB#12748: Specialize arc store for continuous label in FST.
(Guo Feng, Chao Zhang)
- GITHUB#12825, GITHUB#12834: Hunspell: improved dictionary loading performance, allowed in-memory entry sorting.
(Peter Gromov)
- Changes in runtime behavior (3)
- GITHUB#12569: Prevent concurrent tasks from parallelizing execution further which could cause deadlock
(Luca Cavanna)
- GITHUB#12765: Disable vectorization on VMs that are not Hotspot-based.
(Uwe Schindler, Robert Muir)
- GITHUB#12552: Make FSTPostingsFormat load FSTs off-heap.
(Tony X)
- Bug Fixes (11)
- GITHUB#12654: TestIndexWriterOnVMError.testUnknownError times out (fixes potential IndexWriter
deadlock with tragic exceptions).
(Benjamin Trent, Dawid Weiss, Simon Willnauer)
- GITHUB#12614: Make LRUQueryCache respect Accountable queries on eviction and consistency check
(Grigoriy Troitskiy)
- GITHUB#11556: HTMLStripCharFilter fails on '>' or '<' characters in attribute values.
(Elliot Lin)
- GITHUB#12698: Fix IndexOutOfBoundsException when saving FSTStore-backed FST with different DataOutput for metadata
(Anh Dung Bui)
- GITHUB#12642: Ensure #finish only gets called once on the base collector during drill-sideways
(Greg Miller)
- GITHUB#12682: Scorer should sum up scores into a double.
(Shubham Chaudhary)
- GITHUB#12727: Ensure negative scores are not returned by vector similarity functions
(Ben Trent)
- GITHUB#12736: Fix NullPointerException when Monitor.getQuery cannot find the requested queryId
(Davis Cook)
- GITHUB#12770: Stop exploring HNSW graph if scores are not getting better.
(Ben Trent)
- GITHUB#12640: Ensure #finish is called on all drill-sideways collectors even if one throws a
CollectionTerminatedException
(Greg Miller)
- GITHUB#12626: Fix segmentInfos replace to set userData
(Shibi Balamurugan, Uwe Schindler, Marcus Eagan, Michael Froh)
- Build (5)
- GITHUB#12752: tests.multiplier could be omitted in test failure reproduce lines (esp. in
nightly mode).
(Dawid Weiss)
- GITHUB#12742: JavaCompile tasks may be in up-to-date state when modular dependencies have changed
leading to odd runtime errors
(Chris Hostetter, Dawid Weiss)
- GITHUB#12612: Upgrade forbiddenapis to version 3.6 and ASM for APIJAR extraction to 9.6.
(Uwe Schindler)
- GITHUB#12655: Upgrade to Gradle 8.4
(Kevin Risden)
- GITHUB#12845: Only enable support for tests.profile if jdk.jfr module is available
in Gradle runtime.
(Uwe Schindler)
- Other (5)
- GITHUB#12817: Add demo for faceting with StringValueFacetCounts over KeywordField and SortedDocValuesField.
(Stefan Vodita)
- GITHUB#12657: Internal refactor of HNSW graph merging
(Ben Trent).
- GITHUB#12625: Refactor ByteBlockPool so it is just a "shift/mask big array".
(Ignacio Vera)
- GITHUB#6675: Various improvements related to ByteBlockPool. Slice functionality on top of ByteBlockPool moved to its
own class, ByteSlicePool. ByteBlockPool's array of buffers is made private. There are new exceptions for buffer index
overflows and slices that are too large. Some bits of code are simplified. Documentation is updated and expanded.
(Stefan Vodita)
- GITHUB#12762: Refactor BKD HeapPointWriter to hide the internal data structure.
(Ignacio Vera)
- API Changes (3)
- GITHUB#12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException
(Gokul Manoj)
- GITHUB#11248: IntBlockPool's SliceReader, SliceWriter, and all int slice functionality are moved out to MemoryIndex.
(Stefan Vodita)
- GITHUB#12436: Move max vector dims limit to Codec
(Mayya Sharipova)
- New Features (6)
- GITHUB#12380: Introduced LeafCollector#finish, a hook that runs after
collection has finished running on a leaf.
(Adrien Grand)
- LUCENE-8183, GITHUB#9231: Added the abbility to get noSubMatches and noOverlappingMatches in
HyphenationCompoundWordFilter
(Martin Demberger, original from Rupert Westenthaler)
- GITHUB#12434: Add `KnnCollector` to `LeafReader` and `KnnVectorReader` so that custom collection of vector
search results can be provided. The first custom collector provides `ToParentBlockJoin[Float|Byte]KnnVectorQuery`
joining child vector documents with their parent documents.
(Ben Trent)
- GITHUB#12479: Add new Maximum Inner Product vector similarity function for non-normalized dot-product
vector search.
(Jack Mazanec, Ben Trent)
- GITHUB#12525: `WordDelimiterGraphFilterFactory` now supports the `ignoreKeywords` flag
(Thomas De Craemer)
- GITHUB#12489: Add support for recursive graph bisection, also called
bipartite graph partitioning, and often abbreviated BP, an algorithm for
reordering doc IDs that results in more compact postings and faster queries,
especially conjunctions.
(Adrien Grand)
- Improvements (5)
- GITHUB#12374: Add CachingLeafSlicesSupplier to compute the LeafSlices for concurrent segment search
(Sorabh Hamirwasia)
- GITHUB#12499: Simplify task executor for concurrent operations by offloading concurrent operations to the
provided executor unconditionally.
(Luca Cavanna)
- GITHUB#12464: Hunspell: allow customizing the hash table load factor
(Peter Gromov)
- GITHUB#12468: Hunspell: check for aff file wellformedness more strictly
(Peter Gromov)
- GITHUB#12491: Hunspell: speed up the dictionary enumeration on suggestion
(Peter Gromov)
- Optimizations (13)
- GITHUB#12377: Avoid redundant loop for compute min value in DirectMonotonicWriter.
(Zhang Chao)
- GITHUB#12361: Faster top-level disjunctions sorted by descending score.
(Adrien Grand)
- GITHUB#12444: Faster top-level disjunctions sorted by descending score in
case of many terms or queries that expose suboptimal score upper bounds.
(Adrien Grand)
- GITHUB#12383: Assign a dummy simScorer in TermsWeight if score is not needed.
(Sagar Upadhyaya)
- GITHUB#12372: Reduce allocation during HNSW construction
(Jonathan Ellis)
- GITHUB#12385: Restore parallel knn query rewrite across segments rather than slices
(Luca Cavanna)
- GITHUB#12381: Speed up NumericDocValuesWriter with index sorting.
(Zhang Chao)
- GITHUB#12453: Faster bulk numeric reads from BufferedIndexInput
(Armin Braun)
- GITHUB#12415: Optimized counts on disjunctive queries.
(Adrien Grand)
- GITHUB#12518: Use panama vector API to speed up l2norm calculations
(Ben Trent)
- GITHUB#12480: Lazy computation of similarity score during initializeFromGraph
(Jack Wang)
- GITHUB#12490: Faster computation of top-k hits on boolean queries.
(Adrien Grand)
- GITHUB#12560: ExpressionValueSource defers #advanceExact on dependencies until their values are needed, avoiding
unnecessary advancing on values that are never evaluated (e.g., because of ternary expressions).
(Greg Miller)
- Changes in runtime behavior (3)
- GITHUB#12516: Unwrap and throw execution exceptions cause when performing concurrent search
(Luca Cavanna)
- GITHUB#12498: Offload concurrent search execution to the executor that's optionally provided to the IndexSearcher.
Tasks are no longer executed on the caller thread when rejected or if the executor queue goes above a predefined
threshold. Adaptive behaviour as well as a saturation policy can be incorporated in the provided executor instead.
(Luca Cavanna)
- GITHUB#12515: Offload sequential search execution to the executor that's optionally provided to the IndexSearcher
(Luca Cavanna)
- Bug Fixes (10)
- GITHUB#9660: Throw an ArithmeticException when the offset overflows in a ByteBlockPool.
(Stefan Vodita)
- GITHUB#11537: Fix stack overflow in RegExp for long strings by reducing recursion.
(Jakub Slowinski)
- GITHUB#12388: JoinUtil queries were ignoring boosts.
(Alan Woodward)
- GITHUB#12413: Fix HNSW graph search bug that potentially leaked unapproved docs
(Ben Trent).
- GITHUB#12423: Respect timeouts in ExitableDirectoryReader when searching with byte[] vectors
(Ben Trent).
- GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458
(Greg Miller).
- GITHUB#12472: UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any
impact on the user, unless you explicitly invoke the convert function of UTF32ToUTF8, and in the extremely rare
scenario of searching a non-UTF-8 inverted field with Unicode search terms
(Tang Donghai).
- LUCENE-12521: Sort After returning in-correct result when missing values are competitive.
(Chaitanya Gohel)
- GITHUB#12555: Fix bug in TermsEnum#seekCeil on doc values terms enums
that causes IndexOutOfBoundsException.
(Egor Potemkin)
- GITHUB#12571: Fix HNSW graph read bug when built with excessive connections.
(Ben Trent).
- Other (4)
- GITHUB#12404: Remove usage and add some legacy java.util classes to forbiddenapis (Stack, Hashtable, Vector).
(Uwe Schindler)
- GITHUB#12410: Refactor vectorization support (split provider from implementation classes).
(Uwe Schindler, Chris Hegarty)
- GITHUB#12428: Replace consecutive close() calls and close() calls with null checks with IOUtils.close().
(Shubham Chaudhary)
- GITHUB#12512: Remove unused variable in BKDWriter.
(Zhang Chao)
- API Changes (4)
- GITHUB#11840, GITHUB#12304: Query rewrite now takes an IndexSearcher instead of
IndexReader to enable concurrent rewriting. Please note: This is implemented in
a backwards compatible way. A query overriding any of both rewrite methods is
supported. To implement this backwards layer in Lucene 9.x the
RuntimePermission "accessDeclaredMembers" is needed in applications using
SecurityManager.
(Patrick Zhai, Ben Trent, Uwe Schindler)
- GITHUB#12321: DaciukMihovAutomatonBuilder has been marked deprecated in preparation of reducing its visibility in
a future release.
(Greg Miller)
- GITHUB#12268: Add BitSet.clear() without parameters for clearing the entire set
(Jonathan Ellis)
- GITHUB#12346: add new IndexWriter#updateDocuments(Query, Iterable<Document>) API
to update documents atomically, with respect to refresh and commit using a query.
(Patrick Zhai)
- New Features (4)
- GITHUB#12257: Create OnHeapHnswGraphSearcher to let OnHeapHnswGraph to be searched in a thread-safety manner.
(Patrick Zhai)
- GITHUB#12302, GITHUB#12311, GITHUB#12363: Add vectorized implementations of VectorUtil.dotProduct(),
squareDistance(), cosine() with Java 20 or 21 jdk.incubator.vector APIs. Applications started
with command line parameter "java --add-modules jdk.incubator.vector" on exactly Java 20 or 21
will automatically use the new vectorized implementations if running on a supported platform
(x86 AVX2 or later, ARM NEON). This is an opt-in feature and requires explicit Java
command line flag! When enabled, Lucene logs a notice using java.util.logging. Please test
thoroughly and report bugs/slowness to Lucene's mailing list.
(Chris Hegarty, Robert Muir, Uwe Schindler)
- GITHUB#12294: Add support for Java 21 foreign memory API. If Java 19 up to 21 is used,
MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) and indexes
closed while queries are running can no longer crash the JVM. To disable this feature,
pass the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"
(Uwe Schindler)
- GITHUB#12252 Add function queries for computing similarity scores between knn vectors.
(Elia Porciani, Alessandro Benedetti)
- Improvements (7)
- GITHUB#12245: Add support for Score Mode to `ToParentBlockJoinQuery` explain.
(Marcus Eagan via Mikhail Khludnev)
- GITHUB#12305: Minor cleanup and improvements to DaciukMihovAutomatonBuilder.
(Greg Miller)
- GITHUB#12325: Parallelize AbstractKnnVectorQuery rewrite across slices rather than segments.
(Luca Cavanna)
- GITHUB#12333: NumericLeafComparator#competitiveIterator makes better use of a "search after" value when paginating.
(Chaitanya Gohel)
- GITHUB#12290: Make memory fence in ByteBufferGuard explicit using `VarHandle.fullFence()`
- GITHUB#12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit.
(Greg Miller)
- GITHUB#12281: Require indexed KNN float vectors and query vectors to be finite.
(Jonathan Ellis, Uwe Schindler)
- Optimizations (9)
- GITHUB#12324: Speed up sparse block advanceExact with tiny step in IndexedDISI.
(Guo Feng)
- GITHUB#12270 Don't generate stacktrace in CollectionTerminatedException.
(Armin Braun)
- GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery.
(Kaival Parikh)
- GITHUB#12286 Toposort use iterator to avoid stackoverflow.
(Tang Donghai)
- GITHUB#12235: Optimize HNSW diversity calculation.
(Patrick Zhai)
- GITHUB#12328: Optimize ConjunctionDISI.createConjunction
(Armin Braun)
- GITHUB#12357: Better paging when doing backwards random reads. This speeds up
queries relying on terms in NIOFSDirectory and SimpleFSDirectory.
(Alan Woodward)
- GITHUB#12339: Optimize part of duplicate calculation numDeletesToMerge in merge phase
(fudongying)
- GITHUB#12334: Honor after value for skipping documents even if queue is not full for PagingFieldCollector
(Chaitanya Gohel)
- Bug Fixes (4)
- GITHUB#12291: Skip blank lines from stopwords list.
(Jerry Chin)
- GITHUB#11350: Handle possible differences in FieldInfo when merging indices created with Lucene 8.x
(Tomás Fernández Löbbe)
- GITHUB#12352: [Tessellator] Improve the checks that validate the diagonal between two polygon nodes so
the resulting polygons are valid counter clockwise polygons.
(Ignacio Vera)
- LUCENE-10181: Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth.
(Chris Fournier)
- Other (1)
- (No changes)
- API Changes (4)
- GITHUB#12116: Introduce IndexableField#storedValue() to expose the value that
should be stored to IndexingChain without needing to guess the field's type.
(Adrien Grand, Robert Muir)
- GITHUB#12129: Move DocValuesTermsQuery from sandbox to SortedDocValuesField#newSlowSetQuery
and SortedSetDocValuesField#newSlowSetQuery.
(Robert Muir)
- GITHUB#12173: TermInSetQuery#getTermData has been deprecated. This exposes internal implementation details that we
may want to change in the future, and users shouldn't rely on the encoding directly.
(Greg Miller)
- GITHUB#11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount.
(Greg Miller)
- New Features (3)
- GITHUB#12054: Introduce a new KeywordField for simple and efficient
filtering, sorting and faceting.
(Adrien Grand)
- GITHUB#12188: Add support for Java 20 foreign memory API. If exactly Java 19
or 20 is used, MMapDirectory will mmap Lucene indexes in chunks of 16 GiB
(instead of 1 GiB) and indexes closed while queries are running can no longer
crash the JVM. To disable this feature, pass the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"
(Uwe Schindler)
- GITHUB#12169: Introduce a new token filter to expand synonyms based on Word2Vec DL4j models.
(Daniele Antuzi, Ilaria Petreti, Alessandro Benedetti)
- Improvements (5)
- GITHUB#12055: MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE rewrite method introduced and used as the new default
for multi-term queries with a FILTER rewrite (PrefixQuery, WildcardQuery, TermRangeQuery). This introduces better
skipping support for common use-cases.
(Adrien Grand, Greg Miller)
- GITHUB#12156: TermInSetQuery now extends MultiTermQuery instead of providing its own custom implementation (which
was essentially a clone of MultiTermQuery#CONSTANT_SCORE_REWRITE). It uses the new CONSTANT_SCORE_BLENDED_REWRITE
by default, but can be overridden through the constructor.
(Greg Miller)
- GITHUB#12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod.
(Greg Miller)
- GITHUB#12166: Remove the now unused class pointInPolygon.
(Marcus Eagan via Christine Poerschke and Nick Knize)
- GITHUB#12126: Refactor part of IndexFileDeleter and ReplicaFileDeleter into a public common utility class
FileDeleter.
(Patrick Zhai)
- Optimizations (9)
- GITHUB#11900: BloomFilteringPostingsFormat now uses multiple hash functions
in order to achieve the same false positive probability with less memory.
(Jean-François Boeuf)
- GITHUB#12118: Optimize FeatureQuery to TermQuery & weight when scoring is not required.
(Ben Trent, Robert Muir)
- GITHUB#12128, GITHUB#12133: Speed up docvalues set query by making use of sortedness.
(Robert Muir, Uwe Schindler)
- GITHUB#12050: Reuse HNSW graph for intialization during merge
(Jack Mazanec)
- GITHUB#12155: Speed up DocValuesRewriteMethod by making use of sortedness.
(Greg Miller)
- GITHUB#12139: Faster indexing of string fields.
(Adrien Grand)
- GITHUB#12179: Better PostingsEnum reuse in MultiTermQueryConstantScoreBlendedWrapper.
(Greg Miller)
- GITHUB#12198, GITHUB#12199: Reduced contention when indexing with many threads.
(Adrien Grand)
- GITHUB#12241: Add ordering of files in compound files.
(Christoph Büscher)
- Bug Fixes (8)
- GITHUB#12158: KeywordField#newSetQuery should clone input BytesRef[] to avoid modifying provided array.
(Greg Miller)
- GITHUB#12196: Fix MultiFieldQueryParser to handle both query boost and phrase slop at the same time.
(Jasir KT)
- GITHUB#12202: Fix MultiFieldQueryParser to apply boosts to regexp, wildcard, prefix, range, fuzzy queries.
(Jasir KT)
- GITHUB#12178: Add explanations for TermAutomatonQuery
(Marcus Eagan via Patrick Zhai, Mike McCandless, Robert Muir, Mikhail Khludnev)
- GITHUB#12214: Fix ordered intervals query to avoid skipping some of the results over interleaved terms.
(Hongyu Yan)
- GITHUB#12212: Bug fix for a DrillSideways issue where matching hits could occasionally be missed.
(Frederic Thevenet)
- GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end
(Peter Gromov)
- GITHUB#12260: Fix SynonymQuery equals implementation to take the targeted field name into account
(Luca Cavanna)
- Build (3)
- GITHUB#12131: Generate gradle.properties from gradlew, if absent
(Colvin Cowie, Uwe Schindler)
- GITHUB#12188: Building the lucene-core MR-JAR file is now possible without installing
additionally required Java versions (Java 19, Java 20,...). For compilation, a special
JAR file with Panama-foreign API signatures of each supported Java version was added to
source tree. Those can be regenerated an demand with "gradlew :lucene:core:regenerate".
(Uwe Schindler)
- GITHUB#12215: Upgrade forbiddenapis to version 3.5. This tones down some verbose warnings
printed while checking Java 19 and Java 20 sourcesets for the MR-JAR.
(Uwe Schindler)
- Documentation (1)
- GITHUB#10633: Update javadocs in TestBackwardsCompatibility to use gradle and not ant.
(Usman Shaikh)
- Other (2)
- GITHUB#11868: Add a FilterIndexInput and FilterIndexOutput class to more easily and safely create delegate
IndexInput and IndexOutput classes
(Marc D'Mello)
- GITHUB#12239: Hunspell: reduced suggestion set dependency on the hash table order
(Peter Gromov)
- API Changes (20)
- GITHUB#12093: Deprecate support for UTF8TaxonomyWriterCache and changed default to LruTaxonomyWriterCache.
Please use LruTaxonomyWriterCache instead.
(Vigya Sharma)
- GITHUB#11998: Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.
(Adrien Grand, Robert Muir)
- GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now properly returns "top" children instead
of all children.
(Greg Miller)
- GITHUB#11772: Removed native subproject and WindowsDirectory implementation from lucene.misc. Recommendation:
use MMapDirectory implementation on Windows.
(Robert Muir, Uwe Schindler, Dawid Weiss)
- GITHUB#11804: FacetsCollector#collect is no longer final, allowing extension.
(Greg Miller)
- GITHUB#11761: TieredMergePolicy now allowed a maximum allowable deletes percentage of down to 5%, and the default
maximum allowable deletes percentage is changed from 33% to 20%.
(Marc D'Mello)
- GITHUB#11822: Configure replicator PrimaryNode replia shutdown timeout.
(Steven Schlansker)
- GITHUB#11930: Added IOContext#LOAD for files that are a small fraction of the
total index size and heavily accessed with a random access pattern. Some
Directory implementations may choose to load files that use this IOContext in
memory to provide stronger guarantees on query latency.
(Adrien Grand, Uwe Schindler)
- GITHUB#11941: QueryBuilder#add and #newSynonymQuery methods now take a `field` parameter,
to avoid possible exceptions when building queries from an empty term list. The helper
TermAndBoost class now holds a BytesRef rather than a Term.
(Alan Woodward)
- GITHUB#11961: VectorValues#EMPTY was removed as this instance was not
necessary and also illegal as it reported a number of dimensions equal to
zero.
(Adrien Grand)
- GITHUB#11962: VectorValues#cost() now delegates to VectorValues#size().
(Adrien Grand)
- GITHUB#11984: Improved TimeLimitBulkScorer to check the timeout at exponantial rate.
(Costin Leau)
- GITHUB#12004: Add new KnnByteVectorQuery for querying vector fields that are encoded as BYTE. Removes the ability to
use KnnVectorQuery against fields encoded as BYTE
(Ben Trent)
- GITHUB#11997: Introduce IntField, LongField, FloatField and DoubleField.
These new fields index both 1D points and sorted numeric doc values and
provide best performance for filtering and sorting.
(Francisco Fernández Castaño, Adrien Grand)
- GITHUB#12066: Retire/deprecate instance method MMapDirectory#setUseUnmap().
Like the new setting for MemorySegments, this feature is enabled by default and
can only be disabled globally by passing the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableUnmapHack=false"
(Uwe Schindler)
- GITHUB#12038: Deprecate non-NRT replication support.
Please migrate to org.apache.lucene.replicator.nrt instead.
(Robert Muir)
- GITHUB#12087: Move DocValuesNumbersQuery from sandbox to NumericDocValuesField#newSlowSetQuery
and SortedNumericDocValuesField#newSlowSetQuery. IntField, LongField, FloatField, and DoubleField
implement newSetQuery with best-practice use of IndexOrDocValuesQuery.
(Robert Muir)
- GITHUB#12064: Create new KnnByteVectorField, ByteVectorValues and KnnVectorsReader#getByteVectorValues(String)
that are specialized for byte-sized vectors, and clarify the public API by making a clear distinction
between classes that produce and read float vectors and those that produce and read byte vectors.
(Ben Trent)
- GITHUB#12101: Remove VectorValues#binaryValue(). Vectors should only be
accessed through their high-level representation, via
VectorValues#vectorValue().
(Adrien Grand)
- GITHUB#12105: Deprecate KnnVectorField in favour of KnnFloatVectorField,
KnnVectoryQuery in favour of KnnFloatVectorQuery, and LeafReader#getVectorValues
in favour of LeafReader#getFloatVectorValues.
(Luca Cavanna)
- New Features (7)
- GITHUB#11795: Add ByteWritesTrackingDirectoryWrapper to expose metrics for bytes merged, flushed, and overall
write amplification factor.
(Marc D'Mello)
- GITHUB#11929: MMapDirectory gives more granular control on which files to
preload.
(Adrien Grand, Uwe Schindler)
- GITHUB#11999: MemoryIndex now supports stored fields.
(Alan Woodward)
- GITHUB#11997: Add IntField, LongField, FloatField and DoubleField: easy to
use numeric fields that perform well both for filtering and sorting.
(Francisco Fernández Castaño)
- GITHUB#12033: Support for Java 19 foreign memory support is now enabled by default,
no need to pass "--enable-preview" on the command line. If exactly Java 19 is used,
MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) and
indexes closed while queries are running can no longer crash the JVM.
To disable this feature, pass the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"
(Uwe Schindler)
- GITHUB#11869: RangeOnRangeFacetCounts added, supporting numeric range "relationship" faceting over docvalue-stored
ranges.
(Marc D'Mello)
- LUCENE-10626 Hunspell: add tools to aid dictionary editing:
analysis introspection, stem expansion and stem/flag suggestion
(Peter Gromov)
- Improvements (9)
- GITHUB#11785: Improve Tessellator performance by delaying calls to the method
#isIntersectingPolygon
(Ignacio Vera)
- GITHUB#687: speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocIdSetIterator
construction using bkd binary search.
(Jianping Weng)
- GITHUB#11985: ExitableTerms to override Terms#getMin and Terms#getMax in order to avoid
iterating through the terms when the wrapped implementation caches such values.
(Luca Cavanna)
- GITHUB#11860: Improve storage efficiency of connections in the HNSW graph that Lucene uses for
vector search.
(Ben Trent)
- GITHUB#12008: Clean up LongRange#verifyAndEncode logic to remove unnecessary NaN checks.
(Greg Miller)
- GITHUB#12003: Minor cleanup/improvements to IndexSortSortedNumericDocValuesRangeQuery.
(Greg Miller)
- GITHUB#12016: Upgrade lucene/expressions to use antlr 4.11.1
(Andriy Redko)
- GITHUB#12034: Remove null check in IndexReaderContext#leaves() usages
(Erik Pellizzon)
- GITHUB#12070: Compound file creation is no longer subject to merge throttling.
(Adrien Grand)
- Bug Fixes (15)
- GITHUB#11726: Indexing term vectors on large documents could fail due to
trying to apply a dictionary whose size is greater than the maximum supported
window size for LZ4.
(Adrien Grand)
- GITHUB#11768: Taxonomy and SSDV faceting now correctly breaks ties by preferring smaller ordinal
values.
(Greg Miller)
- GITHUB#11907: Fix latent casting bugs in BKDWriter.
(Ben Trent)
- GITHUB#11954: Remove QueryTimeout#isTimeoutEnabled method and move check to caller.
(Shubham Chaudhary)
- GITHUB#11950: Fix NPE in BinaryRangeFieldRangeQuery variants when the queried field doesn't exist
in a segment or is of the wrong type.
(Greg Miller)
- GITHUB#11990: PassageSelector now has a larger minimum size for its priority queue,
so that subsequent passage merges don't mean that we return too few passages in
total.
(Alan Woodward, Dawid Weiss)
- GITHUB#11986: Fix algorithm that chooses the bridge between a polygon and a hole when there is
common vertex.
(Ignacio Vera)
- GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain intersecting geometries.
(Craig Taverner)
- GITHUB#12058: Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9.
(Uwe Schindler)
- GITHUB#12046: Out of boundary in CombinedFieldQuery#addTerm.
(Lu Xugang)
- GITHUB#12072: Fix exponential runtime for nested BooleanQuery#rewrite when a
BooleanClause is non-scoring.
(Ben Trent)
- GITHUB#11807: Don't rewrite queries in unified highlighter.
(Alan Woodward)
- GITHUB#12088: WeightedSpanTermExtractor should not throw UnsupportedOperationException
when it encounters a FieldExistsQuery.
(Alan Woodward)
- GITHUB#12084: Same bound with fallbackQuery.
(Lu Xugang)
- GITHUB#12077: WordBreakSpellChecker now correctly respects maxEvaluations
(hossman)
- Optimizations (18)
- GITHUB#11738: Optimize MultiTermQueryConstantScoreWrapper when a term is present that matches all
docs in a segment.
(Greg Miller)
- GITHUB#11735: KeywordRepeatFilter + OpenNLPLemmatizer always drops last token of a stream.
(Luke Kot-Zaniewski)
- GITHUB#11771: KeywordRepeatFilter + OpenNLPLemmatizer sometimes arbitrarily exits token stream.
(Luke Kot-Zaniewski)
- GITHUB#11803: DrillSidewaysScorer has improved to leverage "advance" instead of "next" where
possible, and splits out first and second phase checks to delay match confirmation.
(Greg Miller)
- GITHUB#11828: Tweak TermInSetQuery "dense" optimization to only require all terms present in a
given field to match a term (rather than all docs in a segment). This is consistent with
MultiTermQueryConstantScoreWrapper.
(Greg Miller)
- GITHUB#11876: Use ByteArrayComparator to speed up PointInSetQuery in single dimension case.
(Guo Feng)
- GITHUB#11880: Use ByteArrayComparator to speed up BinaryRangeFieldRangeQuery, RangeFieldQuery
LatLonPointDistanceFeatureQuery and CheckIndex.
(Guo Feng)
- GITHUB#11881: Further optimize drill-sideways scoring by specializing the single dimension case
and borrowing some concepts from "min should match" scoring.
(Greg Miller)
- GITHUB#11884: Simplify the logic of matchAll() in IndexSortSortedNumericDocValuesRangeQuery.
(Lu Xugang)
- GITHUB#11895: count() in BooleanQuery could be early quit.
(Lu Xugang)
- GITHUB#11972: `IndexSortSortedNumericDocValuesRangeQuery` can now also
optimize query execution with points for descending sorts.
(Adrien Grand)
- GITHUB#12006: Do ints compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries.
(Guo Feng)
- GITHUB#12011: Minor speedup to flushing long postings lists when an index
sort is configured.
(Adrien Grand)
- GITHUB#12017: Aggressive count in BooleanWeight.
(Lu Xugang)
- GITHUB#12079: Faster merging of 1D points.
(Adrien Grand)
- GITHUB#12081: Small merging speedup on sorted indexes.
(Adrien Grand)
- GITHUB#12078: Enhance XXXField#newRangeQuery.
(Lu Xugang)
- GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved suggestion performance
(Peter Gromov)
- Other (9)
- GITHUB#11856: Fix nanos to millis conversion for tests
(Marios Trivyzas)
- LUCENE-10423: Remove usages of System.currentTimeMillis() from tests.
(Marios Trivyzas)
- GITHUB#11811: Upgrade google java format to 1.15.0
(Dawid Weiss)
- GITHUB#11834: Upgrade forbiddenapis to version 3.4.
(Uwe Schindler)
- LUCENE-10635: Ensure test coverage for WANDScorer by using a test query.
(Zach Chen, Adrien Grand)
- GITHUB#11752: Added interface to relate a LatLonShape with another shape represented as Component2D.
(Navneet Verma)
- GITHUB#11983: Make constructors for OffsetFromPositions and OffsetsFromMatchIterator
public.
(Alan Woodward)
- LUCENE-10546: Update Faceting user guide.
(Egor Potemkin)
- GITHUB#12099: Introduce support in KnnVectorQuery for getters.
(Alessandro Benedetti)
- Build (1)
- GITHUB#11886: Upgrade to gradle 7.5.1
(Dawid Weiss)
- Bug Fixes (2)
- GITHUB#11905: Fix integer overflow when seeking the vector index for connections in a single segment.
This addresses a bug that was introduced in 9.2.0 where having many vectors is not handled well
in the vector connections reader.
- GITHUB#11939: Fix incorrect cost calculation in DocIdSetBuilder after upgradeToBitSet when doc list is growing.
This addresses a bug where the cost of TermRangeQuery/TermInSetQuery and some other queries will be highly underestimated.
- Improvements (2)
- GITHUB#11912, GITHUB#11918: Port generic exception handling from MemorySegmentIndexInput
to ByteBufferIndexInput. This also adds the invalid position while seeking or reading
to the exception message. Allows better debugging and analysis of bugs like GITHUB#11905.
(Uwe Schindler, Robert Muir)
- GITHUB#11916: improve checkindex to be more thorough for vectors.
(Ben Trent)
- Bug Fixes (1)
- GITHUB#11858: Fix kNN vectors format validation on large segments. This
addresses a regression in 9.4.0 where validation could fail, preventing
further writes or searches on the index.
(Julie Tibshirani)
- API Changes (1)
- LUCENE-10577: Add VectorEncoding to enable byte-encoded HNSW vectors
(Michael Sokolov, Julie Tibshirani)
- New Features (4)
- LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape.
(Nick Knize)
- LUCENE-10629: Support match set filtering with a query in MatchingFacetSetCounts.
(Stefan Vodita, Shai Erera)
- LUCENE-10633: SortField#setOptimizeSortWithIndexedData and
SortField#getOptimizeSortWithIndexedData were introduced to provide
an option to disable sort optimization for various sort fields.
(Mayya Sharipova)
- GITHUB#912: Support for Java 19 foreign memory support was added. Applications started
with command line parameter "java --enable-preview" will automatically use the new
foreign memory API of Java 19 to access indexes on disk with MMapDirectory. This is
an opt-in feature and requires explicit Java command line flag! When enabled, Lucene logs
a notice using java.util.logging. Please test thoroughly and report bugs/slowness to Lucene's
mailing list. When the new API is used, MMapDirectory will mmap Lucene indexes in chunks of
16 GiB (instead of 1 GiB) and indexes closed while queries are running can no longer crash
the JVM.
(Uwe Schindler)
- Improvements (4)
- LUCENE-10592: Build HNSW Graph on indexing.
(Mayya Sharipova, Adrien Grand, Julie Tibshirani)
- LUCENE-10207: TermInSetQuery can now provide a ScoreSupplier with cost estimation, making it
usable in IndexOrDocValuesQuery.
(Greg Miller)
- LUCENE-10216: Use MergePolicy to define and MergeScheduler to trigger the reader merges
required by addIndexes(CodecReader[]) API.
(Vigya Sharma, Michael McCandless)
- GITHUB#11715: Add Integer awareness to RamUsageEstimator.sizeOf
(Mike Drob)
- Optimizations (5)
- LUCENE-10661: Reduce memory copy in BytesStore.
(luyuncheng)
- GITHUB#1020: Support #scoreSupplier and small optimizations to DocValuesRewriteMethod.
(Greg Miller)
- LUCENE-10633: Added support for dynamic pruning to queries sorted by a string
field that is indexed with terms and SORTED or SORTED_SET doc values.
(Adrien Grand)
- LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data.
(luyuncheng)
- GITHUB#1062: Optimize TermInSetQuery when a term is present that matches all docs in a segment.
(Greg Miller)
- Bug Fixes (7)
- LUCENE-10663: Fix KnnVectorQuery explain with multiple segments.
(Shiming Li)
- LUCENE-10673: Improve check of equality for latitudes for spatial3d GeoBoundingBox
(ignacio Vera)
- LUCENE-10678: Fix potential overflow when building a BKD tree with more than 4 billion points. The overflow
occurs when computing the partition point.
(Ignacio Vera)
- LUCENE-10644: Facets#getAllChildren testing should ignore child order.
(Yuting Gan)
- LUCENE-10665, GITHUB#11701: Fix classloading deadlock in analysis factories / AnalysisSPILoader
initialization.
(Uwe Schindler)
- LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts.
(Jack Mazanec)
- GITHUB#11794: Guard FieldExistsQuery against null pointers
(Luca Cavanna)
- Build (2)
- GITHUB#11720: Upgrade randomizedtesting to 2.8.1 (potential fix for odd wall clock - related
timeout failures).
(Dawid Weiss)
- LUCENE-10669: The build should be more helpful when generated resources are touched
(Dawid Weiss)
- Other (1)
- LUCENE-10559: Add Prefilter Option to KnnGraphTester
(Kaival Parikh)
- API Changes (2)
- LUCENE-10603: SortedSetDocValues#NO_MORE_ORDS marked @deprecated in favor of iterating with
SortedSetDocValues#docValueCount().
(Greg Miller)
- GITHUB#978: Deprecate (remove in Lucene 10) obsolete constants in oal.util.Constants; remove
code which is no longer executed after Java 9.
(Uwe Schindler)
- New Features (4)
- LUCENE-10550: Add getAllChildren functionality to facets
(Yuting Gan)
- LUCENE-10274: Added facetsets module for high dimensional (hyper-rectangle) faceting
- (Shai Erera, Marc D'Mello, Greg Miller)
- LUCENE-10151 Enable timeout support in IndexSearcher.
(Deepika Sharma)
- Improvements (5)
- LUCENE-10078: Merge on full flush is now enabled by default with a timeout of
500ms.
(Adrien Grand)
- LUCENE-10585: Facet module code cleanup (copy/paste scrubbing, simplification and some very minor
optimization tweaks).
(Greg Miller)
- LUCENE-10603: Update SortedSetDocValues iteration to use SortedSetDocValues#docValueCount().
(Greg Miller, Stefan Vodita)
- LUCENE-10619: Optimize the writeBytes in TermsHashPerField.
(Tang Donghai)
- GITHUB#983: AbstractSortedSetDocValueFacetCounts internal code cleanup/refactoring.
(Greg Miller)
- Optimizations (11)
- LUCENE-8519: MultiDocValues.getNormValues should not call getMergedFieldInfos
(Rushabh Shah)
- GITHUB#961: BooleanQuery can return quick counts for simple boolean queries.
(Adrien Grand)
- LUCENE-10618: Implement BooleanQuery rewrite rules based for minimumShouldMatch.
(Fang Hou)
- LUCENE-10480: Implement Block-Max-Maxscore scorer for 2 clauses disjunction.
(Zach Chen, Adrien Grand)
- LUCENE-10606: For KnnVectorQuery, optimize case where filter is backed by BitSetIterator
(Kaival Parikh)
- LUCENE-10593: Vector similarity function and NeighborQueue reverse removal.
(Alessandro Benedetti)
- GITHUB#984: Use primitive type data structures in FloatTaxonomyFacets and IntTaxonomyFacets
#getAllChildren() internal implementation to avoid some garbage creation.
(Greg Miller)
- GITHUB#1010: Specialize ordinal encoding for common case in SortedSetDocValues.
(Greg Miller)
- LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput.
(luyuncheng)
- GITHUB#1007: Optimize IntersectVisitor#visit implementations for certain bulk-add cases.
(Greg Miller)
- LUCENE-10653: BlockMaxMaxscoreScorer uses heapify instead of individual adds.
(Greg Miller)
- Changes in runtime behavior (1)
- GITHUB#978: IndexWriter diagnostics written to index only contain java's runtime version
and vendor.
(Uwe Schindler)
- Bug Fixes (13)
- LUCENE-10574: Prevent pathological O(N^2) merging.
(Adrien Grand)
- LUCENE-10584: Properly support #getSpecificValue for hierarchical dims in SSDV faceting.
(Greg Miller)
- LUCENE-10582: Fix merging of overridden CollectionStatistics in CombinedFieldQuery
(Yannick Welsch)
- LUCENE-10563: Fix failure to tessellate complex polygon
(Craig Taverner)
- LUCENE-10605: Fix error in 32bit jvm object alignment gap calculation
(Sun Wuqiang)
- GITHUB#956: Make sure KnnVectorQuery applies search boost.
(Julie Tibshirani)
- LUCENE-10598: SortedSetDocValues#docValueCount() should be always greater than zero.
(Lu Xugang)
- LUCENE-10600: SortedSetDocValues#docValueCount should be an int, not long
(Lu Xugang)
- LUCENE-10611: Fix failure when KnnVectorQuery has very selective filter
(Kaival Parikh)
- LUCENE-10607: Fix potential integer overflow in maxArcs computions
(Tang Donghai)
- GITHUB#986: Fix FieldExistsQuery rewrite when all docs have vectors.
(Julie Tibshirani)
- LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues
(Lu Xugang)
- GITHUB#1028: Fix error in TieredMergePolicy
(Lin Jian)
- Other (4)
- GITHUB#991: Update randomizedtesting to 2.8.0, hppc to 0.9.1, morfologik to 2.1.9.
(Dawid Weiss)
- LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests.
(Dawid Weiss)
- LUCENE-10604: Improve ability to test and debug triangulation algorithm in Tessellator.
(Craig Taverner)
- GITHUB#922: Remove unused and confusing FacetField indexing options
(Gautam Worah)
- Build (1)
- GITHUB#976: Exclude Lucene's own JAR files from classpath entries in Eclipse config.
(Uwe Schindler)
- API Changes (3)
- LUCENE-10325: Facets API extended to support getTopFacets.
(Yuting Gan)
- LUCENE-10482: Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the
taxoEpoch decide. Add a test case that demonstrates the inconsistencies caused when you reuse taxoArrays on older
checkpoints.
(Gautam Worah)
- LUCENE-10558: Add new constructors to Kuromoji and Nori dictionary classes to support classpath /
module system usage. It is now possible to use JDK's Class/ClassLoader/Module#getResource(...) apis
and pass their returned URL to dictionary constructors to load resources from Classpath or Module
resources.
(Uwe Schindler, Tomoko Uchida, Mike Sokolov)
- New Features (6)
- LUCENE-10312: Add PersianStemmer based on the Arabic stemmer.
(Ramin Alirezaee)
- LUCENE-10539: Return a stream of completions from FSTCompletion.
(Dawid Weiss)
- LUCENE-10385: Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery
to speed up computing the number of hits when possible.
(Lu Xugang, Luca Cavanna, Adrien Grand)
- LUCENE-10422: Monitor Improvements: `Monitor` can use a custom `Directory`
implementation. `Monitor` can be created with a readonly `QueryIndex` in order to
have readonly `Monitor` instances.
(Niko Usai)
- LUCENE-10456: Implement rewrite and Weight#count for MultiRangeQuery
by merging overlapping ranges .
(Jianping Weng)
- LUCENE-10444: Support alternate aggregation functions in association facets.
(Greg Miller)
- Improvements (6)
- LUCENE-10229: return -1 for unknown offsets in ExtendedIntervalsSource. Modify highlighting to
work properly with or without offsets.
(Dawid Weiss)
- LUCENE-10494: Implement method to bulk add all collection elements to a PriorityQueue.
(Bauyrzhan Sakhariyev)
- LUCENE-10484: Add support for concurrent random sampling by calling
RandomSamplingFacetsCollector#createManager.
(Luca Cavanna)
- LUCENE-10467: Throws IllegalArgumentException for Facets#getAllDims and Facets#getTopChildren
if topN <= 0.
(Yuting Gan)
- LUCENE-9848: Correctly sort HNSW graph neighbors when applying diversity criterion
(Mayya
Sharipova, Michael Sokolov)
- LUCENE-10527: Use 2*maxConn for the last layer in HNSW
(Mayya Sharipova)
- Optimizations (16)
- LUCENE-10555: avoid NumericLeafComparator#iteratorCost repeated initialization
when NumericLeafComparator#setScorer is called.
(Jianping Weng)
- LUCENE-10452: Hunspell: call checkCanceled less frequently to reduce the overhead
(Peter Gromov)
- LUCENE-10451: Hunspell: don't perform potentially expensive spellchecking after timeout
(Peter Gromov)
- LUCENE-10418: More `Query#rewrite` optimizations for the non-scoring case.
(Adrien Grand)
- LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery
with FieldExistsQuery.
(Zach Chen, Michael McCandless, Adrien Grand)
- LUCENE-10481: FacetsCollector will not request scores if it does not use them.
(Mike Drob)
- LUCENE-10503: Potential speedup for pure disjunctions whose clauses produce
scores that are very close to each other.
(Adrien Grand)
- LUCENE-10315: Use SIMD instructions to decode BKD doc IDs.
(Guo Feng, Adrien Grand, Ignacio Vera)
- LUCENE-8836: Speed up calls to TermsEnum#lookupOrd on doc values terms enums
and sequences of increasing ords.
(Bruno Roustant, Adrien Grand)
- LUCENE-10536: Doc values terms dictionaries now use the first (uncompressed)
term of each block as a dictionary when compressing suffixes of the other 63
terms of the block.
(Adrien Grand)
- LUCENE-10411: Add nearest neighbors vectors support to ExitableDirectoryReader.
(Zach Chen, Adrien Grand, Julie Tibshirani, Tomoko Uchida)
- LUCENE-10542: FieldSource exists implementations can avoid value retrieval
(Kevin Risden)
- LUCENE-10534: MinFloatFunction / MaxFloatFunction exists check can be slow
(Kevin Risden)
- LUCENE-10496: Queries sorted by field now better handle the degenerate case
when the search order and the index order are in opposite directions.
(Jianping Weng)
- LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle
ordToDoc in HNSW vectors
(Lu Xugang)
- LUCENE-10488: Facets#getTopDims optimized for taxonomy faceting and
ConcurrentSortedSetDocValuesFacetCounts.
(Yuting Gan)
- Bug Fixes (13)
- LUCENE-10477: Highlighter: WeightedSpanTermExtractor.extractWeightedSpanTerms to Query#rewrite
multiple times if necessary.
(Christine Poerschke, Adrien Grand)
- LUCENE-10491: A correctness bug in the way scores are provided within TaxonomyFacetSumValueSource
was fixed.
(Michael McCandless, Greg Miller)
- LUCENE-10466: Ensure IndexSortSortedNumericDocValuesRangeQuery handles sort field
types besides LONG
(Andriy Redko)
- LUCENE-10292: Suggest: Fix AnalyzingInfixSuggester / BlendedInfixSuggester to correctly return
existing lookup() results during concurrent build(). Fix other FST based suggesters so that
getCount() returned results consistent with lookup() during concurrent build().
(hossman)
- LUCENE-10508: Fixes some edge cases where GeoArea were built in a way that vertical planes
could not evaluate their sign, either because the planes where the same or the center between those
planes was lying in one of the planes.
(Ignacio Vera)
- LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets.
(Yuting Gan)
- LUCENE-10533: SpellChecker.formGrams is missing bounds check
(Kevin Risden)
- LUCENE-10529: Properly handle when TestTaxonomyFacetAssociations test case randomly indexes
no documents instead of throwing an NPE.
(Greg Miller)
- LUCENE-10470: Check if polygon has been successfully tessellated before we fail (we are failing some valid
tessellations) and allow filtering edges that fold on top of the previous one.
(Ignacio Vera)
- LUCENE-10530: Avoid floating point precision test case bug in TestTaxonomyFacetAssociations.
(Greg Miller)
- LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode.
(Lu Xugang)
- LUCENE-10558: Restore behaviour of deprecated Kuromoji and Nori dictionary constructors for
custom dictionary support. Please also use new URL-based constructors for classpath/module
system ressources.
(Uwe Schindler, Tomoko Uchida, Mike Sokolov)
- LUCENE-10564: Make sure SparseFixedBitSet#or updates ramBytesUsed.
(Julie Tibshirani)
- Build (3)
- GITHUB#768: Upgrade forbiddenapis to version 3.3.
(Uwe Schindler)
- GITHUB#890: Detect CI builds on Github or Jenkins and enable errorprone.
(Uwe Schindler, Dawid Weiss)
- LUCENE-10532: Remove LuceneTestCase.Slow annotation. All tests can be fast.
(Robert Muir)
- Other (4)
- LUCENE-10526: Test-framework: Add FilterFileSystemProvider.wrapPath(Path) method for mock filesystems
to override if they need to extend the Path implementation.
(Gautam Worah, Robert Muir)
- LUCENE-10525: Test-framework: Add detection of illegal windows filenames to WindowsFS.
(Gautam Worah)
- LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens to 255.
(Robert Muir, Uwe Schindler, Tomoko Uchida, Dawid Weiss)
- GITHUB#854: Allow to link to GitHub pull request from CHANGES.
(Tomoko Uchida, Jan Høydahl)
- API Changes (16)
- LUCENE-10244: MultiCollector::getCollectors is now public, allowing users to access the wrapped
collectors.
(Andriy Redko)
- LUCENE-10197: UnifiedHighlighter now has a Builder to construct it. The UH's setters are now
deprecated.
(Animesh Pandey, David Smiley)
- LUCENE-10301: the test framework is now a module. All the classes have been moved from
org.apache.lucene.* to org.apache.lucene.tests.* to avoid package name conflicts with the
core module.
(Dawid Weiss)
- LUCENE-10183: KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues.
(Zach Chen, Michael Sokolov, Julie Tibshirani, Adrien Grand)
- LUCENE-10335: Deprecate helper methods for resource loading in IOUtils and StopwordAnalyzerBase
that are not compatible with module system (Class#getResourceAsStream() and Class#getResource()
are caller sensitive in Java 11). Instead add utility method IOUtils#requireResourceNonNull(T)
to test existence of resource based on null return value.
(Uwe Schindler, Dawid Weiss)
- LUCENE-10349: WordListLoader methods now return unmodifiable CharArraySets.
(Uwe Schindler)
- LUCENE-10377: SortField.getComparator() has changed signature. The second parameter is now
a boolean indicating whether or not skipping should be enabled on the comparator.
(Alan Woodward)
- LUCENE-10381: Require users to provide FacetsConfig for SSDV faceting.
(Greg Miller)
- LUCENE-10368: IntTaxonomyFacets has been deprecated and is no longer a supported extension point
for user-created faceting implementations.
(Greg Miller)
- LUCENE-10400: Add constructors that take external resource Paths to dictionary classes in Kuromoji and Nori:
ConnectionCosts, TokenInfoDictionary, and UnknownDictionary. Old constructors that take resource scheme and
resource path in those classes are deprecated; These are replaced with the new constructors and planned to be
removed in a future release.
(Tomoko Uchida, Uwe Schindler, Mike Sokolov)
- LUCENE-10050: Deprecate DrillSideways#search(Query, Collector) in favor of
DrillSideways#search(Query, CollectorManager). This reflects the change (LUCENE-10002) being made in
IndexSearcher#search that trends towards using CollectorManagers over Collectors.
(Gautam Worah)
- LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces.
(David Smiley, Uwe Schindler, Dawid Weiss, Tomoko Uchida)
- LUCENE-10398: Add static method for getting Terms from LeafReader.
(Spike Liu)
- LUCENE-10440: TaxonomyFacets and FloatTaxonomyFacets have been deprecated and are no longer
supported extension points for user-created faceting implementations.
(Greg Miller)
- LUCENE-10431: MultiTermQuery.setRewriteMethod() has been deprecated, and constructor
parameters for the various implementations added.
(Alan Woodward)
- LUCENE-10171: OpenNLPOpsFactory.getLemmatizerDictionary(String, ResourceLoader) now returns a
DictionaryLemmatizer object instead of a raw String serialization of the dictionary.
(Spyros Kapnissis via Michael Gibney, Alessandro Benedetti)
- New Features (19)
- LUCENE-10255: Lucene JARs are now proper modules, with module descriptors and dependency information.
(Chris Hegarty, Uwe Schindler, Tomoko Uchida, Dawid Weiss)
- LUCENE-10342: Lucene Core now depends on java.logging (JUL) module and reports
if MMapDirectory cannot unmap mapped ByteBuffers or RamUsageEstimator's object size
calculations may be off. This was added especially for users running Lucene with the
Java Module System where some optional features are not available by default or supported.
For all apps using Lucene it is strongly recommended, to explicitely require non-standard
JDK modules: jdk.unsupported (unmapping) and jdk.management (OOP size for RAM usage calculatons).
It is also recommended to install JUL logging adapters to feed the log events into your app's
logging system.
(Uwe Schindler, Dawid Weiss, Tomoko Uchida, Robert Muir)
- LUCENE-10330: Make MMapDirectory tests fail by default, if unmapping does not work.
(Uwe Schindler, Dawid Weiss)
- LUCENE-10223: Add interval function support to StandardQueryParser. Add min-should-match operator
support to StandardQueryParser. Update and clean up package documentation in flexible query parser
module.
(Dawid Weiss, Alan Woodward)
- LUCENE-10220: Add an utility method to get IntervalSource from analyzed text (or token stream).
(Uwe Schindler, Dawid Weiss, Alan Woodward)
- LUCENE-10085: Added Weight#count on DocValuesFieldExistsQuery to speed up the query if terms or
points are indexed.
(Quentin Pradet, Adrien Grand)
- LUCENE-10263: Added Weight#count to NormsFieldExistsQuery to speed up the query if all
documents have the field..
(Alan Woodward)
- LUCENE-10248: Add SpanishPluralStemFilter, for precise stemming of Spanish plurals.
For more information, see https://s.apache.org/spanishplural
(Xavier Sanchez Loro)
- LUCENE-10243: StandardTokenizer, UAX29URLEmailTokenizer, and HTMLStripCharFilter have
been upgraded to Unicode 12.1
(Robert Muir)
- LUCENE-10335: Add ModuleResourceLoader as complement to ClasspathResourceLoader.
(Uwe Schindler)
- LUCENE-10245: MultiDoubleValues(Source) and MultiLongValues(Source) were added as multi-valued
versions of DoubleValues(Source) and LongValues(Source) to the facets module. LongValueFacetCounts,
LongRangeFacetCounts and DoubleRangeFacetCounts were augmented to support these new multi-valued
abstractions. DoubleRange and LongRange also support creating queries from these multi-valued
sources.
(Greg Miller)
- LUCENE-10250: Add support for arbitrary length hierarchical SSDV facets.
(Marc D'mello)
- LUCENE-10395: Add support for TotalHitCountCollectorManager, a collector manager
based on TotalHitCountCollector that allows users to parallelize counting the
number of hits.
(Luca Cavanna, Adrien Grand)
- LUCENE-10403: Add ArrayUtil#grow(T[]).
(Greg Miller)
- LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser
(Dawid Weiss,
Alan Woodward)
- LUCENE-10378: Implement Weight#count for PointRangeQuery to provide a faster way to calculate
the number of matching range docs when each doc has at-most one point and the points are 1-dimensional.
(Gautam Worah, Ignacio Vera, Adrien Grand)
- LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count.
(Ignacio Vera)
- LUCENE-10382: Add support for filtering in KnnVectorQuery. This allows for finding the
nearest k documents that also match a query.
(Julie Tibshirani, Joel Bernstein)
- LUCENE-10237: Add MergeOnFlushMergePolicy to sandbox.
(Michael Froh, Anand Kotriwal)
- Improvements (9)
- LUCENE-10313: use java util logging in Luke. Add dynamic log filtering. Drop
the persistent log previously written to ~/.luke.d/luke.log. Configure Java's default
logging handlers to persist Luke logs according to your needs.
(Tomoko Uchida, Dawid Weiss)
- LUCENE-10238: Upgrade icu4j dependency to 70.1.
(Dawid Weiss)
- LUCENE-9820: Extract BKD tree interface and move intersecting logic to the
PointValues abstract class.
(Ignacio Vera, Adrien Grand)
- LUCENE-10262: Lift up restrictions for navigating PointValues#PointTree
added in LUCENE-9820
(Ignacio Vera)
- LUCENE-9538: Detect polygon self-intersections in the Tessellator.
(Ignacio Vera)
- LUCENE-10275: Speed up MultiRangeQuery by using an interval tree.
(Ignacio Vera)
- LUCENE-10229: Unify behaviour of match offsets for interval queries on fields
with or without offsets enabled.
(Patrick Zhai)
- LUCENE-10054 Make HnswGraph hierarchical
(Mayya Sharipova, Julie Tibshirani, Mike Sokolov,
Adrien Grand)
- LUCENE-10371: Make IndexRearranger able to arrange segment in a determined order.
(Patrick Zhai)
- Optimizations (20)
- LUCENE-10329: Use computed block mask for DirectMonotonicReader#get.
(Guo Feng)
- LUCENE-10280: Optimize BKD leaves' doc IDs codec when they are continuous.
(Guo Feng)
- LUCENE-10233: Store BKD leaves' doc IDs as bitset in some cases (typically for low cardinality fields
or sorted indices) to speed up addAll.
(Guo Feng, Adrien Grand)
- LUCENE-10225: Improve IntroSelector with 3-ways partitioning.
(Bruno Roustant, Adrien Grand)
- LUCENE-10321: Tweak MultiRangeQuery interval tree creation to skip "pulling up" mins.
(Greg Miller)
- LUCENE-10252: ValueSource.asDoubleValues and asLongValues should not compute the score unless
asked to -- typically never. This fixes a performance regression since 7.3 LUCENE-8099 when some
older boosting queries were replaced with this.
(David Smiley)
- LUCENE-10346: Optimize facet counting for single-valued TaxonomyFacetCounts.
(Guo Feng)
- LUCENE-10356: Further optimize facet counting for single-valued TaxonomyFacetCounts.
(Greg Miller)
- LUCENE-10379: Count directly into the dense values array in FastTaxonomyFacetCounts#countAll.
(Guo Feng, Greg Miller)
- LUCENE-10375: Speed up HNSW vectors merge by first writing combined vector
data to a file.
(Julie Tibshirani, Adrien Grand)
- LUCENE-10388: Remove MultiLevelSkipListReader#SkipBuffer to make JVM less confused.
(Guo Feng)
- LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of
matching clauses is a constant.
(LuYunCheng via Adrien Grand)
- LUCENE-10412: More `Query#rewrite` optimizations for MatchNoDocsQuery.
(Adrien Grand)
- LUCENE-10408 Better encoding of doc Ids in vectors.
(Mayya Sharipova, Julie Tibshirani, Adrien Grand)
- LUCENE-10424, LUCENE-10439: Optimize the "everything matches" case for count query in PointRangeQuery.
(Ignacio Vera, Lu Xugang)
- LUCENE-10084, LUCENE-10435: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery whenever
terms or points have a docCount that is equal to maxDoc.
(Vigya Sharma, Lu Xugang)
- LUCENE-10442: When indexQuery or/and dvQuery be a MatchAllDocsQuery
then IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery.
(Lu Xugang)
- LUCENE-10450: IndexSortSortedNumericDocValuesRangeQuery could be rewrite to MatchAllDocsQuery.
(Lu Xugang)
- LUCENE-10453: Indexing and search speedup with KNN vectors when using
euclidean distance.
(Adrien Grand)
- LUCENE-10455: IndexSortSortedNumericDocValuesRangeQuery now implements the scorerSupplier API.
(Lu Xugang)
- Changes in runtime behavior (2)
- LUCENE-10291: Lucene now only writes files for terms and postings if at least
one field is indexed with postings.
(Yannick Welsch)
- LUCENE-10311: FixedBitSet#approximateCardinality now trades accuracy for
speed instead of delegating to FixedBitSet#cardinality.
(Robert Muir, Adrien Grand)
- Bug Fixes (16)
- LUCENE-10316: fix TestLRUQueryCache.testCachingAccountableQuery failure.
(Patrick Zhai)
- LUCENE-10279: Fix equals in MultiRangeQuery.
(Ignacio Vera)
- LUCENE-10349: Fix all analyzers to behave according to their documentation:
getDefaultStopSet() methods now return unmodifiable CharArraySets.
(Uwe Schindler)
- LUCENE-10352: Add missing service provider entries: KoreanNumberFilterFactory,
DaitchMokotoffSoundexFilterFactory
(Uwe Schindler, Robert Muir)
- LUCENE-10352: Fixed ctor argument checks: JapaneseKatakanaStemFilter,
DoubleMetaphoneFilter
(Uwe Schindler, Robert Muir)
- LUCENE-10236: Stop duplicating norms when scoring in CombinedFieldQuery.
(Zach Chen, Jim Ferenczi, Julie Tibshirani)
- LUCENE-10353: Add random null injection to TestRandomChains.
(Robert Muir,
Uwe Schindler)
- LUCENE-10377: CheckIndex could incorrectly throw an error when checking index sorts
defined on older indexes.
(Alan Woodward)
- LUCENE-9952: Address inaccurate dim counts for SSDV faceting in cases where a dim is configured
as multi-valued.
(Greg Miller)
- LUCENE-10401: Fix lookups on empty doc-value terms dictionaries to no longer
throw an ArrayIndexOutOfBoundsException.
(Adrien Grand)
- LUCENE-10402: Prefix intervals should declare their automaton as binary, otherwise prefixes
containing multibyte characters will not correctly match.
(Alan Woodward)
- LUCENE-10407: Containing intervals could sometimes yield incorrect matches when wrapped
in a disjunction.
(Alan Woodward, Dawid Weiss)
- LUCENE-10405: When using the MemoryIndex, binary and Sorted doc values are stored
as BytesRef instead of BytesRefHash so they don't have a limit on size.
(Ignacio Vera)
- LUCENE-10428: Queries with a misbehaving score function may no longer cause
infinite loops in their parent BooleanQuery.
(Ankit Jain, Daniel Doubrovkine, Adrien Grand)
- LUCENE-10431: MultiTermQuery no longer includes its rewrite method in its hashcode
calculation, as this could cause problems with wrapper queries like BooleanQuery which
expect their child queries hashcodes to be stable.
(Alan Woodward)
- LUCENE-10469: Fix ScoreMode propagation by ConstantScoreQuery.
(Adrien Grand)
- Other (7)
- LUCENE-10273: Deprecate SpanishMinimalStemFilter in favor of SpanishPluralStemFilter.
(Robert Muir)
- LUCENE-10284: Upgrade morfologik-stemming to 2.1.8.
(Dawid Weiss)
- LUCENE-10310: TestXYDocValuesQueries#doRandomDistanceTest does not produce random circles with radius
with '0' value any longer.
- LUCENE-10352: Removed duplicate instances of StringMockResourceLoader and migrated class to
test-framework.
(Uwe Schindler, Robert Muir)
- LUCENE-10352: Convert TestAllAnalyzersHaveFactories and TestRandomChains to a global integration test
and discover classes to check from module system. The test now checks all analyzer modules,
so it may discover new bugs outside of analysis:common module.
(Uwe Schindler, Robert Muir)
- LUCENE-10413: Make Ukrainian default stop words list available as a public getter.
(Alan Woodward)
- LUCENE-10437: Polygon tessellator throws a more informative error message when the provided polygon
does not contain enough no-collinear points.
(Ignacio Vera)
- New Features (8)
- LUCENE-9322, LUCENE-9855: Vector-valued fields, Lucene90 Codec
(Mike Sokolov, Julie Tibshirani, Tomoko Uchida)
- LUCENE-9004, LUCENE-10040: Approximate nearest vector search via NSW graphs
(Mike Sokolov, Tomoko Uchida et al.)
- LUCENE-9659: SpanPayloadCheckQuery now supports inequalities.
(Kevin Watters, Gus Heck)
- LUCENE-9589: Swedish Minimal Stemmer
(janhoy)
- LUCENE-9313: Add SerbianAnalyzer based on the snowball stemmer.
(Dragan Ivanovic)
- LUCENE-10095: Add NepaliAnalyzer based on the snowball stemmer.
(Robert Muir)
- LUCENE-10096: Add TamilAnalyzer based on the snowball stemmer.
(Robert Muir)
- LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion
(Tomoko Uchida, Robert Muir, Jun Ohtani)
- System Requirements (1)
- LUCENE-8738: Move to Java 11 as minimum Java version.
(Adrien Grand, Uwe Schindler)
- API Changes (44)
- LUCENE-8638: Remove many deprecated methods and classes including FST.lookupByOutput(),
LegacyBM25Similarity and Jaspell suggester.
- LUCENE-8982: Separate out native code to another module to allow cpp
build with gradle. This also changes the name of the native "posix-support"
library to LuceneNativeIO.
(Zachary Chen, Dawid Weiss)
- LUCENE-9562: All binary analysis packages (and corresponding
Maven artifacts) with names containing '-analyzers-' have been renamed
to '-analysis-'.
(Dawid Weiss)
- LUCENE-8474: RAMDirectory and associated deprecated classes have been
removed.
(Dawid Weiss)
- LUCENE-3041: The deprecated Weight#extractTerms() method has been
removed
(Alan Woodward, Simon Willnauer, David Smiley, Luca Cavanna)
- LUCENE-8805: StoredFieldVisitor#stringField now takes a String rather than a
byte[] that stores the UTF-8 bytes of the stored string.
(Namgyu Kim via Adrien Grand)
- LUCENE-8811: BooleanQuery#setMaxClauseCount() and #getMaxClauseCount() have
moved to IndexSearcher. The checks are now implemented using a QueryVisitor
and apply to all queries, rather than only booleans.
(Atri Sharma, Adrien
Grand, Alan Woodward)
- LUCENE-8909: The deprecated IndexWriter#getFieldNames() method has been removed.
(Adrien Grand, Munendra S N)
- LUCENE-8948: Change "name" argument in ICU factories to "form". Here, "form" is
named after "Unicode Normalization Form".
(Tomoko Uchida)
- LUCENE-8933: Validate JapaneseTokenizer user dictionary entry.
(Tomoko Uchida)
- LUCENE-8905: Better defence against malformed arguments in TopDocsCollector
(Atri Sharma)
- LUCENE-9089: FST Builder renamed FSTCompiler with fluent-style Builder.
(Bruno Roustant)
- LUCENE-9212: Deprecated Intervals.multiterm() methods that take a bare Automaton
have been removed
(Alan Woodward)
- LUCENE-9264: SimpleFSDirectory has been removed in favor of NIOFSDirectory.
(Yannick Welsch)
- LUCENE-9281: Use java.util.ServiceLoader to load codec components and analysis
factories to be compatible with Java Module System. This allows to load factories
without META-INF/service from a Java module exposing the factory in the module
descriptor. This breaks backwards compatibility as custom analysis factories
must now also implement the default constructor (see MIGRATE.md).
(Uwe Schindler, Dawid Weiss)
- LUCENE-9307: BufferedIndexInput#setBufferSize has been removed.
(Adrien Grand)
- LUCENE-9340: SimpleBindings#add(SortField) has been removed.
(Alan Woodward)
- LUCENE-9462: Fields without positions should still return MatchIterator.
(Alan Woodward, Dawid Weiss)
- LUCENE-9516: Removed the ability to replace the IndexingChain / DocConsumer
in Lucenes IndexWriter. The interface is not sufficient to efficiently
replace the functionality with reasonable efforts.
(Simon Willnauer)
- LUCENE-9317 LUCENE-9318 LUCENE-9319 LUCENE-9558 LUCENE-9600 : Clean up package name conflicts
between modules. See MIGRATE.md for details.
(David Ryan, Tomoko Uchida, Uwe Schindler, Dawid Weiss)
- LUCENE-9646: Set BM25Similarity discountOverlaps via the constructor
(Patrick Marty via Bruno Roustant)
- LUCENE-9480: Make DataInput's skipBytes(long) abstract as the implementation was not performant.
IndexInput's api is unaffected: skipBytes() is implemented via seek().
(Greg Miller)
- LUCENE-9796: SortedDocValues no longer extends BinaryDocValues, as binaryValue() was not performant.
See MIGRATE.md for details.
(Robert Muir)
- LUCENE-9853: JapaneseAnalyzer should use CJKWidthCharFilter for full-width and half-width character normalization.
(Tomoko Uchida)
- LUCENE-9387: Removed CodecReader#ramBytesUsed.
(Adrien Grand)
- LUCENE-9334: Require consistency between data-structures on a per-field basis.
A field across all documents within an index must be indexed with the same index
options and data-structures. As a consequence of this, doc values updates are
only applicable for fields that are indexed with doc values only.
(Mayya Sharipova,
Adrien Grand, Simon Willnauer)
- LUCENE-9047: Directory API is now little endian.
(Ignacio Vera, Adrien Grand)
- LUCENE-9948: No longer require the user to specify whether-or-not a field is multi-valued in
LongValueFacetCounts (detect automatically based on what is indexed).
(Greg Miller)
- LUCENE-9843: Remove compression option on default codec's docvalues.
(Jack Conradson)
- LUCENE-9204: SpanQuery and its subclasses have been moved from core/ into the
queries/ module.
(Alan Woodward)
- LUCENE-9454: Analyzer no longer has a mutable version field.
(Alan Woodward)
- LUCENE-9956: Expose the getBaseQuery, getDrillDownQueries APIs from DrillDownQuery
(Gautam Worah)
- LUCENE-8143: SpanBoostQuery has been removed.
(Alan Woodward)
- LUCENE-9998: Remove unused parameter fis in StoredFieldsWriter.finish() and TermVectorsWriter.finish(),
including those subclasses.
(kkewwei)
- LUCENE-7020: TieredMergePolicy#setMaxMergeAtOnceExplicit has been removed.
TieredMergePolicy no longer sets a limit on the maximum number of segments
that can be merged at once via a forced merge.
(Adrien Grand, Shawn Heisey)
- LUCENE-10027: Directory reader open API from indexCommit and leafSorter has been modified
to add an extra parameter - minSupportedMajorVersion.
(Mayya Sharipova)
- LUCENE-9620: Added a (sometimes) faster implementation for IndexSearcher#count that relies on the new Weight#count API.
The Weight#count API represents a cleaner way for Query classes to optimize their counting method.
(Gautam Worah, Adrien Grand)
- LUCENE-10089: Add a method to SortField that allows to enable or disable numeric sort
optimization to use the points index to skip over non-competitive documents,
which is enabled by default from 9.0
(Mayya Sharipova, Adrien Grand)
- LUCENE-10115: Add an extension point, BaseQueryParser#getFuzzyDistance, to allow custom
query parsers to determine the similarity distance for fuzzy queries.
(Chris Hegarty)
- LUCENE-10132: Support addition of diagnostics by custom merge policies
(Chris Hegarty)
- LUCENE-9325: Sort is now final, and the `setSort()` method has been removed
(Alan Woodward)
- LUCENE-9431: The UnifiedHighlighter's WEIGHT_MATCHES flag is now set by default, provided its
requirements are met. It can be disabled via over-riding getFlags
(Animesh Pandey, David Smiley)
- LUCENE-10158: Add a new interface Unwrappable to the utils package to allow code to
unwrap wrappers/delegators that are added by Lucene's testing framework. This will allow
testing new MMapDirectory implementation based on JDK Project Panama.
(Uwe Schindler)
- LUCENE-10260: LucenePackage class has been removed. The implementation string can be
retrieved from Version.getPackageImplementationVersion().
(Uwe Schindler, Dawid Weiss)
- Improvements (48)
- LUCENE-10234: Added Automatic-Module-Name to all JARs. This is the first step to enable full Java
module system (JMS) support in later Lucene versions. At the moment, the automatic names should
not be considered stable.
(Dawid Weiss, Uwe Schindler)
- LUCENE-10182: TestRamUsageEstimator used RamUsageTester.sizeOf throughout, making some of the
tests trivial. Now, it compares results from RamUsageEstimator with those from RamUsageTester.
To prevent this error in the future, RamUsageTester.sizeOf was renamed to ramUsed.
(Uwe Schindler, Dawid Weiss, Stefan Vodita)
- LUCENE-10129: RamUsageEstimator overloads the shallowSizeOf method for primitive arrays
to avoid falling back on shallowSizeOf(Object), which could lead to performance traps.
(Robert Muir, Uwe Schindler, Stefan Vodita)
- LUCENE-10139: ExternalRefSorter returns a covariant with a subtype of BytesRefIterator
that is Closeable.
(Dawid Weiss).
- LUCENE-10135: Correct passage selector behavior for long matching snippets
(Dawid Weiss).
- LUCENE-9960: Avoid unnecessary top element replacement for equal elements in PriorityQueue.
(Dawid Weiss)
- LUCENE-9633: Improve match highlighter behavior for degenerate intervals (on non-existing positions).
(Dawid Weiss)
- LUCENE-9618: Do not call IntervalIterator.nextInterval after NO_MORE_DOCS is returned.
(Patrick Zhai)
- LUCENE-9576: Improve ConcurrentMergeScheduler settings by default, assuming modern I/O.
Previously Lucene was too conservative, jumping through hoops to detect if disks were SSD-backed.
In many common modern cases (VMs, RAID arrays, containers, encrypted mounts, non-Linux OS),
the pessimistic heuristics were wrong, resulting in slower indexing performance. Heuristics were
also complex and would trigger JDK issues even on unrelated mount points. Merge scheduler defaults
are now modernized and the heuristics removed. Users with spinning disks that want to maximize I/O
performance should tweak ConcurrentMergeScheduler.
(Robert Muir)
- LUCENE-9463: Query match region retrieval component, passage scoring and formatting
for building custom highlighters.
(Alan Woodward, Dawid Weiss)
- LUCENE-9370: RegExp query is no longer lenient about inappropriate backslashes and
follows the Java Pattern policy for rejecting illegal syntax.
(Mark Harwood)
- LUCENE-9336: RegExp query now supports \w \W \d \D \s \S expressions.
This is a break with previous behaviour where these were (mis)interpreted
as literally the characters w W d etc.
(Mark Harwood)
- LUCENE-8757: When provided with an ExecutorService to run queries across
multiple threads, IndexSearcher now groups small segments together, up to
250k docs per slice.
(Atri Sharma via Adrien Grand)
- LUCENE-8857: Introduce Custom Tiebreakers in TopDocs.merge for tie breaking on
docs on equal scores. Also, remove the ability of TopDocs.merge to set shard
indices
(Atri Sharma, Adrien Grand, Simon Willnauer)
- LUCENE-8958: Shared count early termination for relevance sorted indices
(Atri Sharma)
- LUCENE-8937: Avoid aggressive stemming on numbers in the FrenchMinimalStemmer.
(Adrien Gallou via Tomoko Uchida)
- LUCENE-8596: Kuromoji user dictionary now accepts entries containing hash mark (#) that were
previously treated as beginning a line-ending comment
(Satoshi Kato and Masaru Hasegawa via
Michael Sokolov)
- LUCENE-9109: Use StackWalker to implement TestSecurityManager's detection
of JVM exit
(Uwe Schindler)
- LUCENE-9110: Refactor stack analysis in tests to use generalized LuceneTestCase
methods that use StackWalker
(Uwe Schindler)
- LUCENE-9206: IndexMergeTool gets additional options to control the merging.
This tool no longer forceMerge(1)s to a single segment by default. If you
rely upon this behavior, pass -max-segments 1 instead.
(Robert Muir)
- LUCENE-9220: Upgrade snowball to 2.0. New snowball stemmers: Hindi, Indonesian,
Nepali, Serbian, and Tamil. New stoplist: Indonesian. Adds gradle 'snowball'
task to regenerate and ease future upgrades.
(Robert Muir, Dawid Weiss)
- LUCENE-9354: Improvements to snowball french stopwords list, so that it is less
aggressive.
(Philippe Ouellet)
- LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation
(Atri Sharma, David Smiley)
- LUCENE-9074: Introduce Slice Executor For Dynamic Runtime Execution Of Slices
(Atri Sharma)
- LUCENE-9280: Add an ability for field comparators to skip non-competitive documents.
Creating a TopFieldCollector with totalHitsThreshold less than Integer.MAX_VALUE
instructs Lucene to skip non-competitive documents whenever possible. For numeric
sort fields the skipping functionality works when the same field is indexed both
with doc values and points. In this case, there is an assumption that the same data is
stored in these points and doc values
(Mayya Sharipova, Jim Ferenczi, Adrien Grand)
- LUCENE-9449: Enhance DocComparator to provide an iterator over competitive
documents when searching with "after". This iterator can quickly position
on the desired "after" document skipping all documents and segments before
"after". Also redesign numeric comparators to provide skipping functionality
by default.
(Mayya Sharipova, Jim Ferenczi)
- LUCENE-9527: Upgrade javacc to 7.0.4, regenerate query parsers.
(Dawid Weiss)
- LUCENE-9531: Consolidated CharStream and FastCharStream classes: these have been moved
from each query parser package to org.apache.lucene.queryparser.charstream
(Dawid Weiss).
- LUCENE-9450: Use BinaryDocValues for the taxonomy index instead of StoredFields.
Add backwards compatibility tests for the taxonomy index.
(Gautam Worah, Michael McCandless)
- LUCENE-9605: Update snowball to d8cf01ddf37a, adds Yiddish stemmer.
(Robert Muir)
- LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag,
and rename to DirectIODirectory (Zach Chen, Uwe Schindler, Mike McCandless, Dawid Weiss).
- LUCENE-9674: Implement faster advance on VectorValues using binary search.
(Anand Kotriwal, Mike Sokolov)
- LUCENE-9794: Speed up implementations of DataInput.skipBytes().
(Greg Miller)
- LUCENE-9898: Removes no longer used scorePayload method from BM25Similarity
(Pieter van Boxtel)
- LUCENE-9850: Switch to PFOR encoding for doc IDs (instead of FOR).
(Greg Miller)
- LUCENE-9929: Add NorwegianNormalizationFilter, which does the same as ScandinavianNormalizationFilter except
it does not fold oo->ø and ao->å.
(janhoy, Robert Muir, Adrien Grand)
- LUCENE-9535: Improve DocumentsWriterPerThreadPool to prefer larger instances.
(Adrien Grand)
- LUCENE-10000: MultiCollectorManager now has parity with MultiCollector with respect to how it
handles CollectionTerminationException and setMinCompetitiveScore calls.
(Greg Miller)
- LUCENE-10019: Align file starts in CFS files to have proper alignment (8 bytes)
(Uwe Schinder)
- LUCENE-9662: Make CheckIndex concurrent by parallelizing index check across segments.
(Zach Chen, Mike McCandless, Dawid Weiss, Robert Muir)
- LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more efficiently retrieve FacetLabels for multiple
facet ordinals at once. This API is 2-4% faster than iteratively calling getPath.
The getPath API now throws an IAE instead of returning null if the ordinal is out of bounds.
(Gautam Worah, Mike McCandless)
- LUCENE-10113: Use VarHandles to access int/long/short primitive types in byte arrays.
This improves readability and performance of encoding/decoding of primitives to index
file format in input/output classes like DataInput / DataOutput and codecs.
(Uwe Schindler, Robert Muir)
- LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes.
(Tim Brooks, Uwe Schindler, Robert Muir, Adrien Grand)
- LUCENE-10125: Optimize primitive writes in OutputStreamIndexOutput.
(Uwe Schindler, Robert Muir, Adrien Grand)
- LUCENE-10143: Delegate primitive writes in RateLimitedIndexOutput.
(Uwe Schindler, Robert Muir, Adrien Grand)
- LUCENE-10145, LUCENE-10153: Faster flushes and merges of points by leveraging
VarHandles.
(Adrien Grand)
- LUCENE-10201: Spatial-Extras: Upgrading Spatial4j to 0.8 improving a varitety of minor things.
See release notes. https://github.com/locationtech/spatial4j/releases/tag/spatial4j-0.8
(David Smiley)
- LUCENE-10062: Switch taxonomy faceting to use numeric doc values for storing ordinals instead of binary doc values
with its own custom encoding.
(Greg Miller)
- Bug fixes (15)
- LUCENE-9686: Fix read past EOF handling in DirectIODirectory.
(Zach Chen,
Julie Tibshirani)
- LUCENE-8663: NRTCachingDirectory.slowFileExists may open a file while
it's inaccessible.
(Dawid Weiss)
- LUCENE-9117: RamUsageEstimator hangs with AOT compilation. Removed any attempt to
estimate Long.valueOf cache size.
(Cleber Muramoto, Dawid Weiss)
- LUCENE-9290: Don't assume that different XYPoint have different hash code
(Ignacio Vera via Mike Drob)
- LUCENE-9372: Fix paths for cygwin/msys before gradle wrapper jar lookup.
(Peter Barna)
- LUCENE-9365: FuzzyQuery was missing matches when prefix length was equal to the term length
(Mark Harwood, Mike Drob)
- LUCENE-9580: Fix bug in the polygon tessellator when introducing collinear edges during polygon
splitting.
(Ignacio Vera)
- LUCENE-9930: The Ukrainian analyzer was reloading its dictionary for every new
TokenStreamComponents, which could lead to memory leaks.
(Alan Woodward)
- LUCENE-9940: The order of disjuncts in DisjunctionMaxQuery does not matter
for equality checks
(Alan Woodward)
- LUCENE-9971: Requesting facet counts for unseen dimensions in SortedSetDocValueFacetCounts and
ConcurrentSortedSetDocValueFacetCounts now returns null / -1 instead of throwing
IllegalArgumentException as per Javadoc spec in Facets.
(Alexander Lukyanchikov)
- LUCENE-9823: Prevent unsafe rewrites for SynonymQuery and CombinedFieldQuery. Before, rewriting
could slightly change the scoring when weights were specified.
(Naoto Minami via Julie Tibshirani)
- LUCENE-10047: Fix a value de-duping bug in LongValueFacetCounts and RangeFacetCounts
(Greg Miller)
- LUCENE-10101, LUCENE-9281: Use getField() instead of getDeclaredField() to
minimize security impact by analysis SPI discovery.
(Uwe Schindler)
- LUCENE-10114: Remove unused byte order mark in Lucene90PostingsWriter. This
was initially introduced by accident in Lucene 8.4.
(Uwe Schindler)
- LUCENE-10140: Fix cases where minimizing interval iterators could return
incorrect matches
(Nikolay Khitrin, Alan Woodward)
- Changes in Backwards Compatibility Policy (3)
- LUCENE-9904: regenerated UAX29URLEmailTokenizer and the corresponding analyzer with up-to-date top
level domains. This may change the token sequence compared to previous Lucene versions.
(Dawid Weiss)
- LUCENE-9669: DirectoryReader#open now accepts an argument to open indices created with versions
older than N-1. Lucene now can open indices created with a major version of N-2 in read-only mode.
Opening an index created with a major version of N-2 with an IndexWriter is not supported.
Further does lucene only support file-format compatibilty which enables reading of old indices while
semantic changes like analysis or certain encoding on top of the file format are only supported on
a best effort basis.
(Simon Willnauer)
- LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match.
(Greg Miller)
- Build (6)
- LUCENE-9077 LUCENE-9433: Support Gradle build, remove Ant support from trunk
(Dawid Weiss, Erick Erickson, Uwe Schindler et.al.)
- LUCENE-8768: Fix Javadocs build in Java 11.
(Namgyu Kim)
- LUCENE-9544: add regenerate gradle script for nori dictionary
(Namgyu Kim)
- LUCENE-10195: Add gradle cache option and make some tasks cacheable.
(Jerome Prinet, Dawid Weiss)
- LUCENE-10198: LUCENE-10198: Allow external JAVA_OPTS in gradlew scripts; use sane defaults
([email protected], Dawid Weiss)
- LUCENE-10163: Move LICENSE and NOTICE files to top level to satisfy src artifact requirements
(janhoy)
- Other (20)
- LUCENE-10122: Use NumericDocValues to store taxonomy parent array
(Patrick Zhai)
- LUCENE-10136: allow 'var' declarations in source code
(Dawid Weiss)
- LUCENE-9570, LUCENE-9564: Apply google java format and enforce it on source Java files.
Review diffs and correct automatic formatting oddities.
(Erick Erickson,
Bruno Roustant, Dawid Weiss)
- LUCENE-9631: Properly override slice() on subclasses of OffsetRange.
(Dawid Weiss)
- LUCENE-9391: Upgrade HPPC to 0.8.2.
(Patrick Zhai)
- LUCENE-10021: Upgrade HPPC to 0.9.0. Replace usage of ...ScatterMap to ...HashMap.
(Patrick Zhai)
- LUCENE-9092: upgrade randomizedtesting to 2.7.5
(Dawid Weiss)
- LUCENE-8656: Deprecations in FuzzyQuery and get compiler warnings out of
queryparser code
(Alan Woodward, Erick Erickson)
- LUCENE-9344: Convert .txt files to properly formatted .md files.
(Tomoko Uchida, Uwe Schindler)
- LUCENE-9267: Update MatchingQueries documentation to correct
time unit.
(Pierre-Luc Perron via Mike Drob)
- LUCENE-9411: Fail compilation on warnings, 9x gradle-only (Erick Erickson, Dawid Weiss)
Deserves mention here as well as Lucene CHANGES.txt since it affects both.
- LUCENE-9215: Replace checkJavaDocs.py with doclet
(Robert Muir, Dawid Weiss, Uwe Schindler)
- LUCENE-9497: Integrate Error Prone, a static analysis tool during compilation
(Dawid Weiss, Varun Thacker)
- LUCENE-9627: Remove unused Lucene50FieldInfosFormat codec and small refactor some codecs
to separate reading header/footer from reading content of the file.
(Ignacio Vera)
- LUCENE-9773: Upgrade icu to 68.2
(Robert Muir)
- LUCENE-9822: Add assertion to PFOR exception encoding, documenting the BLOCK_SIZE assumption.
(Greg Miller)
- LUCENE-9883: Turn on ecj missingEnumCaseDespiteDefault setting.
(Zach Chen)
- LUCENE-9705: Make new versions of all index formats for the Lucene90 codec and move
the existing ones to the backwards codecs.
(Julie Tibshirani, Ignacio Vera)
- LUCENE-9907: Remove dependency on PackedInts#getReader() from the current codecs and move the
method to backwards codec.
(Ignacio Vera)
- LUCENE-10024: Catch NoSuchFileException when opening index directory with Luke.
(Michael Wechner, Tomoko Uchida)
- Bug Fixes (7)
- LUCENE-9580: Fix bug in the polygon tessellator when introducing collinear edges during polygon
splitting.
(Ignacio Vera)
- LUCENE-10470: Check if polygon has been successfully tessellated before we fail (we are failing some valid
tessellations) and allow filtering edges that fold on top of the previous one.
(Ignacio Vera)
- LUCENE-10563: Fix failure to tessellate complex polygon
(Craig Taverner)
- LUCENE-10678: Fix potential overflow when building a BKD tree with more than 4 billion points. The overflow
occurs when computing the partition point.
(Ignacio Vera)
- GITHUB#11986: Fix algorithm that chooses the bridge between a polygon and a hole when there is
common vertex.
(Ignacio Vera)
- GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain intersecting geometries.
(Craig Taverner)
- GITHUB#12352: [Tessellator] Improve the checks that validate the diagonal between two polygon nodes so
the resulting polygons are valid counter clockwise polygons.
(Ignacio Vera)
- Optimizations (1)
- GITHUB#12604: Estimate the block size of FST BytesStore in BlockTreeTermsWriter
to reduce GC load during indexing.
(Guo Feng)
- Bug Fixes (2)
- LUCENE-10564: Make sure SparseFixedBitSet#or updates ramBytesUsed.
(Julie Tibshirani)
- LUCENE-10477: Highlighter: WeightedSpanTermExtractor.extractWeightedSpanTerms to Query#rewrite
multiple times if necessary.
(Christine Poerschke, Adrien Grand)
- Optimizations (1)
- LUCENE-10481: FacetsCollector will not request scores if it does not use them.
(Mike Drob)
- API Changes (1)
- (No changes)
- New Features (1)
- (No changes)
- Improvements (2)
- LUCENE-9662: Make CheckIndex concurrent by parallelizing index check across segments.
(Zach Chen, Mike McCandless, Dawid Weiss, Robert Muir)
- LUCENE-10103: Make QueryCache respect Accountable queries.
(Patrick Zhai)
- Optimizations (2)
- LUCENE-9673: Substantially improve RAM efficiency of how MemoryIndex stores
postings in memory, and reduced a bit of RAM overhead in
IndexWriter's internal postings book-keeping
(mashudong)
- LUCENE-10196: Improve IntroSorter with 3-ways partitioning.
(Bruno Roustant)
- Bug Fixes (6)
- LUCENE-10111: Missing calculating the bytes used of DocsWithFieldSet in NormValuesWriter.
(Lu Xugang)
- LUCENE-10116: Missing calculating the bytes used of DocsWithFieldSet and currentValues in SortedSetDocValuesWriter.
(Lu Xugang)
- LUCENE-10070 Skip deleted docs when accumulating facet counts for all docs.
(Ankur Goel, Greg Miller)
- LUCENE-10134: ConcurrentSortedSetDocValuesFacetCounts shouldn't share liveDocs Bits across threads.
(Ankur Goel)
- LUCENE-10154: NumericLeafComparator to define getPointValues.
(Mayya Sharipova, Adrien Grand)
- LUCENE-10208: Ensure that the minimum competitive score does not decrease in concurrent search.
(Jim Ferenczi, Adrien Grand)
- Build (1)
- LUCENE-10104, SOLR-15631: Upgrade forbiddenapis to version 3.2.
(Uwe Schindler)
- Other (1)
- LUCENE-10098: Add docs/links to GermanAnalyzer describing how to decompound nouns.
(Robert Muir)
- Bug Fixes (3)
- LUCENE-10110: MultiCollector now handles single leaf collector that wants to skip low-scoring hits
but the combined score mode doesn't allow it.
(Jim Ferenczi)
- LUCENE-10119: Sort optimization with search_after can wrongly skip documents
whose values are equal to the last value of the previous page
(Nhat Nguyen)
- LUCENE-10126: Sort optimization with a chunked bulk scorer
can wrongly skip documents
(Nhat Nguyen, Mayya Sharipova)
- API Changes (5)
- LUCENE-9962: DrillSideways allows sub-classes to provide "drill down" FacetsCollectors. They
may provide a null collector if they choose to bypass "drill down" facet collection.
(Greg Miller)
- LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be protected instead of private.
Users can now access the count of an ordinal directly without constructing an extra FacetLabel.
Also use variable length arguments for the getOrdinal call in TaxonomyReader.
(Gautam Worah)
- LUCENE-10036: Replaced the ScoreCachingWrappingScorer ctor with a static factory method that
ensures unnecessary wrapping doesn't occur.
(Greg Miller)
- LUCENE-10027: Add a new Directory reader open API from indexCommit and
a custom comparator for sorting leaf readers.
(Mayya Sharipova)
- LUCENE-7020: TieredMergePolicy#setMaxMergeAtOnceExplicit is deprecated
and the number of segments that get merged via explicit merges is unlimited
by default.
(Adrien Grand, Shawn Heisey)
- New Features (2)
- LUCENE-10083: Analyzer and stemmer for Telugu language
(Vinod Singh)
- LUCENE-10035: The SimpleText codec now writes skip lists.
(wuda via Adrien Grand)
- Improvements (12)
- LUCENE-9944: Allow DrillSideways users to provide their own CollectorManager without also requiring
them to provide an ExecutorService.
(Greg Miller)
- LUCENE-9946: Support for multi-value fields in LongRangeFacetCounts and
DoubleRangeFacetCounts.
(Greg Miller)
- LUCENE-9965: Added QueryProfilerIndexSearcher and ProfilerCollector to support debugging
query execution strategy and timing.
(Jack Conradson, Julie Tibshirani)
- LUCENE-9981: Operations.getCommonSuffix/Prefix(Automaton) is now much more
efficient, from a worst case exponential down to quadratic cost in the
number of states + transitions in the Automaton. These methods no longer
use the costly determinize method, removing the risk of
TooComplexToDeterminizeException
(Robert Muir, Mike McCandless)
- LUCENE-9981: Operations.determinize now throws TooComplexToDeterminizeException
based on too much "effort" spent determinizing rather than a precise state
count on the resulting returned automaton, to better handle adversarial
cases like det(rev(regexp("(.*a){2000}"))) that spend lots of effort but
result in smallish eventual returned automata.
(Robert Muir, Mike McCandless)
- LUCENE-9983: Stop sorting determinize powersets unnecessarily.
(Patrick Zhai)
- LUCENE-9177: ICUNormalizer2CharFilter no longer requires normalization-inert
characters as boundaries for incremental processing, vastly improving worst-case
performance.
(Michael Gibney)
- LUCENE-10030: Lazily evaluate score in DrillSidewaysScorer.doQueryFirstScoring
(Grigoriy Troitskiy)
- LUCENE-9945: Extend DrillSideways to support exposing FacetCollectors directly.
(Greg Miller, Sejal Pawar)
- LUCENE-10043: Decrease default for LRUQueryCache's skipCacheFactor to 10.
This prevents caching a query clause when it is much more expensive than
running the top-level query.
(Julie Tibshirani)
- LUCENE-5309: Optimize facet counting for single-valued SSDV / StringValueFacetCounts.
(Greg Miller)
- LUCENE-9917: The BEST_SPEED compression mode now trades more compression ratio
in exchange of faster reads.
(Adrien Grand)
- Optimizations (4)
- LUCENE-9996: Improved memory efficiency of IndexWriter's RAM buffer, in
particular in the case of many fields and many indexing threads.
(Adrien Grand)
- LUCENE-10022: Rewrite empty DisjunctionMaxQuery to MatchNoDocsQuery.
(David Harsha via Julie Tibshirani)
- LUCENE-10031: Slightly faster segment merging for sorted indices.
(Adrien Grand)
- LUCENE-10014: Lucene90DocValuesFormat was using too many bits per
value when compressing via gcd, unnecessarily wasting index storage.
(weizijun)
- Bug Fixes (12)
- LUCENE-9988: Fix DrillSideways correctness bug introduced in LUCENE-9944
(Greg Miller)
- LUCENE-9964: Duplicate long values in a document field should only be counted once when using SortedNumericDocValuesFields
(Gautam Worah)
- LUCENE-9999: CombinedFieldQuery can fail with an exception when document
is missing some fields.
(Jim Ferenczi, Julie Tibshirani)
- LUCENE-10020: DocComparator should not skip docs with the same docID on
multiple sorts with search after
(Mayya Sharipova, Julie Tibshirani)
- LUCENE-10026: Fix CombinedFieldQuery equals and hashCode, which ensures
query rewrites don't drop CombinedFieldQuery clauses.
(Julie Tibshirani)
- LUCENE-10039: Correct CombinedFieldQuery scoring when there is a single
field.
(Julie Tibshirani)
- LUCENE-10046: Counting bug fixed in StringValueFacetCounts.
(Greg Miller)
- LUCENE-9963: FlattenGraphFilter is now more robust when handling
incoming holes in the input token graph
(Geoff Lawson)
- LUCENE-10008: Respect ignoreCase in CommonGramsFilterFactory
(Vigya Sharma)
- LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached.
(Greg Miller, Zachary Chen)
- LUCENE-10081: KoreanTokenizer should check the max backtrace gap on whitespaces.
(Jim Ferenczi)
- LUCENE-10106: Sort optimization can wrongly skip the first document of
each segment
(Nhat Nguyen)
- Other (1)
- (No changes)
- API Changes (1)
- LUCENE-9680: IndexWriter#getFieldNames() method added to get fields present in index.
This method was removed in LUCENE-8909.
(Oren Ovadia)
- New Features (8)
- LUCENE-9507: Custom order for leaves in IndexReader and IndexWriter
(Mayya Sharipova, Mike McCandless, Jim Ferenczi)
- LUCENE-9575: PatternTypingFilter has been added to allow setting a type attribute on tokens based on
a configured set of regular expressions
(Gus Heck).
- LUCENE-9572: TypeAsSynonymFilter has been enhanced support ignoring some types, and to allow
the generated synonyms to copy some or all flags from the original token
(Gus Heck).
- LUCENE-9574 A token filter to drop tokens that match all specified flags.
(Gus Heck, Uwe Schindler)
- LUCENE-9537: Added smoothingScore method and default implementation to
Scorable abstract class. The smoothing score allows scorers to calculate a
score for a document where the search term or subquery is not present. The
smoothing score acts like an idf so that documents that do not have terms or
subqueries that are more frequent in the index are not penalized as much as
documents that do not have less frequent terms or subqueries and prevents
scores which are the product or terms or subqueries from going to zero. Added
the implementation of the Indri AND and the IndriDirichletSimilarity from the
academic Indri search engine: http://www.lemurproject.org/indri.php.
(Cameron VandenBerg)
- LUCENE-9694: New tool for creating a deterministic index to enable benchmarking changes
on a consistent multi-segment index even when they require re-indexing.
(Patrick Zhai)
- LUCENE-9385: Add FacetsConfig option to control which drill-down
terms are indexed for a FacetLabel
(Zachary Chen)
- LUCENE-9950: New facet counting implementation for general string doc value fields
(SortedSetDocValues / SortedDocValues) not created through FacetsConfig
(Greg Miller)
- Improvements (5)
- LUCENE-9725: BM25FQuery was extended to handle similarities beyond BM25Similarity. It
was renamed to CombinedFieldQuery to reflect its more general scope.
(Julie Tibshirani)
- LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues.
(Jaison Bi via Bruno Roustant)
- LUCENE-9687: Hunspell support improvements: add API for spell-checking and suggestions, support compound words,
fix various behavior differences between Java and C++ implementations, improve performance
(Peter Gromov, Dawid Weiss)
- LUCENE-9877: Reduce index size by increasing allowable exceptions in PForUtil from 3 to 7.
(Greg Miller)
- LUCENE-9935: Enable bulk merge for stored fields with index sort.
(Robert Muir, Adrien Grand, Nhat Nguyen)
- Optimizations (2)
- LUCENE-9932: Performance improvement for BKD index building
(neoremind)
- LUCENE-9827: Speed up merging of stored fields and term vectors for smaller segments.
(Daniel Mitterdorfer, Dimitrios Liapis, Adrien Grand, Robert Muir)
- Bug Fixes (6)
- LUCENE-9791: BytesRefHash.equals/find is now thread safe, fixing a
Luwak/Monitor bug causing registered queries to sometimes fail to
match.
(Paweł Bugalski)
- LUCENE-9887: Fixed parameter use in RadixSelector.
(liupanfeng via Adrien Grand)
- LUCENE-9958: Fixed performance regression for boolean queries that configure a
minimum number of matching clauses.
(Adrien Grand, Matt Weber)
- LUCENE-9953: LongValueFacetCounts should count each document at most once when determining
the total count for a dimension. Prior to this fix, multi-value docs could contribute a > 1
count to the dimension count.
(Greg Miller)
- LUCENE-9967: Do not throw NullPointerException while trying to handle another exception in
ReplicaNode.start
(Steven Schlansker)
- LUCENE-9991: Fix edge case failure in TestStringValueFacetCounts
(Greg Miller)
- Other (4)
- LUCENE-9836: Removed the pure Maven build. It is no longer possible to build
artifacts using Maven (this feature was no longer working correctly). Due to
migration to Gradle for Lucene/Solr 9.0, the maintenance of the Maven build
was no longer reasonable. POM files are generated for deployment to Maven
Central only. Please use "ant generate-maven-artifacts" to produce and deploy
artifacts to any repository.
(Uwe Schindler, Dawid Weiss)
- LUCENE-9836: Migrate Maven tasks to use "maven-resolver-ant-tasks"
instead of the no longer maintained "maven-ant-tasks".
(Uwe Schindler)
- LUCENE-9985: Upgrade jetty to 9.4.41
(janhoy)
- LUCENE-9976: Fix WANDScorer assertion error.
(Zach Chen, Adrien Grand, Dawid Weiss)
- Bug Fixes (3)
- LUCENE-9870: Fix Circle2D intersectsLine t-value (distance) range clamp
(Jørgen Nystad)
- LUCENE-9744: NPE on a degenerate query in MinimumShouldMatchIntervalsSource
$MinimumMatchesIterator.getSubMatches().
(Alan Woodward)
- LUCENE-9762: DoubleValuesSource.fromQuery (also used by FunctionScoreQuery.boostByQuery) could
throw an exception when the query implements TwoPhaseIterator and when the score is requested
repeatedly.
(David Smiley, hossman)
- New Features (5)
- LUCENE-9552: New LatLonPoint query that accepts an array of LatLonGeometries.
(Ignacio Vera)
- LUCENE-9641: LatLonPoint query support for spatial relationships.
(Ignacio Vera)
- LUCENE-9553: New XYPoint query that accepts an array of XYGeometries.
(Ignacio Vera)
- LUCENE-9378: Doc values now allow configuring how to trade compression for
retrieval speed.
(Adrien Grand)
- LUCENE-9413: Add CJKWidthCharFilter and its factory
(Tomoko Uchida)
- Improvements (3)
- LUCENE-9455: ExitableTermsEnum should sample timeout and interruption
check before calling next().
(Zach Chen via Bruno Roustant)
- LUCENE-9023: GlobalOrdinalsWithScore should not compute occurrences when the
provided min is 1.
(Jim Ferenczi)
- LUCENE-9675: Binary doc values fields now expose their configured compression mode
in the attributes of the field info.
(Jim Ferenczi)
- Optimizations (4)
- LUCENE-9536: Reduced memory usage for OrdinalMap when a segment has all
values.
(Julie Tibshirani via Adrien Grand)
- LUCENE-9021: QueryParser: re-use the LookaheadSuccess exception.
(Przemek Bruski via Mikhail Khludnev)
- LUCENE-9636: Faster decoding of postings for some numbers of bits per value.
(Guo Feng via Adrien Grand)
- LUCENE-9346: WANDScorer now supports queries that have a
`minimumNumberShouldMatch` configured.
(Xi Zachary Chen via Adrien Grand)
- Bug Fixes (8)
- LUCENE-9508: DocumentsWriter was only stalling threads for 1 second allowing
documents to be indexed even the DocumentsWriter wasn't able to keep up flushing.
Unless IW can't make progress due to an ill behaving DWPT this issue was barely
noticeable.
(Simon Willnauer)
- LUCENE-9581: Japanese tokenizer should discard the compound token instead of disabling the decomposition
of long tokens when discardCompoundToken is activated.
(Jim Ferenczi)
- LUCENE-9595: Make Component2D#withinPoint implementations consistent with ShapeQuery logic.
(Ignacio Vera)
- LUCENE-9606: Wrap boolean queries generated by shape fields with a Constant score query.
(Ignacio Vera)
- LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup.
(Yilun Cui)
- LUCENE-9617: Fix per-field memory leak in IndexWriter.deleteAll(). Reset next available internal
field number to 0 on FieldInfos.clear(), to avoid wasting FieldInfo references.
(Michael Froh)
- LUCENE-9642: When encoding triangles in ShapeField, make sure generated triangles are CCW by rotating
triangle points before checking triangle orientation.
(Ignacio Vera)
- LUCENE-9661: Fix deadlock in TermsEnum.EMPTY that occurs when trying to initialize TermsEnum and BaseTermsEnum
at the same time
(Namgyu Kim)
- Other (2)
- SOLR-14995: Update Jetty to 9.4.34
(Mike Drob)
- LUCENE-9637: Removes some unused code and replaces the Point implementation on ShapeField/ShapeQuery
random tests.
(Ignacio Vera)
- API Changes (2)
- LUCENE-9437: Lucene's facet module's DocValuesOrdinalsReader.decode method
is now public, making it easier for applications to decode facet
ordinals into their corresponding labels
(Ankur Goel)
- LUCENE-9515: IndexingChain now accepts individual primitives rather than a
DocumentsWriterPerThread instance in order to create a new DocConsumer.
(Simon Willnauer)
- New Features (4)
- LUCENE-9386: RegExpQuery added case insensitive matching option.
(Mark Harwood)
- LUCENE-8962: Add IndexWriter merge-on-refresh feature to selectively merge
small segments on getReader, subject to a configurable timeout, to improve
search performance by reducing the number of small segments for searching.
(Simon Willnauer)
- LUCENE-9484: Allow sorting an index after it was created. With SortingCodecReader, existing
unsorted segments can be wrapped and merged into a fresh index using IndexWriter#addIndices
API.
(Simon Willnauer, Adrien Grand)
- LUCENE-9444: Add utility class to retrieve facet labels from the
taxonomy index for a facet field so such fields do not also have to
be redundantly stored
(Ankur Goel)
- Improvements (10)
- LUCENE-8574: Add a new ExpressionValueSource which will enforce only one value per name
per hit in dependencies, ExpressionFunctionValues will no longer
recompute already computed values
(Patrick Zhai)
- LUCENE-9416: Fix CheckIndex to print an invalid non-zero norm as
unsigned long when detecting corruption.
- LUCENE-9440: FieldInfo#checkConsistency called twice from Lucene50(60)FieldInfosFormat#read;
Removed the (redundant?) assert and do these checks for real.
(Yauheni Putsykovich)
- LUCENE-9446: In BooleanQuery rewrite, always remove MatchAllDocsQuery filter clauses
when possible.
(Julie Tibshirani)
- LUCENE-9501: Improve coverage for Asserting* test classes: make sure to handle singleton doc
values, and sometimes exercise Weight#scorer instead of Weight#bulkScorer for top-level
queries.
(Julie Tibshirani)
- LUCENE-9511: Include StoredFieldsWriter in DWPT accounting to ensure that it's
heap consumption is taken into account when IndexWriter stalls or should flush
DWPTs.
(Simon Willnauer)
- LUCENE-9514: Include TermVectorsWriter in DWPT accounting to ensure that it's
heap consumption is taken into account when IndexWriter stalls or should flush
DWPTs.
(Simon Willnauer)
- LUCENE-9523: In query shapes over shape fields, skip points while traversing the
BKD tree when the relationship with the document is already known.
(Ignacio Vera)
- LUCENE-9539: Use more compact datastructures to represent sorted doc-values in memory when
sorting a segment before flush and in SortingCodecReader.
(Simon Willnauer)
- LUCENE-9458: WordDelimiterGraphFilter should order tokens at the same position by endOffset to
emit longer tokens first. The same graph is produced.
(David Smiley)
- Optimizations (4)
- LUCENE-9395: ConstantValuesSource now shares a single DoubleValues
instance across all segments
(Tony Xu)
- LUCENE-9447, LUCENE-9486: Stored fields now get higer compression ratios on
highly compressible data.
(Adrien Grand)
- LUCENE-9373: FunctionMatchQuery now accepts a "matchCost" optimization hint.
(Maxim Glazkov, David Smiley)
- LUCENE-9510: Indexing with an index sort is now faster by not compressing
temporary representations of the data.
(Adrien Grand)
- Bug Fixes (6)
- LUCENE-9427: Fix a regression where the unified highlighter didn't produce
highlights on fuzzy queries that correspond to exact matches.
(Julie Tibshirani)
- LUCENE-9467: Fix NRTCachingDirectory to use Directory#fileLength to check if a file
already exists instead of opening an IndexInput on the file which might throw a AccessDeniedException
in some Directory implementations.
(Simon Willnauer)
- LUCENE-9501: Fix a bug in IndexSortSortedNumericDocValuesRangeQuery where it could violate the
DocIdSetIterator contract.
(Julie Tibshirani)
- LUCENE-9401: Include field in ComplexPhraseQuery's toString()
(Thomas Hecker via Munendra S N)
- LUCENE-9578: Fix TermRangeQuery when there is no upper bound and the lower
bound is the empty string excluded. This would previously match no strings at
all while it should match all non-empty strings.
(Christoph Buescher via Adrien Grand)
- LUCENE-9524: Fix NPE in SpanWeight#explain when no scoring is required and
SpanWeight has null Similarity.SimScorer.
(Zach Chen)
- Documentation (1)
- LUCENE-9424: Add a performance warning to AttributeSource.captureState javadocs
(Patrick Zhai)
- Changes in Runtime Behavior (1)
- LUCENE-9539: SortingCodecReader now doesn't cache doc values fields anymore. Previously, SortingCodecReader
used to cache all doc values fields after they were loaded into memory. This reader should only be used
to sort segments after the fact using IndexWriter#addIndices.
(Simon Willnauer)
- Other (3)
- LUCENE-9292: Refactor BKD point configuration into its own class.
(Ignacio Vera)
- LUCENE-9470: Make TestXYMultiPolygonShapeQueries more resilient for CONTAINS queries.
(Ignacio Vera)
- LUCENE-9512: Move LockFactory stress test to be a unit/integration
test.
(Uwe Schindler, Dawid Weiss, Robert Muir)
- Build (1)
- Upgrade forbiddenapis to version 3.1.
(Uwe Schindler)
- Bug Fixes (1)
- LUCENE-9478: Prevent DWPTDeleteQueue from referencing itself and leaking memory. The queue
passed an implicit this reference to the next queue instance on flush which leaked about 500byte
of memory on each full flush, commit or getReader call.
(Simon Willnauer)
- Bug Fixes (1)
- LUCENE-9443: The UnifiedHighlighter was closing the underlying reader when there were multiple term-vector fields.
This was a regression in 8.6.0.
(David Smiley, Chris Beer)
- API Changes (9)
- LUCENE-9265: SimpleFSDirectory is deprecated in favor of NIOFSDirectory.
(Yannick Welsch)
- LUCENE-9304: Removed ability to set DocumentsWriterPerThreadPool on IndexWriterConfig.
The DocumentsWriterPerThreadPool is a packaged protected final class which made it impossible
to customize.
(Simon Willnauer)
- LUCENE-9339: MergeScheduler#merge doesn't accept a parameter if a new merge was found anymore.
(Simon Willnauer)
- LUCENE-9330: SortFields are now responsible for writing themselves into index headers if they
are used as index sorts.
(Alan Woodward, Uwe Schindler, Adrien Grand)
- LUCENE-9340: Deprecate SimpleBindings#add(SortField).
(Alan Woodward)
- LUCENE-9345: MergeScheduler is now decoupled from IndexWriter. Instead it accepts a MergeSource
interface that offers the basic methods to acquire pending merges, run the merge and do accounting
around it.
(Simon Willnauer)
- LUCENE-9349: QueryVisitor.consumeTermsMatching() now takes a
Supplier<ByteRunAutomaton> to enable queries that build large automata to
provide them lazily. TermsInSetQuery switches to using this method
to report matching terms.
(Alan Woodward)
- LUCENE-9366: DocValues.emptySortedNumeric() not longer takes a maxDoc parameter
(Alan Woodward)
- LUCENE-7822: CodecUtil#checkFooter(IndexInput, Throwable) now throws a
CorruptIndexException if checksums mismatch or if checksums can't be verified.
(Martin Amirault, Adrien Grand)
- New Features (2)
- LUCENE-7889: Grouping by range based on values from DoubleValuesSource and LongValuesSource
(Alan Woodward)
- LUCENE-8962: Add IndexWriter merge-on-commit feature to selectively merge small segments on commit,
subject to a configurable timeout, to improve search performance by reducing the number of small
segments for searching
(Michael Froh, Mike Sokolov, Mike Mccandless, Simon Willnauer)
- Improvements (13)
- LUCENE-9276: Use same code-path for updateDocuments and updateDocument in IndexWriter and
DocumentsWriter.
(Simon Willnauer)
- LUCENE-9279: Update dictionary version for Ukrainian analyzer to 4.9.1
(Andriy Rysin via Dawid Weiss)
- LUCENE-8050: PerFieldDocValuesFormat should not get the DocValuesFormat on a field that has no doc values.
(David Smiley, Juan Rodriguez)
- LUCENE-9304: Removed ThreadState abstraction from DocumentsWriter which allows pooling of DWPT directly and
improves the approachability of the IndexWriter code.
(Simon Willnauer)
- LUCENE-9324: Add an ID to SegmentCommitInfo in order to compare commits for equality and make
snapshots incremental on generational files.
(Simon Willnauer, Mike Mccandless, Adrien Grand)
- LUCENE-9342: TotalHits' relation will be EQUAL_TO when the number of hits is lower than TopDocsColector's numHits
(Tomás Fernández Löbbe)
- LUCENE-9353: Metadata of the terms dictionary moved to its own file, with the
`.tmd` extension. This allows checksums of metadata to be verified when
opening indices and helps save seeks when opening an index.
(Adrien Grand)
- LUCENE-9359: SegmentInfos#readCommit now always returns a
CorruptIndexException if the content of the file is invalid.
(Adrien Grand)
- LUCENE-9393: Make FunctionScoreQuery use ScoreMode.COMPLETE for creating the inner query weight when
ScoreMode.TOP_DOCS is requested.
(Tomás Fernández Löbbe)
- LUCENE-9392: Make FacetsConfig.DELIM_CHAR publicly accessible
(Ankur Goel)
- LUCENE-9397: UniformSplit supports encodable fields metadata.
(Bruno Roustant)
- LUCENE-9396: Improved truncation detection for points.
(Adrien Grand, Robert Muir)
- LUCENE-9402: Let MultiCollector handle minCompetitiveScore
(Tomás Fernández Löbbe, Adrien Grand)
- Optimizations (8)
- LUCENE-9254: UniformSplit keeps FST off-heap.
(Bruno Roustant)
- LUCENE-8103: DoubleValuesSource and QueryValueSource now use a TwoPhaseIterator if one is provided by the Query.
(Michele Palmia, David Smiley)
- LUCENE-9287: UsageTrackingQueryCachingPolicy no longer caches DocValuesFieldExistsQuery.
(Ignacio Vera)
- LUCENE-9286: FST.Arc.BitTable reads directly FST bytes. Arc is lightweight again and FSTEnum traversal faster.
(Bruno Roustant)
- LUCENE-7788: fail precommit on unparameterised log messages and examine for wasted work/objects
(Erick Erickson)
- LUCENE-9273: Speed up geometry queries by specialising Component2D spatial operations. Instead of using a generic
relate method for all relations, we use specialize methods for each one. In addition, the type of triangle is
computed at deserialization time, therefore we can be more selective when decoding points of a triangle.
(Ignacio Vera)
- LUCENE-9087: Build always trees with full leaves and lower the default value for maxPointsPerLeafNode to 512.
(Ignacio Vera)
- LUCENE-9148: Points now write their index in a separate file.
(Adrien Grand)
- Bug Fixes (14)
- LUCENE-9259: Fix wrong NGramFilterFactory argument name for preserveOriginal option
(Paul Pazderski)
- LUCENE-8849: DocValuesRewriteMethod.visit wasn't visiting its embedded query
(Michele Palmia, David Smiley)
- LUCENE-9258: DocTermsIndexDocValues assumed it was operating on a SortedDocValues (single valued) field when
it could be multi-valued used with a SortedSetSelector
(Michele Palmia)
- LUCENE-9164: Ensure IW processes all internal events before it closes itself on a rollback.
(Simon Willnauer, Nhat Nguyen, Dawid Weiss, Mike Mccandless)
- LUCENE-8908: Return default value from objectVal when doc doesn't match the query in QueryValueSource
(Bill Bell, hossman, Munendra S N, Michele Palmia)
- LUCENE-9133: Fix for potential NPE in TermFilteredPresearcher for empty fields
(Marvin Justice via Mike Drob)
- LUCENE-9309: Wait for #addIndexes merges when aborting merges.
(Simon Willnauer)
- LUCENE-9337: Ensure CMS updates it's thread accounting datastructures consistently.
CMS today releases it's lock after finishing a merge before it re-acquires it to update
the thread accounting datastructures. This causes threading issues where concurrently
finishing threads fail to pick up pending merges causing potential thread starvation on
forceMerge calls.
(Simon Willnauer)
- LUCENE-9314: Single-document monitor runs were using the less efficient MultiDocumentBatch
implementation.
(Pierre-Luc Perron, Alan Woodward)
- LUCENE-9362: Fix equality check in ExpressionValueSource#rewrite. This fixes rewriting of inner value sources.
(Dmitry Emets)
- LUCENE-9405: IndexWriter incorrectly calls closeMergeReaders twice when the merged segment is 100% deleted.
(Michael Froh, Simon Willnauer, Mike Mccandless, Mike Sokolov)
- LUCENE-9400: Tessellator might build illegal polygons when several holes share the shame vertex.
(Ignacio Vera)
- LUCENE-9417: Tessellator might build illegal polygons when several holes share are connected to the same
vertex.
(Ignacio Vera)
- LUCENE-9418: Fix ordered intervals over interleaved terms
(Alan Woodward)
- Other (12)
- LUCENE-9257: Always keep FST off-heap. FSTLoadMode, Reader attributes and openedFromWriter removed.
(Bruno Roustant)
- LUCENE-9272: Checksums of the terms index are now verified when
LeafReader#checkIntegrity is called rather than when opening the index.
(Adrien Grand)
- LUCENE-9270: Update Javadoc about normalizeEntry in the Kuromoji DictionaryBuilder.
(Namgyu Kim)
- LUCENE-9275: Make TestLatLonMultiPolygonShapeQueries more resilient for CONTAINS queries.
(Ignacio Vera)
- LUCENE-9244: Adjust TestLucene60PointsFormat#testEstimatePointCount2Dims so it does not fail when a point
is shared by multiple leaves.
(Ignacio Vera)
- LUCENE-9271: ByteBufferIndexInput was refactored to work on top of the
ByteBuffer API.
(Adrien Grand)
- LUCENE-9191: Make LineFileDocs's random seeking more efficient, making tests using LineFileDocs faster
(Robert Muir,
Mike McCandless)
- LUCENE-9338: Refactors SimpleBindings to improve type safety and cycle detection
(Alan Woodward,
Adrien Grand)
- LUCENE-9358: Change the way the multi-dimensional BKD tree builder generates the intermediate tree representation to be
equal to the one dimensional case to avoid unnecessary tree and leaves rotation.
(Ignacio Vera)
- LUCENE-9288: poll_mirrors.py release script can handle HTTPS mirrors.
(Ignacio Vera)
- LUCENE-9232: Fix or suppress 13 resource leak precommit warnings in lucene/replicator
(Andras Salamon via Erick Erickson)
- LUCENE-9398: Always keep BKD index off-heap. BKD reader does not implement Accountable any more.
(Ignacio Vera)
- Build (4)
- Upgrade forbiddenapis to version 3.0.1.
(Uwe Schindler)
- LUCENE-9376: Fix or suppress 20 resource leak precommit warnings in lucene/search
(Andras Salamon via Erick Erickson)
- LUCENE-9380: Fix auxiliary class warnings in Lucene
(Erick Erickson)
- LUCENE-9389: Enhance gradle logging calls validation: eliminate getMessage()
(Andras Salamon via Erick Erickson)
- Optimizations (1)
- LUCENE-9350: Partial reversion of LUCENE-9068; holding levenshtein automata on FuzzyQuery can end
up blowing up query caches which use query objects as cache keys, so building the automata is
now delayed to search time again.
(Alan Woodward, Mike Drob)
- Bug Fixes (1)
- LUCENE-9300: Fix corruption of the new gen field infos when doc values updates are applied on a segment created
externally and added to the index with IndexWriter#addIndexes(Directory).
(Jim Ferenczi, Adrien Grand)
- API Changes (9)
- LUCENE-9093: Not an API change but a change in behavior of the UnifiedHighlighter's LengthGoalBreakIterator that will
yield Passages sized a little different due to the fact that the sizing pivot is now the center of the first match and
not its left edge.
- LUCENE-9116: PostingsWriterBase and PostingsReaderBase no longer support
setting a field's metadata via a `long[]`.
(Adrien Grand)
- LUCENE-9116: The FSTOrd postings format has been removed.
(Adrien Grand)
- LUCENE-8369: Remove obsolete spatial module.
(Nick Knize, David Smiley)
- LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes to core.
(Nick Knize)
- LUCENE-9218: XY geometries API works in float space.
(Ignacio Vera)
- LUCENE-9212: Intervals.multiterm() takes CompiledAutomaton rather than plain Automaton
(Alan Woodward)
- LUCENE-9150: Restore support for dynamic PlanetModel in spatial3d.
(Nick Knize)
- LUCENE-9171: QueryBuilder.newTermQuery() and .newSynonymQuery() now take boost parameters.
(Alessandro Benedetti, Alan Woodward)
- New Features (3)
- LUCENE-8903: Add LatLonShape and XYShape point query.
(Ignacio Vera)
- LUCENE-8707: Add LatLonShape and XYShape distance query.
(Ignacio Vera)
- LUCENE-9238: New XYPointField field and Queries for indexing, searching and sorting
cartesian points.
(Ignacio Vera)
- Improvements (12)
- LUCENE-9149: Increase data dimension limit in BKD.
(Nick Knize)
- LUCENE-9102: Add maxQueryLength option to DirectSpellchecker.
(Andy Webb via Bruno Roustant)
- LUCENE-9091: UnifiedHighlighter HTML escaping should only escape essentials
(Nándor Mátravölgyi)
- LUCENE-9105: UniformSplit postings format detects corrupted index and better handles IO exceptions.
(Bruno Roustant)
- LUCENE-9106: UniformSplit postings format allows extension of block/line serializers.
(Bruno Roustant)
- LUCENE-9093: UnifiedHighlighter's LengthGoalBreakIterator has a new fragmentAlignment option to better center the
first match in the passage. Also the sizing point now pivots at the center of the first match term and not its left
edge. This yields Passages that won't be identical to the previous behavior.
(Nándor Mátravölgyi, David Smiley)
- LUCENE-9153: Allow WhitespaceAnalyzer to set a maxTokenLength other than the default of 255
(Alan Woodward)
- LUCENE-9152: Improve line intersections with polygons when they are touching from the outside.
(Ignacio Vera)
- LUCENE-9123: Add new JapaneseTokenizer constructors with discardCompoundToken option that controls whether
the tokenizer emits original (compound) tokens when the mode is not NORMAL.
(Kazuaki Hiraga via Tomoko Uchida)
- LUCENE-9253: KoreanTokenizer now supports custom dictionaries(system, unknown).
(Namgyu Kim)
- LUCENE-9171: QueryBuilder can now use BoostAttributes on input token streams to selectively
boost particular terms or synonyms in parsed queries.
(Alessandro Benedetti, Alan Woodward)
- LUCENE-9298: Improve RAM accounting in BufferedUpdates when deleted doc IDs and terms are cleared.
(Yu Binglei, Simon Willnauer)
- Optimizations (10)
- LUCENE-9211: Add compression for Binary doc value fields.
(Mark Harwood)
- LUCENE-4702: Better compression of terms dictionaries.
(Adrien Grand)
- LUCENE-9228: Sort dvUpdates in the term order before applying if they all update a
single field to the same value. This optimization can reduce the flush time by around
20% for the docValues update user cases.
(Nhat Nguyen, Adrien Grand, Simon Willnauer)
- LUCENE-9245: Reduce AutomatonTermsEnum memory usage.
(Bruno Roustant, Robert Muir)
- LUCENE-9237: Faster UniformSplit intersect TermsEnum.
(Bruno Roustant)
- LUCENE-9260: LeafReader#checkIntegrity verifies checksums of CFS files.
(Adrien Grand)
- LUCENE-9068: FuzzyQuery builds its Automaton up-front
(Alan Woodward, Mike Drob)
- LUCENE-9113: Faster merging of SORTED/SORTED_SET doc values.
(Adrien Grand)
- LUCENE-9125: Optimize Automaton.step() with binary search and introduce Automaton.next().
(Bruno Roustant)
- LUCENE-9147: The index of stored fields and term vectors in now off-heap.
(Adrien Grand)
- Bug Fixes (11)
- LUCENE-9084: Fix potential deadlock due to circular synchronization in AnalyzingInfixSuggester
(Paul Ward)
- LUCENE-9115: NRTCachingDirectory no longer caches files of unknown size.
(Adrien Grand)
- LUCENE-9144: Fix error message on OneDimensionBKDWriter when too many points are added to the writer.
(Ignacio Vera)
- LUCENE-9135: Make UniformSplit FieldMetadata counters long.
(Bruno Roustant)
- LUCENE-9200: Fix TieredMergePolicy to use double (not float) math to make its merging decisions, fixing
a corner-case bug uncovered by fun randomized tests
(Robert Muir, Mike McCandless)
- LUCENE-9099: Unordered and Ordered interval queries now correctly handle
repeated subterms - ordered intervals could supply an 'extra' minimized
interval, resulting in odd matches when combined with eg CONTAINS queries;
and unordered intervals would match duplicate subterms on the same position,
so an query for UNORDERED(foo, foo) would match a document containing 'foo'
only once.
(Alan Woodward)
- LUCENE-9250: Add support for Circle2d#intersectsLine around the dateline.
(Ignacio Vera)
- LUCENE-9243: Add fudge factor when creating a bounding box of a XYCircle.
(Ignacio Vera)
- LUCENE-9239: Circle2D#WithinTriangle detects properly if a triangle is Within distance.
(Ignacio Vera)
- LUCENE-9251: Fix bug in the polygon tessellator where edges with different value on #isEdgeFromPolygon
were bot filtered out properly.
(Ignacio Vera)
- LUCENE-9263: Fix wrong transformation of distance in meters to radians in Geo3DPoint.
(Ignacio Vera)
- Other (6)
- LUCENE-9109: Backport some changes from master (except StackWalker) to improve
TestSecurityManager
(Uwe Schindler)
- LUCENE-9110: Backport refactored stack analysis in tests to use generalized
LuceneTestCase methods
(Uwe Schindler)
- LUCENE-9141: Simplify LatLonShapeXQuery API by adding a new abstract class called LatLonGeometry. Queries are
executed with input objects that extend such interface.
(Ignacio Vera)
- LUCENE-9194: Simplify XYShapeXQuery API by adding a new abstract class called XYGeometry. Queries are
executed with input objects that extend such interface.
(Ignacio Vera)
- LUCENE-9096: Simplification of CompressingTermVectorsWriter#flushOffsets.
(kkewwei via Adrien Grand)
- LUCENE-9225: Rectangle extends LatLonGeometry so it can be used in a geometry collection.
(Ignacio Vera)
- API Changes (1)
- LUCENE-9029: Deprecate SloppyMath toRadians/toDegrees in favor of Java Math.
(Jack Conradson via Adrien Grand)
- New Features (1)
- LUCENE-8620: Add CONTAINS support for LatLonShape and XYShape.
(Ignacio Vera)
- Improvements (7)
- LUCENE-9002: Skip costly caching clause in LRUQueryCache if it makes the query
many times slower.
(Guoqiang Jiang)
- LUCENE-9006: WordDelimiterGraphFilter's catenateAll token is now ordered before any token parts, like WDF did.
(David Smiley)
- LUCENE-9028: introducing Intervals.multiterm()
(Mikhail Khludnev)
- LUCENE-9018: ConcatenateGraphFilter now has a configurable separator.
(Stanislav Mikulchik, David Smiley)
- LUCENE-9036: ExitableDirectoryReader may interupt scaning over DocValues
(Mikhail Khludnev)
- LUCENE-9062: QueryVisitor now has a consumeTermsMatching() method, allowing queries
that match a class of terms to pass a ByteRunAutomaton matching those that class
back to the visitor.
(Alan Woodward, David Smiley)
- LUCENE-9073: IntervalQuery to respond field on toString() and explain()
(Mikhail Khludnev)
- Optimizations (9)
- LUCENE-8928: When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits
to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.
(Ignacio Vera, Adrien Grand)
- BaseDirectoryReader no longer sums up the `LeafReader#numDocs` of its leaves
eagerly. This especially helps when creating views of readers that hide
documents, since computing the number of live documents is an expensive
operation.
(Adrien Grand)
- LUCENE-8992: TopFieldCollector and TopScoreDocCollector can now share minimum scores across leaves
concurrently.
(Adrien Grand, Atri Sharma, Jim Ferenczi)
- LUCENE-8932: BKDReader's index is now stored off-heap when the IndexInput is
an instance of ByteBufferIndexInput.
(Jack Conradson via Adrien Grand)
- LUCENE-9024: IntroSelector now falls back to the median of medians algorithm
instead of sorting when the maximum recursion level is exceeded, providing
better worst-case runtime.
(Paul Sanwald via Adrien Grand)
- LUCENE-8920: The denser arcs of FST now index labels with a bitset in order
to provide near constant time access.
(Bruno Roustant, Mike Sokolov via Adrien Grand)
- LUCENE-9027: Use SIMD instructions to decode postings.
(Adrien Grand)
- LUCENE-9049: Remove FST cached root arcs now redundant with labels indexed by bitset.
This frees some on-heap FST space.
(Jack Conradson via Bruno Roustant)
- LUCENE-9045: Do not use TreeMap/TreeSet in BlockTree and PerFieldPostingsFormat.
(Bruno Roustant)
- Bug Fixes (7)
- LUCENE-9001: Fix race condition in SetOnce.
(Przemko Robakowski)
- LUCENE-9030: Fix WordnetSynonymParser behaviour so it behaves similar to
SolrSynonymParser.
(Christoph Buescher via Alan Woodward)
- LUCENE-9054: Fix reproduceJenkinsFailures.py to not overwrite junit XML files when retrying
(hossman)
- LUCENE-9031: UnsupportedOperationException on MatchesIterator.getQuery()
(Alan Woodward, Mikhail Khludnev)
- LUCENE-8996: maxScore was sometimes missing from distributed grouped responses.
(Julien Massenet, Diego Ceccarelli, Munendra S N, Christine Poerschke)
- LUCENE-9055: Fix the detection of lines crossing triangles through edge points.
(Ignacio Vera)
- LUCENE-9103: Disjunctions can miss some hits in some rare conditions.
(Adrien Grand)
- Other (6)
- LUCENE-8979: Code Cleanup: Use entryset for map iteration wherever possible. - Part 2
(Koen De Groote)
- LUCENE-8994: Code Cleanup - Pass values to list constructor instead of empty constructor followed by addAll().
(Koen De Groote)
- LUCENE-8746: Refactor EdgeTree - Introduce a Component tree that represents the tree of components (e.g polygons).
Edge tree is now just a tree of edges.
(Ignacio Vera)
- LUCENE-9046: Fix wrong example in Javadoc of TermInSetQuery
(Namgyu Kim)
- LUCENE-8983: Add sandbox PhraseWildcardQuery to control multi-terms expansions in a phrase.
(Bruno Roustant)
- LUCENE-9067: Polygon2D#contains() is now thread safe.
(Ignacio Vera)
- Build (2)
- Upgrade forbiddenapis to version 2.7; upgrade Groovy to 2.4.17.
(Uwe Schindler)
- LUCENE-9041: Upgrade ecj to 3.19.0 to fix sporadic precommit javadoc issues
(Kevin Risden)
- Bug Fixes (1)
- LUCENE-9050: MultiTermIntervalsSource.visit() was not calling back to its
visitor.
(Alan Woodward)
- API Changes (5)
- LUCENE-8909: IndexWriter#getFieldNames() method is used to get fields present in index. After LUCENE-8316, this
method is no longer required. Hence, deprecate IndexWriter#getFieldNames() method.
(Adrien Grand, Munendra S N)
- LUCENE-8755: SpatialPrefixTreeFactory now consumes the "version" parsed with Lucene's Version class. The quad
and packed quad prefix trees are sensitive to this. It's recommended to pass the version like you
should do likewise for analysis components for tokenized text, or else changes to the encoding in future versions
may be incompatible with older indexes.
(Chongchen Chen, David Smiley)
- LUCENE-8956: QueryRescorer now only sorts the first topN hits instead of all
initial hits.
(Paul Sanwald via Adrien Grand)
- LUCENE-8921: IndexSearcher.termStatistics() no longer takes a TermStates; it takes the docFreq and totalTermFreq.
And don't call if docFreq <= 0. The previous implementation survives as deprecated and final. It's removed in 9.0.
(Bruno Roustant, David Smiley, Alan Woodward)
- LUCENE-8990: PointValues#estimateDocCount(visitor) estimates the number of documents that would be matched by
the given IntersectVisitor. THe method is used to compute the cost() of ScorerSuppliers instead of
PointValues#estimatePointCount(visitor).
(Ignacio Vera, Adrien Grand)
- New Features (6)
- LUCENE-8936: Add SpanishMinimalStemFilter
(vinod kumar via Tomoko Uchida)
- LUCENE-8764 LUCENE-8945: Add "export all terms and doc freqs" feature to Luke with delimiters.
(Leonardo Menezes, Amish Shah via Tomoko Uchida)
- LUCENE-8747: Composite Matches from multiple subqueries now allow access to
their submatches, and a new NamedMatches API allows marking of subqueries
and a simple way to find which subqueries have matched on a given document
(Alan Woodward, Jim Ferenczi)
- LUCENE-8769: Introduce Range Query For Multiple Connected Ranges
(Atri Sharma)
- LUCENE-8960: Introduce LatLonDocValuesPointInPolygonQuery for LatLonDocValuesField
(Ignacio Vera)
- LUCENE-8753: New UniformSplitPostingsFormat (name "UniformSplit") primarily benefiting in simplicity and
extensibility. New STUniformSplitPostingsFormat (name "SharedTermsUniformSplit") that shares a single internal
term dictionary across fields.
(Bruno Roustant, Juan Rodriguez, David Smiley)
- Improvements (15)
- LUCENE-8874: Show SPI names instead of class names in Luke Analysis tab.
(Tomoko Uchida)
- LUCENE-8894: Add APIs to find SPI names for Tokenizer/CharFilter/TokenFilter factory classes.
(Tomoko Uchida)
- LUCENE-8914: move the logic for discarding inner modes in FloatPointNearestNeighbor to the IntersectVisitor
so we take advantage of the change introduced in LUCENE-7862.
(Ignacio Vera)
- LUCENE-8955: move the logic for discarding inner modes in LatLonPoint NearestNeighbor to the IntersectVisitor
so we take advantage of the change introduced in LUCENE-7862.
(Ignacio Vera)
- LUCENE-8918: PhraseQuery throws exceptions at construction time if it is passed
null arguments.
(Alan Woodward)
- LUCENE-8916: GraphTokenStreamFiniteStrings preserves all Token attributes
through its finite strings TokenStreams
(Alan Woodward)
- LUCENE-8906: Expose Lucene50PostingsFormat.IntBlockTermState as public so that other postings formats can re-use it.
(Bruno Roustant)
- LUCENE-8942: Remove redundant parameters and improve visibility strictness in
LRUQueryCache
(Atri Sharma)
- SOLR-13663: Introduce <SpanPositionRange> into XML Query Parser
(Alessandro Benedetti via Mikhail Khludnev)
- LUCENE-8952: Use a sort key instead of true distance in NearestNeighbor
(Julie Tibshirani).
- LUCENE-8620: Tessellator labels the edges of the generated triangles whether they belong to
the original polygon. This information is added to the triangle encoding.
(Ignacio Vera)
- LUCENE-8964: Fix geojson shape parsing on string arrays in properties
(Alexander Reelsen)
- LUCENE-8976: Use exact distance between point and bounding rectangle in FloatPointNearestNeighbor.
(Ignacio Vera)
- LUCENE-8966: The Korean analyzer now splits tokens on boundaries between digits and alphabetic characters.
(Jim Ferenczi)
- LUCENE-8984: MoreLikeThis MLT is biased for uncommon fields
(Andy Hind via Anshum Gupta)
- Optimizations (8)
- LUCENE-8922: DisjunctionMaxQuery more efficiently leverages impacts to skip
non-competitive hits.
(Adrien Grand)
- LUCENE-8935: BooleanQuery with no scoring clause can now early terminate the query when
the total hits is not requested.
(Jim Ferenczi)
- LUCENE-8941: Matches on wildcard queries will defer building their full
disjunction until a MatchesIterator is pulled
(Alan Woodward)
- LUCENE-8755: spatial-extras quad and packed quad prefix trees now index points faster.
(Chongchen Chen, David Smiley)
- LUCENE-8860: add additional leaf node level optimizations in LatLonShapeBoundingBoxQuery.
(Igor Motov via Ignacio Vera)
- LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries by
doing just one pass whenever possible.
(Ignacio Vera)
- LUCENE-8939: Introduce shared count based early termination across multiple slices
(Atri Sharma)
- LUCENE-8980: Blocktree's seekExact now short-circuits false if the term isn't in the min-max range of the segment.
Large perf gain for ID/time like data when populated sequentially.
(Guoqiang Jiang)
- Bug Fixes (2)
- LUCENE-8755: spatial-extras quad and packed quad prefix trees could throw a
NullPointerException for certain cell edge coordinates
(Chongchen Chen, David Smiley)
- LUCENE-9005: BooleanQuery.visit() would pull subVisitors from its parent visitor, rather
than from a visitor for its own specific query. This could cause problems when BQ was
nested under another BQ. Instead, we now pull a MUST subvisitor, pass it to any MUST
subclauses, and then pull SHOULD, MUST_NOT and FILTER visitors from it rather than from
the parent.
(Alan Woodward)
- Other (7)
- LUCENE-8778 LUCENE-8911 LUCENE-8957: Define analyzer SPI names as static final fields and document the names in Javadocs.
(Tomoko Uchida, Uwe Schindler)
- LUCENE-8758: QuadPrefixTree: removed levelS and levelN fields which weren't used.
(Amish Shah)
- LUCENE-8975: Code Cleanup: Use entryset for map iteration wherever possible.
(Koen De Groote)
- LUCENE-8993, LUCENE-8807: Changed all repository and download references in build files
to HTTPS.
(Uwe Schindler)
- LUCENE-8998: Fix OverviewImplTest.testIsOptimized reproducible failure.
(Tomoko Uchida)
- LUCENE-8999: LuceneTestCase.expectThrows now propogates assert/assumption failures up to the test
w/o wrapping in a new assertion failure unless the caller has explicitly expected them
(hossman)
- LUCENE-8062: GlobalOrdinalsWithScoreQuery is no longer eligible for query caching.
(Jim Ferenczi)
- API Changes (3)
- LUCENE-8865: IndexSearcher now uses Executor instead of ExecutorSerivce.
This change is fully backwards compatible since ExecutorService directly
implements Executor.
(Simon Willnauer)
- LUCENE-8856: Intervals queries have moved from the sandbox to the queries
module.
(Alan Woodward)
- LUCENE-8893: Intervals.wildcard() and Intervals.prefix() methods now take
BytesRef rather than String.
(Alan Woodward)
- New Features (10)
- LUCENE-8632: New XYShape Field and Queries for indexing and searching general cartesian
geometries.
(Nick Knize)
- LUCENE-8891: Snowball stemmer/analyzer for the Estonian language.
(Gert Morten Paimla via Tomoko Uchida)
- LUCENE-8815: Provide a DoubleValues implementation for retrieving the value of features without
requiring a separate numeric field. Note that as feature values are stored with only 8 bits of
mantissa the values returned may have a delta from the original values indexed.
(Colin Goodheart-Smithe via Adrien Grand)
- LUCENE-8803: Provide a FeatureSortfield to allow sorting search hits by descending value of a
feature. This is exposed via the factory method FeatureField#newFeatureSort.
(Colin Goodheart-Smithe via Adrien