Skip to content

Conversation

@jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 21, 2025

As per a recent bug (#14517), competitive iterators are hard to get right given how their state gets updated in place. This commit tries to make them more robust by extracting the logic of updating the state of a DocIdSetIterator to a shared class, which can then be tested on its own.

A side-effect is that it implements #intoBitSet on the doc competitive iterator and #docIDRunEnd on all competitive iterators.

As per a recent bug (apache#14517), competitive iterators are hard to get
right given how their state gets updated in place. This commit tries to
make them more robust by extracting the logic of updating the state of a
`DocIdSetIterator` to a shared class, which can then be tested on its
own.

A side-effect is that it implements `#intoBitSet` on the doc competitive
iterator and `#docIDRunEnd` on all competitive iterators.
Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification,

A side-effect is that it implements #intoBitSet on the doc competitive iterator and #docIDRunEnd on all competitive iterators.

By 'side-effect', do you mean another gain? I thought it means negative effects :)

public class MinDocIterator extends AbstractDocIdSetIterator {
final int segmentMinDoc;
final int maxDoc;
private DocIdSetIterator in = DocIdSetIterator.empty();
Copy link
Contributor

@gf2121 gf2121 Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It looks like UpdateableDocIdSetIterator always need to be updated after construction, would it be better to have a constructor taking a DocIdSetIterator?

@jpountz
Copy link
Contributor Author

jpountz commented Apr 22, 2025

By 'side-effect', do you mean another gain? I thought it means negative effects :)

Hopefully a gain, I still need to run benchmarks. This change also increases polymorphism, so I'm not sure if it will make things better or not.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 24, 2025

I'm seeing a reproducible slowdown with this change:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      399.37      (2.2%)      367.70      (2.3%)   -7.9% ( -12% -   -3%) 0.000
                   TermTitleSort       87.74      (4.2%)       83.02      (5.5%)   -5.4% ( -14% -    4%) 0.001
                IntervalsOrdered        2.31      (2.0%)        2.27      (7.1%)   -1.6% ( -10% -    7%) 0.324
                     CountPhrase        4.18      (2.2%)        4.13      (5.6%)   -1.3% (  -8% -    6%) 0.345
                        PKLookup      315.43      (5.0%)      312.54      (6.4%)   -0.9% ( -11% -   11%) 0.614
               TermDayOfYearSort      286.52      (1.8%)      284.23      (1.7%)   -0.8% (  -4% -    2%) 0.144
             CombinedAndHighHigh       11.51      (3.4%)       11.44      (5.9%)   -0.6% (  -9% -    8%) 0.690
              CombinedAndHighMed       38.75      (3.1%)       38.52      (5.2%)   -0.6% (  -8% -    7%) 0.663
                          Phrase       14.38      (1.5%)       14.30      (2.3%)   -0.6% (  -4% -    3%) 0.363
                 FilteredPrefix3      151.34      (3.7%)      150.70      (3.9%)   -0.4% (  -7% -    7%) 0.726
                 CountAndHighMed      308.56      (1.9%)      307.41      (1.5%)   -0.4% (  -3% -    3%) 0.498
                    FilteredTerm      159.62      (2.2%)      159.03      (2.3%)   -0.4% (  -4% -    4%) 0.606
                         Prefix3      159.86      (3.9%)      159.40      (4.1%)   -0.3% (  -7% -    8%) 0.822
                   TermMonthSort     3191.33      (3.5%)     3186.61      (3.8%)   -0.1% (  -7% -    7%) 0.899
                  FilteredIntNRQ      302.80      (1.2%)      302.51      (1.4%)   -0.1% (  -2% -    2%) 0.819
                    CombinedTerm       30.71      (4.7%)       30.69      (4.7%)   -0.1% (  -9% -    9%) 0.960
                FilteredOr3Terms      165.78      (1.3%)      165.83      (1.6%)    0.0% (  -2% -    3%) 0.951
         CountFilteredOrHighHigh      136.74      (1.0%)      136.79      (1.1%)    0.0% (  -1% -    2%) 0.906
                CountAndHighHigh      358.72      (2.2%)      358.91      (2.5%)    0.1% (  -4% -    4%) 0.944
          CountFilteredOrHighMed      148.20      (0.7%)      148.31      (0.8%)    0.1% (  -1% -    1%) 0.760
              FilteredOrHighHigh       67.56      (1.6%)       67.61      (1.9%)    0.1% (  -3% -    3%) 0.891
               FilteredOrHighMed      152.31      (1.0%)      152.49      (1.3%)    0.1% (  -2% -    2%) 0.755
                      DismaxTerm      477.31      (2.6%)      477.99      (2.5%)    0.1% (  -4% -    5%) 0.860
                  FilteredPhrase       33.05      (1.4%)       33.10      (1.7%)    0.1% (  -2% -    3%) 0.769
      FilteredOr2Terms2StopWords      147.00      (1.2%)      147.26      (1.6%)    0.2% (  -2% -    2%) 0.688
             FilteredOrStopWords       46.26      (2.0%)       46.34      (2.6%)    0.2% (  -4% -    4%) 0.799
                          Fuzzy1       99.78      (2.8%)       99.99      (3.3%)    0.2% (  -5% -    6%) 0.827
            FilteredAndStopWords       45.41      (2.1%)       45.50      (2.8%)    0.2% (  -4% -    5%) 0.785
              CombinedOrHighHigh       18.78      (2.9%)       18.82      (2.8%)    0.2% (  -5% -    6%) 0.813
                          IntNRQ      309.33      (1.1%)      310.04      (1.5%)    0.2% (  -2% -    2%) 0.580
               FilteredAnd3Terms      178.96      (1.7%)      179.39      (2.0%)    0.2% (  -3% -    4%) 0.683
                     CountOrMany       30.46      (1.9%)       30.55      (2.4%)    0.3% (  -3% -    4%) 0.656
                  FilteredOrMany       16.44      (1.5%)       16.49      (2.4%)    0.3% (  -3% -    4%) 0.628
     FilteredAnd2Terms2StopWords      177.94      (1.4%)      178.49      (1.6%)    0.3% (  -2% -    3%) 0.511
                          Fuzzy2       83.87      (2.4%)       84.14      (2.9%)    0.3% (  -4% -    5%) 0.704
               CombinedOrHighMed       71.67      (2.8%)       71.90      (2.9%)    0.3% (  -5% -    6%) 0.719
             CountFilteredOrMany       27.41      (1.8%)       27.51      (1.9%)    0.3% (  -3% -    4%) 0.558
                        Wildcard       91.57      (2.9%)       91.90      (2.9%)    0.4% (  -5% -    6%) 0.700
             FilteredAndHighHigh       65.89      (1.9%)       66.13      (2.2%)    0.4% (  -3% -    4%) 0.566
                      AndHighMed      133.85      (2.6%)      134.37      (2.1%)    0.4% (  -4% -    5%) 0.608
                 CountOrHighHigh      346.70      (2.1%)      348.06      (2.3%)    0.4% (  -3% -    4%) 0.573
              FilteredAndHighMed      125.52      (2.7%)      126.01      (2.7%)    0.4% (  -4% -    5%) 0.642
                          OrMany       18.95      (5.4%)       19.03      (4.8%)    0.5% (  -9% -   11%) 0.778
                     OrStopWords       32.30      (8.5%)       32.46      (8.0%)    0.5% ( -14% -   18%) 0.849
                 DismaxOrHighMed      167.23      (2.5%)      168.13      (1.9%)    0.5% (  -3% -    5%) 0.441
              Or2Terms2StopWords      157.02      (5.4%)      157.89      (4.8%)    0.6% (  -9% -   11%) 0.733
                  CountOrHighMed      362.34      (2.1%)      364.49      (1.5%)    0.6% (  -2% -    4%) 0.307
                       CountTerm     8625.42      (3.3%)     8677.74      (4.8%)    0.6% (  -7% -    9%) 0.644
                       And3Terms      169.76      (4.2%)      170.82      (4.8%)    0.6% (  -8% -   10%) 0.663
                AndMedOrHighHigh       65.46      (2.6%)       65.87      (2.6%)    0.6% (  -4% -    5%) 0.438
             CountFilteredPhrase       25.48      (2.8%)       25.64      (2.4%)    0.6% (  -4% -    5%) 0.434
                     AndHighHigh       42.72      (2.9%)       43.01      (2.4%)    0.7% (  -4% -    6%) 0.430
             And2Terms2StopWords      159.89      (3.9%)      160.98      (4.7%)    0.7% (  -7% -    9%) 0.619
                        Or3Terms      164.08      (5.3%)      165.21      (5.2%)    0.7% (  -9% -   11%) 0.680
                DismaxOrHighHigh      112.87      (2.5%)      113.71      (2.5%)    0.7% (  -4% -    5%) 0.347
                      OrHighHigh       51.15      (4.6%)       51.54      (3.8%)    0.8% (  -7% -    9%) 0.572
                 AndHighOrMedMed       46.46      (1.9%)       46.84      (1.4%)    0.8% (  -2% -    4%) 0.115
                            Term      431.42      (4.8%)      435.07      (4.2%)    0.8% (  -7% -   10%) 0.556
                    AndStopWords       29.45      (6.5%)       29.70      (7.3%)    0.9% ( -12% -   15%) 0.695
                       OrHighMed      188.50      (3.9%)      190.21      (3.4%)    0.9% (  -6% -    8%) 0.431
                      OrHighRare      255.64      (5.7%)      261.70      (6.3%)    2.4% (  -9% -   15%) 0.214

@jpountz
Copy link
Contributor Author

jpountz commented Apr 24, 2025

I could confirm that it's due to #docIdEnd() now being implemented on the competitive iterators. This triggers usage of DISIDocIdStream, which calls nextDoc() in a loop, and calling nextDoc() in a loop happens to be a slower way of iterating doc IDs than calling intoBitSet and then iterating set bits when blocks are encoded as a bit set.

I had initially introduced `DISIDocIdStream` to avoid introducing regressions
when `DenseConjunctionBulkScorer` started accepting single clauses. However,
benchmarks on apache#14532 suggested that going through `DISIDocIdStream` is slower
than loading docs into a bit set first and then iterating the bit set, when the
postings list has many of its blocks encoded as bit sets.

This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a
number of operations compared with calling `FixedBitSet#nextSetBit` in a loop.

So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
jpountz added a commit to jpountz/lucene that referenced this pull request Apr 24, 2025
I had initially introduced `DISIDocIdStream` to avoid introducing regressions
when `DenseConjunctionBulkScorer` started accepting single clauses. However,
benchmarks on apache#14532 suggested that going through `DISIDocIdStream` is slower
than loading docs into a bit set first and then iterating the bit set, when the
postings list has many of its blocks encoded as bit sets.

This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a
number of operations compared with calling `FixedBitSet#nextSetBit` in a loop.

So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
@jpountz
Copy link
Contributor Author

jpountz commented Apr 24, 2025

Benchmark results with #14550 applied:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     OrStopWords       33.75      (8.8%)       32.53      (7.2%)   -3.6% ( -18% -   13%) 0.399
                IntervalsOrdered        2.29      (4.1%)        2.23      (5.3%)   -2.4% ( -11% -    7%) 0.334
                    AndStopWords       30.50      (5.6%)       29.81      (5.9%)   -2.3% ( -13% -    9%) 0.462
                          OrMany       19.52      (4.4%)       19.15      (5.7%)   -1.9% ( -11% -    8%) 0.483
              Or2Terms2StopWords      163.00      (5.5%)      159.95      (4.7%)   -1.9% ( -11% -    8%) 0.492
                        Or3Terms      169.56      (5.8%)      166.70      (4.8%)   -1.7% ( -11% -    9%) 0.553
                      OrHighHigh       52.71      (5.7%)       51.88      (3.8%)   -1.6% ( -10% -    8%) 0.542
                       And3Terms      175.92      (4.0%)      173.27      (4.4%)   -1.5% (  -9% -    7%) 0.504
             And2Terms2StopWords      165.34      (3.5%)      162.88      (3.9%)   -1.5% (  -8% -    6%) 0.456
                DismaxOrHighHigh      117.18      (2.3%)      115.70      (1.6%)   -1.3% (  -5% -    2%) 0.236
                            Term      441.32      (3.6%)      435.76      (4.9%)   -1.3% (  -9% -    7%) 0.586
                       OrHighMed      194.68      (4.8%)      192.28      (3.2%)   -1.2% (  -8% -    7%) 0.569
                 AndHighOrMedMed       47.26      (0.9%)       46.68      (0.9%)   -1.2% (  -2% -    0%) 0.011
                 DismaxOrHighMed      171.78      (2.7%)      170.05      (1.8%)   -1.0% (  -5% -    3%) 0.408
              CombinedOrHighHigh       19.21      (1.1%)       19.03      (0.9%)   -0.9% (  -2% -    1%) 0.074
                    CombinedTerm       31.67      (4.4%)       31.40      (3.9%)   -0.8% (  -8% -    7%) 0.703
               CombinedOrHighMed       73.65      (0.7%)       73.04      (0.9%)   -0.8% (  -2% -    0%) 0.059
             CombinedAndHighHigh       11.61      (0.8%)       11.51      (1.5%)   -0.8% (  -3% -    1%) 0.211
                          Fuzzy1      101.38      (1.6%)      100.61      (2.3%)   -0.8% (  -4% -    3%) 0.462
              CombinedAndHighMed       39.16      (1.2%)       38.87      (1.1%)   -0.8% (  -3% -    1%) 0.236
                          Fuzzy2       85.09      (1.5%)       84.46      (2.1%)   -0.7% (  -4% -    2%) 0.444
             FilteredOrStopWords       47.30      (1.6%)       47.01      (2.2%)   -0.6% (  -4% -    3%) 0.550
                  FilteredPhrase       33.45      (1.4%)       33.25      (1.4%)   -0.6% (  -3% -    2%) 0.439
                     AndHighHigh       43.56      (0.9%)       43.36      (1.4%)   -0.5% (  -2% -    1%) 0.460
                CountAndHighHigh      359.83      (1.3%)      358.29      (3.2%)   -0.4% (  -4% -    4%) 0.743
                         Prefix3      160.80      (2.7%)      160.13      (5.2%)   -0.4% (  -8% -    7%) 0.851
                 CountAndHighMed      309.90      (2.0%)      308.69      (1.6%)   -0.4% (  -3% -    3%) 0.682
               FilteredAnd3Terms      183.33      (2.0%)      182.64      (1.6%)   -0.4% (  -3% -    3%) 0.700
                      DismaxTerm      486.91      (1.4%)      485.15      (1.9%)   -0.4% (  -3% -    3%) 0.688
                      OrHighRare      274.71      (2.8%)      273.73      (4.4%)   -0.4% (  -7% -    7%) 0.857
            FilteredAndStopWords       46.11      (1.6%)       45.95      (1.8%)   -0.4% (  -3% -    3%) 0.701
              FilteredAndHighMed      128.57      (2.5%)      128.13      (2.6%)   -0.3% (  -5% -    4%) 0.801
                AndMedOrHighHigh       66.66      (0.9%)       66.43      (1.8%)   -0.3% (  -3% -    2%) 0.664
                   TermMonthSort     3290.83      (2.4%)     3281.44      (3.0%)   -0.3% (  -5% -    5%) 0.846
                     CountPhrase        4.21      (1.8%)        4.20      (4.4%)   -0.3% (  -6% -    6%) 0.882
          CountFilteredOrHighMed      149.22      (0.5%)      148.84      (0.7%)   -0.3% (  -1% -    0%) 0.449
      FilteredOr2Terms2StopWords      149.95      (1.2%)      149.59      (1.4%)   -0.2% (  -2% -    2%) 0.729
                      AndHighMed      136.50      (0.8%)      136.19      (0.8%)   -0.2% (  -1% -    1%) 0.616
                FilteredOr3Terms      168.18      (1.1%)      167.82      (1.5%)   -0.2% (  -2% -    2%) 0.751
                    FilteredTerm      160.37      (1.8%)      160.11      (1.1%)   -0.2% (  -3% -    2%) 0.838
                  CountOrHighMed      369.17      (1.4%)      368.63      (2.5%)   -0.1% (  -3% -    3%) 0.893
                 FilteredPrefix3      151.55      (2.8%)      151.43      (5.1%)   -0.1% (  -7% -    8%) 0.970
         CountFilteredOrHighHigh      137.62      (0.6%)      137.52      (0.8%)   -0.1% (  -1% -    1%) 0.845
     FilteredAnd2Terms2StopWords      181.24      (1.4%)      181.14      (1.2%)   -0.1% (  -2% -    2%) 0.939
                     CountOrMany       30.54      (2.1%)       30.52      (2.7%)   -0.1% (  -4% -    4%) 0.968
               FilteredOrHighMed      154.31      (1.1%)      154.23      (1.0%)   -0.0% (  -2% -    1%) 0.930
              FilteredOrHighHigh       68.64      (0.9%)       68.62      (1.2%)   -0.0% (  -2% -    2%) 0.962
                          Phrase       14.33      (1.9%)       14.33      (2.6%)   -0.0% (  -4% -    4%) 0.998
             FilteredAndHighHigh       66.83      (1.2%)       66.92      (1.6%)    0.1% (  -2% -    2%) 0.859
                        PKLookup      317.97      (5.1%)      318.50      (5.4%)    0.2% (  -9% -   11%) 0.952
                  FilteredIntNRQ      307.53      (0.9%)      308.09      (1.4%)    0.2% (  -2% -    2%) 0.778
               TermDayOfYearSort      290.78      (1.7%)      291.49      (2.8%)    0.2% (  -4% -    4%) 0.845
                  FilteredOrMany       16.73      (2.4%)       16.78      (1.8%)    0.3% (  -3% -    4%) 0.792
                        Wildcard       93.62      (3.1%)       93.92      (3.2%)    0.3% (  -5% -    6%) 0.847
             CountFilteredPhrase       26.00      (1.0%)       26.10      (1.8%)    0.4% (  -2% -    3%) 0.611
                 CountOrHighHigh      348.58      (1.2%)      350.68      (2.3%)    0.6% (  -2% -    4%) 0.545
                          IntNRQ      312.75      (1.2%)      314.80      (1.4%)    0.7% (  -1% -    3%) 0.338
                       CountTerm     8526.58      (2.2%)     8590.95      (4.1%)    0.8% (  -5% -    7%) 0.667
             CountFilteredOrMany       27.47      (1.7%)       27.69      (1.7%)    0.8% (  -2% -    4%) 0.376
                      TermDTSort      406.16      (2.8%)      414.94      (2.0%)    2.2% (  -2% -    7%) 0.098
                   TermTitleSort       87.88      (3.6%)       91.14      (4.6%)    3.7% (  -4% -   12%) 0.094

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 9, 2025
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot removed the Stale label May 16, 2025
jpountz added a commit that referenced this pull request May 20, 2025
I had initially introduced `DISIDocIdStream` to avoid introducing regressions
when `DenseConjunctionBulkScorer` started accepting single clauses. However,
benchmarks on #14532 suggested that going through `DISIDocIdStream` is slower
than loading docs into a bit set first and then iterating the bit set, when the
postings list has many of its blocks encoded as bit sets.

This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a
number of operations compared with calling `FixedBitSet#nextSetBit` in a loop.

So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
@jpountz jpountz merged commit 81f2517 into apache:main May 20, 2025
7 checks passed
@jpountz jpountz deleted the robust_competitive_iterators branch May 20, 2025 21:12
jpountz added a commit that referenced this pull request May 20, 2025
I had initially introduced `DISIDocIdStream` to avoid introducing regressions
when `DenseConjunctionBulkScorer` started accepting single clauses. However,
benchmarks on #14532 suggested that going through `DISIDocIdStream` is slower
than loading docs into a bit set first and then iterating the bit set, when the
postings list has many of its blocks encoded as bit sets.

This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a
number of operations compared with calling `FixedBitSet#nextSetBit` in a loop.

So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
jpountz added a commit that referenced this pull request May 20, 2025
As per a recent bug (#14517), competitive iterators are hard to get
right given how their state gets updated in place. This commit tries to
make them more robust by extracting the logic of updating the state of a
`DocIdSetIterator` to a shared class, which can then be tested on its
own.

A side-effect is that it implements `#intoBitSet` on the doc competitive
iterator and `#docIDRunEnd` on all competitive iterators.
weizijun added a commit to weizijun/lucene that referenced this pull request May 27, 2025
* main: (32 commits)
  update os.makedirs with pathlib mkdir (apache#14710)
  Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674)
  Implement #docIDRunEnd() on PostingsEnum. (apache#14693)
  Speed up TermQuery (apache#14709)
  Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701)
  Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706)
  Use a temporary repository location to download certain ecj versions ("drops") (apache#14703)
  Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696)
  Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700)
  Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695)
  Speed up exhaustive evaluation. (apache#14679)
  Specify and test that IOContext is immutable (apache#14686)
  deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691)
  deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692)
  Clean up how the test framework creates asserting scorables. (apache#14452)
  Make competitive iterators more robust. (apache#14532)
  Remove DISIDocIdStream. (apache#14550)
  Implement AssertingPostingsEnum#intoBitSet. (apache#14675)
  Fix patience knn queries to work with seeded knn queries (apache#14688)
  Added toString() method to BytesRefBuilder (apache#14676)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants