-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Make competitive iterators more robust. #14532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As per a recent bug (apache#14517), competitive iterators are hard to get right given how their state gets updated in place. This commit tries to make them more robust by extracting the logic of updating the state of a `DocIdSetIterator` to a shared class, which can then be tested on its own. A side-effect is that it implements `#intoBitSet` on the doc competitive iterator and `#docIDRunEnd` on all competitive iterators.
gf2121
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplification,
A side-effect is that it implements #intoBitSet on the doc competitive iterator and #docIDRunEnd on all competitive iterators.
By 'side-effect', do you mean another gain? I thought it means negative effects :)
| public class MinDocIterator extends AbstractDocIdSetIterator { | ||
| final int segmentMinDoc; | ||
| final int maxDoc; | ||
| private DocIdSetIterator in = DocIdSetIterator.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It looks like UpdateableDocIdSetIterator always need to be updated after construction, would it be better to have a constructor taking a DocIdSetIterator?
Hopefully a gain, I still need to run benchmarks. This change also increases polymorphism, so I'm not sure if it will make things better or not. |
|
I'm seeing a reproducible slowdown with this change: |
|
I could confirm that it's due to |
I had initially introduced `DISIDocIdStream` to avoid introducing regressions when `DenseConjunctionBulkScorer` started accepting single clauses. However, benchmarks on apache#14532 suggested that going through `DISIDocIdStream` is slower than loading docs into a bit set first and then iterating the bit set, when the postings list has many of its blocks encoded as bit sets. This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a number of operations compared with calling `FixedBitSet#nextSetBit` in a loop. So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
I had initially introduced `DISIDocIdStream` to avoid introducing regressions when `DenseConjunctionBulkScorer` started accepting single clauses. However, benchmarks on apache#14532 suggested that going through `DISIDocIdStream` is slower than loading docs into a bit set first and then iterating the bit set, when the postings list has many of its blocks encoded as bit sets. This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a number of operations compared with calling `FixedBitSet#nextSetBit` in a loop. So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
|
Benchmark results with #14550 applied: |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
I had initially introduced `DISIDocIdStream` to avoid introducing regressions when `DenseConjunctionBulkScorer` started accepting single clauses. However, benchmarks on #14532 suggested that going through `DISIDocIdStream` is slower than loading docs into a bit set first and then iterating the bit set, when the postings list has many of its blocks encoded as bit sets. This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a number of operations compared with calling `FixedBitSet#nextSetBit` in a loop. So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
I had initially introduced `DISIDocIdStream` to avoid introducing regressions when `DenseConjunctionBulkScorer` started accepting single clauses. However, benchmarks on #14532 suggested that going through `DISIDocIdStream` is slower than loading docs into a bit set first and then iterating the bit set, when the postings list has many of its blocks encoded as bit sets. This makes sense, the way how `BitSetDocIdStream` iterates set bits saves a number of operations compared with calling `FixedBitSet#nextSetBit` in a loop. So I'm suggesting removing `DISIDocIdStream` for now for simplicity.
As per a recent bug (#14517), competitive iterators are hard to get right given how their state gets updated in place. This commit tries to make them more robust by extracting the logic of updating the state of a `DocIdSetIterator` to a shared class, which can then be tested on its own. A side-effect is that it implements `#intoBitSet` on the doc competitive iterator and `#docIDRunEnd` on all competitive iterators.
* main: (32 commits) update os.makedirs with pathlib mkdir (apache#14710) Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674) Implement #docIDRunEnd() on PostingsEnum. (apache#14693) Speed up TermQuery (apache#14709) Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701) Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706) Use a temporary repository location to download certain ecj versions ("drops") (apache#14703) Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696) Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700) Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695) Speed up exhaustive evaluation. (apache#14679) Specify and test that IOContext is immutable (apache#14686) deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691) deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692) Clean up how the test framework creates asserting scorables. (apache#14452) Make competitive iterators more robust. (apache#14532) Remove DISIDocIdStream. (apache#14550) Implement AssertingPostingsEnum#intoBitSet. (apache#14675) Fix patience knn queries to work with seeded knn queries (apache#14688) Added toString() method to BytesRefBuilder (apache#14676) ...
As per a recent bug (#14517), competitive iterators are hard to get right given how their state gets updated in place. This commit tries to make them more robust by extracting the logic of updating the state of a
DocIdSetIteratorto a shared class, which can then be tested on its own.A side-effect is that it implements
#intoBitSeton the doc competitive iterator and#docIDRunEndon all competitive iterators.