-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Optimize AbstractKnnVectorQuery#createBitSet with intoBitset #14674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
| } | ||
| } | ||
| }; | ||
| return BitSet.of(filterIterator, maxDoc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implementing intoBitSet on the iterator, we could do BitSet bitSet = BitSet.of(iterator, maxDoc); and then clear deleted docs with liveDocs.applyMask(bitSet, 0);? Ie. there wouldn't be a FilterDocIdSetIterator anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for feedback!
This is what i had in mind initially. The problem here is applyMask is requiring a FixedBitSet while BitSet#of may return a SparseFixedBitSet. So i have to fork the Bits#applyMask code for the sparse case, which could look like:
BitSet bitSet = BitSet.of(iterator, maxDoc);
if (liveDocs != null) {
if (liveDocs instanceof FixedBitSet fixedBitSet) {
liveDocs.applyMask(fixedBitSet, 0);
} else {
for (int i = bitSet.nextSetBit(0);
i != DocIdSetIterator.NO_MORE_DOCS;
i = i + 1 >= bitSet.length() ? DocIdSetIterator.NO_MORE_DOCS : bitSet.nextSetBit(i + 1)) {
if (bitSet.get(i) == false) {
bitSet.clear(i);
}
}
}
}
return bitSet;
Or another potential way i'm thinking is making Bits#applyMask accept a BitSet. Default impl works fine but specialized impls need an instanceof FixedBitset check.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is annoying... It's not great either but I guess I'd like to do the SparseFixedBitSet/FixedBitSet decision up-front, something like:
int threshold = maxDoc >> 7; // like BitSet#of
if (it.cost() >= threshold) {
// Create a FixedBitSet in an optimized way
} else {
// existing code
}|
It looks like a similar change can be made to |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
… optimize_createBitSet
769b1b3 to
85b80b7
Compare
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
* main: (32 commits) update os.makedirs with pathlib mkdir (apache#14710) Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674) Implement #docIDRunEnd() on PostingsEnum. (apache#14693) Speed up TermQuery (apache#14709) Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701) Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706) Use a temporary repository location to download certain ecj versions ("drops") (apache#14703) Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696) Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700) Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695) Speed up exhaustive evaluation. (apache#14679) Specify and test that IOContext is immutable (apache#14686) deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691) deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692) Clean up how the test framework creates asserting scorables. (apache#14452) Make competitive iterators more robust. (apache#14532) Remove DISIDocIdStream. (apache#14550) Implement AssertingPostingsEnum#intoBitSet. (apache#14675) Fix patience knn queries to work with seeded knn queries (apache#14688) Added toString() method to BytesRefBuilder (apache#14676) ...
| new FilteredDocIdSetIterator(iterator) { | ||
| @Override | ||
| protected boolean match(int doc) { | ||
| return liveDocs == null || liveDocs.get(doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if liveDocs == null we could return Bitset.of(iterator, maxDoc), no, and avoid the internal logic in the filtered iterator, which probably goes away with branch prediction 🤷
This tries to implement
intoBitsetto speed up the building of BitSet.