Skip to content

Conversation

@gf2121
Copy link
Contributor

@gf2121 gf2121 commented May 15, 2025

This tries to implement intoBitset to speed up the building of BitSet.

@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

}
}
};
return BitSet.of(filterIterator, maxDoc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing intoBitSet on the iterator, we could do BitSet bitSet = BitSet.of(iterator, maxDoc); and then clear deleted docs with liveDocs.applyMask(bitSet, 0);? Ie. there wouldn't be a FilterDocIdSetIterator anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for feedback!

This is what i had in mind initially. The problem here is applyMask is requiring a FixedBitSet while BitSet#of may return a SparseFixedBitSet. So i have to fork the Bits#applyMask code for the sparse case, which could look like:

BitSet bitSet = BitSet.of(iterator, maxDoc);
if (liveDocs != null) {
  if (liveDocs instanceof FixedBitSet fixedBitSet) {
    liveDocs.applyMask(fixedBitSet, 0);
  } else {
    for (int i = bitSet.nextSetBit(0);
         i != DocIdSetIterator.NO_MORE_DOCS;
         i = i + 1 >= bitSet.length() ? DocIdSetIterator.NO_MORE_DOCS : bitSet.nextSetBit(i + 1)) {
      if (bitSet.get(i) == false) {
        bitSet.clear(i);
      }
    }
  }
}
return bitSet;

Or another potential way i'm thinking is making Bits#applyMask accept a BitSet. Default impl works fine but specialized impls need an instanceof FixedBitset check.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is annoying... It's not great either but I guess I'd like to do the SparseFixedBitSet/FixedBitSet decision up-front, something like:

int threshold = maxDoc >> 7; // like BitSet#of
if (it.cost() >= threshold) {
  // Create a FixedBitSet in an optimized way
} else {
  // existing code
}

@jpountz
Copy link
Contributor

jpountz commented May 15, 2025

It looks like a similar change can be made to AbstractVectorSimilarityQuery.

@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

@gf2121 gf2121 merged commit 3c118d7 into apache:main May 26, 2025
7 checks passed
weizijun added a commit to weizijun/lucene that referenced this pull request May 27, 2025
* main: (32 commits)
  update os.makedirs with pathlib mkdir (apache#14710)
  Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674)
  Implement #docIDRunEnd() on PostingsEnum. (apache#14693)
  Speed up TermQuery (apache#14709)
  Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701)
  Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706)
  Use a temporary repository location to download certain ecj versions ("drops") (apache#14703)
  Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696)
  Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700)
  Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695)
  Speed up exhaustive evaluation. (apache#14679)
  Specify and test that IOContext is immutable (apache#14686)
  deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691)
  deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692)
  Clean up how the test framework creates asserting scorables. (apache#14452)
  Make competitive iterators more robust. (apache#14532)
  Remove DISIDocIdStream. (apache#14550)
  Implement AssertingPostingsEnum#intoBitSet. (apache#14675)
  Fix patience knn queries to work with seeded knn queries (apache#14688)
  Added toString() method to BytesRefBuilder (apache#14676)
  ...
new FilteredDocIdSetIterator(iterator) {
@Override
protected boolean match(int doc) {
return liveDocs == null || liveDocs.get(doc);
Copy link
Contributor

@msokolov msokolov May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if liveDocs == null we could return Bitset.of(iterator, maxDoc), no, and avoid the internal logic in the filtered iterator, which probably goes away with branch prediction 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants