Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes the calculation of the average document length for BM25 and BM25STD scorers by ensuring totalDocsLen is properly updated when documents are replaced (MOD-122234).
- Updates
totalDocsLenwhen a document is replaced by subtracting the old document's length - Adds a test to verify average document length stays correct after document updates
- Documents that deletion handling still needs to be fixed with a TODO comment
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/indexer.c | Subtracts old document length from totalDocsLen when replacing a document |
| src/spec.c | Adds TODO comment noting that totalDocsLen update is still missing for deletion |
| tests/pytests/test_scorers.py | Adds test to verify average document length calculation for document updates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7227 +/- ##
==========================================
- Coverage 85.03% 85.01% -0.03%
==========================================
Files 341 341
Lines 51671 51670 -1
Branches 12526 12526
==========================================
- Hits 43941 43929 -12
- Misses 7537 7549 +12
+ Partials 193 192 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7df792e to
eff09b8
Compare
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.2
git worktree add -d .worktree/backport-7227-to-8.2 origin/8.2
cd .worktree/backport-7227-to-8.2
git switch --create backport-7227-to-8.2
git cherry-pick -x 25e65efd1c5cf6a91d1f70126d2cf287507fefe8 |
* Fix BM25STD underflow wraparound * Fix comment * Fix documentation and tweak test * Address review * Remove old test * Skip cluster on expiration test * Fix leak in test * Switch to rust ii (#6958) * Switch to Rust inverted index * Remove encoder and decoder getter usage It is no longer possible to get the encoders or decoders from the flags. Instead they are coupled to the inverted index or its reader. * Update `testNumericInverted` * Update testIndexFlags * Don't check buffer growth in test There is no reason for the test to check the internals of these structures. * Fix size test on flags * Fix size test for numeric index * Fix size test for tag index * Fix sizes on LLapi * Update fork tests to const * C GC updates * Update fork tests * Don't free deltas twice * Fix sizes in Python tests * Recreate index to have compression turned on * Remove impossible test This test requires a mutable access to the inverted index in a reader and direct access to the repair block call. Both of these are not possible so the test is removed. * Remove C benchmarks * Move header out of future * Remove from iterator benches * Fix symbols for GC benchmark * Update iterator benchmark FFI * Get num_entries on index directly * Trim bench dependencies * Fix filters in CPP tests The text inverted indices expect to have a field mask filter, not the `None` case. * Update debug assertions It is now possible to have an inverted index with no blocks in it. * Partially fix totalDocLen metric * Fix * Make test not pass by accident * Remove var * Update totalDocLen on deletions as well * Fix tests * Fix assertions * Clean up unused old dmd * Add assertions to llapi * Remove Delete API * Skip unnecesary cluster tests --------- Co-authored-by: Pieter <[email protected]> (cherry picked from commit 25e65ef)
|
Successfully created backport PR for |
* MOD-12234: Fix totalDocsLen updates (#7227) * Fix BM25STD underflow wraparound * Fix comment * Fix documentation and tweak test * Address review * Remove old test * Skip cluster on expiration test * Fix leak in test * Switch to rust ii (#6958) * Switch to Rust inverted index * Remove encoder and decoder getter usage It is no longer possible to get the encoders or decoders from the flags. Instead they are coupled to the inverted index or its reader. * Update `testNumericInverted` * Update testIndexFlags * Don't check buffer growth in test There is no reason for the test to check the internals of these structures. * Fix size test on flags * Fix size test for numeric index * Fix size test for tag index * Fix sizes on LLapi * Update fork tests to const * C GC updates * Update fork tests * Don't free deltas twice * Fix sizes in Python tests * Recreate index to have compression turned on * Remove impossible test This test requires a mutable access to the inverted index in a reader and direct access to the repair block call. Both of these are not possible so the test is removed. * Remove C benchmarks * Move header out of future * Remove from iterator benches * Fix symbols for GC benchmark * Update iterator benchmark FFI * Get num_entries on index directly * Trim bench dependencies * Fix filters in CPP tests The text inverted indices expect to have a field mask filter, not the `None` case. * Update debug assertions It is now possible to have an inverted index with no blocks in it. * Partially fix totalDocLen metric * Fix * Make test not pass by accident * Remove var * Update totalDocLen on deletions as well * Fix tests * Fix assertions * Clean up unused old dmd * Add assertions to llapi * Remove Delete API * Skip unnecesary cluster tests --------- Co-authored-by: Pieter <[email protected]> (cherry picked from commit 25e65ef) * Remove unrelated diffs --------- Co-authored-by: Raz Monsonego <[email protected]> Co-authored-by: Pieter <[email protected]>
* Fix BM25STD underflow wraparound * Fix comment * Fix documentation and tweak test * Address review * Remove old test * Skip cluster on expiration test * Fix leak in test * Switch to rust ii (#6958) * Switch to Rust inverted index * Remove encoder and decoder getter usage It is no longer possible to get the encoders or decoders from the flags. Instead they are coupled to the inverted index or its reader. * Update `testNumericInverted` * Update testIndexFlags * Don't check buffer growth in test There is no reason for the test to check the internals of these structures. * Fix size test on flags * Fix size test for numeric index * Fix size test for tag index * Fix sizes on LLapi * Update fork tests to const * C GC updates * Update fork tests * Don't free deltas twice * Fix sizes in Python tests * Recreate index to have compression turned on * Remove impossible test This test requires a mutable access to the inverted index in a reader and direct access to the repair block call. Both of these are not possible so the test is removed. * Remove C benchmarks * Move header out of future * Remove from iterator benches * Fix symbols for GC benchmark * Update iterator benchmark FFI * Get num_entries on index directly * Trim bench dependencies * Fix filters in CPP tests The text inverted indices expect to have a field mask filter, not the `None` case. * Update debug assertions It is now possible to have an inverted index with no blocks in it. * Partially fix totalDocLen metric * Fix * Make test not pass by accident * Remove var * Update totalDocLen on deletions as well * Fix tests * Fix assertions * Clean up unused old dmd * Add assertions to llapi * Remove Delete API * Skip unnecesary cluster tests --------- Co-authored-by: Pieter <[email protected]>
MOD-12234: Fix totalDocsLen updates (#7227) * Fix BM25STD underflow wraparound * Fix comment * Fix documentation and tweak test * Address review * Remove old test * Skip cluster on expiration test * Fix leak in test * Switch to rust ii (#6958) * Switch to Rust inverted index * Remove encoder and decoder getter usage It is no longer possible to get the encoders or decoders from the flags. Instead they are coupled to the inverted index or its reader. * Update `testNumericInverted` * Update testIndexFlags * Don't check buffer growth in test There is no reason for the test to check the internals of these structures. * Fix size test on flags * Fix size test for numeric index * Fix size test for tag index * Fix sizes on LLapi * Update fork tests to const * C GC updates * Update fork tests * Don't free deltas twice * Fix sizes in Python tests * Recreate index to have compression turned on * Remove impossible test This test requires a mutable access to the inverted index in a reader and direct access to the repair block call. Both of these are not possible so the test is removed. * Remove C benchmarks * Move header out of future * Remove from iterator benches * Fix symbols for GC benchmark * Update iterator benchmark FFI * Get num_entries on index directly * Trim bench dependencies * Fix filters in CPP tests The text inverted indices expect to have a field mask filter, not the `None` case. * Update debug assertions It is now possible to have an inverted index with no blocks in it. * Partially fix totalDocLen metric * Fix * Make test not pass by accident * Remove var * Update totalDocLen on deletions as well * Fix tests * Fix assertions * Clean up unused old dmd * Add assertions to llapi * Remove Delete API * Skip unnecesary cluster tests --------- Co-authored-by: Pieter <[email protected]>
Currently, we update
spec->stats.totalDocsLenby incrementing it only, such that we do not compensate for documents that are updated or deleted.This PR fixes that, such that we have a correct value for the total document length metric, that is used for scoring in the
BM25STD(default) andBM25scorers.We also clean up the unused old dmd saved in the
RSAddDocumentCtx.Note
Ensure
numDocumentsandtotalDocsLenare decremented on replace/delete, removeDocTable_Deletein favor ofDocTable_Pop, and clean up old metadata handling; update tests accordingly.DocTable_DeleteAPI withDocTable_Pop; update all call sites (indexer.c,spec.c,redisearch_api.c, tests).spec->stats.numDocumentsandspec->stats.totalDocsLenbymd->len; assert invariants; callDMD_Return(md).RSAddDocumentCtx->oldMdand related return/cleanup logic.DocTable_Deletedeclarations; keepDocTable_Pop{,R}helpers.DocTable_Pop; adjust expected BM25/BM25STD scores.Written by Cursor Bugbot for commit 36b8ce1. This will update automatically on new commits. Configure here.