Skip to content

MOD-12234: Fix totalDocsLen updates#7227

Merged
raz-mon merged 20 commits intomasterfrom
razmon-fix_totalDocLen
Nov 9, 2025
Merged

MOD-12234: Fix totalDocsLen updates#7227
raz-mon merged 20 commits intomasterfrom
razmon-fix_totalDocLen

Conversation

@raz-mon
Copy link
Collaborator

@raz-mon raz-mon commented Nov 5, 2025

Currently, we update spec->stats.totalDocsLen by incrementing it only, such that we do not compensate for documents that are updated or deleted.
This PR fixes that, such that we have a correct value for the total document length metric, that is used for scoring in the BM25STD (default) and BM25 scorers.

We also clean up the unused old dmd saved in the RSAddDocumentCtx.


Note

Ensure numDocuments and totalDocsLen are decremented on replace/delete, remove DocTable_Delete in favor of DocTable_Pop, and clean up old metadata handling; update tests accordingly.

  • Core/Stats & Deletion Flow:
    • Replace DocTable_Delete API with DocTable_Pop; update all call sites (indexer.c, spec.c, redisearch_api.c, tests).
    • On replace/delete: decrement spec->stats.numDocuments and spec->stats.totalDocsLen by md->len; assert invariants; call DMD_Return(md).
    • Maintain refcounts/comments; unchain TTL, remove from DocIdMap, and adjust sortables/memsize; on replace/delete remove VecSim vectors and geometry ids.
  • Document Context:
    • Remove RSAddDocumentCtx->oldMd and related return/cleanup logic.
  • Headers/API:
    • Remove DocTable_Delete declarations; keep DocTable_Pop{,R} helpers.
  • Tests:
    • Add BM25/BM25STD average-doc-length correctness test (updates and deletions).
    • Update C++/Python tests to use DocTable_Pop; adjust expected BM25/BM25STD scores.
    • Minor vecsim expectations updated.

Written by Cursor Bugbot for commit 36b8ce1. This will update automatically on new commits. Configure here.

@raz-mon raz-mon requested a review from Copilot November 5, 2025 09:04
@github-actions github-actions bot added the size:S label Nov 5, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the calculation of the average document length for BM25 and BM25STD scorers by ensuring totalDocsLen is properly updated when documents are replaced (MOD-122234).

  • Updates totalDocsLen when a document is replaced by subtracting the old document's length
  • Adds a test to verify average document length stays correct after document updates
  • Documents that deletion handling still needs to be fixed with a TODO comment

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/indexer.c Subtracts old document length from totalDocsLen when replacing a document
src/spec.c Adds TODO comment noting that totalDocsLen update is still missing for deletion
tests/pytests/test_scorers.py Adds test to verify average document length calculation for document updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added the size:M label Nov 5, 2025
@codecov
Copy link

codecov bot commented Nov 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.01%. Comparing base (d95056f) to head (36b8ce1).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7227      +/-   ##
==========================================
- Coverage   85.03%   85.01%   -0.03%     
==========================================
  Files         341      341              
  Lines       51671    51670       -1     
  Branches    12526    12526              
==========================================
- Hits        43941    43929      -12     
- Misses       7537     7549      +12     
+ Partials      193      192       -1     
Flag Coverage Δ
flow 84.25% <64.70%> (-0.06%) ⬇️
unit 51.56% <64.70%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alonre24
alonre24 previously approved these changes Nov 5, 2025
GuyAv46
GuyAv46 previously approved these changes Nov 5, 2025
@raz-mon raz-mon enabled auto-merge November 9, 2025 08:47
GuyAv46
GuyAv46 previously approved these changes Nov 9, 2025
@raz-mon raz-mon added this pull request to the merge queue Nov 9, 2025
Merged via the queue into master with commit 25e65ef Nov 9, 2025
19 checks passed
@raz-mon raz-mon deleted the razmon-fix_totalDocLen branch November 9, 2025 12:23
@redisearch-backport-pull-request
Copy link
Contributor

Backport failed for 8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.2
git worktree add -d .worktree/backport-7227-to-8.2 origin/8.2
cd .worktree/backport-7227-to-8.2
git switch --create backport-7227-to-8.2
git cherry-pick -x 25e65efd1c5cf6a91d1f70126d2cf287507fefe8

redisearch-backport-pull-request bot pushed a commit that referenced this pull request Nov 9, 2025
* Fix BM25STD underflow wraparound

* Fix comment

* Fix documentation and tweak test

* Address review

* Remove old test

* Skip cluster on expiration test

* Fix leak in test

* Switch to rust ii (#6958)

* Switch to Rust inverted index

* Remove encoder and decoder getter usage

It is no longer possible to get the encoders or decoders from the flags.
Instead they are coupled to the inverted index or its reader.

* Update `testNumericInverted`

* Update testIndexFlags

* Don't check buffer growth in test

There is no reason for the test to check the internals of these structures.

* Fix size test on flags

* Fix size test for numeric index

* Fix size test for tag index

* Fix sizes on LLapi

* Update fork tests to const

* C GC updates

* Update fork tests

* Don't free deltas twice

* Fix sizes in Python tests

* Recreate index to have compression turned on

* Remove impossible test

This test requires a mutable access to the inverted index in a reader
and direct access to the repair block call. Both of these are not
possible so the test is removed.

* Remove C benchmarks

* Move header out of future

* Remove from iterator benches

* Fix symbols for GC benchmark

* Update iterator benchmark FFI

* Get num_entries on index directly

* Trim bench dependencies

* Fix filters in CPP tests

The text inverted indices expect to have a field mask filter, not the
`None` case.

* Update debug assertions

It is now possible to have an inverted index with no blocks in it.

* Partially fix totalDocLen metric

* Fix

* Make test not pass by accident

* Remove var

* Update totalDocLen on deletions as well

* Fix tests

* Fix assertions

* Clean up unused old dmd

* Add assertions to llapi

* Remove Delete API

* Skip unnecesary cluster tests

---------

Co-authored-by: Pieter <[email protected]>
(cherry picked from commit 25e65ef)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 8.4:

@raz-mon raz-mon mentioned this pull request Nov 9, 2025
github-merge-queue bot pushed a commit that referenced this pull request Nov 9, 2025
* MOD-12234: Fix totalDocsLen updates (#7227)

* Fix BM25STD underflow wraparound

* Fix comment

* Fix documentation and tweak test

* Address review

* Remove old test

* Skip cluster on expiration test

* Fix leak in test

* Switch to rust ii (#6958)

* Switch to Rust inverted index

* Remove encoder and decoder getter usage

It is no longer possible to get the encoders or decoders from the flags.
Instead they are coupled to the inverted index or its reader.

* Update `testNumericInverted`

* Update testIndexFlags

* Don't check buffer growth in test

There is no reason for the test to check the internals of these structures.

* Fix size test on flags

* Fix size test for numeric index

* Fix size test for tag index

* Fix sizes on LLapi

* Update fork tests to const

* C GC updates

* Update fork tests

* Don't free deltas twice

* Fix sizes in Python tests

* Recreate index to have compression turned on

* Remove impossible test

This test requires a mutable access to the inverted index in a reader
and direct access to the repair block call. Both of these are not
possible so the test is removed.

* Remove C benchmarks

* Move header out of future

* Remove from iterator benches

* Fix symbols for GC benchmark

* Update iterator benchmark FFI

* Get num_entries on index directly

* Trim bench dependencies

* Fix filters in CPP tests

The text inverted indices expect to have a field mask filter, not the
`None` case.

* Update debug assertions

It is now possible to have an inverted index with no blocks in it.

* Partially fix totalDocLen metric

* Fix

* Make test not pass by accident

* Remove var

* Update totalDocLen on deletions as well

* Fix tests

* Fix assertions

* Clean up unused old dmd

* Add assertions to llapi

* Remove Delete API

* Skip unnecesary cluster tests

---------

Co-authored-by: Pieter <[email protected]>
(cherry picked from commit 25e65ef)

* Remove unrelated diffs

---------

Co-authored-by: Raz Monsonego <[email protected]>
Co-authored-by: Pieter <[email protected]>
raz-mon added a commit that referenced this pull request Nov 9, 2025
* Fix BM25STD underflow wraparound

* Fix comment

* Fix documentation and tweak test

* Address review

* Remove old test

* Skip cluster on expiration test

* Fix leak in test

* Switch to rust ii (#6958)

* Switch to Rust inverted index

* Remove encoder and decoder getter usage

It is no longer possible to get the encoders or decoders from the flags.
Instead they are coupled to the inverted index or its reader.

* Update `testNumericInverted`

* Update testIndexFlags

* Don't check buffer growth in test

There is no reason for the test to check the internals of these structures.

* Fix size test on flags

* Fix size test for numeric index

* Fix size test for tag index

* Fix sizes on LLapi

* Update fork tests to const

* C GC updates

* Update fork tests

* Don't free deltas twice

* Fix sizes in Python tests

* Recreate index to have compression turned on

* Remove impossible test

This test requires a mutable access to the inverted index in a reader
and direct access to the repair block call. Both of these are not
possible so the test is removed.

* Remove C benchmarks

* Move header out of future

* Remove from iterator benches

* Fix symbols for GC benchmark

* Update iterator benchmark FFI

* Get num_entries on index directly

* Trim bench dependencies

* Fix filters in CPP tests

The text inverted indices expect to have a field mask filter, not the
`None` case.

* Update debug assertions

It is now possible to have an inverted index with no blocks in it.

* Partially fix totalDocLen metric

* Fix

* Make test not pass by accident

* Remove var

* Update totalDocLen on deletions as well

* Fix tests

* Fix assertions

* Clean up unused old dmd

* Add assertions to llapi

* Remove Delete API

* Skip unnecesary cluster tests

---------

Co-authored-by: Pieter <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Nov 10, 2025
MOD-12234: Fix totalDocsLen updates (#7227)

* Fix BM25STD underflow wraparound

* Fix comment

* Fix documentation and tweak test

* Address review

* Remove old test

* Skip cluster on expiration test

* Fix leak in test

* Switch to rust ii (#6958)

* Switch to Rust inverted index

* Remove encoder and decoder getter usage

It is no longer possible to get the encoders or decoders from the flags.
Instead they are coupled to the inverted index or its reader.

* Update `testNumericInverted`

* Update testIndexFlags

* Don't check buffer growth in test

There is no reason for the test to check the internals of these structures.

* Fix size test on flags

* Fix size test for numeric index

* Fix size test for tag index

* Fix sizes on LLapi

* Update fork tests to const

* C GC updates

* Update fork tests

* Don't free deltas twice

* Fix sizes in Python tests

* Recreate index to have compression turned on

* Remove impossible test

This test requires a mutable access to the inverted index in a reader
and direct access to the repair block call. Both of these are not
possible so the test is removed.

* Remove C benchmarks

* Move header out of future

* Remove from iterator benches

* Fix symbols for GC benchmark

* Update iterator benchmark FFI

* Get num_entries on index directly

* Trim bench dependencies

* Fix filters in CPP tests

The text inverted indices expect to have a field mask filter, not the
`None` case.

* Update debug assertions

It is now possible to have an inverted index with no blocks in it.

* Partially fix totalDocLen metric

* Fix

* Make test not pass by accident

* Remove var

* Update totalDocLen on deletions as well

* Fix tests

* Fix assertions

* Clean up unused old dmd

* Add assertions to llapi

* Remove Delete API

* Skip unnecesary cluster tests

---------

Co-authored-by: Pieter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants