Skip to content

Don't lock shard holder for a long time in stream records transfer#8374

Merged
timvisee merged 4 commits into
fix-transfer-locking-shard-holderfrom
fix-transfer-locking-shard-holder-detach
Mar 13, 2026
Merged

Don't lock shard holder for a long time in stream records transfer#8374
timvisee merged 4 commits into
fix-transfer-locking-shard-holderfrom
fix-transfer-locking-shard-holder-detach

Conversation

@timvisee

Copy link
Copy Markdown
Member

Fix #8373

Splits transferring batches in stream records transfers into two. This ensures the shard holder read lock is only held for a very short time.

Stages:

  1. read - read the batch (quick, and holds the lock)
  2. write - send batch to remote and wait (takes long, but does not hold the lock)

The changes are a little bit invasive, but I don't see a nicer way of handling this problem.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --workspace --all-features command?

@timvisee timvisee changed the title Detach shard holder lock from sending update batch Fix holding shard holder lock for a long time in stream records transfers Mar 12, 2026
@timvisee timvisee changed the title Fix holding shard holder lock for a long time in stream records transfers Don't lock shard holder for a long time in stream records transfer Mar 12, 2026
Comment on lines +137 to +162
// Read batch under shard holder lock
let prepared_batch = {
let shard_holder = shard_holder.read().await;

let Some(replica_set) = shard_holder.get_shard(shard_id) else {
// Forward proxy gone?!
// That would be a programming error.
return Err(CollectionError::service_error(format!(
"Shard {shard_id} is not found"
)));
};

replica_set
.read_transfer_batch(offset, TRANSFER_BATCH_SIZE, None, merge_points)
.await?

// Shard holder lock is dropped here, but the forward proxy update lock is still
// held inside the prepared batch, preventing concurrent updates between reading
// and sending.
};

let result = replica_set
.transfer_batch(offset, TRANSFER_BATCH_SIZE, None, merge_points)
.await?;
// Send batch to remote shard without holding the shard holder lock.
// This is important because sending can take a very long time (especially the last
// batch which waits for the remote to fully process). Holding the shard holder lock
// during this time would block all other operations on the collection.
let result = prepared_batch.send(&remote_shard).await?;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical section is here.

First we get a prepared_batch which is quick.

Then we transfer the batch and wait for application. This can take a very long time. And now we don't hold the shard holder lock anymore.

@timvisee timvisee force-pushed the fix-transfer-locking-shard-holder-detach branch from 468aadf to bd8beac Compare March 12, 2026 14:22
@timvisee timvisee force-pushed the fix-transfer-locking-shard-holder-detach branch from bd8beac to bde82cd Compare March 12, 2026 14:28

This comment was marked as resolved.

@timvisee timvisee merged commit 783296b into fix-transfer-locking-shard-holder Mar 13, 2026
14 checks passed
@timvisee timvisee deleted the fix-transfer-locking-shard-holder-detach branch March 13, 2026 08:10
timvisee added a commit that referenced this pull request Mar 13, 2026
* Add basic test that shows shard holder is locked for too long

* Rework test, move into stream records test

* Fix typos

* Don't lock shard holder for a long time in stream records transfer (#8374)

* [ai] Detach shard holder lock from sending update batch

* Fix clippy warning

* Move stream records delay up to sleep in the middle of the batch

* remove unused code

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>
generall added a commit that referenced this pull request Mar 26, 2026
* Add basic test that shows shard holder is locked for too long

* Rework test, move into stream records test

* Fix typos

* Don't lock shard holder for a long time in stream records transfer (#8374)

* [ai] Detach shard holder lock from sending update batch

* Fix clippy warning

* Move stream records delay up to sleep in the middle of the batch

* remove unused code

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>
KShivendu pushed a commit that referenced this pull request Mar 30, 2026
* Add basic test that shows shard holder is locked for too long

* Rework test, move into stream records test

* Fix typos

* Don't lock shard holder for a long time in stream records transfer (#8374)

* [ai] Detach shard holder lock from sending update batch

* Fix clippy warning

* Move stream records delay up to sleep in the middle of the batch

* remove unused code

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants