Skip to content

Fix transfers may lock shard holder for too long#8373

Merged
timvisee merged 4 commits into
devfrom
fix-transfer-locking-shard-holder
Mar 13, 2026
Merged

Fix transfers may lock shard holder for too long#8373
timvisee merged 4 commits into
devfrom
fix-transfer-locking-shard-holder

Conversation

@timvisee
Copy link
Copy Markdown
Member

@timvisee timvisee commented Mar 12, 2026

Shard transfers may hold a shard holder lock for a very long time, which can cascade into blocking many operations and eventually freezing the whole cluster.

Some side effects we've seen in real clusters include unresponsive nodes, broken consensus and downtime.

Since we've implemented prevent_unoptimized and the update queue this is a lot more likely to happen.

For example, the last batch of a stream records transfer may block here for a very long time because the last batch sends with wait=true, which holds a shard holder read lock for all that time:

loop {
let shard_holder = shard_holder.read().await;
let Some(replica_set) = shard_holder.get_shard(shard_id) else {
// Forward proxy gone?!
// That would be a programming error.
return Err(CollectionError::service_error(format!(
"Shard {shard_id} is not found"
)));
};
let result = replica_set
.transfer_batch(offset, TRANSFER_BATCH_SIZE, None, merge_points)
.await?;
offset = result.next_page_offset;
progress.lock().add(result.count);
#[cfg(feature = "staging")]
if let Some(delay) = staging_delay {
tokio::time::sleep(delay).await;
}
// If this is the last batch, finalize
if offset.is_none() {
break;
}
}

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --workspace --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@timvisee
Copy link
Copy Markdown
Member Author

Fix opened in #8374

…8374)

* [ai] Detach shard holder lock from sending update batch

* Fix clippy warning

* Move stream records delay up to sleep in the middle of the batch

* remove unused code

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>
@timvisee timvisee changed the title Draft: transfers may lock shard holder for too long Fix transfers may lock shard holder for too long Mar 13, 2026
@timvisee timvisee marked this pull request as ready for review March 13, 2026 08:11
@timvisee
Copy link
Copy Markdown
Member Author

@agourlay @generall Same as #8374, but including the test.

coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai Bot Mar 13, 2026
@timvisee timvisee merged commit 01a56ef into dev Mar 13, 2026
15 checks passed
@timvisee timvisee deleted the fix-transfer-locking-shard-holder branch March 13, 2026 08:48
generall added a commit that referenced this pull request Mar 26, 2026
* Add basic test that shows shard holder is locked for too long

* Rework test, move into stream records test

* Fix typos

* Don't lock shard holder for a long time in stream records transfer (#8374)

* [ai] Detach shard holder lock from sending update batch

* Fix clippy warning

* Move stream records delay up to sleep in the middle of the batch

* remove unused code

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>
KShivendu pushed a commit that referenced this pull request Mar 30, 2026
* Add basic test that shows shard holder is locked for too long

* Rework test, move into stream records test

* Fix typos

* Don't lock shard holder for a long time in stream records transfer (#8374)

* [ai] Detach shard holder lock from sending update batch

* Fix clippy warning

* Move stream records delay up to sleep in the middle of the batch

* remove unused code

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>

---------

Co-authored-by: Andrey Vasnetsov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants