Clear shard data before snapshot recovery transfer#8782
Conversation
|
I've seen this failure multiple times now: Currently investigating. |
Fixed in cd99636 A user request might temporarily see the replica disappear while it is cleared. In my opinion this is expected behavior. This resulted in the test falling through the 'wait all to be active' check. I'm not 100% confident this fixed all CI issues yet, because it's hard to track what actually happens here. However, local runs don't show the problem anymore. AI also couldn't find any more issues:
Let's consider it fixed and take another look if CI trips on it again. |
028d85a to
e9280e7
Compare
* [ai] On shard snapshot transfer recovery, drop existing shard before recovery * [ai] Add integration test to assert clearing behavior * [ai] Debug assert our replica is not active when we clear it * [ai] Tweak assertion * Fix flaky test, replica may temporarily not be visible * Replace debug assertion with runtime error
Alternative for #8697
When sending a shard snapshot transfer to an existing shard, extra headroom is needed on the receiver. Specifically, the snapshot is first downloaded and then replaces the existing data. It means that we need double the size of a shard in terms of disk space.
This is a blocker on some customer deployments that have very large shards. And we'd like to start enabling snapshot transfer by default everywhere because it has better performance characteristics.
This PR adjusts the shard snapshot recovery process. It now clears the target shard and then downloads the snapshot. This is safe because the shard transfer itself marks the shard as dead, and so it will not be used in reads. Clearing first prevents needing the headroom.
All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --workspace --all-featurescommand?Changes to Core Features: