Skip to content

fix(snapshot): ensure atomic snapshot persistence with rollback on failure (#8357)#8474

Merged
timvisee merged 3 commits into
qdrant:devfrom
krapcys1-maker:fix/8357-shard-snapshot-cleanup-v2
Apr 15, 2026
Merged

fix(snapshot): ensure atomic snapshot persistence with rollback on failure (#8357)#8474
timvisee merged 3 commits into
qdrant:devfrom
krapcys1-maker:fix/8357-shard-snapshot-cleanup-v2

Conversation

@krapcys1-maker

@krapcys1-maker krapcys1-maker commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

Fixes #8357

Summary

Fixes partial snapshot persistence so snapshot/checksum remain all-or-nothing and cleanup is deterministic on failure.

Problem

Snapshot persistence could leave partial artifacts when one persist step failed, causing inconsistent on-disk state and resource leakage.

What changed

  • Keep persistence/rollback ownership explicit in snapshot storage path.
  • Ensure rollback removes already-persisted snapshot when checksum persist fails.

Solution

Persist snapshot/checksum via guarded temporary paths, then commit atomically with rollback of already-persisted artifacts on downstream failure.

Why

This enforces all-or-nothing behavior for snapshot + checksum and prevents orphaned files after interrupted persistence.

Tests

  • cargo test -p collection common::snapshots_manager::tests::test_localfs_store_file_rolls_back_snapshot_if_checksum_persist_fails -- --nocapture
  • cargo test -p collection common::snapshots_manager::tests -- --nocapture

@coderabbitai

coderabbitai Bot commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The change modifies SnapshotStorageLocalFS::store_file to implement atomic snapshot persistence by writing the checksum to a temporary file before promoting it to the final location. After persisting the snapshot file, the code now explicitly persists the temporary checksum file to its final path. Error handling was added to roll back the snapshot file if checksum promotion fails, preventing partial on-disk states where a snapshot exists without a properly promoted checksum. The previous approach using keep() on the checksum temp handle was removed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • timvisee
  • ffuugoo
  • agourlay
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the main change: making snapshot persistence atomic with rollback on failure.
Description check ✅ Passed Description is related to the changeset, explaining the problem and solution regarding snapshot persistence atomicity and rollback.
Linked Issues check ✅ Passed Changes align with issue #8357 requirements: temporary files are persisted via guarded paths with atomic commit and rollback of already-persisted artifacts on failure.
Out of Scope Changes check ✅ Passed All changes are scoped to snapshot persistence logic in snapshots_manager.rs; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/per_collection_metrics_test.sh (1)

31-44: Consider adding gRPC per-collection metrics validation.

The test validates REST per-collection metrics (rest_collection_responses_total) but doesn't validate gRPC per-collection metrics (grpc_collection_responses_total). Since this PR also adds gRPC collection tracking via src/tonic/api/points_api.rs, consider adding a gRPC call (e.g., via grpcurl) and validating the corresponding metric.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/per_collection_metrics_test.sh` around lines 31 - 44, Add validation
for gRPC per-collection metrics by invoking a gRPC request (e.g., using grpcurl)
against the gRPC endpoint exercised by src/tonic/api/points_api.rs so that a
grpc_collection_responses_total metric is emitted, then mirror the existing REST
checks in tests/per_collection_metrics_test.sh to grep for
"grpc_collection_responses_total.*collection=\"$TEST_COLLECTION\"" and fail the
test if not found; also add a global gRPC metric check analogous to the existing
rest_responses_total check (e.g., grep for "grpc_responses_total{.*method=") to
ensure global gRPC metrics are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/collection/src/common/snapshots_manager.rs`:
- Line 313: Replace the implicit conversion using into() with an explicit from()
call: locate the return that currently uses Err(err.into()) in
snapshots_manager.rs and change it to Err(CollectionError::from(err)) so the
error is constructed explicitly via the CollectionError::from(...) conversion.

---

Nitpick comments:
In `@tests/per_collection_metrics_test.sh`:
- Around line 31-44: Add validation for gRPC per-collection metrics by invoking
a gRPC request (e.g., using grpcurl) against the gRPC endpoint exercised by
src/tonic/api/points_api.rs so that a grpc_collection_responses_total metric is
emitted, then mirror the existing REST checks in
tests/per_collection_metrics_test.sh to grep for
"grpc_collection_responses_total.*collection=\"$TEST_COLLECTION\"" and fail the
test if not found; also add a global gRPC metric check analogous to the existing
rest_responses_total check (e.g., grep for "grpc_responses_total{.*method=") to
ensure global gRPC metrics are present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: db933773-f41d-43f5-aede-154077975dac

📥 Commits

Reviewing files that changed from the base of the PR and between 8733d9f and deb8151.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • Cargo.toml
  • lib/collection/src/common/snapshots_manager.rs
  • src/actix/actix_telemetry.rs
  • src/common/metrics.rs
  • src/common/telemetry_ops/requests_telemetry.rs
  • src/tonic/api/points_api.rs
  • src/tonic/tonic_telemetry.rs
  • tests/per_collection_metrics_test.sh

target_path.display(),
);
}
return Err(err.into());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify implicit into() conversion at this return site
rg -nP --type=rust 'return\s+Err\(\s*err\.into\(\)\s*\);' lib/collection/src/common/snapshots_manager.rs

Repository: qdrant/qdrant

Length of output: 96


🏁 Script executed:

# Read the context around line 313 to understand the types and function signature
sed -n '300,330p' lib/collection/src/common/snapshots_manager.rs

Repository: qdrant/qdrant

Length of output: 1034


🏁 Script executed:

# Find the function containing line 313 to see its signature and return type
sed -n '280,313p' lib/collection/src/common/snapshots_manager.rs | head -40

Repository: qdrant/qdrant

Length of output: 1697


🏁 Script executed:

# Find the function signature - search backwards for async fn
sed -n '250,315p' lib/collection/src/common/snapshots_manager.rs

Repository: qdrant/qdrant

Length of output: 2986


🏁 Script executed:

# Find CollectionResult definition and TempPath type
rg -n 'type CollectionResult' lib/collection/src/ --type rust | head -5

Repository: qdrant/qdrant

Length of output: 159


🏁 Script executed:

# Find TempPath definition and understand what persist returns
rg -n 'TempPath' lib/collection/src/common/snapshots_manager.rs | head -10

Repository: qdrant/qdrant

Length of output: 556


🏁 Script executed:

# Check the imports and what tempfile errors look like
head -20 lib/collection/src/common/snapshots_manager.rs

Repository: qdrant/qdrant

Length of output: 707


🏁 Script executed:

# Check if CollectionError implements From<io::Error>
rg -n 'impl.*From.*io::Error' lib/collection/src/ --type rust -A 2

Repository: qdrant/qdrant

Length of output: 319


Use explicit conversion instead of into()

At Line 313, replace Err(err.into()) with Err(CollectionError::from(err)) to follow repository Rust style guidance.

Suggested change
-            return Err(err.into());
+            return Err(CollectionError::from(err));

As per coding guidelines: "**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return Err(err.into());
return Err(CollectionError::from(err));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/collection/src/common/snapshots_manager.rs` at line 313, Replace the
implicit conversion using into() with an explicit from() call: locate the return
that currently uses Err(err.into()) in snapshots_manager.rs and change it to
Err(CollectionError::from(err)) so the error is constructed explicitly via the
CollectionError::from(...) conversion.

@krapcys1-maker krapcys1-maker force-pushed the fix/8357-shard-snapshot-cleanup-v2 branch from deb8151 to 48777d6 Compare March 22, 2026 09:22
@krapcys1-maker krapcys1-maker changed the title fix(snapshots): clean up partial shard snapshot persistence failures fix(snapshot): ensure atomic snapshot persistence with rollback on failure (#8357) Mar 22, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
lib/collection/src/common/snapshots_manager.rs (1)

313-313: ⚠️ Potential issue | 🟡 Minor

Use explicit conversion at Line 313

Replace Err(err.into()) with explicit CollectionError::from(err) to match repository Rust style.

Suggested change
-            return Err(err.into());
+            return Err(CollectionError::from(err));
#!/bin/bash
# Verify this implicit conversion pattern in the reviewed file
rg -nP --type=rust 'return\s+Err\(\s*err\.into\(\)\s*\);' lib/collection/src/common/snapshots_manager.rs

As per coding guidelines: "**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/collection/src/common/snapshots_manager.rs` at line 313, Replace the
implicit conversion Err(err.into()) with an explicit CollectionError::from(err)
at the return site in snapshots_manager.rs; locate the occurrence of
Err(err.into()) inside the snapshots manager logic (the function containing the
snapshots_manager error return) and change it to return
Err(CollectionError::from(err)) so it matches the repository Rust style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@lib/collection/src/common/snapshots_manager.rs`:
- Line 313: Replace the implicit conversion Err(err.into()) with an explicit
CollectionError::from(err) at the return site in snapshots_manager.rs; locate
the occurrence of Err(err.into()) inside the snapshots manager logic (the
function containing the snapshots_manager error return) and change it to return
Err(CollectionError::from(err)) so it matches the repository Rust style.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2878ced-ef99-47ec-8832-354bac761dae

📥 Commits

Reviewing files that changed from the base of the PR and between deb8151 and 48777d6.

📒 Files selected for processing (1)
  • lib/collection/src/common/snapshots_manager.rs

@krapcys1-maker krapcys1-maker force-pushed the fix/8357-shard-snapshot-cleanup-v2 branch from 38b7cdd to 81fc2ad Compare March 28, 2026 19:21

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/collection/src/common/snapshots_manager.rs`:
- Around line 302-306: When checksum_path_tmp.persist(&checksum_path) fails, do
not ignore errors from the rollback tokio_fs::remove_file(target_path); instead
capture its Result and return an explicit error that includes both the original
persist error and any rollback error (or at least return the rollback error if
it fails), so the caller knows the rollback did not complete; update the error
handling around checksum_path_tmp.persist, checksum_path and
tokio_fs::remove_file(target_path) to propagate a combined or prioritized error
rather than discarding the remove_file failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b65fb82c-b936-4439-b585-9b7f560e3505

📥 Commits

Reviewing files that changed from the base of the PR and between 81fc2ad and 7e4e5ea.

📒 Files selected for processing (1)
  • lib/collection/src/common/snapshots_manager.rs

Comment on lines +302 to +306
if let Err(err) = checksum_path_tmp.persist(&checksum_path) {
// Snapshot persisted but checksum promotion failed: rollback to
// avoid leaving partial state.
let _ = tokio_fs::remove_file(target_path).await;
return Err(err.error.into());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t swallow rollback failures here.

If persist(&checksum_path) fails and remove_file(target_path) fails too, this branch still returns only the checksum error while leaving the .snapshot behind. Because missing checksum files are treated as None downstream, that orphan can still show up in snapshot listings, so the all-or-nothing guarantee is still broken on this path.

Suggested change
         if let Err(err) = checksum_path_tmp.persist(&checksum_path) {
             // Snapshot persisted but checksum promotion failed: rollback to
             // avoid leaving partial state.
-            let _ = tokio_fs::remove_file(target_path).await;
-            return Err(err.error.into());
+            if let Err(rollback_err) = tokio_fs::remove_file(target_path).await {
+                return Err(CollectionError::service_error(format!(
+                    "failed to roll back snapshot {} after checksum persist failure for {}: {rollback_err}; original error: {}",
+                    target_path.display(),
+                    checksum_path.display(),
+                    err.error,
+                )));
+            }
+            return Err(CollectionError::from(err.error));
         }

Based on learnings: "prefer fail-fast error handling by returning explicit errors rather than silently continuing with potentially corrupted state".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/collection/src/common/snapshots_manager.rs` around lines 302 - 306, When
checksum_path_tmp.persist(&checksum_path) fails, do not ignore errors from the
rollback tokio_fs::remove_file(target_path); instead capture its Result and
return an explicit error that includes both the original persist error and any
rollback error (or at least return the rollback error if it fails), so the
caller knows the rollback did not complete; update the error handling around
checksum_path_tmp.persist, checksum_path and tokio_fs::remove_file(target_path)
to propagate a combined or prioritized error rather than discarding the
remove_file failure.

@timvisee timvisee left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@timvisee timvisee merged commit c46bdc1 into qdrant:dev Apr 15, 2026
15 checks passed
timvisee pushed a commit that referenced this pull request May 8, 2026
…ilure (#8357) (#8474)

* fix: issue #8357 patch from pipeline run

* fix(snapshot): correct temp path ownership and rollback on checksum persist failure

* chore(ci): retrigger failed s3 snapshot workflow
@timvisee timvisee mentioned this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants