fix(snapshot): ensure atomic snapshot persistence with rollback on failure (#8357)#8474
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe change modifies Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/per_collection_metrics_test.sh (1)
31-44: Consider adding gRPC per-collection metrics validation.The test validates REST per-collection metrics (
rest_collection_responses_total) but doesn't validate gRPC per-collection metrics (grpc_collection_responses_total). Since this PR also adds gRPC collection tracking viasrc/tonic/api/points_api.rs, consider adding a gRPC call (e.g., viagrpcurl) and validating the corresponding metric.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/per_collection_metrics_test.sh` around lines 31 - 44, Add validation for gRPC per-collection metrics by invoking a gRPC request (e.g., using grpcurl) against the gRPC endpoint exercised by src/tonic/api/points_api.rs so that a grpc_collection_responses_total metric is emitted, then mirror the existing REST checks in tests/per_collection_metrics_test.sh to grep for "grpc_collection_responses_total.*collection=\"$TEST_COLLECTION\"" and fail the test if not found; also add a global gRPC metric check analogous to the existing rest_responses_total check (e.g., grep for "grpc_responses_total{.*method=") to ensure global gRPC metrics are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/collection/src/common/snapshots_manager.rs`:
- Line 313: Replace the implicit conversion using into() with an explicit from()
call: locate the return that currently uses Err(err.into()) in
snapshots_manager.rs and change it to Err(CollectionError::from(err)) so the
error is constructed explicitly via the CollectionError::from(...) conversion.
---
Nitpick comments:
In `@tests/per_collection_metrics_test.sh`:
- Around line 31-44: Add validation for gRPC per-collection metrics by invoking
a gRPC request (e.g., using grpcurl) against the gRPC endpoint exercised by
src/tonic/api/points_api.rs so that a grpc_collection_responses_total metric is
emitted, then mirror the existing REST checks in
tests/per_collection_metrics_test.sh to grep for
"grpc_collection_responses_total.*collection=\"$TEST_COLLECTION\"" and fail the
test if not found; also add a global gRPC metric check analogous to the existing
rest_responses_total check (e.g., grep for "grpc_responses_total{.*method=") to
ensure global gRPC metrics are present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: db933773-f41d-43f5-aede-154077975dac
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
Cargo.tomllib/collection/src/common/snapshots_manager.rssrc/actix/actix_telemetry.rssrc/common/metrics.rssrc/common/telemetry_ops/requests_telemetry.rssrc/tonic/api/points_api.rssrc/tonic/tonic_telemetry.rstests/per_collection_metrics_test.sh
| target_path.display(), | ||
| ); | ||
| } | ||
| return Err(err.into()); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify implicit into() conversion at this return site
rg -nP --type=rust 'return\s+Err\(\s*err\.into\(\)\s*\);' lib/collection/src/common/snapshots_manager.rsRepository: qdrant/qdrant
Length of output: 96
🏁 Script executed:
# Read the context around line 313 to understand the types and function signature
sed -n '300,330p' lib/collection/src/common/snapshots_manager.rsRepository: qdrant/qdrant
Length of output: 1034
🏁 Script executed:
# Find the function containing line 313 to see its signature and return type
sed -n '280,313p' lib/collection/src/common/snapshots_manager.rs | head -40Repository: qdrant/qdrant
Length of output: 1697
🏁 Script executed:
# Find the function signature - search backwards for async fn
sed -n '250,315p' lib/collection/src/common/snapshots_manager.rsRepository: qdrant/qdrant
Length of output: 2986
🏁 Script executed:
# Find CollectionResult definition and TempPath type
rg -n 'type CollectionResult' lib/collection/src/ --type rust | head -5Repository: qdrant/qdrant
Length of output: 159
🏁 Script executed:
# Find TempPath definition and understand what persist returns
rg -n 'TempPath' lib/collection/src/common/snapshots_manager.rs | head -10Repository: qdrant/qdrant
Length of output: 556
🏁 Script executed:
# Check the imports and what tempfile errors look like
head -20 lib/collection/src/common/snapshots_manager.rsRepository: qdrant/qdrant
Length of output: 707
🏁 Script executed:
# Check if CollectionError implements From<io::Error>
rg -n 'impl.*From.*io::Error' lib/collection/src/ --type rust -A 2Repository: qdrant/qdrant
Length of output: 319
Use explicit conversion instead of into()
At Line 313, replace Err(err.into()) with Err(CollectionError::from(err)) to follow repository Rust style guidance.
Suggested change
- return Err(err.into());
+ return Err(CollectionError::from(err));As per coding guidelines: "**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return Err(err.into()); | |
| return Err(CollectionError::from(err)); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/collection/src/common/snapshots_manager.rs` at line 313, Replace the
implicit conversion using into() with an explicit from() call: locate the return
that currently uses Err(err.into()) in snapshots_manager.rs and change it to
Err(CollectionError::from(err)) so the error is constructed explicitly via the
CollectionError::from(...) conversion.
deb8151 to
48777d6
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
lib/collection/src/common/snapshots_manager.rs (1)
313-313:⚠️ Potential issue | 🟡 MinorUse explicit conversion at Line 313
Replace
Err(err.into())with explicitCollectionError::from(err)to match repository Rust style.Suggested change
- return Err(err.into()); + return Err(CollectionError::from(err));#!/bin/bash # Verify this implicit conversion pattern in the reviewed file rg -nP --type=rust 'return\s+Err\(\s*err\.into\(\)\s*\);' lib/collection/src/common/snapshots_manager.rsAs per coding guidelines: "
**/*.rs: Prefer explicitSomeType::from(x)over implicitx.into()in Rust".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/collection/src/common/snapshots_manager.rs` at line 313, Replace the implicit conversion Err(err.into()) with an explicit CollectionError::from(err) at the return site in snapshots_manager.rs; locate the occurrence of Err(err.into()) inside the snapshots manager logic (the function containing the snapshots_manager error return) and change it to return Err(CollectionError::from(err)) so it matches the repository Rust style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@lib/collection/src/common/snapshots_manager.rs`:
- Line 313: Replace the implicit conversion Err(err.into()) with an explicit
CollectionError::from(err) at the return site in snapshots_manager.rs; locate
the occurrence of Err(err.into()) inside the snapshots manager logic (the
function containing the snapshots_manager error return) and change it to return
Err(CollectionError::from(err)) so it matches the repository Rust style.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e2878ced-ef99-47ec-8832-354bac761dae
📒 Files selected for processing (1)
lib/collection/src/common/snapshots_manager.rs
38b7cdd to
81fc2ad
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/collection/src/common/snapshots_manager.rs`:
- Around line 302-306: When checksum_path_tmp.persist(&checksum_path) fails, do
not ignore errors from the rollback tokio_fs::remove_file(target_path); instead
capture its Result and return an explicit error that includes both the original
persist error and any rollback error (or at least return the rollback error if
it fails), so the caller knows the rollback did not complete; update the error
handling around checksum_path_tmp.persist, checksum_path and
tokio_fs::remove_file(target_path) to propagate a combined or prioritized error
rather than discarding the remove_file failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b65fb82c-b936-4439-b585-9b7f560e3505
📒 Files selected for processing (1)
lib/collection/src/common/snapshots_manager.rs
| if let Err(err) = checksum_path_tmp.persist(&checksum_path) { | ||
| // Snapshot persisted but checksum promotion failed: rollback to | ||
| // avoid leaving partial state. | ||
| let _ = tokio_fs::remove_file(target_path).await; | ||
| return Err(err.error.into()); |
There was a problem hiding this comment.
Don’t swallow rollback failures here.
If persist(&checksum_path) fails and remove_file(target_path) fails too, this branch still returns only the checksum error while leaving the .snapshot behind. Because missing checksum files are treated as None downstream, that orphan can still show up in snapshot listings, so the all-or-nothing guarantee is still broken on this path.
Suggested change
if let Err(err) = checksum_path_tmp.persist(&checksum_path) {
// Snapshot persisted but checksum promotion failed: rollback to
// avoid leaving partial state.
- let _ = tokio_fs::remove_file(target_path).await;
- return Err(err.error.into());
+ if let Err(rollback_err) = tokio_fs::remove_file(target_path).await {
+ return Err(CollectionError::service_error(format!(
+ "failed to roll back snapshot {} after checksum persist failure for {}: {rollback_err}; original error: {}",
+ target_path.display(),
+ checksum_path.display(),
+ err.error,
+ )));
+ }
+ return Err(CollectionError::from(err.error));
}Based on learnings: "prefer fail-fast error handling by returning explicit errors rather than silently continuing with potentially corrupted state".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/collection/src/common/snapshots_manager.rs` around lines 302 - 306, When
checksum_path_tmp.persist(&checksum_path) fails, do not ignore errors from the
rollback tokio_fs::remove_file(target_path); instead capture its Result and
return an explicit error that includes both the original persist error and any
rollback error (or at least return the rollback error if it fails), so the
caller knows the rollback did not complete; update the error handling around
checksum_path_tmp.persist, checksum_path and tokio_fs::remove_file(target_path)
to propagate a combined or prioritized error rather than discarding the
remove_file failure.
Fixes #8357
Summary
Fixes partial snapshot persistence so snapshot/checksum remain all-or-nothing and cleanup is deterministic on failure.
Problem
Snapshot persistence could leave partial artifacts when one persist step failed, causing inconsistent on-disk state and resource leakage.
What changed
Solution
Persist snapshot/checksum via guarded temporary paths, then commit atomically with rollback of already-persisted artifacts on downstream failure.
Why
This enforces all-or-nothing behavior for snapshot + checksum and prevents orphaned files after interrupted persistence.
Tests
cargo test -p collection common::snapshots_manager::tests::test_localfs_store_file_rolls_back_snapshot_if_checksum_persist_fails -- --nocapturecargo test -p collection common::snapshots_manager::tests -- --nocapture