fix(core): fix checkpoint restore by removing phantom tables directories from disk#6614
Conversation
… checkpoint restore
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis PR adds orphan directory cleanup logic to the checkpoint recovery process. When recovering from a checkpoint, the agent now identifies and removes destination table directories that don't exist in the checkpoint snapshot. Two new tests validate both the normal cleanup flow and failure scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java (2)
75-75: Consider explicit static imports over wildcard for clarity.Wildcard static imports (
import static io.questdb.cairo.TableUtils.*;) can make it less obvious where identifiers likeTXN_FILE_NAMEoriginate, especially when multiple utility classes are in use. If the project's style guide permits wildcards for well-known utility classes, this is acceptable; otherwise, consider using explicit imports.
839-841: Clarify the checkpoint directory guard logic.The guard at line 841 correctly prevents deleting the checkpoint directory when it's configured inside the db directory, but the logic is subtle. Consider enhancing the comment to explain the specific scenario:
-// Check if the checkpoint dir is configured to be inside the db dir, -// so we do not delete checkpoint directory itself -if (!Utf8s.startsWith(srcPath, dstPath)) { +// Check if the checkpoint dir is configured to be inside the db dir, +// so we do not delete the checkpoint directory itself. +// Example: if checkpoint is /data/db/.checkpoint/db/ and we're examining +// /data/db/.checkpoint/, startsWith() returns true, so we skip deletion. +if (!Utf8s.startsWith(srcPath, dstPath)) {This would help future maintainers quickly understand why we check if the checkpoint path starts with the current directory path (rather than the more intuitive opposite).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.javacore/src/test/java/io/questdb/test/griffin/CheckpointTest.java
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-07T00:59:31.522Z
Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: In QuestDB's Cairo engine, transaction (_txn) files have a strong invariant: they are never truncated below TX_BASE_HEADER_SIZE. Once created, they are either fully formed (size >= header size) or completely removed along with the entire table directory when the table is dropped.
Applied to files:
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java
📚 Learning: 2025-11-07T00:59:31.522Z
Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: When checking for transaction file validity in QuestDB, use `ff.length(path) >= TX_BASE_HEADER_SIZE` instead of `ff.exists(path)`. The length check provides stronger guarantees under extreme system load by ensuring the file system catalog is synchronized and the file is fully formed, preventing SIGBUS errors when memory mapping the file.
Applied to files:
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java
🧬 Code graph analysis (2)
core/src/test/java/io/questdb/test/griffin/CheckpointTest.java (1)
core/src/main/java/io/questdb/std/str/Utf8s.java (1)
Utf8s(49-2199)
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java (1)
core/src/main/java/io/questdb/cairo/TableUtils.java (1)
TableUtils(83-2144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: New pull request (Check Changes Check changes)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (4)
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java (2)
296-296: LGTM! Consistent usage of statically imported constant.The references to
TXN_FILE_NAMEare now consistent with the static import. No functional changes introduced.Also applies to: 336-336, 370-370, 453-453, 775-775, 797-797
830-864: Well-implemented orphan directory cleanup logic.The new recovery logic correctly identifies and removes destination table directories that don't exist in the checkpoint snapshot. Key strengths:
- Proper path management using
isDirOrSoftLinkDirNoDotswithrootLenparameter- Appropriate error handling that aborts startup on failure with clear guidance
- Correct handling of the edge case where checkpoint directory is inside the db directory (line 841)
- Good logging at advisory and critical levels
The implementation ensures the database is restored to the exact state captured in the checkpoint.
core/src/test/java/io/questdb/test/griffin/CheckpointTest.java (2)
998-1044: LGTM! Well-structured failure scenario test.This test properly validates the error handling when orphan directory removal fails:
- Custom
FilesFacadecorrectly simulatesrmdirfailure for the target path- Assertions verify both the exception type and the presence of key error message components
- Finally block ensures checkpoint cleanup even when recovery fails
- Test coverage includes the critical failure path where startup must abort
1046-1093: LGTM! Comprehensive happy path test.This test effectively validates the core orphan directory cleanup functionality:
- Clear test flow: checkpoint → drop → create new table → recover
- Data assertions confirm the original table is fully restored with correct content
- Negative assertion verifies the orphan table (created after checkpoint) is removed
- Test coverage includes the primary use case for the feature
[PR Coverage check]😍 pass : 23 / 23 (100.00%) file detail
|
Summary
Fixes checkpoint restore to properly handle tables created after the checkpoint was taken. Previously, when restoring from a checkpoint, table directories created after the checkpoint would remain on disk, potentially causing inconsistencies or orphaned data.
Problem
When a checkpoint is restored:
Solution
After recovering the checkpoint contents, the code now: