Skip to content

fix(core): fix checkpoint restore by removing phantom tables directories from disk#6614

Merged
bluestreak01 merged 4 commits intomasterfrom
fix-checkpoint-restore-handle-dropped-tables
Jan 11, 2026
Merged

fix(core): fix checkpoint restore by removing phantom tables directories from disk#6614
bluestreak01 merged 4 commits intomasterfrom
fix-checkpoint-restore-handle-dropped-tables

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Jan 8, 2026

Summary

Fixes checkpoint restore to properly handle tables created after the checkpoint was taken. Previously, when restoring from a checkpoint, table directories created after the checkpoint would remain on disk, potentially causing inconsistencies or orphaned data.

Problem

When a checkpoint is restored:

  1. A checkpoint is taken at time T1
  2. New tables are created after T1 (e.g., tiesto)
  3. On restore, the checkpoint recovery process would restore tables from the checkpoint but leave the newly created table directories in place
  4. These "phantom" tables could be in an inconsistent state since they were not part of the checkpoint

Solution

After recovering the checkpoint contents, the code now:

  1. Iterates over all table directories in the database root
  2. Compares each directory against the checkpoint contents
  3. Removes any table directories that do not exist in the checkpoint
  4. Properly handles edge cases (e.g., checkpoint directory inside db directory)
  5. Fails fast with a clear error message if orphan directory removal fails

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 8, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR adds orphan directory cleanup logic to the checkpoint recovery process. When recovering from a checkpoint, the agent now identifies and removes destination table directories that don't exist in the checkpoint snapshot. Two new tests validate both the normal cleanup flow and failure scenarios.

Changes

Cohort / File(s) Change Summary
Core checkpoint recovery logic
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java
Added orphan directory removal in recover() method that iterates destination tables and deletes those absent from checkpoint. Includes guards to skip checkpoint directory and error handling with logging. Refactored to use static TXN_FILE_NAME import instead of qualified references.
Checkpoint recovery tests
core/src/test/java/io/questdb/test/griffin/CheckpointTest.java
Added two test methods: testCheckpointWithTableDropAndTableCreate() validates normal recovery after table drop/recreate sequence; testCheckpointRestoreOrphanDirRemovalFailure() validates error handling when directory removal fails using custom TestFilesFacade.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Bug

Suggested reviewers

  • bluestreak01
🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether the intended context and rationale are communicated. Add a description explaining the bug being fixed, its impact, and how the changes address the issue.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: removing phantom table directories during checkpoint restore, which aligns with the core modifications in DatabaseCheckpointAgent.java.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ideoma
Copy link
Copy Markdown
Collaborator Author

ideoma commented Jan 8, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ideoma ideoma changed the title fix(core): remove table directories that not included in the checkpoint on checkpoint restore fix(core): fix checkpoint restore by removing phantom tables directories from disk Jan 8, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java (2)

75-75: Consider explicit static imports over wildcard for clarity.

Wildcard static imports (import static io.questdb.cairo.TableUtils.*;) can make it less obvious where identifiers like TXN_FILE_NAME originate, especially when multiple utility classes are in use. If the project's style guide permits wildcards for well-known utility classes, this is acceptable; otherwise, consider using explicit imports.


839-841: Clarify the checkpoint directory guard logic.

The guard at line 841 correctly prevents deleting the checkpoint directory when it's configured inside the db directory, but the logic is subtle. Consider enhancing the comment to explain the specific scenario:

-// Check if the checkpoint dir is configured to be inside the db dir,
-// so we do not delete checkpoint directory itself
-if (!Utf8s.startsWith(srcPath, dstPath)) {
+// Check if the checkpoint dir is configured to be inside the db dir,
+// so we do not delete the checkpoint directory itself.
+// Example: if checkpoint is /data/db/.checkpoint/db/ and we're examining
+// /data/db/.checkpoint/, startsWith() returns true, so we skip deletion.
+if (!Utf8s.startsWith(srcPath, dstPath)) {

This would help future maintainers quickly understand why we check if the checkpoint path starts with the current directory path (rather than the more intuitive opposite).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d51a22 and 7c37f74.

📒 Files selected for processing (2)
  • core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java
  • core/src/test/java/io/questdb/test/griffin/CheckpointTest.java
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-07T00:59:31.522Z
Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: In QuestDB's Cairo engine, transaction (_txn) files have a strong invariant: they are never truncated below TX_BASE_HEADER_SIZE. Once created, they are either fully formed (size >= header size) or completely removed along with the entire table directory when the table is dropped.

Applied to files:

  • core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java
📚 Learning: 2025-11-07T00:59:31.522Z
Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: When checking for transaction file validity in QuestDB, use `ff.length(path) >= TX_BASE_HEADER_SIZE` instead of `ff.exists(path)`. The length check provides stronger guarantees under extreme system load by ensuring the file system catalog is synchronized and the file is fully formed, preventing SIGBUS errors when memory mapping the file.

Applied to files:

  • core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java
🧬 Code graph analysis (2)
core/src/test/java/io/questdb/test/griffin/CheckpointTest.java (1)
core/src/main/java/io/questdb/std/str/Utf8s.java (1)
  • Utf8s (49-2199)
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java (1)
core/src/main/java/io/questdb/cairo/TableUtils.java (1)
  • TableUtils (83-2144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (4)
core/src/main/java/io/questdb/cairo/DatabaseCheckpointAgent.java (2)

296-296: LGTM! Consistent usage of statically imported constant.

The references to TXN_FILE_NAME are now consistent with the static import. No functional changes introduced.

Also applies to: 336-336, 370-370, 453-453, 775-775, 797-797


830-864: Well-implemented orphan directory cleanup logic.

The new recovery logic correctly identifies and removes destination table directories that don't exist in the checkpoint snapshot. Key strengths:

  • Proper path management using isDirOrSoftLinkDirNoDots with rootLen parameter
  • Appropriate error handling that aborts startup on failure with clear guidance
  • Correct handling of the edge case where checkpoint directory is inside the db directory (line 841)
  • Good logging at advisory and critical levels

The implementation ensures the database is restored to the exact state captured in the checkpoint.

core/src/test/java/io/questdb/test/griffin/CheckpointTest.java (2)

998-1044: LGTM! Well-structured failure scenario test.

This test properly validates the error handling when orphan directory removal fails:

  • Custom FilesFacade correctly simulates rmdir failure for the target path
  • Assertions verify both the exception type and the presence of key error message components
  • Finally block ensures checkpoint cleanup even when recovery fails
  • Test coverage includes the critical failure path where startup must abort

1046-1093: LGTM! Comprehensive happy path test.

This test effectively validates the core orphan directory cleanup functionality:

  • Clear test flow: checkpoint → drop → create new table → recover
  • Data assertions confirm the original table is fully restored with correct content
  • Negative assertion verifies the orphan table (created after checkpoint) is removed
  • Test coverage includes the primary use case for the feature

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 23 / 23 (100.00%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/DatabaseCheckpointAgent.java 23 23 100.00%

@bluestreak01 bluestreak01 merged commit ee4c668 into master Jan 11, 2026
43 checks passed
@bluestreak01 bluestreak01 deleted the fix-checkpoint-restore-handle-dropped-tables branch January 11, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants