Skip to content

fix(core): fix missing partition dir on table reader open when writing identical data with dedup#6479

Merged
bluestreak01 merged 4 commits intomasterfrom
fix-missing-partition
Dec 2, 2025
Merged

fix(core): fix missing partition dir on table reader open when writing identical data with dedup#6479
bluestreak01 merged 4 commits intomasterfrom
fix-missing-partition

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Dec 1, 2025

Fixes storage error that can result in transient query failures withan error message similar to:

2025-11-28T12:05:43.150214Z C i.q.c.h.p.JsonQueryProcessorState [5526041371967] error [msg=`Partition '2025-11-28T075000-000001' does not exist in table 'crypto_price_1m' directory. Run [ALTER TABLE table_name FORCE DROP PARTITION LIST '2025-11-28T075000-000001'] to repair the table or the database from the backup.`, errno=0, q=`

The bug is that PartitionPurgeJob can see a partition version while scanning the directories that will not be committed by the TableWriter and removed before the commit. However, PartitionPurgeJob scans partitions before reading _txn file, and when it reads the _txn file, the transaction can already be committed, so it counts those "phantom" deleted partitions as valid versions and assumes that they can be used by TableReaders.

The fix is to check that the next partition exists before deleting the overwritten partition version.

Step by step:

The race condition occurs when:

  1. O3PartitionPurgeJob scans partition directories and finds versions .0, .1, .2
  2. Meanwhile, TableWriter creates partition .2 temporarily for a potential copy-on-write operation
  3. TableWriter determines copy-on-write isn't needed (e.g., identical data with dedup), removes .2, and commits
  4. O3PartitionPurgeJob reads _txn file (now updated), sees the commit succeeded
  5. O3PartitionPurgeJob calculates that .0 and .1 are safe to delete because .2 exists
  6. But .2 was already removed - it was a "phantom" partition
  7. TableReader later tries to open the partition but finds no valid version

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 1, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Changes add a validation guard in the partition purge job to detect phantom partitions (those created and removed before transaction commit) and introduces an error counter in PartitionOverwriteControl to track partition overwrite errors. Tests are enhanced with assertions verifying the counter remains zero and include a new phantom partition scenario test.

Changes

Cohort / File(s) Summary
Phantom partition validation and error tracking
core/src/main/java/io/questdb/cairo/O3PartitionPurgeJob.java, core/src/main/java/io/questdb/cairo/PartitionOverwriteControl.java
Added existence check for partition directory in purge processing to skip phantom partitions with informational logging. Introduced AtomicInteger errorCount field and public getErrorCount() getter in PartitionOverwriteControl; increments counter before throwing partition overwrite exceptions.
Test validation and scenarios
core/src/test/java/io/questdb/test/cairo/o3/O3PartitionPurgeTest.java
Added assertions across multiple test methods verifying error counter equals zero post-purge. Introduced new testPhantomPartition() test method simulating partition disappearance via FilesFacade override to validate purge proceeds without errors.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Addition of straightforward existence check guard in purge job
  • Basic atomic counter initialization and getter in control class
  • Repetitive test assertion additions across multiple methods

Suggested labels

Bug, storage

Suggested reviewers

  • bluestreak01

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: preventing errors from phantom partitions when deduplicating identical data.
Description check ✅ Passed The description comprehensively explains the race condition, its consequences, and the implemented fix with clear step-by-step context.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ideoma ideoma changed the title fix(core): fix missing partition dir on table reader open when writing identical data with dedupg identical data with dedupg identical data with dedup fix(core): fix missing partition dir on table reader open when writing identical data with dedup Dec 1, 2025
@ideoma
Copy link
Copy Markdown
Collaborator Author

ideoma commented Dec 1, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 1, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

1 similar comment
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 1, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2131fbe and c7b6842.

📒 Files selected for processing (3)
  • core/src/main/java/io/questdb/cairo/O3PartitionPurgeJob.java (1 hunks)
  • core/src/main/java/io/questdb/cairo/PartitionOverwriteControl.java (3 hunks)
  • core/src/test/java/io/questdb/test/cairo/o3/O3PartitionPurgeTest.java (21 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
core/src/main/java/io/questdb/cairo/O3PartitionPurgeJob.java (1)
core/src/main/java/io/questdb/cairo/TableUtils.java (1)
  • TableUtils (82-2127)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (36)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
  • GitHub Check: New pull request (Coverage Report Coverage Report)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-2)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-2)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-base)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x86-graal)
  • GitHub Check: New pull request (Trigger Enterprise CI Trigger Enterprise Pipeline)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x86-graal)
  • GitHub Check: New pull request (Check Changes Check changes)
🔇 Additional comments (5)
core/src/main/java/io/questdb/cairo/PartitionOverwriteControl.java (3)

33-33: LGTM!

Appropriate import for the error counter field.


40-40: LGTM!

The error counter field is correctly declared as final AtomicInteger, ensuring thread-safe access in the concurrent environment where partition purge operations occur.


81-83: LGTM!

The error counter implementation correctly increments before throwing the exception and provides a public getter for test verification. The AtomicInteger ensures thread-safe operations.

Also applies to: 104-104

core/src/test/java/io/questdb/test/cairo/o3/O3PartitionPurgeTest.java (2)

137-137: LGTM!

The error count assertions are correctly added to all test methods to verify that purge operations complete without triggering partition overwrite errors. This ensures the phantom partition fix doesn't introduce false positives.

Also applies to: 210-210, 248-248, 276-276, 311-311, 402-402, 439-439, 480-480, 532-532, 669-669, 716-716, 750-750, 810-810, 830-830, 881-881, 933-933, 975-975, 1032-1032, 1094-1094, 1156-1156


573-635: Excellent test coverage for the phantom partition scenario.

The test effectively simulates the race condition described in the PR objectives:

  1. Creates partition versions and a phantom directory that will be scanned
  2. Mocks FilesFacade.exists() to return false for the phantom partition, simulating its removal between directory scanning and the existence check
  3. Verifies the purge job aborts gracefully without recording errors

This validates the fix in O3PartitionPurgeJob.processPartition0 handles phantom partitions correctly.

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 10 / 11 (90.91%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/PartitionOverwriteControl.java 2 3 66.67%
🔵 io/questdb/cairo/O3PartitionPurgeJob.java 8 8 100.00%

@bluestreak01 bluestreak01 merged commit 33a3e6d into master Dec 2, 2025
43 checks passed
@bluestreak01 bluestreak01 deleted the fix-missing-partition branch December 2, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants