fix(core): fix missing partition dir on table reader open when writing identical data with dedup#6479
Conversation
…g identical data with dedup
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughChanges add a validation guard in the partition purge job to detect phantom partitions (those created and removed before transaction commit) and introduces an error counter in PartitionOverwriteControl to track partition overwrite errors. Tests are enhanced with assertions verifying the counter remains zero and include a new phantom partition scenario test. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
1 similar comment
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
core/src/main/java/io/questdb/cairo/O3PartitionPurgeJob.java(1 hunks)core/src/main/java/io/questdb/cairo/PartitionOverwriteControl.java(3 hunks)core/src/test/java/io/questdb/test/cairo/o3/O3PartitionPurgeTest.java(21 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
core/src/main/java/io/questdb/cairo/O3PartitionPurgeJob.java (1)
core/src/main/java/io/questdb/cairo/TableUtils.java (1)
TableUtils(82-2127)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (36)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
- GitHub Check: New pull request (Coverage Report Coverage Report)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
- GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
- GitHub Check: New pull request (Hosted Running tests on windows-other-2)
- GitHub Check: New pull request (Hosted Running tests on windows-other-1)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo-2)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo-1)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin-base)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x86-graal)
- GitHub Check: New pull request (Trigger Enterprise CI Trigger Enterprise Pipeline)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (SelfHosted Cairo tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Cairo tests on linux-x86-graal)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Cairo tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Griffin tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin tests on linux-x86-graal)
- GitHub Check: New pull request (Check Changes Check changes)
🔇 Additional comments (5)
core/src/main/java/io/questdb/cairo/PartitionOverwriteControl.java (3)
33-33: LGTM!Appropriate import for the error counter field.
40-40: LGTM!The error counter field is correctly declared as
final AtomicInteger, ensuring thread-safe access in the concurrent environment where partition purge operations occur.
81-83: LGTM!The error counter implementation correctly increments before throwing the exception and provides a public getter for test verification. The AtomicInteger ensures thread-safe operations.
Also applies to: 104-104
core/src/test/java/io/questdb/test/cairo/o3/O3PartitionPurgeTest.java (2)
137-137: LGTM!The error count assertions are correctly added to all test methods to verify that purge operations complete without triggering partition overwrite errors. This ensures the phantom partition fix doesn't introduce false positives.
Also applies to: 210-210, 248-248, 276-276, 311-311, 402-402, 439-439, 480-480, 532-532, 669-669, 716-716, 750-750, 810-810, 830-830, 881-881, 933-933, 975-975, 1032-1032, 1094-1094, 1156-1156
573-635: Excellent test coverage for the phantom partition scenario.The test effectively simulates the race condition described in the PR objectives:
- Creates partition versions and a phantom directory that will be scanned
- Mocks
FilesFacade.exists()to return false for the phantom partition, simulating its removal between directory scanning and the existence check- Verifies the purge job aborts gracefully without recording errors
This validates the fix in
O3PartitionPurgeJob.processPartition0handles phantom partitions correctly.
[PR Coverage check]😍 pass : 10 / 11 (90.91%) file detail
|
Fixes storage error that can result in transient query failures withan error message similar to:
The bug is that
PartitionPurgeJobcan see a partition version while scanning the directories that will not be committed by the TableWriter and removed before the commit. However,PartitionPurgeJobscans partitions before reading_txnfile, and when it reads the_txnfile, the transaction can already be committed, so it counts those "phantom" deleted partitions as valid versions and assumes that they can be used by TableReaders.The fix is to check that the next partition exists before deleting the overwritten partition version.
Step by step:
The race condition occurs when: