Skip to content

test(core): run concurrent reader operation in WAL writer fuzz tests#6524

Merged
bluestreak01 merged 11 commits intomasterfrom
puzpuzpuz_concurrent_fuzz_reads
Dec 13, 2025
Merged

test(core): run concurrent reader operation in WAL writer fuzz tests#6524
bluestreak01 merged 11 commits intomasterfrom
puzpuzpuz_concurrent_fuzz_reads

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Dec 11, 2025

Adds concurrent reader operation to WAL writer fuzz tests to verify that the column files from intermediate txns are not corrupted.

@puzpuzpuz puzpuzpuz self-assigned this Dec 11, 2025
@puzpuzpuz puzpuzpuz added the Core Related to storage, data type, etc. label Dec 11, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 11, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR introduces a rollback mechanism for column version tracking and extends the fuzz testing framework with query operations. Core changes include adding a rollback() method to ColumnVersionWriter, updating TableWriter to validate and reconcile column version state with transaction files during initialization, and introducing a new FuzzQueryOperation class and associated queryProb parameter for query-based fuzzing in the testing suite.

Changes

Cohort / File(s) Summary
Column Version Recovery
core/src/main/java/io/questdb/cairo/ColumnVersionWriter.java, core/src/main/java/io/questdb/cairo/TableWriter.java
Added rollback() method to ColumnVersionWriter that decrements version and reloads state. Updated TableWriter constructor to validate column version against transaction file state, with automatic rollback if off-by-one or hard failure on unrecoverable mismatch. Updated TxWriter method calls to pass denseSymbolMapWriters parameter.
Fuzz Query Operations
core/src/test/java/io/questdb/test/fuzz/FuzzQueryOperation.java
New class implementing FuzzTransactionOperation that executes SQL queries on tables with specified limits, handles concurrency failures and missing table conditions gracefully.
Fuzz Framework Configuration
core/src/test/java/io/questdb/test/cairo/fuzz/FuzzRunner.java, core/src/test/java/io/questdb/test/cairo/fuzz/AbstractFuzzTest.java, core/src/test/java/io/questdb/test/fuzz/FuzzTransactionGenerator.java
Added queryProb parameter to fuzz probability configuration, introduced overloaded setFuzzProbabilities methods, and extended generateSet to include query operation generation logic.
Fuzz Test Cases
core/src/test/java/io/questdb/test/cairo/fuzz/O3MaxLagFuzzTest.java, core/src/test/java/io/questdb/test/cairo/fuzz/O3ParquetPartitionFuzzTest.java, core/src/test/java/io/questdb/test/cairo/fuzz/WalWriterFuzzTest.java
Updated FuzzTransactionGenerator.generateSet calls to include queryProb argument (0.0 or 0.01 depending on test), shifting subsequent parameters accordingly.
Fuzz Test Infrastructure
core/src/test/java/io/questdb/test/cairo/fuzz/FailureFileFacade.java
Prefixed failure messages with [failure-facade] tag for improved error traceability without altering exception semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • TableWriter constructor validation logic: Verify state reconciliation between column version and transaction file, ensure rollback and failure conditions are correct.
  • FuzzQueryOperation implementation: Confirm exception handling strategy, ensure concurrent failure detection via FailureFileFacade is sound.
  • FuzzTransactionGenerator changes: Verify probability normalization includes new queryProb parameter and that query operation generation integrates correctly without disrupting existing operations.
  • Cross-file consistency: Ensure all test file updates correctly pass the new queryProb argument with appropriate defaults (0.0 or 0.01).

Possibly related PRs

Suggested reviewers

  • ideoma

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.27% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding concurrent reader operation to WAL writer fuzz tests.
Description check ✅ Passed The description is related to the changeset, explaining the purpose of adding concurrent reader operations to verify column file integrity.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@puzpuzpuz puzpuzpuz changed the title chore(test): run concurrent reader thread in WAL writer fuzz tests test(core): run concurrent reader thread in WAL writer fuzz tests Dec 11, 2025
@puzpuzpuz puzpuzpuz changed the title test(core): run concurrent reader thread in WAL writer fuzz tests test(core): run concurrent reader operation in WAL writer fuzz tests Dec 12, 2025
@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

@ideoma ColumnVersionWriter#rollback() left version in inconsistent state. I fixed that here: 82566f7. PTAL

@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
core/src/test/java/io/questdb/test/fuzz/FuzzQueryOperation.java (1)

90-90: Consider documenting the return value semantics.

The method always returns false. While this appears intentional (read-only operations don't "succeed" in a way that affects transaction state), a brief comment explaining the return value semantics would improve maintainability.

         if (th instanceof TableReferenceOutOfDateException) {
             return false;
         }
         throw th;
     }
-    return false;
+    return false; // read-only operation, no state change to report
 }
core/src/main/java/io/questdb/cairo/TableWriter.java (1)

442-457: Avoid version - 1 underflow and make mismatch logic clearer
The intended case is “_cv is one step ahead of _txn” (cv == txn + 1). Using subtraction is a tiny correctness risk and less readable; also cache values to avoid repeated reads.

-            if (columnVersionWriter.getVersion() != txWriter.getColumnVersion()) {
-                if (columnVersionWriter.getVersion() - 1 == txWriter.getColumnVersion()) {
+            final long cvVersion = columnVersionWriter.getVersion();
+            final long txnCvVersion = txWriter.getColumnVersion();
+            if (cvVersion != txnCvVersion) {
+                if (cvVersion == txnCvVersion + 1) {
                     // This is case when transaction was aborted during column version change
                     // we need to use previous version of _cv file, which should still be available for reading
                     LOG.info().$("rolling back column version change to match _txn file [table=").$(tableToken)
-                            .$(", txnVersion=").$(txWriter.getColumnVersion())
-                            .$(", actualFileVersion=").$(columnVersionWriter.getVersion())
+                            .$(", txnVersion=").$(txnCvVersion)
+                            .$(", actualFileVersion=").$(cvVersion)
                             .I$();
                     columnVersionWriter.rollback();
                 } else {
                     throw CairoException.critical(0).put("Unrecoverable storage corruption detected: column version mismatch [table=").put(tableToken)
-                            .put(", txnVersion=").put(txWriter.getColumnVersion())
-                            .put(", actualFileVersion=").put(columnVersionWriter.getVersion())
+                            .put(", txnVersion=").put(txnCvVersion)
+                            .put(", actualFileVersion=").put(cvVersion)
                             .put(']');
                 }
             }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3137b86 and 7774928.

📒 Files selected for processing (10)
  • core/src/main/java/io/questdb/cairo/ColumnVersionWriter.java (3 hunks)
  • core/src/main/java/io/questdb/cairo/TableWriter.java (2 hunks)
  • core/src/test/java/io/questdb/test/cairo/fuzz/AbstractFuzzTest.java (2 hunks)
  • core/src/test/java/io/questdb/test/cairo/fuzz/FailureFileFacade.java (2 hunks)
  • core/src/test/java/io/questdb/test/cairo/fuzz/FuzzRunner.java (8 hunks)
  • core/src/test/java/io/questdb/test/cairo/fuzz/O3MaxLagFuzzTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/cairo/fuzz/O3ParquetPartitionFuzzTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/cairo/fuzz/WalWriterFuzzTest.java (15 hunks)
  • core/src/test/java/io/questdb/test/fuzz/FuzzQueryOperation.java (1 hunks)
  • core/src/test/java/io/questdb/test/fuzz/FuzzTransactionGenerator.java (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-07T00:59:31.522Z
Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: In QuestDB's Cairo engine, transaction (_txn) files have a strong invariant: they are never truncated below TX_BASE_HEADER_SIZE. Once created, they are either fully formed (size >= header size) or completely removed along with the entire table directory when the table is dropped.

Applied to files:

  • core/src/test/java/io/questdb/test/fuzz/FuzzQueryOperation.java
  • core/src/main/java/io/questdb/cairo/TableWriter.java
📚 Learning: 2025-11-19T12:21:00.062Z
Learnt from: jerrinot
Repo: questdb/questdb PR: 6413
File: core/src/test/java/io/questdb/test/cutlass/pgwire/PGJobContextTest.java:11982-12002
Timestamp: 2025-11-19T12:21:00.062Z
Learning: QuestDB Java tests use a deterministic random seed. The test utilities (e.g., io.questdb.test.tools.TestUtils and io.questdb.std.Rnd) produce reproducible sequences, so rnd_* functions (including rnd_uuid4) yield deterministic outputs across runs. Do not flag tests in core/src/test/** that assert against values produced by rnd_* as flaky due to randomness.

Applied to files:

  • core/src/test/java/io/questdb/test/fuzz/FuzzTransactionGenerator.java
  • core/src/test/java/io/questdb/test/cairo/fuzz/WalWriterFuzzTest.java
  • core/src/test/java/io/questdb/test/cairo/fuzz/FuzzRunner.java
📚 Learning: 2025-11-07T00:59:31.522Z
Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: When checking for transaction file validity in QuestDB, use `ff.length(path) >= TX_BASE_HEADER_SIZE` instead of `ff.exists(path)`. The length check provides stronger guarantees under extreme system load by ensuring the file system catalog is synchronized and the file is fully formed, preventing SIGBUS errors when memory mapping the file.

Applied to files:

  • core/src/test/java/io/questdb/test/cairo/fuzz/FailureFileFacade.java
🧬 Code graph analysis (3)
core/src/test/java/io/questdb/test/fuzz/FuzzQueryOperation.java (9)
core/src/main/java/io/questdb/cairo/CairoException.java (1)
  • CairoException (39-429)
core/src/main/java/io/questdb/cairo/TableToken.java (1)
  • TableToken (38-192)
core/src/main/java/io/questdb/cairo/sql/TableReferenceOutOfDateException.java (1)
  • TableReferenceOutOfDateException (33-93)
core/src/main/java/io/questdb/griffin/SqlException.java (1)
  • SqlException (37-268)
core/src/main/java/io/questdb/griffin/SqlExecutionContextImpl.java (1)
  • SqlExecutionContextImpl (59-499)
core/src/main/java/io/questdb/std/LongList.java (1)
  • LongList (35-636)
core/src/main/java/io/questdb/std/Rnd.java (1)
  • Rnd (39-491)
core/src/main/java/io/questdb/std/str/StringSink.java (1)
  • StringSink (31-194)
core/src/test/java/io/questdb/test/cairo/fuzz/FailureFileFacade.java (1)
  • FailureFileFacade (39-559)
core/src/test/java/io/questdb/test/cairo/fuzz/FailureFileFacade.java (2)
core/src/main/java/io/questdb/cairo/CairoException.java (1)
  • CairoException (39-429)
core/src/main/java/io/questdb/std/Os.java (1)
  • Os (41-357)
core/src/main/java/io/questdb/cairo/TableWriter.java (1)
core/src/main/java/io/questdb/cairo/CairoException.java (1)
  • CairoException (39-429)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (36)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-2)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-2)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-base)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x86-graal)
  • GitHub Check: New pull request (Trigger Enterprise CI Trigger Enterprise Pipeline)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x86-graal)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (12)
core/src/test/java/io/questdb/test/cairo/fuzz/AbstractFuzzTest.java (1)

267-292: LGTM!

The new setFuzzProbabilities overload correctly propagates the queryProb parameter to the fuzzer, maintaining consistency with the existing parameter ordering. The original overload without queryProb provides backward compatibility for tests that don't require query fuzzing.

core/src/test/java/io/questdb/test/cairo/fuzz/O3ParquetPartitionFuzzTest.java (1)

178-182: LGTM!

Setting probabilityOfQuery to 0.0 correctly disables query operations for this parquet partition fuzz test, which focuses on data integrity validation rather than concurrent read testing.

core/src/test/java/io/questdb/test/cairo/fuzz/O3MaxLagFuzzTest.java (1)

164-168: LGTM!

The addition of 0.0 for probabilityOfQuery correctly adapts this test to the new generateSet signature while keeping query fuzzing disabled, which is appropriate for this O3 max lag focused test.

core/src/test/java/io/questdb/test/cairo/fuzz/WalWriterFuzzTest.java (1)

118-120: LGTM!

The updated setFuzzProbabilities calls correctly add the new queryProb parameter. Using 0.01 (1% probability) for query operations provides good coverage for concurrent read testing without overwhelming the fuzz tests. This pattern is consistently applied across all test methods.

core/src/test/java/io/questdb/test/cairo/fuzz/FailureFileFacade.java (1)

274-274: LGTM!

The [failure-facade] prefix improves debuggability by clearly identifying that these errors originate from the simulated failure mechanism rather than actual I/O failures.

core/src/test/java/io/questdb/test/fuzz/FuzzTransactionGenerator.java (1)

210-216: LGTM!

The query operation generation follows the established pattern. The limit calculation (±(1 to 1000)) correctly tests both positive (head) and negative (tail) LIMIT clauses. Queries appropriately don't increment metaVersion or waitBarrierVersion since they're read-only operations.

core/src/test/java/io/questdb/test/fuzz/FuzzQueryOperation.java (1)

41-91: Well-structured concurrent read operation for fuzz testing.

The implementation correctly:

  1. Tracks FailureFileFacade state to distinguish between simulated failures and real errors
  2. Handles expected exceptions during concurrent modifications (TableReferenceOutOfDateException, table-does-not-exist cases)
  3. Uses try-with-resources for proper SqlExecutionContext cleanup

The query format <tableName> LIMIT <limit> will effectively read column data to verify intermediate transaction files aren't corrupted, fulfilling the PR objective.

core/src/main/java/io/questdb/cairo/ColumnVersionWriter.java (1)

165-174: Implementation is correct; edge cases are guarded by calling code.

The rollback() method correctly decrements the version and reloads state from the previous data area via the double-write (A/B) scheme. The edge case concern—where version - 1 could be 0 or negative—is protected by the guard in TableWriter: rollback() is only called when columnVersionWriter.getVersion() - 1 == txWriter.getColumnVersion(), which ensures version >= 1 before decrement. If version were 0, the condition would fail and a critical error would be thrown instead.

core/src/test/java/io/questdb/test/cairo/fuzz/FuzzRunner.java (2)

110-110: LGTM - queryProb integration is well-structured.

The new queryProb field follows the established pattern for probability configuration in fuzz tests. The backward-compatible overload ensures existing call sites continue to work while new tests can specify query probability. The value is properly threaded through to FuzzTransactionGenerator.generateSet on line 496, which aligns with the PR's goal of adding concurrent reader operations to WAL writer fuzz tests.

Also applies to: 496-496, 609-667


149-156: Good practice - final keywords improve code clarity.

Making these local variables final prevents accidental reassignment and clearly communicates intent, especially helpful in this multi-threaded context where these collections are shared across threads.

core/src/main/java/io/questdb/cairo/TableWriter.java (2)

442-457: Add a targeted test to exercise the ctor rollback branch (coverage currently 0% for new lines)
Right now this logic is likely untested (and PR notes show coverage gaps). A small regression test that creates a table dir with _cv at txn+1 and _txn at txn (or mocks via existing test harness helpers) would lock in the intended recovery behavior and prevent future regressions.


3902-3906: No changes needed—API refactoring is complete and correct.

All call sites to txWriter.bumpMetadataAndColumnStructureVersion() and txWriter.bumpMetadataVersion() have been updated to pass the denseSymbolMapWriters parameter. The parameter is initialized once in the TableWriter constructor and is never null. No old no-arg signatures remain for direct TxWriter calls.

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😞 fail : 4 / 18 (22.22%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/ColumnVersionWriter.java 1 5 20.00%
🔵 io/questdb/cairo/TableWriter.java 3 13 23.08%

@bluestreak01 bluestreak01 merged commit 00b9115 into master Dec 13, 2025
42 of 43 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_concurrent_fuzz_reads branch December 13, 2025 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Related to storage, data type, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants