Skip to content

fix(sql): fix internal error when concurrent threads try to drop a non-wal table#6462

Merged
bluestreak01 merged 1 commit intomasterfrom
jh_concurrent_drop_fix
Nov 27, 2025
Merged

fix(sql): fix internal error when concurrent threads try to drop a non-wal table#6462
bluestreak01 merged 1 commit intomasterfrom
jh_concurrent_drop_fix

Conversation

@jerrinot
Copy link
Copy Markdown
Contributor

@jerrinot jerrinot commented Nov 27, 2025

When multiple clients try to drop the same non-WAL table, then there is a small chance of getting an internal error.
This issue affects only legacy (non-WAL) tables. This change fixes the problem.

Details for reviewers:
It was caught by testConcurrentCreateDropRemove(), it's reproducible locally, it helps to decrease tableCount parameter in the test. This is the stacktrace:

io.questdb.cairo.CairoException: [2] could not remove table [table=tab3~]
2025-11-26T23:30:10.6649157Z 	at io.questdb.cairo.CairoException.instance(CairoException.java:387)
2025-11-26T23:30:10.6649711Z 	at io.questdb.cairo.CairoException.critical(CairoException.java:81)
2025-11-26T23:30:10.6650259Z 	at io.questdb.cairo.CairoEngine.dropTableOrMatView(CairoEngine.java:595)
2025-11-26T23:30:10.6650879Z 	at io.questdb.griffin.SqlCompilerImpl.executeDropTable(SqlCompilerImpl.java:3944)
2025-11-26T23:30:10.6651448Z 	at io.questdb.griffin.SqlCompilerImpl.execute(SqlCompilerImpl.java:490)
2025-11-26T23:30:10.6652025Z 	at io.questdb.cairo.pool.SqlCompilerPool$C.execute(SqlCompilerPool.java:151)
2025-11-26T23:30:10.6652650Z 	at io.questdb.griffin.engine.ops.GenericDropOperation.execute(GenericDropOperation.java:40)
2025-11-26T23:30:10.6653239Z 	at io.questdb.cairo.CairoEngine.execute(CairoEngine.java:253)
2025-11-26T23:30:10.6653757Z 	at io.questdb.cairo.CairoEngine.execute(CairoEngine.java:631)
2025-11-26T23:30:10.6654225Z 	at io.questdb.cairo.CairoEngine.execute(CairoEngine.java:625)
2025-11-26T23:30:10.6654732Z 	at io.questdb.test.AbstractCairoTest.execute(AbstractCairoTest.java:1569)
2025-11-26T23:30:10.6655323Z 	at io.questdb.test.cairo.TableNameRegistryTest.lambda$testConcurrentCreateDropRemove$4(TableNameRegistryTest.java:228)
2025-11-26T23:30:10.6655799Z 	...1 more

Root cause: CairoEngine unlocks the table too early. So another thread can acquire a lock (for now a non-existing table), tries to delete the directory, but the directory no longer exists -> failure with errno=2.

@jerrinot jerrinot added Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution labels Nov 27, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 27, 2025

Walkthrough

The dropTableOrMatView method in CairoEngine.java has been modified to append the current thread ID to error messages on unlink failure and to reorder cleanup operations, moving table registry removal and scoreboard entry removal to occur before the unlock in the finally block rather than after.

Changes

Cohort / File(s) Summary
Engine table drop cleanup reordering
core/src/main/java/io/questdb/cairo/CairoEngine.java
Modified error messaging to include thread ID on unlink failure; reordered registry and scoreboard cleanup to execute before unlock instead of after, altering the synchronization sequence for non-WAL table drops

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • The reordering of cleanup operations relative to unlocking introduces subtle concurrency implications that require careful verification
  • Thread ID addition to error messages is straightforward but the core change involves lock/unlock sequence timing
  • Single-file change limits scope but the threading semantics demand attention

Suggested reviewers

  • bluestreak01
  • RaphDal
  • ideoma

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing a concurrency bug in non-WAL table dropping, which aligns with the changeset's focus on reordering unlock/registry operations to prevent race conditions.
Description check ✅ Passed The description is directly related to the changeset, providing context about the concurrent drop bug, the affected scenario, reproducible test case, stack trace, and root cause analysis.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jh_concurrent_drop_fix

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a626ff8 and 82839fd.

📒 Files selected for processing (1)
  • core/src/main/java/io/questdb/cairo/CairoEngine.java (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-07T00:59:31.522Z
Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: In QuestDB's Cairo engine, transaction (_txn) files have a strong invariant: they are never truncated below TX_BASE_HEADER_SIZE. Once created, they are either fully formed (size >= header size) or completely removed along with the entire table directory when the table is dropped.

Applied to files:

  • core/src/main/java/io/questdb/cairo/CairoEngine.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (33)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
  • GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x86-graal)
  • GitHub Check: New pull request (SelfHosted Cairo tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-x64-zfs)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-2)
  • GitHub Check: New pull request (Hosted Running tests on windows-other-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-2)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo-1)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin-base)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (SelfHosted Griffin tests on linux-arm64)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (Check Changes Check changes)
🔇 Additional comments (2)
core/src/main/java/io/questdb/cairo/CairoEngine.java (2)

596-596: Good addition: Thread ID in error message aids concurrent debugging.

Including the thread ID in the error message is helpful for diagnosing concurrent drop scenarios, especially for the race condition this PR addresses.


598-604: LGTM: Race condition fix correctly moves registry cleanup before unlock.

This change correctly addresses the TOCTOU race condition described in the PR. By removing the table from the registry and scoreboard before unlocking (rather than after), other threads can no longer acquire a lock on a table whose directory has been deleted.

The ordering is now:

  1. Remove table directory (line 594)
  2. Drop from registry (line 599)
  3. Remove scoreboard entry (line 604)
  4. Unlock in finally block (line 606)

This ensures that once a table's directory is successfully removed, no other thread can see or lock that table before the cleanup completes.

Minor consideration: If tableNameRegistry.dropTable() or scoreboardPool.remove() throw exceptions after the directory has been deleted, the system would be in an inconsistent state (directory gone, but still registered). However, the unlock will still execute in the finally block, preventing deadlocks. Given that these operations are unlikely to fail in normal circumstances and the critical fix they enable, this is acceptable.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jerrinot
Copy link
Copy Markdown
Contributor Author

Happy Thursday, dear @coderabbitai, would you be so kind as to review this humble PR?

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 27, 2025

@jerrinot, happy Thursday to you as well! I'd be delighted to review this PR. Let me take a look at the changes to fix the concurrent drop table race condition. 🐰

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 2 / 3 (66.67%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/CairoEngine.java 2 3 66.67%

@bluestreak01 bluestreak01 merged commit ec3f84c into master Nov 27, 2025
44 checks passed
@bluestreak01 bluestreak01 deleted the jh_concurrent_drop_fix branch November 27, 2025 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants