fix(sql): fix internal error when concurrent threads try to drop a non-wal table by jerrinot · Pull Request #6462 · questdb/questdb

jerrinot · 2025-11-27T09:08:56Z

When multiple clients try to drop the same non-WAL table, then there is a small chance of getting an internal error.
This issue affects only legacy (non-WAL) tables. This change fixes the problem.

Details for reviewers:
It was caught by testConcurrentCreateDropRemove(), it's reproducible locally, it helps to decrease tableCount parameter in the test. This is the stacktrace:

io.questdb.cairo.CairoException: [2] could not remove table [table=tab3~]
2025-11-26T23:30:10.6649157Z 	at io.questdb.cairo.CairoException.instance(CairoException.java:387)
2025-11-26T23:30:10.6649711Z 	at io.questdb.cairo.CairoException.critical(CairoException.java:81)
2025-11-26T23:30:10.6650259Z 	at io.questdb.cairo.CairoEngine.dropTableOrMatView(CairoEngine.java:595)
2025-11-26T23:30:10.6650879Z 	at io.questdb.griffin.SqlCompilerImpl.executeDropTable(SqlCompilerImpl.java:3944)
2025-11-26T23:30:10.6651448Z 	at io.questdb.griffin.SqlCompilerImpl.execute(SqlCompilerImpl.java:490)
2025-11-26T23:30:10.6652025Z 	at io.questdb.cairo.pool.SqlCompilerPool$C.execute(SqlCompilerPool.java:151)
2025-11-26T23:30:10.6652650Z 	at io.questdb.griffin.engine.ops.GenericDropOperation.execute(GenericDropOperation.java:40)
2025-11-26T23:30:10.6653239Z 	at io.questdb.cairo.CairoEngine.execute(CairoEngine.java:253)
2025-11-26T23:30:10.6653757Z 	at io.questdb.cairo.CairoEngine.execute(CairoEngine.java:631)
2025-11-26T23:30:10.6654225Z 	at io.questdb.cairo.CairoEngine.execute(CairoEngine.java:625)
2025-11-26T23:30:10.6654732Z 	at io.questdb.test.AbstractCairoTest.execute(AbstractCairoTest.java:1569)
2025-11-26T23:30:10.6655323Z 	at io.questdb.test.cairo.TableNameRegistryTest.lambda$testConcurrentCreateDropRemove$4(TableNameRegistryTest.java:228)
2025-11-26T23:30:10.6655799Z 	...1 more

Root cause: CairoEngine unlocks the table too early. So another thread can acquire a lock (for now a non-existing table), tries to delete the directory, but the directory no longer exists -> failure with errno=2.

…same non-wal table

coderabbitai · 2025-11-27T09:09:03Z

Walkthrough

The dropTableOrMatView method in CairoEngine.java has been modified to append the current thread ID to error messages on unlink failure and to reorder cleanup operations, moving table registry removal and scoreboard entry removal to occur before the unlock in the finally block rather than after.

Changes

Cohort / File(s)	Summary
Engine table drop cleanup reordering `core/src/main/java/io/questdb/cairo/CairoEngine.java`	Modified error messaging to include thread ID on unlink failure; reordered registry and scoreboard cleanup to execute before unlock instead of after, altering the synchronization sequence for non-WAL table drops

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The reordering of cleanup operations relative to unlocking introduces subtle concurrency implications that require careful verification
Thread ID addition to error messages is straightforward but the core change involves lock/unlock sequence timing
Single-file change limits scope but the threading semantics demand attention

Suggested reviewers

bluestreak01
RaphDal
ideoma

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: fixing a concurrency bug in non-WAL table dropping, which aligns with the changeset's focus on reordering unlock/registry operations to prevent race conditions.
Description check	✅ Passed	The description is directly related to the changeset, providing context about the concurrent drop bug, the affected scenario, reproducible test case, stack trace, and root cause analysis.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jh_concurrent_drop_fix

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a626ff8 and 82839fd.

📒 Files selected for processing (1)

core/src/main/java/io/questdb/cairo/CairoEngine.java (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-07T00:59:31.522Z

Learnt from: bluestreak01
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2025-11-07T00:59:31.522Z
Learning: In QuestDB's Cairo engine, transaction (_txn) files have a strong invariant: they are never truncated below TX_BASE_HEADER_SIZE. Once created, they are either fully formed (size >= header size) or completely removed along with the entire table directory when the table is dropped.

Applied to files:

core/src/main/java/io/questdb/cairo/CairoEngine.java

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (33)

GitHub Check: New pull request (SelfHosted Running tests with cover on linux-other)
GitHub Check: New pull request (SelfHosted Running tests with cover on linux-pgwire)
GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-sub)
GitHub Check: New pull request (SelfHosted Running tests with cover on linux-cairo-root)
GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz2)
GitHub Check: New pull request (SelfHosted Running tests with cover on linux-fuzz1)
GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-sub)
GitHub Check: New pull request (SelfHosted Running tests with cover on linux-griffin-root)
GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
GitHub Check: New pull request (SelfHosted Other tests on linux-x86-graal)
GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
GitHub Check: New pull request (SelfHosted Cairo tests on linux-x64-zfs)
GitHub Check: New pull request (SelfHosted Cairo tests on linux-x86-graal)
GitHub Check: New pull request (SelfHosted Griffin tests on linux-x86-graal)
GitHub Check: New pull request (SelfHosted Cairo tests on linux-arm64)
GitHub Check: New pull request (SelfHosted Griffin tests on linux-x64-zfs)
GitHub Check: New pull request (Hosted Running tests on windows-other-2)
GitHub Check: New pull request (Hosted Running tests on windows-other-1)
GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
GitHub Check: New pull request (Hosted Running tests on windows-cairo-2)
GitHub Check: New pull request (Hosted Running tests on windows-cairo-1)
GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
GitHub Check: New pull request (Hosted Running tests on windows-griffin-sub)
GitHub Check: New pull request (Hosted Running tests on windows-griffin-base)
GitHub Check: New pull request (Hosted Running tests on mac-other)
GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
GitHub Check: New pull request (Hosted Running tests on mac-cairo)
GitHub Check: New pull request (SelfHosted Griffin tests on linux-arm64)
GitHub Check: New pull request (Hosted Running tests on mac-griffin)
GitHub Check: New pull request (Check Changes Check changes)

🔇 Additional comments (2)

core/src/main/java/io/questdb/cairo/CairoEngine.java (2)

596-596: Good addition: Thread ID in error message aids concurrent debugging.

Including the thread ID in the error message is helpful for diagnosing concurrent drop scenarios, especially for the race condition this PR addresses.

598-604: LGTM: Race condition fix correctly moves registry cleanup before unlock.

This change correctly addresses the TOCTOU race condition described in the PR. By removing the table from the registry and scoreboard before unlocking (rather than after), other threads can no longer acquire a lock on a table whose directory has been deleted.

The ordering is now:

Remove table directory (line 594)

Drop from registry (line 599)

Remove scoreboard entry (line 604)

Unlock in finally block (line 606)

This ensures that once a table's directory is successfully removed, no other thread can see or lock that table before the cleanup completes.

Minor consideration: If tableNameRegistry.dropTable() or scoreboardPool.remove() throw exceptions after the directory has been deleted, the system would be in an inconsistent state (directory gone, but still registered). However, the unlock will still execute in the finally block, preventing deadlocks. Given that these operations are unlikely to fail in normal circumstances and the critical fix they enable, this is acceptable.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jerrinot · 2025-11-27T09:19:01Z

Happy Thursday, dear @coderabbitai, would you be so kind as to review this humble PR?

coderabbitai · 2025-11-27T09:19:17Z

@jerrinot, happy Thursday to you as well! I'd be delighted to review this PR. Let me take a look at the changes to fix the concurrent drop table race condition. 🐰

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

glasstiger · 2025-11-27T09:32:35Z

[PR Coverage check]

😍 pass : 2 / 3 (66.67%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/cairo/CairoEngine.java	2	3	66.67%

fix(sql): fix internal error when concurrent threads try to drop the …

82839fd

…same non-wal table

jerrinot added Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution labels Nov 27, 2025

ideoma approved these changes Nov 27, 2025

View reviewed changes

bluestreak01 merged commit ec3f84c into master Nov 27, 2025
44 checks passed

bluestreak01 deleted the jh_concurrent_drop_fix branch November 27, 2025 11:14

coderabbitai bot mentioned this pull request Dec 18, 2025

feat(core): release disk space faster on table drop #6555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sql): fix internal error when concurrent threads try to drop a non-wal table#6462

fix(sql): fix internal error when concurrent threads try to drop a non-wal table#6462
bluestreak01 merged 1 commit intomasterfrom
jh_concurrent_drop_fix

jerrinot commented Nov 27, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

jerrinot commented Nov 27, 2025

Uh oh!

coderabbitai bot commented Nov 27, 2025

Uh oh!

glasstiger commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jerrinot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

jerrinot commented Nov 27, 2025

Uh oh!

coderabbitai bot commented Nov 27, 2025

Uh oh!

glasstiger commented Nov 27, 2025

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerrinot commented Nov 27, 2025 •

edited

Loading

coderabbitai bot commented Nov 27, 2025 •

edited

Loading