fix(core): fix potential File Descriptor leak on table drop by ideoma · Pull Request #6053 · questdb/questdb

ideoma · 2025-08-14T17:59:47Z

A file descriptor leak was found by fuzz tests. This PR fixes a file descriptor leak that occurs when a table is dropped and WAL (Write-Ahead Log) files are being purged concurrently with TableWriter operations. The leak happens due to double-closing of file descriptors when exceptions occur during WAL file operations. The scenario is

A table is dropped
WAL Purge job starts deleting the WAL files
TableWriter may be at this moment try to open the WAL files that are being deleted
TableWriter gets an exception on WAL file open and closes the open WAL files
Then, on closing the TableWriter object, the WAL files are closed the second time and fail because of the double close guard in FdCache.
The last exception aborts TableWriter cleanup and leads to a file descriptor leak.

The fix is to prevent double close attempt of WAL files in this situation.

coderabbitai · 2025-08-14T17:59:54Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Relocates mmapWalColumns call in TableWriter to before the try block; refactors TableWriterSegmentFileCache to add conservative rollback and FD-cache cleanup on partial mmap failures; updates BitmapIndexConcurrentTest to use per-thread SQL execution contexts and extend executor termination timeout.

Changes

Cohort / File(s)	Summary
TableWriter WAL mapping change `core/src/main/java/io/questdb/cairo/TableWriter.java`	Moved `mmapWalColumns` invocation out of the try block so mapping failures are handled by `mmapWalColumns` itself and do not interact with the method-level finally WAL-close logic.
Segment file cache mmap & FD cleanup `core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java`	Added initial mapped-columns size snapshot and stricter rollback on mmap failures; free partially mapped columns on error, call `closeWalFiles(...)` on failure, reset `walMappedColumns` position, and ensure cached FD lists are cleared/returned to pool to avoid leaks/double-closes.
Concurrent test: per-thread SQL context `core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java`	Create and use a per-thread SQL execution context (try-with-resources) for update threads, call `engine.execute(updateSql, ctx)`, clear thread-locals via `Path.clearThreadLocals()`, and increase awaitTermination timeout from 5s to 60s with an assertion.

Sequence Diagram(s)

sequenceDiagram
  participant TW as TableWriter
  participant SFC as TableWriterSegmentFileCache
  participant WAL as WAL Files

  TW->>SFC: mmapWalColumns()
  alt mmap fails
    SFC->>SFC: free partial mapped columns
    SFC->>WAL: closeWalFiles(...)
    SFC-->>TW: throw exception
    note right of TW: method-level finally won't double-close WAL
  else mmap succeeds
    TW->>TW: try { processWalCommitBlock }
    TW-->>WAL: finally { close WAL files }
  end

sequenceDiagram
  participant Test as BitmapIndexConcurrentTest
  participant Eng as Engine
  participant Ctx as Thread-local SQL Context

  loop per update thread
    Test->>Ctx: create (TestUtils.createSqlExecutionCtx)
    Test->>Eng: execute(updateSql, Ctx)
    Eng-->>Test: result
    Test->>Ctx: close (try-with-resources)
    Test->>Test: Path.clearThreadLocals()
  end
  Test->>Test: awaitTermination(60s) and assert terminated

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

bluestreak01

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix-double-fd-close

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Copilot

Pull Request Overview

This PR fixes a file descriptor leak that occurs when a table is dropped and WAL (Write-Ahead Log) files are being purged concurrently with TableWriter operations. The leak happens due to double-closing of file descriptors when exceptions occur during WAL file operations.

Adds proper cleanup logic to prevent double-closing of file descriptors in exception scenarios
Moves WAL column mapping outside try-catch blocks to avoid duplicate cleanup
Implements cache removal and resource cleanup in finally blocks

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
TableWriterSegmentFileCache.java	Adds finally block for FD cache cleanup and exception handling improvements to prevent double-close scenarios
TableWriter.java	Moves WAL column mapping outside try block to prevent duplicate cleanup on exceptions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java

Co-authored-by: Copilot <[email protected]>

ideoma · 2025-08-15T11:30:35Z

@coderabbitai review

coderabbitai · 2025-08-15T11:30:41Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

208-209: Good call capturing pre-mapping size for precise cleanup

Saving initialSize before mapping ensures the catch path targets only newly added mappings for cleanup, avoiding both leaks and accidental over-closing.

The earlier suggestion to use walMappedColumns.size() in the catch would have resulted in no-op cleanup for newly added columns. Using initialSize is correct here.

🧹 Nitpick comments (4)

core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (3)

179-183: Avoid String.format in hot loop to reduce allocations

This is a tight loop; String.format allocates and is slower. Simple concatenation is sufficient and cheaper.

Apply this diff:

-                                String updateSql = String.format(
-                                        "UPDATE trades SET symbol = '%s' WHERE id = %d",
-                                        randomSymbol, randomId
-                                );
+                                String updateSql = "UPDATE trades SET symbol = '" + randomSymbol + "' WHERE id = " + randomId;

172-194: Optional: mirror per-thread execution context for readers to eliminate any shared-context risk

Writers and updaters now use thread-local contexts; readers still share sqlExecutionContext (Line 229). It may be safe, but for symmetry and to preempt subtle races, consider a per-thread context for readers too.

Example (outside the selected lines, for the reader runnable body):

try (var readCtx = TestUtils.createSqlExecutionCtx(engine)) {
    while (!stopFlag.get() && errorCount.get() == 0) {
        // ...
        try (RecordCursorFactory factory = select(querySql)) {
            try (RecordCursor cursor = factory.getCursor(readCtx)) {
                // ...
            }
        }
    }
}

281-281: Harden executor shutdown with a forced fallback to avoid lingering threads on failure

If termination exceeds 60s, issue shutdownNow and wait briefly again. This prevents leaked threads when an assertion fails.

Apply this diff:

-        Assert.assertTrue("failed to terminate threads within 60s", executor.awaitTermination(60, TimeUnit.SECONDS));
+        if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
+            executor.shutdownNow();
+            Assert.assertTrue(
+                    "failed to terminate threads within 60s (forced shutdown issued)",
+                    executor.awaitTermination(10, TimeUnit.SECONDS)
+            );
+        }

core/src/main/java/io/questdb/cairo/TableWriter.java (1)

7655-7660: Nit: tighten the comment wording for clarity

Minor grammar/wording improvements make the intent clearer and reference the double-close guard explicitly.

-        // Don't move the line to mmap Wal column inside the following try block,
-        // This call, if failed will close the WAL files correctly on its own
-        // putting it inside the try block will cause the WAL files to be closed twice in the finally block
-        // in case of the exception.
+        // Important: do not move the mmapWalColumns call inside the try block below.
+        // If mmapWalColumns fails, it closes the WAL files itself.
+        // Putting it inside the try would cause the finally block to close the WAL files again,
+        // tripping the FdCache double-close guard in case of an exception.
         segmentFileCache.mmapWalColumns(segmentCopyInfo, metadata, path);

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7454770 and 6ce8519.

📒 Files selected for processing (3)

core/src/main/java/io/questdb/cairo/TableWriter.java (1 hunks)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (3 hunks)
core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
GitHub Check: New pull request (Hosted Running tests on windows-other)
GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
GitHub Check: New pull request (Hosted Running tests on windows-cairo)
GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
GitHub Check: New pull request (Hosted Running tests on windows-griffin)
GitHub Check: New pull request (Hosted Running tests on mac-other)
GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
GitHub Check: New pull request (Hosted Running tests on mac-cairo)
GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
GitHub Check: New pull request (Hosted Running tests on mac-griffin)
GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
GitHub Check: New pull request (Check Changes Check changes)
GitHub Check: build
GitHub Check: build

🔇 Additional comments (3)

core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (1)

172-187: LGTM: per-thread SqlExecutionContext with try-with-resources prevents cross-thread state contamination

Creating a thread-local SqlExecutionContext and closing it via try-with-resources is the right move here. It isolates UPDATE execution, avoids shared state races, and guarantees cleanup even on failures. The explicit Path.clearThreadLocals() in the outer finally complements this well.

core/src/main/java/io/questdb/cairo/TableWriter.java (1)

7655-7660: Good fix: mmapWalColumns moved outside try to prevent double-close and FD leak

Placing segmentFileCache.mmapWalColumns(...) before the try avoids the finally-block double-close when mapping fails (since the mapper handles its own cleanup on failure). This aligns with the described FD leak scenario and is a safe, minimal change.

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

326-341: Robust nested cleanup around WAL column mapping

The inner try/catch ensures partial mappings are freed and cached FDs are closed on failure, preventing leaks when mapping multiple segments. This aligns with the PR’s goal of avoiding double-close and leak scenarios.

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java

ideoma · 2025-08-15T12:21:06Z

@coderabbitai review

coderabbitai · 2025-08-15T12:21:12Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

303-314: Critical: stale fdCacheKey removal in finally can double-remove and corrupt walFdCache state

If mmap fails, the catch path calls closeWalFiles(true, ...), which removes the cached FD entry and returns the LongList to the pool. The finally block still executes walFdCache.removeAt(fdCacheKey) using the stale index and decrements walFdCacheSize again. This risks:

Removing a different entry (stale index) from walFdCache
Double-decrementing walFdCacheSize
Double-pushing the same LongList to the pool (if fds is still non-null in other paths)

Re-resolve the key in finally and only remove if the key still exists. Also avoid double-pushing by handling the fds reference accordingly.

Apply this diff:

-            // Now that the FDs are used in the column objects, remove them from the cache.
-            // to avoid double close in case of exceptions.
-            if (fdCacheKey < 0) {
-                walFdCache.removeAt(fdCacheKey);
-                walFdCacheSize--;
-            }
-
-            if (fds != null) {
-                fds.clear();
-                walFdCacheListPool.push(fds);
-            }
+            // Remove cached FDs only if they still exist (they may have been removed in the catch branch).
+            final int currentKey = walFdCache.keyIndex(walSegmentId);
+            if (currentKey < 0) {
+                final LongList cached = walFdCache.valueAt(currentKey);
+                walFdCache.removeAt(currentKey);
+                walFdCacheSize--;
+                cached.clear();
+                walFdCacheListPool.push(cached);
+            } else if (fds != null) {
+                // No cache entry remains; just return the list we took earlier, if any.
+                fds.clear();
+                walFdCacheListPool.push(fds);
+            }

🧹 Nitpick comments (1)

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

297-301: Using initialSize for partial rollback is correct

Passing initialSize to closeWalFiles(true, walSegmentId, initialSize) closes only the newly mapped columns within this method, which is the intended partial rollback behavior. The explicit walMappedColumns.setPos(initialSize) is redundant given closeWalFiles already sets the position but harmless.

If you want to reduce redundancy, you can remove the setPos(initialSize) line since closeWalFiles already handles it.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6ce8519 and 38d6570.

📒 Files selected for processing (1)

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)

GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
GitHub Check: New pull request (Hosted Running tests on windows-other)
GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
GitHub Check: New pull request (Hosted Running tests on windows-cairo)
GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
GitHub Check: New pull request (Hosted Running tests on windows-griffin)
GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
GitHub Check: New pull request (Hosted Running tests on mac-other)
GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
GitHub Check: New pull request (Hosted Running tests on mac-cairo)
GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
GitHub Check: New pull request (Hosted Running tests on mac-griffin)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
GitHub Check: New pull request (Check Changes Check changes)
GitHub Check: build
GitHub Check: build

🔇 Additional comments (1)

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

331-347: Robust failure handling in mmapWalColumns looks good

The inner try-catch ensures that on any partial failure:

All mapped columns are freed and the list is cleared.

Cached FDs are closed to avoid leaks.

The exception is rethrown, preserving original failure semantics.

This aligns with the PR goal to prevent FD leaks during concurrent WAL purges.

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java

glasstiger · 2025-08-15T13:34:19Z

[PR Coverage check]

😍 pass : 19 / 23 (82.61%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/cairo/TableWriterSegmentFileCache.java	18	22	81.82%
🔵	io/questdb/cairo/TableWriter.java	1	1	100.00%

fix(core): fix potential double fd leak on table drop

4490c4e

ideoma requested a review from Copilot August 14, 2025 17:59

Copilot AI reviewed Aug 14, 2025

View reviewed changes

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java Show resolved Hide resolved

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java Show resolved Hide resolved

ideoma and others added 2 commits August 14, 2025 19:03

Apply suggestion from @Copilot

722d353

Co-authored-by: Copilot <[email protected]>

code simplification

e777a61

ideoma changed the title ~~fix(core): fix potential double fd leak on table drop~~ fix(core): fix potential File Descriptor leak on table drop Aug 14, 2025

ideoma added 2 commits August 15, 2025 12:29

fix failed on CI BitmapIndexConcurrentTest

68a7025

Merge branch 'master' into fix-double-fd-close

6ce8519

coderabbitai bot reviewed Aug 15, 2025

View reviewed changes

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java Show resolved Hide resolved

applied rabbit review suggestion

38d6570

coderabbitai bot reviewed Aug 15, 2025

View reviewed changes

bluestreak01 reviewed Aug 15, 2025

View reviewed changes

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java Outdated Show resolved Hide resolved

bluestreak01 reviewed Aug 15, 2025

View reviewed changes

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java Outdated Show resolved Hide resolved

address review nits

e21abae

bluestreak01 added Bug Incorrect or unexpected behavior storage labels Aug 15, 2025

bluestreak01 approved these changes Aug 15, 2025

View reviewed changes

bluestreak01 merged commit f34f936 into master Aug 15, 2025
35 checks passed

bluestreak01 deleted the fix-double-fd-close branch August 15, 2025 14:09

Conversation

ideoma commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

ideoma commented Aug 15, 2025

Uh oh!

coderabbitai bot commented Aug 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ideoma commented Aug 15, 2025

Uh oh!

coderabbitai bot commented Aug 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

glasstiger commented Aug 15, 2025

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ideoma commented Aug 14, 2025 •

edited

Loading

coderabbitai bot commented Aug 14, 2025 •

edited

Loading