Skip to content

fix(core): fix potential File Descriptor leak on table drop#6053

Merged
bluestreak01 merged 7 commits intomasterfrom
fix-double-fd-close
Aug 15, 2025
Merged

fix(core): fix potential File Descriptor leak on table drop#6053
bluestreak01 merged 7 commits intomasterfrom
fix-double-fd-close

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Aug 14, 2025

A file descriptor leak was found by fuzz tests. This PR fixes a file descriptor leak that occurs when a table is dropped and WAL (Write-Ahead Log) files are being purged concurrently with TableWriter operations. The leak happens due to double-closing of file descriptors when exceptions occur during WAL file operations. The scenario is

  • A table is dropped
  • WAL Purge job starts deleting the WAL files
  • TableWriter may be at this moment try to open the WAL files that are being deleted
  • TableWriter gets an exception on WAL file open and closes the open WAL files
  • Then, on closing the TableWriter object, the WAL files are closed the second time and fail because of the double close guard in FdCache.
  • The last exception aborts TableWriter cleanup and leads to a file descriptor leak.

The fix is to prevent double close attempt of WAL files in this situation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 14, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Relocates mmapWalColumns call in TableWriter to before the try block; refactors TableWriterSegmentFileCache to add conservative rollback and FD-cache cleanup on partial mmap failures; updates BitmapIndexConcurrentTest to use per-thread SQL execution contexts and extend executor termination timeout.

Changes

Cohort / File(s) Summary
TableWriter WAL mapping change
core/src/main/java/io/questdb/cairo/TableWriter.java
Moved mmapWalColumns invocation out of the try block so mapping failures are handled by mmapWalColumns itself and do not interact with the method-level finally WAL-close logic.
Segment file cache mmap & FD cleanup
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java
Added initial mapped-columns size snapshot and stricter rollback on mmap failures; free partially mapped columns on error, call closeWalFiles(...) on failure, reset walMappedColumns position, and ensure cached FD lists are cleared/returned to pool to avoid leaks/double-closes.
Concurrent test: per-thread SQL context
core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java
Create and use a per-thread SQL execution context (try-with-resources) for update threads, call engine.execute(updateSql, ctx), clear thread-locals via Path.clearThreadLocals(), and increase awaitTermination timeout from 5s to 60s with an assertion.

Sequence Diagram(s)

sequenceDiagram
  participant TW as TableWriter
  participant SFC as TableWriterSegmentFileCache
  participant WAL as WAL Files

  TW->>SFC: mmapWalColumns()
  alt mmap fails
    SFC->>SFC: free partial mapped columns
    SFC->>WAL: closeWalFiles(...)
    SFC-->>TW: throw exception
    note right of TW: method-level finally won't double-close WAL
  else mmap succeeds
    TW->>TW: try { processWalCommitBlock }
    TW-->>WAL: finally { close WAL files }
  end
Loading
sequenceDiagram
  participant Test as BitmapIndexConcurrentTest
  participant Eng as Engine
  participant Ctx as Thread-local SQL Context

  loop per update thread
    Test->>Ctx: create (TestUtils.createSqlExecutionCtx)
    Test->>Eng: execute(updateSql, Ctx)
    Eng-->>Test: result
    Test->>Ctx: close (try-with-resources)
    Test->>Test: Path.clearThreadLocals()
  end
  Test->>Test: awaitTermination(60s) and assert terminated
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • bluestreak01
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-double-fd-close

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ideoma ideoma requested a review from Copilot August 14, 2025 17:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a file descriptor leak that occurs when a table is dropped and WAL (Write-Ahead Log) files are being purged concurrently with TableWriter operations. The leak happens due to double-closing of file descriptors when exceptions occur during WAL file operations.

  • Adds proper cleanup logic to prevent double-closing of file descriptors in exception scenarios
  • Moves WAL column mapping outside try-catch blocks to avoid duplicate cleanup
  • Implements cache removal and resource cleanup in finally blocks

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
TableWriterSegmentFileCache.java Adds finally block for FD cache cleanup and exception handling improvements to prevent double-close scenarios
TableWriter.java Moves WAL column mapping outside try block to prevent duplicate cleanup on exceptions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ideoma ideoma changed the title fix(core): fix potential double fd leak on table drop fix(core): fix potential File Descriptor leak on table drop Aug 14, 2025
@ideoma
Copy link
Copy Markdown
Collaborator Author

ideoma commented Aug 15, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 15, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

208-209: Good call capturing pre-mapping size for precise cleanup

Saving initialSize before mapping ensures the catch path targets only newly added mappings for cleanup, avoiding both leaks and accidental over-closing.

The earlier suggestion to use walMappedColumns.size() in the catch would have resulted in no-op cleanup for newly added columns. Using initialSize is correct here.

🧹 Nitpick comments (4)
core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (3)

179-183: Avoid String.format in hot loop to reduce allocations

This is a tight loop; String.format allocates and is slower. Simple concatenation is sufficient and cheaper.

Apply this diff:

-                                String updateSql = String.format(
-                                        "UPDATE trades SET symbol = '%s' WHERE id = %d",
-                                        randomSymbol, randomId
-                                );
+                                String updateSql = "UPDATE trades SET symbol = '" + randomSymbol + "' WHERE id = " + randomId;

172-194: Optional: mirror per-thread execution context for readers to eliminate any shared-context risk

Writers and updaters now use thread-local contexts; readers still share sqlExecutionContext (Line 229). It may be safe, but for symmetry and to preempt subtle races, consider a per-thread context for readers too.

Example (outside the selected lines, for the reader runnable body):

try (var readCtx = TestUtils.createSqlExecutionCtx(engine)) {
    while (!stopFlag.get() && errorCount.get() == 0) {
        // ...
        try (RecordCursorFactory factory = select(querySql)) {
            try (RecordCursor cursor = factory.getCursor(readCtx)) {
                // ...
            }
        }
    }
}

281-281: Harden executor shutdown with a forced fallback to avoid lingering threads on failure

If termination exceeds 60s, issue shutdownNow and wait briefly again. This prevents leaked threads when an assertion fails.

Apply this diff:

-        Assert.assertTrue("failed to terminate threads within 60s", executor.awaitTermination(60, TimeUnit.SECONDS));
+        if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
+            executor.shutdownNow();
+            Assert.assertTrue(
+                    "failed to terminate threads within 60s (forced shutdown issued)",
+                    executor.awaitTermination(10, TimeUnit.SECONDS)
+            );
+        }
core/src/main/java/io/questdb/cairo/TableWriter.java (1)

7655-7660: Nit: tighten the comment wording for clarity

Minor grammar/wording improvements make the intent clearer and reference the double-close guard explicitly.

-        // Don't move the line to mmap Wal column inside the following try block,
-        // This call, if failed will close the WAL files correctly on its own
-        // putting it inside the try block will cause the WAL files to be closed twice in the finally block
-        // in case of the exception.
+        // Important: do not move the mmapWalColumns call inside the try block below.
+        // If mmapWalColumns fails, it closes the WAL files itself.
+        // Putting it inside the try would cause the finally block to close the WAL files again,
+        // tripping the FdCache double-close guard in case of an exception.
         segmentFileCache.mmapWalColumns(segmentCopyInfo, metadata, path);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7454770 and 6ce8519.

📒 Files selected for processing (3)
  • core/src/main/java/io/questdb/cairo/TableWriter.java (1 hunks)
  • core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (3 hunks)
  • core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (Hosted Running tests on windows-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
  • GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
  • GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (3)
core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (1)

172-187: LGTM: per-thread SqlExecutionContext with try-with-resources prevents cross-thread state contamination

Creating a thread-local SqlExecutionContext and closing it via try-with-resources is the right move here. It isolates UPDATE execution, avoids shared state races, and guarantees cleanup even on failures. The explicit Path.clearThreadLocals() in the outer finally complements this well.

core/src/main/java/io/questdb/cairo/TableWriter.java (1)

7655-7660: Good fix: mmapWalColumns moved outside try to prevent double-close and FD leak

Placing segmentFileCache.mmapWalColumns(...) before the try avoids the finally-block double-close when mapping fails (since the mapper handles its own cleanup on failure). This aligns with the described FD leak scenario and is a safe, minimal change.

core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

326-341: Robust nested cleanup around WAL column mapping

The inner try/catch ensures partial mappings are freed and cached FDs are closed on failure, preventing leaks when mapping multiple segments. This aligns with the PR’s goal of avoiding double-close and leak scenarios.

@ideoma
Copy link
Copy Markdown
Collaborator Author

ideoma commented Aug 15, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 15, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

303-314: Critical: stale fdCacheKey removal in finally can double-remove and corrupt walFdCache state

If mmap fails, the catch path calls closeWalFiles(true, ...), which removes the cached FD entry and returns the LongList to the pool. The finally block still executes walFdCache.removeAt(fdCacheKey) using the stale index and decrements walFdCacheSize again. This risks:

  • Removing a different entry (stale index) from walFdCache
  • Double-decrementing walFdCacheSize
  • Double-pushing the same LongList to the pool (if fds is still non-null in other paths)

Re-resolve the key in finally and only remove if the key still exists. Also avoid double-pushing by handling the fds reference accordingly.

Apply this diff:

-            // Now that the FDs are used in the column objects, remove them from the cache.
-            // to avoid double close in case of exceptions.
-            if (fdCacheKey < 0) {
-                walFdCache.removeAt(fdCacheKey);
-                walFdCacheSize--;
-            }
-
-            if (fds != null) {
-                fds.clear();
-                walFdCacheListPool.push(fds);
-            }
+            // Remove cached FDs only if they still exist (they may have been removed in the catch branch).
+            final int currentKey = walFdCache.keyIndex(walSegmentId);
+            if (currentKey < 0) {
+                final LongList cached = walFdCache.valueAt(currentKey);
+                walFdCache.removeAt(currentKey);
+                walFdCacheSize--;
+                cached.clear();
+                walFdCacheListPool.push(cached);
+            } else if (fds != null) {
+                // No cache entry remains; just return the list we took earlier, if any.
+                fds.clear();
+                walFdCacheListPool.push(fds);
+            }
🧹 Nitpick comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

297-301: Using initialSize for partial rollback is correct

Passing initialSize to closeWalFiles(true, walSegmentId, initialSize) closes only the newly mapped columns within this method, which is the intended partial rollback behavior. The explicit walMappedColumns.setPos(initialSize) is redundant given closeWalFiles already sets the position but harmless.

If you want to reduce redundancy, you can remove the setPos(initialSize) line since closeWalFiles already handles it.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6ce8519 and 38d6570.

📒 Files selected for processing (1)
  • core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
  • GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (Hosted Running tests on windows-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)

331-347: Robust failure handling in mmapWalColumns looks good

The inner try-catch ensures that on any partial failure:

  • All mapped columns are freed and the list is cleared.
  • Cached FDs are closed to avoid leaks.
  • The exception is rethrown, preserving original failure semantics.

This aligns with the PR goal to prevent FD leaks during concurrent WAL purges.

@bluestreak01 bluestreak01 added Bug Incorrect or unexpected behavior storage labels Aug 15, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 19 / 23 (82.61%)

file detail

path covered line new line coverage
🔵 io/questdb/cairo/TableWriterSegmentFileCache.java 18 22 81.82%
🔵 io/questdb/cairo/TableWriter.java 1 1 100.00%

@bluestreak01 bluestreak01 merged commit f34f936 into master Aug 15, 2025
35 checks passed
@bluestreak01 bluestreak01 deleted the fix-double-fd-close branch August 15, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants