fix(core): fix potential File Descriptor leak on table drop#6053
fix(core): fix potential File Descriptor leak on table drop#6053bluestreak01 merged 7 commits intomasterfrom
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughRelocates mmapWalColumns call in TableWriter to before the try block; refactors TableWriterSegmentFileCache to add conservative rollback and FD-cache cleanup on partial mmap failures; updates BitmapIndexConcurrentTest to use per-thread SQL execution contexts and extend executor termination timeout. Changes
Sequence Diagram(s)sequenceDiagram
participant TW as TableWriter
participant SFC as TableWriterSegmentFileCache
participant WAL as WAL Files
TW->>SFC: mmapWalColumns()
alt mmap fails
SFC->>SFC: free partial mapped columns
SFC->>WAL: closeWalFiles(...)
SFC-->>TW: throw exception
note right of TW: method-level finally won't double-close WAL
else mmap succeeds
TW->>TW: try { processWalCommitBlock }
TW-->>WAL: finally { close WAL files }
end
sequenceDiagram
participant Test as BitmapIndexConcurrentTest
participant Eng as Engine
participant Ctx as Thread-local SQL Context
loop per update thread
Test->>Ctx: create (TestUtils.createSqlExecutionCtx)
Test->>Eng: execute(updateSql, Ctx)
Eng-->>Test: result
Test->>Ctx: close (try-with-resources)
Test->>Test: Path.clearThreadLocals()
end
Test->>Test: awaitTermination(60s) and assert terminated
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a file descriptor leak that occurs when a table is dropped and WAL (Write-Ahead Log) files are being purged concurrently with TableWriter operations. The leak happens due to double-closing of file descriptors when exceptions occur during WAL file operations.
- Adds proper cleanup logic to prevent double-closing of file descriptors in exception scenarios
- Moves WAL column mapping outside try-catch blocks to avoid duplicate cleanup
- Implements cache removal and resource cleanup in finally blocks
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| TableWriterSegmentFileCache.java | Adds finally block for FD cache cleanup and exception handling improvements to prevent double-close scenarios |
| TableWriter.java | Moves WAL column mapping outside try block to prevent duplicate cleanup on exceptions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)
208-209: Good call capturing pre-mapping size for precise cleanupSaving
initialSizebefore mapping ensures the catch path targets only newly added mappings for cleanup, avoiding both leaks and accidental over-closing.The earlier suggestion to use
walMappedColumns.size()in the catch would have resulted in no-op cleanup for newly added columns. UsinginitialSizeis correct here.
🧹 Nitpick comments (4)
core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (3)
179-183: Avoid String.format in hot loop to reduce allocationsThis is a tight loop; String.format allocates and is slower. Simple concatenation is sufficient and cheaper.
Apply this diff:
- String updateSql = String.format( - "UPDATE trades SET symbol = '%s' WHERE id = %d", - randomSymbol, randomId - ); + String updateSql = "UPDATE trades SET symbol = '" + randomSymbol + "' WHERE id = " + randomId;
172-194: Optional: mirror per-thread execution context for readers to eliminate any shared-context riskWriters and updaters now use thread-local contexts; readers still share sqlExecutionContext (Line 229). It may be safe, but for symmetry and to preempt subtle races, consider a per-thread context for readers too.
Example (outside the selected lines, for the reader runnable body):
try (var readCtx = TestUtils.createSqlExecutionCtx(engine)) { while (!stopFlag.get() && errorCount.get() == 0) { // ... try (RecordCursorFactory factory = select(querySql)) { try (RecordCursor cursor = factory.getCursor(readCtx)) { // ... } } } }
281-281: Harden executor shutdown with a forced fallback to avoid lingering threads on failureIf termination exceeds 60s, issue shutdownNow and wait briefly again. This prevents leaked threads when an assertion fails.
Apply this diff:
- Assert.assertTrue("failed to terminate threads within 60s", executor.awaitTermination(60, TimeUnit.SECONDS)); + if (!executor.awaitTermination(60, TimeUnit.SECONDS)) { + executor.shutdownNow(); + Assert.assertTrue( + "failed to terminate threads within 60s (forced shutdown issued)", + executor.awaitTermination(10, TimeUnit.SECONDS) + ); + }core/src/main/java/io/questdb/cairo/TableWriter.java (1)
7655-7660: Nit: tighten the comment wording for clarityMinor grammar/wording improvements make the intent clearer and reference the double-close guard explicitly.
- // Don't move the line to mmap Wal column inside the following try block, - // This call, if failed will close the WAL files correctly on its own - // putting it inside the try block will cause the WAL files to be closed twice in the finally block - // in case of the exception. + // Important: do not move the mmapWalColumns call inside the try block below. + // If mmapWalColumns fails, it closes the WAL files itself. + // Putting it inside the try would cause the finally block to close the WAL files again, + // tripping the FdCache double-close guard in case of an exception. segmentFileCache.mmapWalColumns(segmentCopyInfo, metadata, path);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
core/src/main/java/io/questdb/cairo/TableWriter.java(1 hunks)core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java(3 hunks)core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (Check Changes Check changes)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (3)
core/src/test/java/io/questdb/test/cairo/BitmapIndexConcurrentTest.java (1)
172-187: LGTM: per-thread SqlExecutionContext with try-with-resources prevents cross-thread state contaminationCreating a thread-local SqlExecutionContext and closing it via try-with-resources is the right move here. It isolates UPDATE execution, avoids shared state races, and guarantees cleanup even on failures. The explicit Path.clearThreadLocals() in the outer finally complements this well.
core/src/main/java/io/questdb/cairo/TableWriter.java (1)
7655-7660: Good fix: mmapWalColumns moved outside try to prevent double-close and FD leakPlacing
segmentFileCache.mmapWalColumns(...)before the try avoids the finally-block double-close when mapping fails (since the mapper handles its own cleanup on failure). This aligns with the described FD leak scenario and is a safe, minimal change.core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)
326-341: Robust nested cleanup around WAL column mappingThe inner try/catch ensures partial mappings are freed and cached FDs are closed on failure, preventing leaks when mapping multiple segments. This aligns with the PR’s goal of avoiding double-close and leak scenarios.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)
303-314: Critical: stale fdCacheKey removal in finally can double-remove and corrupt walFdCache stateIf mmap fails, the catch path calls closeWalFiles(true, ...), which removes the cached FD entry and returns the LongList to the pool. The finally block still executes walFdCache.removeAt(fdCacheKey) using the stale index and decrements walFdCacheSize again. This risks:
- Removing a different entry (stale index) from walFdCache
- Double-decrementing walFdCacheSize
- Double-pushing the same LongList to the pool (if fds is still non-null in other paths)
Re-resolve the key in finally and only remove if the key still exists. Also avoid double-pushing by handling the fds reference accordingly.
Apply this diff:
- // Now that the FDs are used in the column objects, remove them from the cache. - // to avoid double close in case of exceptions. - if (fdCacheKey < 0) { - walFdCache.removeAt(fdCacheKey); - walFdCacheSize--; - } - - if (fds != null) { - fds.clear(); - walFdCacheListPool.push(fds); - } + // Remove cached FDs only if they still exist (they may have been removed in the catch branch). + final int currentKey = walFdCache.keyIndex(walSegmentId); + if (currentKey < 0) { + final LongList cached = walFdCache.valueAt(currentKey); + walFdCache.removeAt(currentKey); + walFdCacheSize--; + cached.clear(); + walFdCacheListPool.push(cached); + } else if (fds != null) { + // No cache entry remains; just return the list we took earlier, if any. + fds.clear(); + walFdCacheListPool.push(fds); + }
🧹 Nitpick comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)
297-301: Using initialSize for partial rollback is correctPassing initialSize to closeWalFiles(true, walSegmentId, initialSize) closes only the newly mapped columns within this method, which is the intended partial rollback behavior. The explicit walMappedColumns.setPos(initialSize) is redundant given closeWalFiles already sets the position but harmless.
If you want to reduce redundancy, you can remove the setPos(initialSize) line since closeWalFiles already handles it.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (Check Changes Check changes)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (1)
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java (1)
331-347: Robust failure handling in mmapWalColumns looks goodThe inner try-catch ensures that on any partial failure:
- All mapped columns are freed and the list is cleared.
- Cached FDs are closed to avoid leaks.
- The exception is rethrown, preserving original failure semantics.
This aligns with the PR goal to prevent FD leaks during concurrent WAL purges.
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/cairo/TableWriterSegmentFileCache.java
Outdated
Show resolved
Hide resolved
[PR Coverage check]😍 pass : 19 / 23 (82.61%) file detail
|
A file descriptor leak was found by fuzz tests. This PR fixes a file descriptor leak that occurs when a table is dropped and WAL (Write-Ahead Log) files are being purged concurrently with TableWriter operations. The leak happens due to double-closing of file descriptors when exceptions occur during WAL file operations. The scenario is
The fix is to prevent double close attempt of WAL files in this situation.