fix(core): potential connection leakage on Windows#6243
fix(core): potential connection leakage on Windows#6243bluestreak01 merged 11 commits intomasterfrom
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughNetwork I/O dispatchers refactor identifier handling (introduce centralized ID sequencing, switch int→long IDs on macOS, remove per-OS generators), adjust Kqueue API to use long data and expose OS FDs, minor logging/format tweaks, add a volatile listener field, add null checks in tests, and tweak test timeouts/socket options. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App as App code
participant Dispatcher as AbstractIODispatcher
participant OS as OS I/O (kqueue/epoll/IOCP)
participant Conn as Connection
rect rgb(240,248,255)
note over Dispatcher: Centralized ID sequencing
App->>Dispatcher: register(fd)
Dispatcher->>Dispatcher: opId = nextOpId()
Dispatcher->>OS: arm read/write with fd, opId (long on macOS)
end
OS-->>Dispatcher: event(fd, data=eventId/opId)
Dispatcher->>Dispatcher: eventId = nextEventId()
Dispatcher->>Conn: handleSocketOperation(fd, eventId/opId)
alt heartbeat
Dispatcher->>Dispatcher: correlate heartbeat by opId
else normal I/O
Dispatcher->>OS: re-arm as needed
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
core/src/main/java/io/questdb/network/IODispatcherOsx.java (1)
231-238: Suspend event is re-registered during heartbeat instead of de-registeredComment says "de-register", but code calls readFD(), which re-arms the event and then deletes its tracking row. This will fire an untracked kqueue event later and log "unexpected event id". Replace with removeReadFD().
- } else { - final long eventId = pendingEvents.get(eventRow, EVM_ID); - keventWriter.prepare().readFD(suspendEvent.getFd(), eventId).done(); - pendingEvents.deleteRow(eventRow); - } + } else { + // De-register the suspend event while heartbeat is in-flight. + keventWriter.prepare().removeReadFD(suspendEvent.getFd()).done(); + pendingEvents.deleteRow(eventRow); + }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
core/src/main/c/osx/kqueue.c(1 hunks)core/src/main/java/io/questdb/cairo/pool/AbstractPool.java(1 hunks)core/src/main/java/io/questdb/network/AbstractIODispatcher.java(3 hunks)core/src/main/java/io/questdb/network/IODispatcherLinux.java(11 hunks)core/src/main/java/io/questdb/network/IODispatcherOsx.java(19 hunks)core/src/main/java/io/questdb/network/IODispatcherWindows.java(2 hunks)core/src/main/java/io/questdb/network/Kqueue.java(3 hunks)core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverFuzzTest.java(1 hunks)core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverTest.java(4 hunks)
🧰 Additional context used
🪛 ast-grep (0.39.6)
core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverTest.java
[info] 303-303: "Detected use of a Java socket that is not encrypted. As a result, the
traffic could be read by an attacker intercepting the network traffic. Use
an SSLSocket created by 'SSLSocketFactory' or 'SSLServerSocketFactory'
instead."
Context: new Socket(sockaddr, fd)
Note: [CWE-319] Cleartext Transmission of Sensitive Information [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(unencrypted-socket-java)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (Check Changes Check changes)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (2)
core/src/main/java/io/questdb/network/IODispatcherOsx.java (1)
66-68: Parity sanity: event vs operation IDsisEventId() assumes event IDs are odd and op IDs are even. Please confirm nextOpId()/nextEventId() in AbstractIODispatcher guarantee this invariant across all dispatchers.
Also applies to: 266-267, 322-329
core/src/main/java/io/questdb/network/Kqueue.java (1)
84-90: Confirm EV flags for listen()listen() now uses EV_ADD only (no EV_ONESHOT). That’s likely correct for a persistent server socket. Please confirm this change was intentional.
[PR Coverage check]😞 fail : 15 / 71 (21.13%) file detail
|
|
@CodeRabbit review. Although this is merged, it is possible a bug was introduced. Events from IODispatcher might be lost occasionally. Re-check the whole change |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 82256 @bluestreak01, I'll perform a thorough analysis of the changes to identify potential event loss issues. 🏁 Script executed: Length of output: 2377 @bluestreak01, I've identified a critical bug in the ID generation logic that can cause event loss. Here's my analysis: 🔴 Critical Bug: Broken ID Ordering InvariantRoot CauseIn all three dispatcher implementations ( IODispatcherLinux.java (lines 241-256): private boolean processRegistrations(long timestamp) {
// ...
while ((cursor = interestSubSeq.next()) > -1) {
final IOEvent<C> event = interestQueue.get(cursor);
final C context = event.context;
final int requestedOperation = event.operation;
final long srcOpId = context.getAndResetHeartbeatId();
interestSubSeq.done(cursor);
useful = true;
final long opId = nextOpId(); // ❌ ID GENERATED HERE
final long fd = context.getFd();
int operation = requestedOperation;
final SuspendEvent suspendEvent = context.getSuspendEvent();
int epollCmd = EpollAccessor.EPOLL_CTL_MOD;
if (requestedOperation == IOOperation.HEARTBEAT) {
assert srcOpId != -1;
int heartbeatRow = pendingHeartbeats.binarySearch(srcOpId, OPM_ID);
if (heartbeatRow < 0) {
LOG.info().$("could not find heartbeat, connection must be already closed [fd=").$(fd)
.$(", srcId=").$(srcOpId)
.I$();
continue; // ❌ EARLY EXIT - ID WASTED, ORDERING BROKEN
}The same pattern exists in Why This Causes Event Loss
Monotonic Ordering RequirementThe code relies heavily on binary search over
Binary search requires that "the range must be sorted" and "if it is not sorted, the results are undefined." While the IDs in
Affected Code Paths🏁 Script executed: Length of output: 8487 🏁 Script executed: Length of output: 101 🏁 Script executed: Length of output: 958 🏁 Script executed: Length of output: 2767 🔴 Confirmed: TWO Bug Locations Per DispatcherAll three dispatchers ( Bug #1: Heartbeat Not Found
Bug #2: Plaintext Already Buffered
🔍 Event Loss MechanismHere's how events can be lost: Scenario: While unlikely that a wasted ID would be returned by the kernel, the real problem is that the code assumes monotonic ordering for correctness. The gaps indicate a design flaw. 🔧 Recommended FixMove private boolean processRegistrations(long timestamp) {
long cursor;
boolean useful = false;
while ((cursor = interestSubSeq.next()) > -1) {
final IOEvent<C> event = interestQueue.get(cursor);
final C context = event.context;
final int requestedOperation = event.operation;
final long srcOpId = context.getAndResetHeartbeatId();
interestSubSeq.done(cursor);
useful = true;
final long fd = context.getFd();
// ❌ REMOVED: final long opId = nextOpId();
int operation = requestedOperation;
final SuspendEvent suspendEvent = context.getSuspendEvent();
int epollCmd = EpollAccessor.EPOLL_CTL_MOD;
if (requestedOperation == IOOperation.HEARTBEAT) {
assert srcOpId != -1;
int heartbeatRow = pendingHeartbeats.binarySearch(srcOpId, OPM_ID);
if (heartbeatRow < 0) {
LOG.info().$("could not find heartbeat, connection must be already closed [fd=").$(fd)
.$(", srcId=").$(srcOpId)
.I$();
continue; // ✓ Early exit OK - no ID generated yet
} else {
final long opId = nextOpId(); // ✓ Generate ID only when needed
epollCmd = EpollAccessor.EPOLL_CTL_ADD;
operation = (int) pendingHeartbeats.get(heartbeatRow, OPM_OPERATION);
// ... rest of heartbeat handling with opId ...
}
} else {
if (requestedOperation == IOOperation.READ && suspendEvent == null && context.getSocket().isMorePlaintextBuffered()) {
publishOperation(IOOperation.READ, context);
continue; // ✓ Early exit OK - no ID generated yet
}
final long opId = nextOpId(); // ✓ Generate ID only when needed
// ... rest of normal operation handling with opId ...
}
}
return useful;
}📊 Additional Concerns
✅ SummaryThe bug exists in all three platform-specific dispatchers and can cause:
Recommendation: Apply the fix to move Would you like me to create an issue to track this fix? |
The connections were leaked here due to broken operation id order:
questdb/core/src/main/java/io/questdb/network/IODispatcherWindows.java
Lines 81 to 84 in bdd0fbf
The operation id order was broken due to missing id assignment here (Linux and OSX dispatchers have that assignment):
questdb/core/src/main/java/io/questdb/network/IODispatcherWindows.java
Lines 149 to 151 in bdd0fbf
The bug may have manifested in the following scenario: you open a TCP connection, but write to it only after some time, so that heartbeat mechanism has the time to kick in.
Also, includes the following:
AbstractIODispatcher, so that it's common across all I/O dispatcher implementations.intfor operation id which is easy to overflow. On overflow, we'd face the same problem with broken operation id order invariant inpendingandpendingHeartbeatlists. This is fixed by switching tolongcounter as in the other two dispatchers.volatilein all pool implementations to avoid data races in tests.