Skip to content

fix(core): potential connection leakage on Windows#6243

Merged
bluestreak01 merged 11 commits intomasterfrom
puzpuzpuz_flaky_ilp_tests
Oct 9, 2025
Merged

fix(core): potential connection leakage on Windows#6243
bluestreak01 merged 11 commits intomasterfrom
puzpuzpuz_flaky_ilp_tests

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Oct 8, 2025

The connections were leaked here due to broken operation id order:

int heartbeatRow = pendingHeartbeats.binarySearch(srcOpId, OPM_ID);
if (heartbeatRow < 0) {
continue; // The connection is already closed.
} else {

The operation id order was broken due to missing id assignment here (Linux and OSX dispatchers have that assignment):

protected void pendingAdded(int index) {
pending.set(index, OPM_OPERATION, initialBias == IODispatcherConfiguration.BIAS_READ ? IOOperation.READ : IOOperation.WRITE);
}

The bug may have manifested in the following scenario: you open a TCP connection, but write to it only after some time, so that heartbeat mechanism has the time to kick in.

Also, includes the following:

  • Moves operation and event id generation code to AbstractIODispatcher, so that it's common across all I/O dispatcher implementations.
  • OSX dispatcher used an int for operation id which is easy to overflow. On overflow, we'd face the same problem with broken operation id order invariant in pending and pendingHeartbeat lists. This is fixed by switching to long counter as in the other two dispatchers.
  • Marks listener field as volatile in all pool implementations to avoid data races in tests.

@puzpuzpuz puzpuzpuz self-assigned this Oct 8, 2025
@puzpuzpuz puzpuzpuz added DO NOT MERGE These changes should not be merged to main branch Schrödinger's bug labels Oct 8, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 8, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Network I/O dispatchers refactor identifier handling (introduce centralized ID sequencing, switch int→long IDs on macOS, remove per-OS generators), adjust Kqueue API to use long data and expose OS FDs, minor logging/format tweaks, add a volatile listener field, add null checks in tests, and tweak test timeouts/socket options.

Changes

Cohort / File(s) Summary of Changes
Native kqueue formatting
core/src/main/c/osx/kqueue.c
Whitespace adjustments around casts in return statements; no logic change.
Pool listener volatility
core/src/main/java/io/questdb/cairo/pool/AbstractPool.java
Make eventListener volatile with comment; no behavioral change elsewhere.
Dispatcher ID sequencing (core base)
core/src/main/java/io/questdb/network/AbstractIODispatcher.java
Add private idSeq and protected nextEventId()/nextOpId(); change pendingAdded from abstract to no-op concrete; integrate ID generation where pending is queued.
Linux dispatcher cleanup
core/src/main/java/io/questdb/network/IODispatcherLinux.java
Remove local ID sequencing (idSeq, nextEventId/nextOpId) and pendingAdded override; adjust logs/errno formatting; rely on existing IDs.
macOS dispatcher ID width and FD handling
core/src/main/java/io/questdb/network/IODispatcherOsx.java
Switch IDs from int to long across pending, suspend, heartbeat, and logging; replace LongHashSet→IntHashSet for handled OS FDs; adapt to Kqueue long data; compare FDs via Files.toOsFd; remove nextEventId/nextOpId and pendingAdded override; adjust KeventWriter API usage.
Windows dispatcher adjustments
core/src/main/java/io/questdb/network/IODispatcherWindows.java
Remove local op ID generation (idSeq/nextOpId); enhance heartbeat not-found logging; minor log formatting changes.
Kqueue API data width and FD exposure
core/src/main/java/io/questdb/network/Kqueue.java
Change data access from int→long (getData, readFD, writeFD, internal commonFd); write DATA_OFFSET as long; remove getFd(), add getOsFd(); update listen flags usage.
Line TCP fuzz test null-guards
core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverFuzzTest.java
Add null checks before table permit/return/mapping to avoid NPEs.
Line TCP test timeout and socket opts
core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverTest.java
Override getTimeout() to 10 minutes; set TCP no-delay and keep-alive in newSocket(); add Dates import.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as App code
  participant Dispatcher as AbstractIODispatcher
  participant OS as OS I/O (kqueue/epoll/IOCP)
  participant Conn as Connection

  rect rgb(240,248,255)
    note over Dispatcher: Centralized ID sequencing
    App->>Dispatcher: register(fd)
    Dispatcher->>Dispatcher: opId = nextOpId()
    Dispatcher->>OS: arm read/write with fd, opId (long on macOS)
  end

  OS-->>Dispatcher: event(fd, data=eventId/opId)
  Dispatcher->>Dispatcher: eventId = nextEventId()
  Dispatcher->>Conn: handleSocketOperation(fd, eventId/opId)
  alt heartbeat
    Dispatcher->>Dispatcher: correlate heartbeat by opId
  else normal I/O
    Dispatcher->>OS: re-arm as needed
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • mtopolnik

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.84% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title directly indicates that the pull request fixes a connection leakage issue on Windows, which is the primary problem addressed by the changeset, and correctly uses the conventional fix(core) prefix to signal a core bug fix. It is concise, specific, and understandable to teammates scanning the history.
Description Check ✅ Passed The description clearly outlines the root cause of the connection leakage bug in the Windows dispatcher, references the relevant code locations, and summarizes the additional refactoring changes for id generation, OSX counter adjustments, and volatile listener fields. It directly relates to the provided diff and gives sufficient context for reviewers.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@puzpuzpuz puzpuzpuz added Bug Incorrect or unexpected behavior Core Related to storage, data type, etc. and removed DO NOT MERGE These changes should not be merged to main branch Schrödinger's bug labels Oct 9, 2025
@puzpuzpuz puzpuzpuz changed the title test(ilp): harden ILP/TCP tests fix(core): potential connection leakage on Windows Oct 9, 2025
@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 9, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/main/java/io/questdb/network/IODispatcherOsx.java (1)

231-238: Suspend event is re-registered during heartbeat instead of de-registered

Comment says "de-register", but code calls readFD(), which re-arms the event and then deletes its tracking row. This will fire an untracked kqueue event later and log "unexpected event id". Replace with removeReadFD().

-                } else {
-                    final long eventId = pendingEvents.get(eventRow, EVM_ID);
-                    keventWriter.prepare().readFD(suspendEvent.getFd(), eventId).done();
-                    pendingEvents.deleteRow(eventRow);
-                }
+                } else {
+                    // De-register the suspend event while heartbeat is in-flight.
+                    keventWriter.prepare().removeReadFD(suspendEvent.getFd()).done();
+                    pendingEvents.deleteRow(eventRow);
+                }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdd0fbf and 8ca493b.

📒 Files selected for processing (9)
  • core/src/main/c/osx/kqueue.c (1 hunks)
  • core/src/main/java/io/questdb/cairo/pool/AbstractPool.java (1 hunks)
  • core/src/main/java/io/questdb/network/AbstractIODispatcher.java (3 hunks)
  • core/src/main/java/io/questdb/network/IODispatcherLinux.java (11 hunks)
  • core/src/main/java/io/questdb/network/IODispatcherOsx.java (19 hunks)
  • core/src/main/java/io/questdb/network/IODispatcherWindows.java (2 hunks)
  • core/src/main/java/io/questdb/network/Kqueue.java (3 hunks)
  • core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverFuzzTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverTest.java (4 hunks)
🧰 Additional context used
🪛 ast-grep (0.39.6)
core/src/test/java/io/questdb/test/cutlass/line/tcp/AbstractLineTcpReceiverTest.java

[info] 303-303: "Detected use of a Java socket that is not encrypted. As a result, the
traffic could be read by an attacker intercepting the network traffic. Use
an SSLSocket created by 'SSLSocketFactory' or 'SSLServerSocketFactory'
instead."
Context: new Socket(sockaddr, fd)
Note: [CWE-319] Cleartext Transmission of Sensitive Information [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(unencrypted-socket-java)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
  • GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
  • GitHub Check: New pull request (Hosted Running tests on windows-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
  • GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
  • GitHub Check: New pull request (Check Changes Check changes)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (2)
core/src/main/java/io/questdb/network/IODispatcherOsx.java (1)

66-68: Parity sanity: event vs operation IDs

isEventId() assumes event IDs are odd and op IDs are even. Please confirm nextOpId()/nextEventId() in AbstractIODispatcher guarantee this invariant across all dispatchers.

Also applies to: 266-267, 322-329

core/src/main/java/io/questdb/network/Kqueue.java (1)

84-90: Confirm EV flags for listen()

listen() now uses EV_ADD only (no EV_ONESHOT). That’s likely correct for a persistent server socket. Please confirm this change was intentional.

bluestreak01
bluestreak01 previously approved these changes Oct 9, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😞 fail : 15 / 71 (21.13%)

file detail

path covered line new line coverage
🔵 io/questdb/network/IODispatcherWindows.java 0 8 00.00%
🔵 io/questdb/network/IODispatcherOsx.java 0 28 00.00%
🔵 io/questdb/network/Kqueue.java 0 4 00.00%
🔵 io/questdb/network/IODispatcherLinux.java 9 25 36.00%
🔵 io/questdb/network/AbstractIODispatcher.java 6 6 100.00%

@bluestreak01 bluestreak01 merged commit eabbadc into master Oct 9, 2025
34 of 35 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_flaky_ilp_tests branch October 9, 2025 17:28
@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit review. Although this is merged, it is possible a bug was introduced. Events from IODispatcher might be lost occasionally. Re-check the whole change

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 30, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the complete content of all modified dispatcher files
echo "=== AbstractIODispatcher.java ==="
cat core/src/main/java/io/questdb/network/AbstractIODispatcher.java

echo -e "\n=== IODispatcherLinux.java ==="
cat core/src/main/java/io/questdb/network/IODispatcherLinux.java

echo -e "\n=== IODispatcherOsx.java ==="
cat core/src/main/java/io/questdb/network/IODispatcherOsx.java

echo -e "\n=== IODispatcherWindows.java ==="
cat core/src/main/java/io/questdb/network/IODispatcherWindows.java

echo -e "\n=== Kqueue.java ==="
cat core/src/main/java/io/questdb/network/Kqueue.java

Length of output: 82256


@bluestreak01, I'll perform a thorough analysis of the changes to identify potential event loss issues.


🏁 Script executed:

#!/bin/bash
# Search for all usages of nextOpId and nextEventId to trace ID generation patterns
echo "=== nextOpId() call sites ==="
rg -n "nextOpId\(\)" --type java

echo -e "\n=== nextEventId() call sites ==="
rg -n "nextEventId\(\)" --type java

echo -e "\n=== Binary searches on OPM_ID ==="
rg -n "binarySearch.*OPM_ID" --type java

echo -e "\n=== Early returns/continues in processRegistrations ==="
rg -n -A2 -B2 "processRegistrations.*continue" --type java

Length of output: 2377


@bluestreak01, I've identified a critical bug in the ID generation logic that can cause event loss. Here's my analysis:

🔴 Critical Bug: Broken ID Ordering Invariant

Root Cause

In all three dispatcher implementations (IODispatcherLinux, IODispatcherOsx, IODispatcherWindows), the processRegistrations() method calls nextOpId() before validating whether the operation should proceed:

IODispatcherLinux.java (lines 241-256):

private boolean processRegistrations(long timestamp) {
    // ...
    while ((cursor = interestSubSeq.next()) > -1) {
        final IOEvent<C> event = interestQueue.get(cursor);
        final C context = event.context;
        final int requestedOperation = event.operation;
        final long srcOpId = context.getAndResetHeartbeatId();
        interestSubSeq.done(cursor);

        useful = true;
        final long opId = nextOpId();  // ❌ ID GENERATED HERE
        final long fd = context.getFd();

        int operation = requestedOperation;
        final SuspendEvent suspendEvent = context.getSuspendEvent();
        int epollCmd = EpollAccessor.EPOLL_CTL_MOD;
        if (requestedOperation == IOOperation.HEARTBEAT) {
            assert srcOpId != -1;

            int heartbeatRow = pendingHeartbeats.binarySearch(srcOpId, OPM_ID);
            if (heartbeatRow < 0) {
                LOG.info().$("could not find heartbeat, connection must be already closed [fd=").$(fd)
                        .$(", srcId=").$(srcOpId)
                        .I$();
                continue;  // ❌ EARLY EXIT - ID WASTED, ORDERING BROKEN
            }

The same pattern exists in IODispatcherOsx.java (line 261) and IODispatcherWindows.java (line 68).

Why This Causes Event Loss

  1. ID Generation: When nextOpId() is called, it increments idSeq and returns an even number (e.g., 200)
  2. Early Continue: If the heartbeat reference is not found, the method continues without adding anything to pending
  3. Wasted ID: The generated ID (200) is never used
  4. Next Iteration: The next valid operation gets ID 202
  5. Ordering Break: If operations are later looked up by their original ID, or if the binary search expects strict monotonic ordering in the insertion sequence, events can be lost or lookups can fail

Monotonic Ordering Requirement

The code relies heavily on binary search over OPM_ID:

  • pending.binarySearch(id, OPM_ID) - appears 9 times across all dispatchers
  • pendingHeartbeats.binarySearch(srcOpId, OPM_ID) - appears 4 times
  • pendingEvents.binarySearch(id, ...) - appears 4 times

Binary search requires that "the range must be sorted" and "if it is not sorted, the results are undefined."

While the IDs in pending might still be monotonically increasing (since new rows are always appended), the gaps in ID sequence mean that:

  • IDs from epoll/kqueue/select events might reference wasted IDs
  • Binary searches expecting contiguous or predictable ID patterns will fail
  • Event correlation between different matrices becomes unreliable

Affected Code Paths


🏁 Script executed:

#!/bin/bash
# Find all places where nextOpId is called early and could be wasted
echo "=== processRegistrations with early continues ==="
rg -n -B5 -A15 "final long opId = nextOpId\(\)" core/src/main/java/io/questdb/network/

echo -e "\n=== All continue statements after nextOpId ==="
rg -n -B10 "nextOpId.*continue" core/src/main/java/io/questdb/network/ || echo "Pattern not found, checking manually..."

# Check for the specific pattern
echo -e "\n=== Heartbeat lookup failures leading to continue ==="
rg -n -A3 "could not find heartbeat" core/src/main/java/io/questdb/network/

Length of output: 8487


🏁 Script executed:

#!/bin/bash
# Check for other early continue paths after nextOpId is called
echo "=== All continue statements in processRegistrations ==="
rg -n -B15 -A2 "^\s*continue;" core/src/main/java/io/questdb/network/IODispatcher*.java | grep -A15 -B15 "processRegistrations"

echo -e "\n=== Plaintext buffered early return path ==="
rg -n -B5 -A3 "isMorePlaintextBuffered.*publishOperation.*continue" core/src/main/java/io/questdb/network/

Length of output: 101


🏁 Script executed:

#!/bin/bash
# Extract the full processRegistrations method from each dispatcher
echo "=== IODispatcherLinux processRegistrations ==="
sed -n '/private boolean processRegistrations/,/^    }/p' core/src/main/java/io/questdb/network/IODispatcherLinux.java | grep -n "continue"

echo -e "\n=== IODispatcherOsx processRegistrations ==="
sed -n '/private boolean processRegistrations/,/^    }/p' core/src/main/java/io/questdb/network/IODispatcherOsx.java | grep -n "continue"

echo -e "\n=== IODispatcherWindows processRegistrations ==="
sed -n '/private boolean processRegistrations/,/^    }/p' core/src/main/java/io/questdb/network/IODispatcherWindows.java | grep -n "continue"

echo -e "\n=== Now check isMorePlaintextBuffered pattern ==="
rg -n "isMorePlaintextBuffered" core/src/main/java/io/questdb/network/IODispatcher*.java

Length of output: 958


🏁 Script executed:

#!/bin/bash
# Get context around isMorePlaintextBuffered to see if it's after nextOpId
echo "=== IODispatcherLinux isMorePlaintextBuffered context ==="
rg -n -B10 -A5 "isMorePlaintextBuffered" core/src/main/java/io/questdb/network/IODispatcherLinux.java

echo -e "\n=== IODispatcherOsx isMorePlaintextBuffered context ==="
rg -n -B10 -A5 "isMorePlaintextBuffered" core/src/main/java/io/questdb/network/IODispatcherOsx.java

echo -e "\n=== IODispatcherWindows isMorePlaintextBuffered context ==="
rg -n -B10 -A5 "isMorePlaintextBuffered" core/src/main/java/io/questdb/network/IODispatcherWindows.java

Length of output: 2767


🔴 Confirmed: TWO Bug Locations Per Dispatcher

All three dispatchers (IODispatcherLinux, IODispatcherOsx, IODispatcherWindows) waste generated operation IDs in two locations within processRegistrations():

Bug #1: Heartbeat Not Found

  • Lines: 258 (Linux), 273 (OSX), 80 (Windows)
  • Scenario: When a heartbeat response arrives but the original heartbeat entry is missing from pendingHeartbeats
  • Impact: The opId generated at the start of the loop is wasted

Bug #2: Plaintext Already Buffered

  • Lines: 282 (Linux), 296 (OSX), 103 (Windows)
  • Scenario: When a READ operation is registered but TLS socket already has decrypted plaintext buffered
  • Impact: Operation is immediately published without adding to pending, wasting the opId

🔍 Event Loss Mechanism

Here's how events can be lost:

Scenario:

1. Connection A registers READ → opId = 100 (even number)
   - Plaintext buffered → continue (ID 100 wasted)
2. Connection B registers WRITE → opId = 102
   - Added to pending[0] with ID 102
3. Connection C registers READ → opId = 104
   - Added to pending[1] with ID 104
4. epoll/kqueue returns event with ID 102
5. Binary search: pending.binarySearch(102, OPM_ID)
   - pending has: [102, 104]
   - Search finds row 0 ✓

Later:
6. Connection D registers HEARTBEAT → opId = 106
   - Heartbeat not found → continue (ID 106 wasted)
7. Connection E registers READ → opId = 108
   - Added to pending[2] with ID 108
8. Connection F registers WRITE → opId = 110
   - Added to pending[3] with ID 110
9. epoll/kqueue returns event with ID 106 (delayed/resent)
10. Binary search: pending.binarySearch(106, OPM_ID)
    - pending has: [102, 104, 108, 110]
    - 106 not found → returns negative insertion point
11. LOG.critical "internal error: epoll returned unexpected id"
12. Event LOST ❌

While unlikely that a wasted ID would be returned by the kernel, the real problem is that the code assumes monotonic ordering for correctness. The gaps indicate a design flaw.

🔧 Recommended Fix

Move nextOpId() call to after all validation checks:

private boolean processRegistrations(long timestamp) {
    long cursor;
    boolean useful = false;
    while ((cursor = interestSubSeq.next()) > -1) {
        final IOEvent<C> event = interestQueue.get(cursor);
        final C context = event.context;
        final int requestedOperation = event.operation;
        final long srcOpId = context.getAndResetHeartbeatId();
        interestSubSeq.done(cursor);

        useful = true;
        final long fd = context.getFd();
        
        // ❌ REMOVED: final long opId = nextOpId();  

        int operation = requestedOperation;
        final SuspendEvent suspendEvent = context.getSuspendEvent();
        int epollCmd = EpollAccessor.EPOLL_CTL_MOD;
        if (requestedOperation == IOOperation.HEARTBEAT) {
            assert srcOpId != -1;

            int heartbeatRow = pendingHeartbeats.binarySearch(srcOpId, OPM_ID);
            if (heartbeatRow < 0) {
                LOG.info().$("could not find heartbeat, connection must be already closed [fd=").$(fd)
                        .$(", srcId=").$(srcOpId)
                        .I$();
                continue;  // ✓ Early exit OK - no ID generated yet
            } else {
                final long opId = nextOpId();  // ✓ Generate ID only when needed
                epollCmd = EpollAccessor.EPOLL_CTL_ADD;
                operation = (int) pendingHeartbeats.get(heartbeatRow, OPM_OPERATION);
                
                // ... rest of heartbeat handling with opId ...
            }
        } else {
            if (requestedOperation == IOOperation.READ && suspendEvent == null && context.getSocket().isMorePlaintextBuffered()) {
                publishOperation(IOOperation.READ, context);
                continue;  // ✓ Early exit OK - no ID generated yet
            }

            final long opId = nextOpId();  // ✓ Generate ID only when needed
            
            // ... rest of normal operation handling with opId ...
        }
    }
    return useful;
}

📊 Additional Concerns

  1. Non-atomic increment: idSeq++ in nextOpId() and nextEventId() is not thread-safe, but this appears acceptable since each dispatcher runs on a single thread.

  2. Suspend event correlation: The pendingEvents matrix uses EVM_OPERATION_ID to store references to operation IDs. If operation IDs have gaps, suspend event lookups could be affected.

  3. Test coverage: The PR has 0% coverage for Windows (8 new lines) and OSX (28 new lines), making these bugs difficult to detect.

✅ Summary

The bug exists in all three platform-specific dispatchers and can cause:

  • Event loss when wasted IDs are later returned by the kernel
  • Binary search failures due to broken monotonic ordering assumptions
  • Connection hangs when operations are not properly tracked

Recommendation: Apply the fix to move nextOpId() generation after all early-exit conditions in processRegistrations() for all three dispatchers.

Would you like me to create an issue to track this fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior Core Related to storage, data type, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants