fix(core): fix server failing to accept connections under high-load#6334
fix(core): fix server failing to accept connections under high-load#6334bluestreak01 merged 19 commits intomasterfrom
Conversation
On busy Linux systems, when many clients try to open new connections simultaneously, the server could fail to accept some of them. This manifests as client connection timeouts.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe PR adds JVM garbage collection flags to CI configurations, refactors IO dispatcher methods to return boolean status indicators instead of void, updates platform-specific dispatcher implementations to consume these return values for control flow decisions, and extends test infrastructure with accept loop timeout configuration and improved error handling. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested labels
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai please review this little piggie 🐷 |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 9188 @jerrinot 🐷 This is a well-crafted fix for a subtle edge-triggered epoll issue! Here's my review: The Good 🎯Core Fix is Correct Smart Optimization // If a new connection event arrived in this round, the main loop's accept()
// would have drained all connections, resetting pendingAccept to false
if (pendingAccept && isListening()) {
pendingAccept = !accept(timestamp);
useful = true;
}This avoids unnecessary syscalls when the queue was already drained by the main loop. Good Documentation Test Coverage Minor Observations 🔍
Nitpick 🤏The Verdict ✅This fix properly addresses the high-load connection acceptance issue. The implementation is correct, well-documented, and includes good test coverage. Ship it! 🚢 |
… on very slow CI.
|
|
…eive just a single notification, no matter how many connection are waiting
|
@CodeRabbit review. This PR fixes message loss around timeouts. Could you check if message loss in IODispatchers is possible under any other circumstances |
🧩 Analysis chain🏁 Script executed: Length of output: 77951 🏁 Script executed: Length of output: 4657 🏁 Script executed: Length of output: 6277 🏁 Script executed: Length of output: 17992 🏁 Script executed: Length of output: 11178 @bluestreak01, great question! I've analyzed the IODispatcher implementations across all platforms. Your fix addresses the connection acceptance issue excellently. Here's what I found regarding other potential message loss scenarios: ✅ Areas That Look Good
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
ci/templates/hosted-jobs.yml(1 hunks)ci/test-pipeline.yml(1 hunks)core/src/main/java/io/questdb/network/AbstractIODispatcher.java(4 hunks)core/src/main/java/io/questdb/network/IODispatcherLinux.java(3 hunks)core/src/main/java/io/questdb/network/IODispatcherOsx.java(1 hunks)core/src/main/java/io/questdb/network/IODispatcherWindows.java(1 hunks)core/src/test/java/io/questdb/test/TestListener.java(3 hunks)core/src/test/java/io/questdb/test/cutlass/pgwire/BasePGTest.java(2 hunks)core/src/test/java/io/questdb/test/cutlass/pgwire/PGArraysTest.java(1 hunks)core/src/test/java/io/questdb/test/cutlass/pgwire/PGJobContextTest.java(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (Check Changes Check changes)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (18)
core/src/test/java/io/questdb/test/cutlass/pgwire/BasePGTest.java (2)
94-94: LGTM! Good default timeout value.The field declaration provides a sensible default of 500ms and allows test subclasses to customize the accept loop timeout for exercising different timing scenarios.
589-592: LGTM! Configuration properly propagated.The override correctly propagates the test-configurable
acceptLoopTimeoutto the PG server configuration, enabling tests to exercise various timeout values.core/src/test/java/io/questdb/test/cutlass/pgwire/PGArraysTest.java (2)
81-81: LGTM! Randomization supports timing bug detection.Randomizing the accept loop timeout (10-509 ms) helps catch timing-dependent issues related to the high-load accept fix, as intended by the PR.
87-87: LGTM! Helpful diagnostic logging.Including the accept loop timeout in the diagnostic log aids debugging if timing-related test failures occur.
core/src/test/java/io/questdb/test/cutlass/pgwire/PGJobContextTest.java (1)
235-241: LGTM! Test infrastructure enhancement supports timing-dependent testing.The randomization of
acceptLoopTimeout(10-509 ms) helps validate the connection acceptance fix under varying timeout conditions, which aligns with the PR's objective to address high-load connection failures. The logging addition aids in diagnosing timing-related issues during test runs.ci/templates/hosted-jobs.yml (1)
121-121: Consistency confirmed with test-pipeline.yml.The addition of
-XX:+UseParallelGChere mirrors the change in ci/test-pipeline.yml, which is good. This template covers a wide range of test configurations across macOS, Windows, and Linux, so the change has broad coverage.Please confirm that the verification from the ci/test-pipeline.yml review passes (checking for consistent MAVEN_OPTS definitions across all CI configs).
core/src/main/java/io/questdb/network/AbstractIODispatcher.java (2)
342-437: LGTM! Well-documented greedy accept loop with drain tracking.The boolean return value correctly distinguishes between a fully drained accept queue (EWOULDBLOCK) and early exit due to timeout or connection limit. This is essential for the edge-triggered epoll fix, as the caller can now detect when connections remain queued and retry on the next iteration to avoid stranding.
The comment explaining why
acceptEndTimeusesclock.getTicks()rather than the passedtimestampis helpful.
480-491: LGTM! Boolean return propagates disconnect processing status.The method now correctly returns whether useful work occurred (processing disconnects or closing the listener due to timeout). This enables platform-specific dispatchers to make informed control-flow decisions.
core/src/main/java/io/questdb/network/IODispatcherOsx.java (1)
401-463: LGTM! Properly integrates processDisconnects return value.The method now correctly captures the boolean result from
processDisconnects(timestamp)and incorporates it into the final return via the OR expression. This ensures that disconnect processing contributes to the "useful work" determination.core/src/main/java/io/questdb/network/IODispatcherWindows.java (1)
158-167: LGTM! Improved error handling with disconnect status.The method now correctly captures the result from
processDisconnects(timestamp)and returns it on select failure, rather than unconditionally returningfalse. This ensures that any useful disconnect processing is properly reflected in the return value, even when select fails.core/src/main/java/io/questdb/network/IODispatcherLinux.java (5)
36-36: LGTM! State flag for tracking incomplete accept operations.The
pendingAcceptflag correctly tracks whenaccept()exits early without fully draining the queue, enabling retry on the next iteration to prevent connection stranding with edge-triggered epoll.
387-387: LGTM! Captures disconnect processing status.Correctly integrates the boolean return from
processDisconnects(timestamp)into theusefulflag, aligning with the refactored API.
391-392: Excellent! Key fix for edge-triggered epoll connection stranding.The modified condition
if (n > 0 || pendingAccept)ensures that accept processing occurs even when epoll returns no new events, preventing connections from being stranded when the previousaccept()call exited early due to timeout or connection limit. TheacceptProcessedflag prevents redundant accept attempts within the same iteration.
402-404: LGTM! Correct state tracking for accept operation.The logic correctly sets
pendingAccept = !accept(timestamp)so that:
- When
accept()returnstrue(fully drained),pendingAcceptbecomesfalse- When
accept()returnsfalse(early exit),pendingAcceptremainstruefor retryThe
acceptProcessedflag prevents redundant accept calls within the same iteration.
413-418: Excellent! Retry mechanism prevents connection stranding across iterations.This post-check correctly retries
accept()when:
pendingAcceptistrue(connections remain from previous iteration)acceptProcessedisfalse(accept wasn't already called in the epoll event loop)isListening()istrue(server socket is active)The comment about branch prediction is valuable—since
pendingAcceptis normallyfalsein steady state, testing it first makes the branch highly predictable for the CPU.core/src/test/java/io/questdb/test/TestListener.java (3)
52-54: LGTM! Good defensive null check.The null check correctly handles the case where
ThreadMXBean.getThreadInfo()returns null for threads that terminated between retrieving thread IDs and fetching thread info.
122-125: LGTM! Clearer error message.The
[kind=java]label helpfully distinguishes the Java thread dumper from the native thread dumper, improving diagnostics when failures occur.
134-142: LGTM! Error handling improves test infrastructure resilience.The try-catch block ensures that failures in the native thread dumper are logged rather than silently terminating the monitoring thread. Catching
Throwableis appropriate here for maximum resilience in test monitoring infrastructure.
|
@bluestreak01 I pushed 2 additional commits as discussed:
|
[PR Coverage check]😍 pass : 31 / 40 (77.50%) file detail
|
On busy Linux systems, when many clients try to open new connections simultaneously, the server could fail to accept some of them.
This manifests as client connection timeouts.