-
Notifications
You must be signed in to change notification settings - Fork 759
fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035
Conversation
Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.
Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.
Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a critical regression from Issue #1001 where large file uploads (1GB+) would hang when a node experiences power loss. The fix implements a comprehensive keepalive strategy across both the control plane (gRPC) and data plane (HTTP/streaming), along with fixing a macOS/AArch64 build failure in the profiling module.
Key Changes:
- Data Plane: Configured TCP and HTTP/2 keepalives in the
reqwestclient (10s TCP, 5s HTTP/2) to detect dead connections during large file streaming operations - Server Side: Enforced
SO_KEEPALIVEon incoming TCP connections with aggressive timeouts (10s idle, 5s interval, 3 retries) to forcefully close dead client sockets - Cross-Platform Build: Gated Linux-specific profiling code (
jemalloc_pprof,pprof) with#[cfg(target_os = "linux")]to resolve macOS compilation failures
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| rustfs/src/server/http.rs | Adds TCP keepalive configuration (10s/5s/3) on incoming connections to detect and close dead client sockets |
| rustfs/src/profiling.rs | Wraps all profiling code in #[cfg(target_os = "linux")] module with stub implementations for non-Linux platforms |
| crates/rio/src/http_reader.rs | Configures HTTP client with TCP keepalive (10s), HTTP/2 keepalive (5s/3s), and connection timeout (5s) to prevent upload hangs |
| docs/cluster_recovery.md | Documents the keepalive strategy covering control plane (gRPC), data plane (streaming), and build stability improvements |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hey @weisd @houseme @Jitterx69 , The repair was unsuccessful. #1035
It did not meet expectations.
|
What I think is the keepalive fix helps the data plane (file transfers) fail fast than normal. Other than that :
Could be possible to fix via :
I will start debugging if you find my instincts well. @loverustfs |
Okay, let's try this solution. |


This PR follows up on #1025 to fix a regression where large file uploads would hang on node power-off.
Changes:
rioclient (TCP 10s, HTTP/2 5s).SO_KEEPALIVEon incoming connections.Verified with
cargo test -p rustfs-rio.