fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035

Jitterx69 · 2025-12-07T05:31:23Z

This PR follows up on #1025 to fix a regression where large file uploads would hang on node power-off.

Changes:

Data Plane: Enabled keepalives in rio client (TCP 10s, HTTP/2 5s).
Server: Enforced SO_KEEPALIVE on incoming connections.
Build: Fixed macOS/AArch64 compilation by gating profiling code.

Verified with cargo test -p rustfs-rio.

Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.

Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.

Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.

Copilot

Pull request overview

This PR addresses a critical regression from Issue #1001 where large file uploads (1GB+) would hang when a node experiences power loss. The fix implements a comprehensive keepalive strategy across both the control plane (gRPC) and data plane (HTTP/streaming), along with fixing a macOS/AArch64 build failure in the profiling module.

Key Changes:

Data Plane: Configured TCP and HTTP/2 keepalives in the reqwest client (10s TCP, 5s HTTP/2) to detect dead connections during large file streaming operations
Server Side: Enforced SO_KEEPALIVE on incoming TCP connections with aggressive timeouts (10s idle, 5s interval, 3 retries) to forcefully close dead client sockets
Cross-Platform Build: Gated Linux-specific profiling code (jemalloc_pprof, pprof) with #[cfg(target_os = "linux")] to resolve macOS compilation failures

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
rustfs/src/server/http.rs	Adds TCP keepalive configuration (10s/5s/3) on incoming connections to detect and close dead client sockets
rustfs/src/profiling.rs	Wraps all profiling code in `#[cfg(target_os = "linux")]` module with stub implementations for non-Linux platforms
crates/rio/src/http_reader.rs	Configures HTTP client with TCP keepalive (10s), HTTP/2 keepalive (5s/3s), and connection timeout (5s) to prevent upload hangs
docs/cluster_recovery.md	Documents the keepalive strategy covering control plane (gRPC), data plane (streaming), and build stability improvements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rustfs/src/profiling.rs

loverustfs · 2025-12-07T15:03:05Z

Hey @weisd @houseme @Jitterx69 ,

The repair was unsuccessful. #1035

During the upload, one server lost power, and the rustfsadmin user was unable to log in.
Upload stopped.
The performance page cannot be displayed.

It did not meet expectations.

Jitterx69 · 2025-12-07T16:35:54Z

Hey @weisd @houseme @Jitterx69 ,

The repair was unsuccessful. #1035

During the upload, one server lost power, and the rustfsadmin user was unable to log in.

Upload stopped.

The performance page cannot be displayed.

It did not meet expectations.

What I think is the keepalive fix helps the data plane (file transfers) fail fast than normal. Other than that :

The metadata operations (login, server stats) may be blocking on the dead node because of Raft quorum, requiring priority.
If connections to the dead node don't timeout properly at the gRPC/protos level, the cluster hangs.
The console may be making synchronous calls to cluster metadata endpoints which is potentially clogging the dead node.
If the rustfsadmin session was pinned to the failed node, login attempts would hang

Could be possible to fix via :

Add dead connection eviction.
Implement graceful degradation in the console.
Make IAM sync, which does not blocks out the auth.
Add connection eviction on error in PeerRestClient.

I will start debugging if you find my instincts well. @loverustfs

loverustfs · 2025-12-07T22:58:20Z

Hey @weisd @houseme @Jitterx69 ,
The repair was unsuccessful. #1035

During the upload, one server lost power, and the rustfsadmin user was unable to log in.

Upload stopped.

The performance page cannot be displayed.

It did not meet expectations.

What I think is the keepalive fix helps the data plane (file transfers) fail fast than normal. Other than that :

The metadata operations (login, server stats) may be blocking on the dead node because of Raft quorum, requiring priority.

If connections to the dead node don't timeout properly at the gRPC/protos level, the cluster hangs.

The console may be making synchronous calls to cluster metadata endpoints which is potentially clogging the dead node.

If the rustfsadmin session was pinned to the failed node, login attempts would hang

Could be possible to fix via :

Add dead connection eviction.

Implement graceful degradation in the console.

Make IAM sync, which does not blocks out the auth.

Add connection eviction on error in PeerRestClient.

I will start debugging if you find my instincts well. @loverustfs

Okay, let's try this solution.
@Jitterx69 Thank you!

Jitterx69 and others added 8 commits December 6, 2025 16:08

fix(protos): enable app-layer keepalives (rustfs#1001)

3eca2e3

Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.

Merge branch 'main' into fix/issue-1001-dead-node-detection

7d0a735

fix(net): enable data-plane and server-side keepalives

8288edc

Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.

fix(prof): fix macos/aarch64 profiling build error

40c0b7f

Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.

docs: update recovery report with data plane and build fixes

435a4ce

Merge branch 'main' into fix/issue-1001-dead-node-detection

1661394

style: fix rustfmt errors in profiling and http

e673509

style: fix clippy redundancy error in profiling cfg

94222b1

houseme requested review from Copilot and weisd December 7, 2025 09:34

Copilot started reviewing on behalf of houseme December 7, 2025 09:35 View session

Copilot AI reviewed Dec 7, 2025

View reviewed changes

weisd approved these changes Dec 7, 2025

View reviewed changes

houseme reviewed Dec 7, 2025

View reviewed changes

rustfs/src/profiling.rs Show resolved Hide resolved

weisd merged commit cd6a26b into rustfs:main Dec 7, 2025
18 checks passed

loverustfs mentioned this pull request Dec 7, 2025

Cluster Behaves Differently to kill vs. Abrupt Power-Off #1001

Open

Jitterx69 mentioned this pull request Dec 8, 2025

Fix/issue #1001 dead node detection #1054

Merged

15 tasks

Copilot AI mentioned this pull request Dec 8, 2025

Comprehensive cluster power-off resilience solution with multi-layer defense and EC quorum management #1068

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035

fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035

Uh oh!

Jitterx69 commented Dec 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

loverustfs commented Dec 7, 2025

Uh oh!

Jitterx69 commented Dec 7, 2025

Uh oh!

loverustfs commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035

fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035

Uh oh!

Conversation

Jitterx69 commented Dec 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

loverustfs commented Dec 7, 2025

Uh oh!

Jitterx69 commented Dec 7, 2025

Uh oh!

loverustfs commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants