Skip to content

Conversation

@Jitterx69
Copy link
Contributor

This PR follows up on #1025 to fix a regression where large file uploads would hang on node power-off.

Changes:

  1. Data Plane: Enabled keepalives in rio client (TCP 10s, HTTP/2 5s).
  2. Server: Enforced SO_KEEPALIVE on incoming connections.
  3. Build: Fixed macOS/AArch64 compilation by gating profiling code.

Verified with cargo test -p rustfs-rio.

Jitterx69 and others added 8 commits December 6, 2025 16:08
Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.
Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.
Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical regression from Issue #1001 where large file uploads (1GB+) would hang when a node experiences power loss. The fix implements a comprehensive keepalive strategy across both the control plane (gRPC) and data plane (HTTP/streaming), along with fixing a macOS/AArch64 build failure in the profiling module.

Key Changes:

  • Data Plane: Configured TCP and HTTP/2 keepalives in the reqwest client (10s TCP, 5s HTTP/2) to detect dead connections during large file streaming operations
  • Server Side: Enforced SO_KEEPALIVE on incoming TCP connections with aggressive timeouts (10s idle, 5s interval, 3 retries) to forcefully close dead client sockets
  • Cross-Platform Build: Gated Linux-specific profiling code (jemalloc_pprof, pprof) with #[cfg(target_os = "linux")] to resolve macOS compilation failures

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
rustfs/src/server/http.rs Adds TCP keepalive configuration (10s/5s/3) on incoming connections to detect and close dead client sockets
rustfs/src/profiling.rs Wraps all profiling code in #[cfg(target_os = "linux")] module with stub implementations for non-Linux platforms
crates/rio/src/http_reader.rs Configures HTTP client with TCP keepalive (10s), HTTP/2 keepalive (5s/3s), and connection timeout (5s) to prevent upload hangs
docs/cluster_recovery.md Documents the keepalive strategy covering control plane (gRPC), data plane (streaming), and build stability improvements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@weisd weisd merged commit cd6a26b into rustfs:main Dec 7, 2025
18 checks passed
@loverustfs
Copy link
Contributor

Hey @weisd @houseme @Jitterx69 ,

The repair was unsuccessful. #1035

  1. During the upload, one server lost power, and the rustfsadmin user was unable to log in.
  2. Upload stopped.
  3. The performance page cannot be displayed.

It did not meet expectations.

image

@Jitterx69
Copy link
Contributor Author

Hey @weisd @houseme @Jitterx69 ,

The repair was unsuccessful. #1035

  1. During the upload, one server lost power, and the rustfsadmin user was unable to log in.
  2. Upload stopped.
  3. The performance page cannot be displayed.

It did not meet expectations.

image

What I think is the keepalive fix helps the data plane (file transfers) fail fast than normal. Other than that :

  1. The metadata operations (login, server stats) may be blocking on the dead node because of Raft quorum, requiring priority.
  2. If connections to the dead node don't timeout properly at the gRPC/protos level, the cluster hangs.
  3. The console may be making synchronous calls to cluster metadata endpoints which is potentially clogging the dead node.
  4. If the rustfsadmin session was pinned to the failed node, login attempts would hang

Could be possible to fix via :

  1. Add dead connection eviction.
  2. Implement graceful degradation in the console.
  3. Make IAM sync, which does not blocks out the auth.
  4. Add connection eviction on error in PeerRestClient.

I will start debugging if you find my instincts well. @loverustfs

@loverustfs
Copy link
Contributor

Hey @weisd @houseme @Jitterx69 ,
The repair was unsuccessful. #1035

  1. During the upload, one server lost power, and the rustfsadmin user was unable to log in.
  2. Upload stopped.
  3. The performance page cannot be displayed.

It did not meet expectations.
image

What I think is the keepalive fix helps the data plane (file transfers) fail fast than normal. Other than that :

  1. The metadata operations (login, server stats) may be blocking on the dead node because of Raft quorum, requiring priority.
  2. If connections to the dead node don't timeout properly at the gRPC/protos level, the cluster hangs.
  3. The console may be making synchronous calls to cluster metadata endpoints which is potentially clogging the dead node.
  4. If the rustfsadmin session was pinned to the failed node, login attempts would hang

Could be possible to fix via :

  1. Add dead connection eviction.
  2. Implement graceful degradation in the console.
  3. Make IAM sync, which does not blocks out the auth.
  4. Add connection eviction on error in PeerRestClient.

I will start debugging if you find my instincts well. @loverustfs

Okay, let's try this solution.
@Jitterx69 Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants