Skip reconnect for unrecoverable transport disconnects#10096
Skip reconnect for unrecoverable transport disconnects#10096kevinyang372 merged 1 commit intomasterfrom
Conversation
|
I'm starting a first review of this pull request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This PR adds a transport-level reconnectability check and skips the reconnect loop when SSH reports certain exit statuses.
Concerns
- SSH exit status 255 is treated as always non-reconnectable, but OpenSSH also returns 255 when a remote command is killed by a signal or terminates without sending an exit-status. Those cases can happen when the remote proxy or daemon dies while the ControlMaster is still usable, so this change can skip the intended reconnect path for recoverable failures.
Verdict
Found: 0 critical, 1 important, 0 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
| /// ControlMaster is futile. | ||
| fn is_reconnectable(&self, exit_status: Option<&RemoteServerExitStatus>) -> bool { | ||
| match exit_status { | ||
| Some(s) => s.code != Some(255) && !s.signal_killed, |
There was a problem hiding this comment.
) ## Description We are seeing ~850 Sentry errors in 2 days ([WARP-CLIENT-BETA-STABLE-7JNH](https://warpdotdev.sentry.io/issues/7456268110/)) from doomed reconnect attempts after SSH disconnects caused by system sleep / network loss. When the Mac sleeps, the SSH TCP connection dies but the ControlMaster process stays alive locally (no keepalives configured). The `remote-server-proxy` reader task detects EOF and triggers reconnect, but reconnecting through the same ControlMaster is futile since its TCP connection is dead. Both attempts fail with "Response channel closed before receiving a reply" and the errors get reported to Sentry. **Root cause**: the reconnect flow had no way to distinguish a recoverable disconnect (remote server process crashed, SSH connection still alive) from an unrecoverable one (SSH connection itself is dead). **Fix**: Add `is_reconnectable(exit_status)` to the `RemoteTransport` trait so each transport can decide whether a reconnect is viable. `SshTransport` returns `false` when the exit code is 255 (SSH connection-level error) or the process was signal-killed, indicating the ControlMaster connection is dead. `mark_session_disconnected` consults the transport before entering the reconnect loop. This is a required trait method (no default impl) so future transports must explicitly consider reconnectability. ### Changes - **`RemoteTransport` trait** (`transport.rs`): Added required `is_reconnectable` method - **`SshTransport`** (`ssh_transport.rs`): Implements `is_reconnectable` — returns `false` for exit code 255 / signal kill - **`RemoteServerManager`** (`manager.rs`): `mark_session_disconnected` calls `transport.is_reconnectable()` before attempting reconnect; extracted `finalize_disconnect` helper to deduplicate the disconnect-and-emit pattern ## Linked Issue - Sentry: [WARP-CLIENT-BETA-STABLE-7JNH](https://warpdotdev.sentry.io/issues/7456268110/) ## Testing - `cargo clippy` passes on both `remote_server` and `warp` crates - All 48 `remote_server` tests pass ## Agent Mode - [x] Warp Agent Mode - This PR was created via Warp's AI Agent Mode [Conversation](https://staging.warp.dev/conversation/ead2a14e-5ddd-4fe6-9a04-5ce7f48ec84f) <!-- CHANGELOG-BUG-FIX: Fixed unnecessary reconnect attempts for remote SSH sessions after system sleep, reducing error noise -->

Description
We are seeing ~850 Sentry errors in 2 days (WARP-CLIENT-BETA-STABLE-7JNH) from doomed reconnect attempts after SSH disconnects caused by system sleep / network loss.
When the Mac sleeps, the SSH TCP connection dies but the ControlMaster process stays alive locally (no keepalives configured). The
remote-server-proxyreader task detects EOF and triggers reconnect, but reconnecting through the same ControlMaster is futile since its TCP connection is dead. Both attempts fail with "Response channel closed before receiving a reply" and the errors get reported to Sentry.Root cause: the reconnect flow had no way to distinguish a recoverable disconnect (remote server process crashed, SSH connection still alive) from an unrecoverable one (SSH connection itself is dead).
Fix: Add
is_reconnectable(exit_status)to theRemoteTransporttrait so each transport can decide whether a reconnect is viable.SshTransportreturnsfalsewhen the exit code is 255 (SSH connection-level error) or the process was signal-killed, indicating the ControlMaster connection is dead.mark_session_disconnectedconsults the transport before entering the reconnect loop.This is a required trait method (no default impl) so future transports must explicitly consider reconnectability.
Changes
RemoteTransporttrait (transport.rs): Added requiredis_reconnectablemethodSshTransport(ssh_transport.rs): Implementsis_reconnectable— returnsfalsefor exit code 255 / signal killRemoteServerManager(manager.rs):mark_session_disconnectedcallstransport.is_reconnectable()before attempting reconnect; extractedfinalize_disconnecthelper to deduplicate the disconnect-and-emit patternLinked Issue
Testing
cargo clippypasses on bothremote_serverandwarpcratesremote_servertests passAgent Mode
Conversation