Skip to content

Skip reconnect for unrecoverable transport disconnects#10096

Merged
kevinyang372 merged 1 commit intomasterfrom
kevin/only-attempt-reconnect-on-reconnectable-errors
May 4, 2026
Merged

Skip reconnect for unrecoverable transport disconnects#10096
kevinyang372 merged 1 commit intomasterfrom
kevin/only-attempt-reconnect-on-reconnectable-errors

Conversation

@kevinyang372
Copy link
Copy Markdown
Member

@kevinyang372 kevinyang372 commented May 4, 2026

Description

We are seeing ~850 Sentry errors in 2 days (WARP-CLIENT-BETA-STABLE-7JNH) from doomed reconnect attempts after SSH disconnects caused by system sleep / network loss.

When the Mac sleeps, the SSH TCP connection dies but the ControlMaster process stays alive locally (no keepalives configured). The remote-server-proxy reader task detects EOF and triggers reconnect, but reconnecting through the same ControlMaster is futile since its TCP connection is dead. Both attempts fail with "Response channel closed before receiving a reply" and the errors get reported to Sentry.

Root cause: the reconnect flow had no way to distinguish a recoverable disconnect (remote server process crashed, SSH connection still alive) from an unrecoverable one (SSH connection itself is dead).

Fix: Add is_reconnectable(exit_status) to the RemoteTransport trait so each transport can decide whether a reconnect is viable. SshTransport returns false when the exit code is 255 (SSH connection-level error) or the process was signal-killed, indicating the ControlMaster connection is dead. mark_session_disconnected consults the transport before entering the reconnect loop.

This is a required trait method (no default impl) so future transports must explicitly consider reconnectability.

Changes

  • RemoteTransport trait (transport.rs): Added required is_reconnectable method
  • SshTransport (ssh_transport.rs): Implements is_reconnectable — returns false for exit code 255 / signal kill
  • RemoteServerManager (manager.rs): mark_session_disconnected calls transport.is_reconnectable() before attempting reconnect; extracted finalize_disconnect helper to deduplicate the disconnect-and-emit pattern

Linked Issue

Testing

  • cargo clippy passes on both remote_server and warp crates
  • All 48 remote_server tests pass

Agent Mode

  • Warp Agent Mode - This PR was created via Warp's AI Agent Mode

Conversation

@cla-bot cla-bot Bot added the cla-signed label May 4, 2026
@oz-for-oss
Copy link
Copy Markdown
Contributor

oz-for-oss Bot commented May 4, 2026

@kevinyang372

I'm starting a first review of this pull request.

You can view the conversation on Warp.

I completed the review and no human review was requested for this pull request.

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@kevinyang372 kevinyang372 changed the title Only reconnect on reconnect-able errors Skip reconnect for unrecoverable transport disconnects May 4, 2026
@kevinyang372 kevinyang372 requested a review from moirahuang May 4, 2026 22:25
@kevinyang372 kevinyang372 merged commit 39ff0d2 into master May 4, 2026
36 checks passed
@kevinyang372 kevinyang372 deleted the kevin/only-attempt-reconnect-on-reconnectable-errors branch May 4, 2026 22:35
Copy link
Copy Markdown
Contributor

@oz-for-oss oz-for-oss Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR adds a transport-level reconnectability check and skips the reconnect loop when SSH reports certain exit statuses.

Concerns

  • SSH exit status 255 is treated as always non-reconnectable, but OpenSSH also returns 255 when a remote command is killed by a signal or terminates without sending an exit-status. Those cases can happen when the remote proxy or daemon dies while the ControlMaster is still usable, so this change can skip the intended reconnect path for recoverable failures.

Verdict

Found: 0 critical, 1 important, 0 suggestions

Request changes

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

/// ControlMaster is futile.
fn is_reconnectable(&self, exit_status: Option<&RemoteServerExitStatus>) -> bool {
match exit_status {
Some(s) => s.code != Some(255) && !s.signal_killed,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [IMPORTANT] OpenSSH returns 255 not only for ControlMaster/socket failures, but also when the remote command dies by signal or no exit status is received; treating every 255 as non-reconnectable skips reconnect for recoverable proxy/daemon crashes.

wolverine2k pushed a commit to wolverine2k/warp that referenced this pull request May 5, 2026
)

## Description

We are seeing ~850 Sentry errors in 2 days
([WARP-CLIENT-BETA-STABLE-7JNH](https://warpdotdev.sentry.io/issues/7456268110/))
from doomed reconnect attempts after SSH disconnects caused by system
sleep / network loss.

When the Mac sleeps, the SSH TCP connection dies but the ControlMaster
process stays alive locally (no keepalives configured). The
`remote-server-proxy` reader task detects EOF and triggers reconnect,
but reconnecting through the same ControlMaster is futile since its TCP
connection is dead. Both attempts fail with "Response channel closed
before receiving a reply" and the errors get reported to Sentry.

**Root cause**: the reconnect flow had no way to distinguish a
recoverable disconnect (remote server process crashed, SSH connection
still alive) from an unrecoverable one (SSH connection itself is dead).

**Fix**: Add `is_reconnectable(exit_status)` to the `RemoteTransport`
trait so each transport can decide whether a reconnect is viable.
`SshTransport` returns `false` when the exit code is 255 (SSH
connection-level error) or the process was signal-killed, indicating the
ControlMaster connection is dead. `mark_session_disconnected` consults
the transport before entering the reconnect loop.

This is a required trait method (no default impl) so future transports
must explicitly consider reconnectability.

### Changes
- **`RemoteTransport` trait** (`transport.rs`): Added required
`is_reconnectable` method
- **`SshTransport`** (`ssh_transport.rs`): Implements `is_reconnectable`
— returns `false` for exit code 255 / signal kill
- **`RemoteServerManager`** (`manager.rs`): `mark_session_disconnected`
calls `transport.is_reconnectable()` before attempting reconnect;
extracted `finalize_disconnect` helper to deduplicate the
disconnect-and-emit pattern

## Linked Issue
- Sentry:
[WARP-CLIENT-BETA-STABLE-7JNH](https://warpdotdev.sentry.io/issues/7456268110/)

## Testing
- `cargo clippy` passes on both `remote_server` and `warp` crates
- All 48 `remote_server` tests pass

## Agent Mode
- [x] Warp Agent Mode - This PR was created via Warp's AI Agent Mode


[Conversation](https://staging.warp.dev/conversation/ead2a14e-5ddd-4fe6-9a04-5ce7f48ec84f)

<!--
CHANGELOG-BUG-FIX: Fixed unnecessary reconnect attempts for remote SSH
sessions after system sleep, reducing error noise
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants