Skip to content

fix(transport): add SSE keepalive and configure HTTP server timeouts for proxy chain #139

@polaz

Description

@polaz

Problem

SSE connections through the proxy chain (Cloudflare → Envoy → Node.js) are killed after ~125 seconds of idle time because no keepalive mechanism exists.

Evidence from Production Logs

Envoy access logs show ALL SSE GET requests dying at exactly ~125s with DR (Downstream Reset) flag and 0 response body bytes:

[2026-01-23T10:23:54] "GET / HTTP/2" 200 DR 0 0 125032 1 "claude-code/2.1.17"
[2026-01-23T10:26:01] "GET / HTTP/2" 200 DR 0 0 125014 2 "claude-code/2.1.17"
[2026-01-23T10:28:07] "GET / HTTP/2" 200 DR 0 0 125030 2 "claude-code/2.1.17"
[2026-01-23T10:30:13] "GET / HTTP/2" 200 DR 0 0 125009 1 "claude-code/2.1.17"

Proxy Chain Configuration (Verified)

Client → Cloudflare (idle timeout ~100-125s) → Envoy (lb1) → Node.js (192.168.95.53:3333)

Envoy config is correct (no timeouts for this route):

  • Route: timeout: 0s, idle_timeout: 0s
  • HCM: stream_idle_timeout: 86400s
  • Cluster: TCP keepalive configured (300s/30s/5 probes)

Cloudflare is the bottleneck — kills connections with no data after ~100-125s.

Sub-Problems

1. No SSE Heartbeat/Keepalive

The MCP SDK's StreamableHTTPServerTransport has no built-in ping/heartbeat mechanism. During long tool calls (up to 47s with retries in enhancedFetch), the SSE stream is completely idle.

SSE spec supports comment-based keepalives:

: ping\n\n

These are ignored by clients but keep the connection alive through proxies.

2. Node.js HTTP Server Default Timeouts

No explicit timeout configuration on the HTTP server:

  • keepAliveTimeout defaults to 5000ms (Node.js 24)
  • headersTimeout defaults to 60000ms

Between tool calls on the same HTTP/1.1 connection, a 5-second gap causes connection reset.

Proposed Solution

SSE Heartbeat (every 30s)

After SSE stream is established, send periodic keepalive comments:

// In server.ts, after SSE connection is established
const heartbeatInterval = setInterval(() => {
  try {
    controller.enqueue(encoder.encode(": ping\n\n"));
  } catch {
    clearInterval(heartbeatInterval);
  }
}, 30000); // Every 30 seconds — well under Cloudflare's 100s limit

// Clean up on stream close
stream.on('close', () => clearInterval(heartbeatInterval));

HTTP Server Timeouts

const httpServer = http.createServer(app);
httpServer.keepAliveTimeout = 620000; // 620s — above Cloudflare's max (600s Enterprise)
httpServer.headersTimeout = 625000;   // Must be > keepAliveTimeout
httpServer.timeout = 0;               // No socket timeout for streaming

Fix Checklist

  • Add SSE keepalive ping (:ping\n\n every 30s) to Streamable HTTP transport
  • Configure keepAliveTimeout on HTTP server (620s for proxy compatibility)
  • Configure headersTimeout on HTTP server (> keepAliveTimeout)
  • Set server.timeout = 0 for SSE streaming support
  • Make heartbeat interval configurable via env var (GITLAB_SSE_HEARTBEAT_MS)
  • Add integration test: verify SSE stream survives > 125s with heartbeat
  • Document proxy chain timeout requirements

Impact

Serious — All SSE connections die after ~2 minutes regardless of transport routing fix (#138). This can be developed in parallel with #138.

Files to Modify

  • src/server.ts — HTTP server timeout configuration + SSE heartbeat setup

Relationship

Independent of #138 (transport routing). Can be developed and tested in parallel. Both fixes are required for stable production operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions