-
Notifications
You must be signed in to change notification settings - Fork 1
fix(transport): add SSE keepalive and configure HTTP server timeouts for proxy chain #139
Description
Problem
SSE connections through the proxy chain (Cloudflare → Envoy → Node.js) are killed after ~125 seconds of idle time because no keepalive mechanism exists.
Evidence from Production Logs
Envoy access logs show ALL SSE GET requests dying at exactly ~125s with DR (Downstream Reset) flag and 0 response body bytes:
[2026-01-23T10:23:54] "GET / HTTP/2" 200 DR 0 0 125032 1 "claude-code/2.1.17"
[2026-01-23T10:26:01] "GET / HTTP/2" 200 DR 0 0 125014 2 "claude-code/2.1.17"
[2026-01-23T10:28:07] "GET / HTTP/2" 200 DR 0 0 125030 2 "claude-code/2.1.17"
[2026-01-23T10:30:13] "GET / HTTP/2" 200 DR 0 0 125009 1 "claude-code/2.1.17"
Proxy Chain Configuration (Verified)
Client → Cloudflare (idle timeout ~100-125s) → Envoy (lb1) → Node.js (192.168.95.53:3333)
Envoy config is correct (no timeouts for this route):
- Route:
timeout: 0s,idle_timeout: 0s - HCM:
stream_idle_timeout: 86400s - Cluster: TCP keepalive configured (300s/30s/5 probes)
Cloudflare is the bottleneck — kills connections with no data after ~100-125s.
Sub-Problems
1. No SSE Heartbeat/Keepalive
The MCP SDK's StreamableHTTPServerTransport has no built-in ping/heartbeat mechanism. During long tool calls (up to 47s with retries in enhancedFetch), the SSE stream is completely idle.
SSE spec supports comment-based keepalives:
: ping\n\n
These are ignored by clients but keep the connection alive through proxies.
2. Node.js HTTP Server Default Timeouts
No explicit timeout configuration on the HTTP server:
keepAliveTimeoutdefaults to 5000ms (Node.js 24)headersTimeoutdefaults to 60000ms
Between tool calls on the same HTTP/1.1 connection, a 5-second gap causes connection reset.
Proposed Solution
SSE Heartbeat (every 30s)
After SSE stream is established, send periodic keepalive comments:
// In server.ts, after SSE connection is established
const heartbeatInterval = setInterval(() => {
try {
controller.enqueue(encoder.encode(": ping\n\n"));
} catch {
clearInterval(heartbeatInterval);
}
}, 30000); // Every 30 seconds — well under Cloudflare's 100s limit
// Clean up on stream close
stream.on('close', () => clearInterval(heartbeatInterval));HTTP Server Timeouts
const httpServer = http.createServer(app);
httpServer.keepAliveTimeout = 620000; // 620s — above Cloudflare's max (600s Enterprise)
httpServer.headersTimeout = 625000; // Must be > keepAliveTimeout
httpServer.timeout = 0; // No socket timeout for streamingFix Checklist
- Add SSE keepalive ping (
:ping\n\nevery 30s) to Streamable HTTP transport - Configure
keepAliveTimeouton HTTP server (620s for proxy compatibility) - Configure
headersTimeouton HTTP server (> keepAliveTimeout) - Set
server.timeout = 0for SSE streaming support - Make heartbeat interval configurable via env var (
GITLAB_SSE_HEARTBEAT_MS) - Add integration test: verify SSE stream survives > 125s with heartbeat
- Document proxy chain timeout requirements
Impact
Serious — All SSE connections die after ~2 minutes regardless of transport routing fix (#138). This can be developed in parallel with #138.
Files to Modify
src/server.ts— HTTP server timeout configuration + SSE heartbeat setup
Relationship
Independent of #138 (transport routing). Can be developed and tested in parallel. Both fixes are required for stable production operation.