♻️ refactor(phase3d): migrate pod exec WebSocket to kc-agent (Fixes #5406, Refs #7993)#8168
Conversation
…5406) Phase 3d of the pod-SA → kc-agent refactor. The /ws/exec WebSocket now runs under kc-agent using the user's own kubeconfig rather than the backend pod ServiceAccount. The target cluster's apiserver enforces RBAC natively against the caller's identity, so the SubjectAccessReview workaround the backend handler had to simulate (#8120, #5406) is no longer needed — the shell fails synchronously at stream-open time if the user's kubeconfig doesn't permit pods/exec on the target pod. Fixes #5406. pkg/agent/server_exec.go (new file, ~520 lines): handleExec is the kc-agent side of the pod exec WebSocket. The flow mirrors the backend handler minus the pod-SA-specific layers: 1. Token validation via the standard s.validateToken(r) path (Authorization header or ?token= query param fallback that kc-agent's /ws already uses) — no first-message JWT handshake. 2. Gorilla WebSocket upgrade. 3. Read the exec_init JSON frame (cluster/namespace/pod/container/ command/tty/cols/rows). 4. Resolve the clientset + REST config from the shared *k8s.MultiClusterClient (the user's kubeconfig). 5. Build the SPDY exec stream and spin up ping ticker + read loop goroutines under a shared writeMu. 6. Block on executor.StreamWithContext; on exit, drain the reader goroutine (#7048 / #7778 ordering preserved) and send the exit frame. What we kept from the backend handler because it's independent of the pod-SA problem: - The stdin drop telemetry (agentExecStdinDropCount + rate-limited WARN log on power-of-two session counts) from PR 7995. - The agentExecMaxStdinBytes frame cap (1 MiB). - The #7048 / #7778 ordering: drain reader before closing the resize queue channel to avoid send-on-closed panic. - Ping/pong keepalive so half-open TCP is detected within agentExecPongTimeout. - Terminal size queue for TTY resize events. - writeMu serializing every WriteMessage so stdout/stderr/ping/ exit/error frames never race. What we dropped because kc-agent uses the user's kubeconfig: - First-message JWT auth handshake (authMessage type). Auth is on the HTTP upgrade now. - execAuthorizer interface + authorizePodExec + the four sentinel errors errExecAuthorizerUnavailable / errExecMissingUserSubject / errExecSARFailed / errExecRBACDenied. The whole SubjectAccess- Review dance was there to simulate the user's identity against the pod SA's clientset (#8120, #8137, #8139, #8140); kc-agent's clientset IS the user's identity, so the apiserver enforces natively. - Per-user execSessionsRegistry / registerExecSession / unregisterExecSession / CancelUserExecSessions. kc-agent is a per-user local process with no logout concept — if the user closes their browser tab the WebSocket closes, the read loop exits, and execCancel fires via defer. #6024's cross-session cleanup is no longer needed. Named constants instead of literals per the repo rule: agentExecMaxStdinBytes (1 MiB), agentExecPingInterval (30s), agentExecPongTimeout (45s), agentExecWriteDeadline (10s — explicit because gorilla unlike gofiber does not have an implicit write timeout), agentExecStdinBufferSize (32), agentExecResizeBufferSize (4), agentExecDefaultCols (80), agentExecDefaultRows (24). pkg/agent/server.go: mux.HandleFunc("/ws/exec", s.handleExec) registered next to /ws with an explanatory comment. pkg/api/handlers/exec.go + exec_test.go (deleted): The entire 791-line backend handler and its 395-line test file are gone. The architecture is cleaner now — there's one code path for exec, it runs against the user's identity, and the whole class of "who is the console SA and what can it impersonate" concerns is retired alongside the code. pkg/api/handlers/auth.go: The Logout handler used to call CancelUserExecSessions (#6024) to tear down exec streams that were running as the pod SA when the user's JWT was revoked. kc-agent exec is per-user local, so there are no cross-session exec streams to cancel on logout. The SSE cancellation (#6029) stays; it's unrelated. pkg/api/server.go: The backend /ws/exec route registration is gone. A comment at the same line points readers at pkg/agent/server_exec.go and web/src/hooks/useExecSession.ts for the replacement. web/src/hooks/useExecSession.ts: The WebSocket URL is now LOCAL_AGENT_WS_URL.replace(/\/ws$/, '/ws/exec') instead of ${window.location.host}/ws/exec so the browser connects to the user's own kc-agent at 127.0.0.1:8585. The first-message auth send is removed (kc-agent validates on upgrade). STORAGE_KEY_TOKEN import dropped; LOCAL_AGENT_WS_URL added. web/src/hooks/__tests__/useExecSession.test.ts (40/40 pass): Three assertion updates — sentMessages[0] is now exec_init (not auth), sentMessages[1] is no longer used. The "no auth token on connect" test is removed (that failure mode no longer exists on the frontend; kc-agent would 401 on upgrade instead, which the "constructor throwing" test already covers). Refs #7993. Fixes #5406. Signed-off-by: Andrew Anderson <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
There was a problem hiding this comment.
Pull request overview
Migrates pod exec WebSocket handling from the backend (pod ServiceAccount) to the local kc-agent, so the SPDY exec stream runs under the user’s kubeconfig and the target apiserver enforces RBAC natively (Phase 3d of #7993; closes #5406).
Changes:
- Added kc-agent
/ws/execWebSocket handler and registered the route. - Removed backend
/ws/exechandler + tests and unregistered the backend route. - Updated the frontend exec hook/tests to connect to kc-agent and send
exec_initas the first frame.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/hooks/useExecSession.ts | Switches exec WebSocket target to LOCAL_AGENT_WS_URL and removes first-message auth preamble. |
| web/src/hooks/tests/useExecSession.test.ts | Updates tests to reflect exec_init being the first/only pre-stream message. |
| pkg/api/server.go | Removes backend /ws/exec registration and documents migration to kc-agent. |
| pkg/api/handlers/auth.go | Removes logout-time exec-session cancellation (exec no longer runs in backend). |
| pkg/agent/server_exec.go | New kc-agent exec WebSocket handler implementing SPDY exec bridging and terminal I/O adapters. |
| pkg/agent/server.go | Registers kc-agent /ws/exec route. |
| pkg/api/handlers/exec.go | Deleted backend exec handler (pod-SA based). |
| pkg/api/handlers/exec_test.go | Deleted backend exec handler tests. |
Comments suppressed due to low confidence (1)
web/src/hooks/useExecSession.ts:239
- When KC_AGENT_TOKEN is configured, kc-agent only accepts WebSocket auth via Authorization header or the ?token= query parameter (browsers can’t set Authorization on the WebSocket handshake). This hook currently constructs the URL without a token, so /ws/exec will 401 in hardened installs. Include the stored token as a query param (or other supported mechanism) when present, and keep demo-token excluded if that’s the established policy elsewhere.
const wsUrl = LOCAL_AGENT_WS_URL.replace(/\/ws$/, '/ws/exec')
let ws: WebSocket
try {
ws = new WebSocket(wsUrl)
| execErr := executor.StreamWithContext(execCtx, streamOpts) | ||
|
|
||
| // #7048 — Wait for the reader goroutine to finish before closing the | ||
| // size queue channel. The reader may still be pushing resize events; | ||
| // closing first would panic on send-to-closed (#7778). | ||
| <-done | ||
| close(sizeQueue.ch) |
There was a problem hiding this comment.
Waiting on <-done immediately after executor.StreamWithContext can deadlock when the remote command exits while the WebSocket is still open: the read-loop goroutine can block indefinitely in conn.ReadMessage() (pongs don’t make ReadMessage return), so done never closes and the handler never sends the exit frame. Consider proactively closing the WebSocket (or writing a close control frame) and/or cancelling the session context when StreamWithContext returns, then wait for the read loop to exit before closing sizeQueue.ch.
| if init.Cluster == "" || init.Namespace == "" || init.Pod == "" { | ||
| agentExecWriteError(conn, "Missing cluster, namespace, or pod") | ||
| return | ||
| } |
There was a problem hiding this comment.
The init frame values are used for kubeconfig context lookup and to build the pods/exec request, but they aren’t validated. Other kc-agent handlers validate user-supplied identifiers with validateKubeContext and validateDNS1123Label to prevent path traversal / command injection / unexpected lookups. Apply the same validation here (cluster context, namespace, pod, and container) before calling GetClient/GetRestConfig and before logging them.
| } | |
| } | |
| if err := validateKubeContext(init.Cluster); err != nil { | |
| agentExecWriteError(conn, "Invalid cluster") | |
| return | |
| } | |
| if err := validateDNS1123Label(init.Namespace); err != nil { | |
| agentExecWriteError(conn, "Invalid namespace") | |
| return | |
| } | |
| if err := validateDNS1123Label(init.Pod); err != nil { | |
| agentExecWriteError(conn, "Invalid pod") | |
| return | |
| } | |
| if init.Container != "" { | |
| if err := validateDNS1123Label(init.Container); err != nil { | |
| agentExecWriteError(conn, "Invalid container") | |
| return | |
| } | |
| } |
| return | ||
| } | ||
| defer conn.Close() | ||
|
|
There was a problem hiding this comment.
The gorilla WebSocket connection here doesn’t set a read limit. Without conn.SetReadLimit(...), an authenticated client can send an arbitrarily large frame and force large allocations before the JSON/Data-size checks run. Mirror handleWebSocket by setting an appropriate read limit (e.g., slightly above the max expected JSON frame size / 1 MiB).
| // Bound inbound frame size so an authenticated client cannot force large | |
| // allocations before JSON / message-size validation runs. | |
| const agentExecMaxWebSocketFrameBytes int64 = 1 << 20 // 1 MiB | |
| conn.SetReadLimit(agentExecMaxWebSocketFrameBytes) |
| // Pod exec WebSocket moved to kc-agent (#7993 Phase 3d-A). Runs the | ||
| // SPDY exec stream under the user's kubeconfig so the target cluster's | ||
| // apiserver enforces RBAC natively — no SubjectAccessReview dance is | ||
| // needed, unlike pkg/api/handlers/exec.go which uses the pod SA and | ||
| // has to simulate the user identity. Handler in server_exec.go. | ||
| mux.HandleFunc("/ws/exec", s.handleExec) |
There was a problem hiding this comment.
This adds a new sensitive WebSocket endpoint (/ws/exec) guarded by validateToken. To prevent auth regressions, it should be included in the agent’s endpoint auth coverage table (pkg/agent/endpoint_auth_test.go sensitiveEndpoints + handler resolver) the same way /ws is.
|
Post-merge build verification passed ✅ Both Go and frontend builds compiled successfully against merge commit |
❌ Post-Merge Verification: failedCommit: |
Fixes #8175 The `builds wss: URL when page uses https` test in useExecSession-expand was asserting the pre-phase3d URL shape — `wss://<page-host>/ws/exec` — built from `window.location.protocol`. After #8168 migrated pod exec to kc-agent (#7993), the hook uses the `LOCAL_AGENT_WS_URL` constant (`ws://127.0.0.1:8585/ws`) unconditionally, because the SPDY exec stream now runs under the user's own kubeconfig on the local loopback rather than through the page origin. The page protocol no longer influences the URL, so the old assertion has been stale since #8168 merged. This is a pre-existing flake that Coverage Suite run #760 happened to surface on top of b828c87 (#8169). #8169 only touched useDashboardHealth, MSW handlers, and useDashboardHealth.test.ts — none of which are related to exec session URL construction. Confirmed by running the test in isolation and together with useDashboardHealth.test.ts (both pass after the fix; isolated failure reproduces on unmodified main against useExecSession.ts:235). Fix: rename the test to "builds kc-agent /ws/exec URL regardless of page protocol" and assert the new, stable target. Documented the rationale inline with source pointer to useExecSession.ts:235 so future readers don't reintroduce the old assertion. Signed-off-by: Andy Anderson <[email protected]>
Fixes #8175 The `builds wss: URL when page uses https` test in useExecSession-expand was asserting the pre-phase3d URL shape — `wss://<page-host>/ws/exec` — built from `window.location.protocol`. After #8168 migrated pod exec to kc-agent (#7993), the hook uses the `LOCAL_AGENT_WS_URL` constant (`ws://127.0.0.1:8585/ws`) unconditionally, because the SPDY exec stream now runs under the user's own kubeconfig on the local loopback rather than through the page origin. The page protocol no longer influences the URL, so the old assertion has been stale since #8168 merged. This is a pre-existing flake that Coverage Suite run #760 happened to surface on top of b828c87 (#8169). #8169 only touched useDashboardHealth, MSW handlers, and useDashboardHealth.test.ts — none of which are related to exec session URL construction. Confirmed by running the test in isolation and together with useDashboardHealth.test.ts (both pass after the fix; isolated failure reproduces on unmodified main against useExecSession.ts:235). Fix: rename the test to "builds kc-agent /ws/exec URL regardless of page protocol" and assert the new, stable target. Documented the rationale inline with source pointer to useExecSession.ts:235 so future readers don't reintroduce the old assertion. Signed-off-by: Andy Anderson <[email protected]>
Summary
Phase 3d of the pod-SA → kc-agent refactor (#7993). Pod exec WebSocket now runs under kc-agent using the user's own kubeconfig rather than the backend pod ServiceAccount. The target cluster's apiserver enforces RBAC natively against the caller's identity, so the SubjectAccessReview workaround the backend handler had to simulate (#8120, #5406) is no longer needed.
Fixes #5406. Refs #7993.
What's in the PR
New — `pkg/agent/server_exec.go` (519 lines): kc-agent's `handleExec`, route `/ws/exec`, plus the `agentWSWriter` / `agentWSReader` / `agentTerminalSizeQueue` adapters for gorilla/websocket. Mirrors the backend handler's behaviour where it's independent of the pod-SA problem (stdin drop telemetry from PR 7995, rate-limited WARN log on power-of-two counts, 1 MiB frame cap, ping/pong keepalive, #7048/#7778 reader-drain-before-close ordering, writeMu serialization, TTY resize queue).
Deleted — `pkg/api/handlers/exec.go` (791 lines) + `pkg/api/handlers/exec_test.go` (395 lines). The whole backend handler and its tests are gone. The #8120 / #8137 / #8139 / #8140 SubjectAccessReview machinery — ~200 lines of simulated-user RBAC checking — retires with it because kc-agent's clientset IS the user's identity.
Wired up — `pkg/agent/server.go` registers `/ws/exec`; `pkg/api/server.go` drops the backend registration; `pkg/api/handlers/auth.go`'s Logout handler no longer calls `CancelUserExecSessions` (that whole cross-session registry was there to cancel exec streams running as the pod SA on JWT revocation — kc-agent exec is per-user local, so closing the browser tab is enough).
Frontend — `web/src/hooks/useExecSession.ts` now connects to `LOCAL_AGENT_WS_URL.replace(//ws$/, '/ws/exec')` instead of `${window.location.host}/ws/exec`, and sends `exec_init` as the first frame instead of an `auth` preamble. 40/40 vitest cases pass; 3 assertion updates (sentMessages[0] is now init, not auth) and one test removed (the "no auth token on connect" error mode no longer exists frontend-side — kc-agent would 401 on upgrade).
Diff size
8 files, +576 / -1238 — the migration is a net 662-line delete because the pod-SA workaround machinery goes away.
Test plan
Risk
The apiserver's native RBAC check replaces the simulated one. Users who relied on the old `github:` impersonation subject for their cluster RBAC bindings will need to re-bind against their actual kubeconfig identity — but that's the correct direction: RBAC on the target cluster should reflect the real user, not a console-minted synthetic subject.
In-cluster deployments without a local kc-agent lose exec entirely. This is the stated architectural intent of #7993 — a console admin who's not a k8s cluster-admin no longer gets to exec into arbitrary pods via the pod SA.