Skip to content

♻️ refactor(phase3d): migrate pod exec WebSocket to kc-agent (Fixes #5406, Refs #7993)#8168

Merged
clubanderson merged 1 commit intomainfrom
fix/phase3d-exec-via-agent
Apr 15, 2026
Merged

♻️ refactor(phase3d): migrate pod exec WebSocket to kc-agent (Fixes #5406, Refs #7993)#8168
clubanderson merged 1 commit intomainfrom
fix/phase3d-exec-via-agent

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

Phase 3d of the pod-SA → kc-agent refactor (#7993). Pod exec WebSocket now runs under kc-agent using the user's own kubeconfig rather than the backend pod ServiceAccount. The target cluster's apiserver enforces RBAC natively against the caller's identity, so the SubjectAccessReview workaround the backend handler had to simulate (#8120, #5406) is no longer needed.

Fixes #5406. Refs #7993.

What's in the PR

New — `pkg/agent/server_exec.go` (519 lines): kc-agent's `handleExec`, route `/ws/exec`, plus the `agentWSWriter` / `agentWSReader` / `agentTerminalSizeQueue` adapters for gorilla/websocket. Mirrors the backend handler's behaviour where it's independent of the pod-SA problem (stdin drop telemetry from PR 7995, rate-limited WARN log on power-of-two counts, 1 MiB frame cap, ping/pong keepalive, #7048/#7778 reader-drain-before-close ordering, writeMu serialization, TTY resize queue).

Deleted — `pkg/api/handlers/exec.go` (791 lines) + `pkg/api/handlers/exec_test.go` (395 lines). The whole backend handler and its tests are gone. The #8120 / #8137 / #8139 / #8140 SubjectAccessReview machinery — ~200 lines of simulated-user RBAC checking — retires with it because kc-agent's clientset IS the user's identity.

Wired up — `pkg/agent/server.go` registers `/ws/exec`; `pkg/api/server.go` drops the backend registration; `pkg/api/handlers/auth.go`'s Logout handler no longer calls `CancelUserExecSessions` (that whole cross-session registry was there to cancel exec streams running as the pod SA on JWT revocation — kc-agent exec is per-user local, so closing the browser tab is enough).

Frontend — `web/src/hooks/useExecSession.ts` now connects to `LOCAL_AGENT_WS_URL.replace(//ws$/, '/ws/exec')` instead of `${window.location.host}/ws/exec`, and sends `exec_init` as the first frame instead of an `auth` preamble. 40/40 vitest cases pass; 3 assertion updates (sentMessages[0] is now init, not auth) and one test removed (the "no auth token on connect" error mode no longer exists frontend-side — kc-agent would 401 on upgrade).

Diff size

8 files, +576 / -1238 — the migration is a net 662-line delete because the pod-SA workaround machinery goes away.

Test plan

  • `go build ./...` green
  • `go vet ./...` clean
  • `go test ./pkg/api/handlers/ -run TestAuth` — Logout handler still passes after dropping `CancelUserExecSessions`
  • `npx vitest run src/hooks/tests/useExecSession.test.ts` — 40/40 pass
  • `npm run build` — green, all post-build safety checks pass
  • Manual: open a Web Terminal on a running pod via the browser UI, confirm the exec stream reaches the pod and stdin/stdout flow correctly
  • Manual: try to exec into a pod the user's kubeconfig doesn't have pods/exec on — confirm a clean apiserver 403 error instead of the old pre-flight SAR deny

Risk

The apiserver's native RBAC check replaces the simulated one. Users who relied on the old `github:` impersonation subject for their cluster RBAC bindings will need to re-bind against their actual kubeconfig identity — but that's the correct direction: RBAC on the target cluster should reflect the real user, not a console-minted synthetic subject.

In-cluster deployments without a local kc-agent lose exec entirely. This is the stated architectural intent of #7993 — a console admin who's not a k8s cluster-admin no longer gets to exec into arbitrary pods via the pod SA.

…5406)

Phase 3d of the pod-SA → kc-agent refactor. The /ws/exec WebSocket now
runs under kc-agent using the user's own kubeconfig rather than the
backend pod ServiceAccount. The target cluster's apiserver enforces RBAC
natively against the caller's identity, so the SubjectAccessReview
workaround the backend handler had to simulate (#8120, #5406) is no
longer needed — the shell fails synchronously at stream-open time if
the user's kubeconfig doesn't permit pods/exec on the target pod.

Fixes #5406.

pkg/agent/server_exec.go (new file, ~520 lines):
  handleExec is the kc-agent side of the pod exec WebSocket. The flow
  mirrors the backend handler minus the pod-SA-specific layers:
    1. Token validation via the standard s.validateToken(r) path
       (Authorization header or ?token= query param fallback that
       kc-agent's /ws already uses) — no first-message JWT handshake.
    2. Gorilla WebSocket upgrade.
    3. Read the exec_init JSON frame (cluster/namespace/pod/container/
       command/tty/cols/rows).
    4. Resolve the clientset + REST config from the shared
       *k8s.MultiClusterClient (the user's kubeconfig).
    5. Build the SPDY exec stream and spin up ping ticker + read loop
       goroutines under a shared writeMu.
    6. Block on executor.StreamWithContext; on exit, drain the reader
       goroutine (#7048 / #7778 ordering preserved) and send the exit
       frame.

  What we kept from the backend handler because it's independent of
  the pod-SA problem:
    - The stdin drop telemetry (agentExecStdinDropCount +
      rate-limited WARN log on power-of-two session counts) from PR 7995.
    - The agentExecMaxStdinBytes frame cap (1 MiB).
    - The #7048 / #7778 ordering: drain reader before closing the
      resize queue channel to avoid send-on-closed panic.
    - Ping/pong keepalive so half-open TCP is detected within
      agentExecPongTimeout.
    - Terminal size queue for TTY resize events.
    - writeMu serializing every WriteMessage so stdout/stderr/ping/
      exit/error frames never race.

  What we dropped because kc-agent uses the user's kubeconfig:
    - First-message JWT auth handshake (authMessage type). Auth is on
      the HTTP upgrade now.
    - execAuthorizer interface + authorizePodExec + the four sentinel
      errors errExecAuthorizerUnavailable / errExecMissingUserSubject
      / errExecSARFailed / errExecRBACDenied. The whole SubjectAccess-
      Review dance was there to simulate the user's identity against
      the pod SA's clientset (#8120, #8137, #8139, #8140); kc-agent's
      clientset IS the user's identity, so the apiserver enforces
      natively.
    - Per-user execSessionsRegistry / registerExecSession /
      unregisterExecSession / CancelUserExecSessions. kc-agent is a
      per-user local process with no logout concept — if the user
      closes their browser tab the WebSocket closes, the read loop
      exits, and execCancel fires via defer. #6024's cross-session
      cleanup is no longer needed.

  Named constants instead of literals per the repo rule:
  agentExecMaxStdinBytes (1 MiB), agentExecPingInterval (30s),
  agentExecPongTimeout (45s), agentExecWriteDeadline (10s — explicit
  because gorilla unlike gofiber does not have an implicit write
  timeout), agentExecStdinBufferSize (32), agentExecResizeBufferSize
  (4), agentExecDefaultCols (80), agentExecDefaultRows (24).

pkg/agent/server.go:
  mux.HandleFunc("/ws/exec", s.handleExec) registered next to /ws with
  an explanatory comment.

pkg/api/handlers/exec.go + exec_test.go (deleted):
  The entire 791-line backend handler and its 395-line test file are
  gone. The architecture is cleaner now — there's one code path for
  exec, it runs against the user's identity, and the whole class of
  "who is the console SA and what can it impersonate" concerns is
  retired alongside the code.

pkg/api/handlers/auth.go:
  The Logout handler used to call CancelUserExecSessions (#6024) to
  tear down exec streams that were running as the pod SA when the
  user's JWT was revoked. kc-agent exec is per-user local, so there
  are no cross-session exec streams to cancel on logout. The SSE
  cancellation (#6029) stays; it's unrelated.

pkg/api/server.go:
  The backend /ws/exec route registration is gone. A comment at the
  same line points readers at pkg/agent/server_exec.go and
  web/src/hooks/useExecSession.ts for the replacement.

web/src/hooks/useExecSession.ts:
  The WebSocket URL is now LOCAL_AGENT_WS_URL.replace(/\/ws$/, '/ws/exec')
  instead of ${window.location.host}/ws/exec so the browser connects
  to the user's own kc-agent at 127.0.0.1:8585. The first-message
  auth send is removed (kc-agent validates on upgrade). STORAGE_KEY_TOKEN
  import dropped; LOCAL_AGENT_WS_URL added.

web/src/hooks/__tests__/useExecSession.test.ts (40/40 pass):
  Three assertion updates — sentMessages[0] is now exec_init (not
  auth), sentMessages[1] is no longer used. The "no auth token on
  connect" test is removed (that failure mode no longer exists on the
  frontend; kc-agent would 401 on upgrade instead, which the
  "constructor throwing" test already covers).

Refs #7993. Fixes #5406.

Signed-off-by: Andrew Anderson <[email protected]>
Copilot AI review requested due to automatic review settings April 15, 2026 15:51
@kubestellar-prow kubestellar-prow Bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 15, 2026
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign eeshaansa for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 15, 2026

Deploy Preview for kubestellarconsole ready!

Name Link
🔨 Latest commit 20c5fcd
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69dfb41c78373300087bf200
😎 Deploy Preview https://deploy-preview-8168.console-deploy-preview.kubestellar.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@clubanderson clubanderson added the ai-generated Pull request generated by AI label Apr 15, 2026
@kubestellar-prow kubestellar-prow Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@clubanderson clubanderson merged commit 3f1af70 into main Apr 15, 2026
34 of 35 checks passed
@kubestellar-prow kubestellar-prow Bot deleted the fix/phase3d-exec-via-agent branch April 15, 2026 15:57
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates pod exec WebSocket handling from the backend (pod ServiceAccount) to the local kc-agent, so the SPDY exec stream runs under the user’s kubeconfig and the target apiserver enforces RBAC natively (Phase 3d of #7993; closes #5406).

Changes:

  • Added kc-agent /ws/exec WebSocket handler and registered the route.
  • Removed backend /ws/exec handler + tests and unregistered the backend route.
  • Updated the frontend exec hook/tests to connect to kc-agent and send exec_init as the first frame.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
web/src/hooks/useExecSession.ts Switches exec WebSocket target to LOCAL_AGENT_WS_URL and removes first-message auth preamble.
web/src/hooks/tests/useExecSession.test.ts Updates tests to reflect exec_init being the first/only pre-stream message.
pkg/api/server.go Removes backend /ws/exec registration and documents migration to kc-agent.
pkg/api/handlers/auth.go Removes logout-time exec-session cancellation (exec no longer runs in backend).
pkg/agent/server_exec.go New kc-agent exec WebSocket handler implementing SPDY exec bridging and terminal I/O adapters.
pkg/agent/server.go Registers kc-agent /ws/exec route.
pkg/api/handlers/exec.go Deleted backend exec handler (pod-SA based).
pkg/api/handlers/exec_test.go Deleted backend exec handler tests.
Comments suppressed due to low confidence (1)

web/src/hooks/useExecSession.ts:239

  • When KC_AGENT_TOKEN is configured, kc-agent only accepts WebSocket auth via Authorization header or the ?token= query parameter (browsers can’t set Authorization on the WebSocket handshake). This hook currently constructs the URL without a token, so /ws/exec will 401 in hardened installs. Include the stored token as a query param (or other supported mechanism) when present, and keep demo-token excluded if that’s the established policy elsewhere.
    const wsUrl = LOCAL_AGENT_WS_URL.replace(/\/ws$/, '/ws/exec')

    let ws: WebSocket
    try {
      ws = new WebSocket(wsUrl)

Comment thread pkg/agent/server_exec.go
Comment on lines +487 to +493
execErr := executor.StreamWithContext(execCtx, streamOpts)

// #7048 — Wait for the reader goroutine to finish before closing the
// size queue channel. The reader may still be pushing resize events;
// closing first would panic on send-to-closed (#7778).
<-done
close(sizeQueue.ch)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on <-done immediately after executor.StreamWithContext can deadlock when the remote command exits while the WebSocket is still open: the read-loop goroutine can block indefinitely in conn.ReadMessage() (pongs don’t make ReadMessage return), so done never closes and the handler never sends the exit frame. Consider proactively closing the WebSocket (or writing a close control frame) and/or cancelling the session context when StreamWithContext returns, then wait for the read loop to exit before closing sizeQueue.ch.

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/server_exec.go
if init.Cluster == "" || init.Namespace == "" || init.Pod == "" {
agentExecWriteError(conn, "Missing cluster, namespace, or pod")
return
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init frame values are used for kubeconfig context lookup and to build the pods/exec request, but they aren’t validated. Other kc-agent handlers validate user-supplied identifiers with validateKubeContext and validateDNS1123Label to prevent path traversal / command injection / unexpected lookups. Apply the same validation here (cluster context, namespace, pod, and container) before calling GetClient/GetRestConfig and before logging them.

Suggested change
}
}
if err := validateKubeContext(init.Cluster); err != nil {
agentExecWriteError(conn, "Invalid cluster")
return
}
if err := validateDNS1123Label(init.Namespace); err != nil {
agentExecWriteError(conn, "Invalid namespace")
return
}
if err := validateDNS1123Label(init.Pod); err != nil {
agentExecWriteError(conn, "Invalid pod")
return
}
if init.Container != "" {
if err := validateDNS1123Label(init.Container); err != nil {
agentExecWriteError(conn, "Invalid container")
return
}
}

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/server_exec.go
return
}
defer conn.Close()

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gorilla WebSocket connection here doesn’t set a read limit. Without conn.SetReadLimit(...), an authenticated client can send an arbitrarily large frame and force large allocations before the JSON/Data-size checks run. Mirror handleWebSocket by setting an appropriate read limit (e.g., slightly above the max expected JSON frame size / 1 MiB).

Suggested change
// Bound inbound frame size so an authenticated client cannot force large
// allocations before JSON / message-size validation runs.
const agentExecMaxWebSocketFrameBytes int64 = 1 << 20 // 1 MiB
conn.SetReadLimit(agentExecMaxWebSocketFrameBytes)

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/server.go
Comment on lines +546 to +551
// Pod exec WebSocket moved to kc-agent (#7993 Phase 3d-A). Runs the
// SPDY exec stream under the user's kubeconfig so the target cluster's
// apiserver enforces RBAC natively — no SubjectAccessReview dance is
// needed, unlike pkg/api/handlers/exec.go which uses the pod SA and
// has to simulate the user identity. Handler in server_exec.go.
mux.HandleFunc("/ws/exec", s.handleExec)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a new sensitive WebSocket endpoint (/ws/exec) guarded by validateToken. To prevent auth regressions, it should be included in the agent’s endpoint auth coverage table (pkg/agent/endpoint_auth_test.go sensitiveEndpoints + handler resolver) the same way /ws is.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Post-merge build verification passed

Both Go and frontend builds compiled successfully against merge commit 3f1af7032be8b0e751e0cf62a32d8f2fe4ac572b.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Post-Merge Verification: failed

Commit: 3f1af7032be8b0e751e0cf62a32d8f2fe4ac572b
Specs run: smoke.spec.ts
Report: https://github.com/kubestellar/console/actions/runs/24464581390

clubanderson added a commit that referenced this pull request Apr 15, 2026
Fixes #8175

The `builds wss: URL when page uses https` test in useExecSession-expand
was asserting the pre-phase3d URL shape — `wss://<page-host>/ws/exec` —
built from `window.location.protocol`. After #8168 migrated pod exec to
kc-agent (#7993), the hook uses the `LOCAL_AGENT_WS_URL` constant
(`ws://127.0.0.1:8585/ws`) unconditionally, because the SPDY exec stream
now runs under the user's own kubeconfig on the local loopback rather
than through the page origin. The page protocol no longer influences the
URL, so the old assertion has been stale since #8168 merged.

This is a pre-existing flake that Coverage Suite run #760 happened to
surface on top of b828c87 (#8169). #8169 only touched
useDashboardHealth, MSW handlers, and useDashboardHealth.test.ts — none
of which are related to exec session URL construction. Confirmed by
running the test in isolation and together with useDashboardHealth.test.ts
(both pass after the fix; isolated failure reproduces on unmodified
main against useExecSession.ts:235).

Fix: rename the test to "builds kc-agent /ws/exec URL regardless of
page protocol" and assert the new, stable target. Documented the
rationale inline with source pointer to useExecSession.ts:235 so future
readers don't reintroduce the old assertion.

Signed-off-by: Andy Anderson <[email protected]>
clubanderson added a commit that referenced this pull request Apr 15, 2026
Fixes #8175

The `builds wss: URL when page uses https` test in useExecSession-expand
was asserting the pre-phase3d URL shape — `wss://<page-host>/ws/exec` —
built from `window.location.protocol`. After #8168 migrated pod exec to
kc-agent (#7993), the hook uses the `LOCAL_AGENT_WS_URL` constant
(`ws://127.0.0.1:8585/ws`) unconditionally, because the SPDY exec stream
now runs under the user's own kubeconfig on the local loopback rather
than through the page origin. The page protocol no longer influences the
URL, so the old assertion has been stale since #8168 merged.

This is a pre-existing flake that Coverage Suite run #760 happened to
surface on top of b828c87 (#8169). #8169 only touched
useDashboardHealth, MSW handlers, and useDashboardHealth.test.ts — none
of which are related to exec session URL construction. Confirmed by
running the test in isolation and together with useDashboardHealth.test.ts
(both pass after the fix; isolated failure reproduces on unmodified
main against useExecSession.ts:235).

Fix: rename the test to "builds kc-agent /ws/exec URL regardless of
page protocol" and assert the new, stable target. Documented the
rationale inline with source pointer to useExecSession.ts:235 so future
readers don't reintroduce the old assertion.

Signed-off-by: Andy Anderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated Pull request generated by AI dco-signoff: yes Indicates the PR's author has signed the DCO. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod exec allows any authenticated user to open a shell in any pod.

2 participants