Skip to content

refactor: backend pod SA must not be used for user-initiated k8s actions (umbrella — was CheckCanI) #7993

@clubanderson

Description

@clubanderson

Architectural rule (confirmed 2026-04-14)

The console running on a cluster is not supposed to give anyone elevated access to the cluster via the pod ServiceAccount. It is supposed to work just like localhost: each user brings their own kc-agent + kubeconfig for their own use — not shared. As long as GPU reservation continues to work, it is fine to break in-cluster functionality for users who don't have a local kc-agent.

Pod SA may only be used for:

  1. Bootstrapping the console as a Deployment (frontend + internal console state).
  2. GPU reservation exception — namespace create + ResourceQuota create/update/delete (namespaces.go GPU path, mcp_resources.go ResourceQuota handlers).
  3. Self-upgrade exceptionself_upgrade.go patching its own Deployment.

Every other k8s operation against a managed cluster MUST go through kc-agent at LOCAL_AGENT_HTTP_URL / LOCAL_AGENT_WS_URL (ws://127.0.0.1:8585), which loads the user's kubeconfig and respects per-cluster RBAC automatically via the apiserver.

This was originally filed (from #7979) as a narrow CheckCanI SSAR identity bug. The audit revealed it's an architectural migration gap — some handlers already route through kc-agent, many don't. CheckCanI is a symptom, not the disease.

Phase 0 audit — violation inventory

pkg/api/handlers/ call sites (mutating or dynamic)

file handler verb + resource class
workloads.go:199 DeployWorkload create Deployment/STS/DS/Svc/CM/Secret bundle MIGRATE
workloads.go:506/568/576/625 node-label group flow patch Node `kubestellar.io/group` MIGRATE
workloads.go:1112 ScaleWorkload patch/scale Deployment/STS MIGRATE
workloads.go:1139 DeleteWorkload delete Deployment bundle MIGRATE
namespaces.go:98 CreateNamespace create Namespace SPLIT: GPU path KEEP, general MIGRATE
namespaces.go:138 DeleteNamespace delete Namespace SPLIT: same policy
namespaces.go:279 GrantNamespaceAccess create RoleBinding MIGRATE
namespaces.go:321 RevokeNamespaceAccess delete RoleBinding MIGRATE
rbac.go:461 CreateServiceAccount create ServiceAccount MIGRATE
rbac.go:554 CreateRoleBinding create RoleBinding MIGRATE
mcs.go:229 CreateServiceExport create ServiceExport MIGRATE
mcs.go:256 DeleteServiceExport delete ServiceExport MIGRATE
mcp_resources.go:225 InstallGPUHealthCronJob create CronJob+RBAC MIGRATE
mcp_resources.go:261 UninstallGPUHealthCronJob delete CronJob+RBAC MIGRATE
mcp_resources.go:892 CreateOrUpdateResourceQuota create/update ResourceQuota KEEP (GPU)
mcp_resources.go:928 DeleteResourceQuota delete ResourceQuota KEEP (GPU)
gitops.go (13 exec sites + 1 dyn) helm / kubectl / argocd / git all verbs MIGRATE — shells out with pod kubeconfig
self_upgrade.go:113/126/148/388 status/apply own Deployment KEEP (self-upgrade)
console_persistence.go + console_resources.go (~13 write sites) CRUD ManagedWorkload / ClusterGroup / WorkloadDeployment CRs create/update/delete CR MIGRATE
custom_resources.go:171, crds.go:89, admission_webhooks.go:113, service_exports.go:94 list READ-ONLY MIGRATE (view-leak)
exec.go:363/369 WS exec via SPDY pod exec DELETE — local kc-agent WS already handles, closes #5406
sse.go (15 sites) + mcp_resources.go/mcp_workloads.go/mcp_cluster.go/rbac.go/gateway.go/mcs.go/topology.go reads (~120 sites) list/get various MIGRATE (view-leak, Phase 4.5)
k8s/rbac.go:289-312 CheckClusterAdminAccess SSAR via shared client DELETE — guard used in namespaces.go is invalid; replaced by kc-agent routing
k8s/rbac.go:619-648 CheckCanI SSAR via shared client DELETE or make GPU-specific — no general-purpose consumer after migration

Frontend call sites to migrate

  • `hooks/useWorkloads.ts` (deploy, scale, delete)
  • `hooks/useUsers.ts` (service accounts, bindings)
  • `hooks/useMCS.ts` (service exports)
  • `hooks/useArgoCD.ts` (sync, applicationsets, detect-drift)
  • `components/gitops/SyncDialog.tsx` (sync, detect-drift)
  • `components/namespaces/{CreateNamespaceModal,NamespaceManager,GrantAccessModal}.tsx`
  • `hooks/mcp/storage.ts` — ResourceQuota paths KEEP (GPU)
  • `hooks/useCachedData.ts` — GPU health cronjob → MIGRATE
  • `components/drilldown/RemediationConsole.tsx` — MCP ops tools (per-tool review)

kc-agent coverage gaps (new routes needed in `pkg/agent/server.go`)

kc-agent currently has only one mutating k8s route: `POST /scale`. All below need to be added:

  • `POST /workloads/deploy` (bundle create — replaces `DeployWorkload`, ~400 LOC port)
  • `POST /workloads/delete`
  • `POST/DELETE /namespaces` (general create/delete)
  • `POST/DELETE /rolebindings` (namespace access grant/revoke + rbac.CreateRoleBinding)
  • `POST /serviceaccounts`
  • `POST/DELETE /serviceexports` (mcs)
  • `POST /gitops/helm-{rollback,uninstall,upgrade}` (new shell-out handlers)
  • `POST /gitops/detect-drift` + `POST /gitops/sync`
  • `POST /argocd/sync`
  • `POST /gpu-health-cronjob` (install/uninstall — MIGRATE classification)
  • `POST/PUT/DELETE /console-cr/*` (ManagedWorkload, ClusterGroup, WorkloadDeployment)
  • Optional: `POST /node-label` for `kubestellar.io/group` patches

Phase plan (refined against real scope)

Phase Scope Size Blocks
1 workloads.go: Scale (frontend swap to existing `/scale`); Deploy + Delete (port bundling logic to kc-agent); decide node-label routing Medium (~400 LOC port)
1.5 rbac.go + mcs.go: CreateServiceAccount, CreateRoleBinding, Create/Delete ServiceExport Small new `/rolebindings` route used in Phase 2
2 namespaces.go SPLIT: keep GPU-reservation path, migrate general create/delete + grant/revoke Medium 1.5
2.5 console_persistence.go: migrate CR writes via new agent `/console-cr/*` routes Medium
3a kc-agent helm handlers (rollback/uninstall/upgrade) Medium — shell-out wrappers
3b kc-agent drift-detect + kubectl-sync handlers Small
3c kc-agent ArgoCD handlers (sync + CR update) Medium
3d Delete backend exec handler (local kc-agent already serves) — closes #5406 Small
3e Migrate InstallGPUHealthCronJob + node-label to kc-agent Small
4 Delete backend gitops.go handlers; frontend migration to new agent routes Medium 3a/3b/3c
4.5 Read-leak cleanup: migrate ~150 list/get call sites in sse/mcp_*/rbac/gateway/mcs/topology/crds/admission_webhooks/custom_resources/service_exports Large-LOC, mechanical
5 Rename `MultiClusterClient` → `PrivilegedClient`; add CI lint blocking new `k8sClient.*Create/Delete/Update/Patch` in `pkg/api/handlers/` outside allowlist; delete/narrow `CheckCanI` and `CheckClusterAdminAccess` Small all prior

Closes on merge

This issue + #5406 (exec backend handler documented limitation — deleted in Phase 3d).

Expected user-visible effect

In-cluster users without a local kc-agent lose destructive operations as each phase lands. This is the stated architectural intent. GPU reservation continues to work throughout (the only pod-SA path for user-initiated action). Local-mode users are unaffected because the backend falls through to `~/.kube/config` which is what kc-agent uses anyway.

Status

  • Phase 0 audit: complete (this issue body)
  • Phase 1: awaiting user green-light to launch

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions