fix(backend): SSE/WS safety, exec goroutine join, nightly E2E hardening (#7041-#7057) by clubanderson · Pull Request #7069 · kubestellar/console

clubanderson · 2026-04-13T00:02:55Z

Closes #7041
Closes #7042
Closes #7043
Closes #7044
Closes #7045
Closes #7046
Closes #7047
Closes #7048
Closes #7049
Closes #7050
Closes #7052
Closes #7053
Closes #7054
Closes #7055
Closes #7056
Closes #7057

Summary

SSE/WebSocket (8 issues):

DisconnectUser calls conn.WriteMessage directly racing the writer goroutine causing concurrent write panic #7041 — DisconnectUser sends nil sentinel through client.send channel so the writer goroutine sends the close frame (single-writer semantics)
Hub.Close() only closes h.done without closing client.send channels leaking all writer goroutines #7042 — Hub.Close() closes every client.send channel so writer goroutines unblock immediately
Server.Shutdown() never calls StopSSECacheEvictor() leaking the evictor goroutine after shutdown #7043 — Server.Shutdown() calls StopSSECacheEvictor() to stop the background evictor
SSE cache key omits user identity allowing cross-user data leakage between admin and viewer roles #7044 — SSE cache key includes user ID to prevent cross-user data leakage
SSE cold cache stampede spawns one Kubernetes API call per concurrent user with no singleflight coalescing #7045 — SSE cold-cache fetches use singleflight to coalesce concurrent K8s API calls
GetEventsStream limit query parameter has no upper bound cap allowing billions of events to be requested #7046 — GetEventsStream clamps limit to maxWarningEventsLimit (500)
Hub broadcast case releases RLock before iterating clients slice exposing it to concurrent modification #7049 — Hub broadcast copies client slice under RLock before iterating
writeSSEEvent injects eventName directly into SSE wire format with no newline sanitization #7050 — writeSSEEvent strips newlines from eventName to prevent SSE frame injection

Exec (2 issues):

terminalSizeQueue channel never closed preventing SPDY executor goroutine from terminating #7047 — Close terminalSizeQueue channel after StreamWithContext returns
Exec reader goroutine not joined before HandleExec returns causing unordered exit message write #7048 — Wait for reader goroutine before writing exit message

Nightly E2E + Caching (6 issues):

prewarm() abandons inner goroutine and all fetchAll sub-goroutines on timeout with no cancellation #7052 — prewarm() uses context cancellation instead of abandoned goroutines
GetRuns() cache miss window allows N concurrent callers to each spawn 17+ goroutines with no re-check #7053 — GetRuns() uses singleflight for cold-cache coalescing
Nightly E2E handler hardcodes api.github.com ignoring GITHUB_URL env var for GitHub Enterprise #7054 — All GitHub API calls use resolveGitHubAPIBase() for GHE support
Unbounded io.ReadAll on GitHub error response bodies in nightly E2E handler #7055 — Error response bodies use io.LimitReader (10 KB cap)
classifyFailures spawns unbounded goroutines with no semaphore for concurrent GitHub API calls #7056 — classifyFailures uses semaphore (max 5 concurrent)
getCachedOrFetchPRs has no singleflight guard causing redundant concurrent paginated GitHub PR fetches #7057 — getCachedOrFetchPRs uses singleflight for cold-cache coalescing

Test plan

go build ./... passes
go vet ./... passes
CI build/lint pass

…singleflight, evictor shutdown #7041 — DisconnectUser now sends a nil sentinel through client.send so the writer goroutine sends the close frame, maintaining single-writer semantics instead of racing conn.WriteMessage. #7042 — Hub.Close() iterates all clients and closes their send channels so writer goroutines unblock immediately on shutdown. #7043 — Server.Shutdown() now calls handlers.StopSSECacheEvictor() to stop the background evictor goroutine. #7044 — SSE cache key includes user ID to prevent cross-user data leakage between admin and viewer roles. #7045 — SSE cold-cache fetches use singleflight to coalesce concurrent Kubernetes API calls for the same cache key. #7046 — GetEventsStream clamps the limit query parameter to maxWarningEventsLimit (500). #7049 — Hub broadcast copies the client slice under RLock before iterating, preventing concurrent modification races. #7050 — writeSSEEvent strips \n and \r from eventName to prevent SSE frame injection. Signed-off-by: Andy Anderson <[email protected]>

#7047 — Close terminalSizeQueue channel after StreamWithContext returns so the SPDY executor's internal goroutine calling Next() terminates. #7048 — Wait for the reader goroutine (<-done) before writing the exit message to ensure cleanup ordering. Signed-off-by: Andy Anderson <[email protected]>

…eflight, semaphore #7052 — prewarm() uses context cancellation so inner goroutines are cancelled on timeout instead of abandoned. #7053 — GetRuns() uses singleflight to coalesce concurrent cold-cache fetches into a single fetchAll call. #7054 — All nightly E2E GitHub API calls use resolveGitHubAPIBase() instead of hardcoded api.github.com, supporting GitHub Enterprise. #7055 — Error response bodies read with io.LimitReader (10 KB cap) to prevent unbounded memory on large HTML error pages. #7056 — classifyFailures uses a semaphore (maxConcurrentClassify=5) to cap concurrent detectGPUFailure goroutines. #7057 — getCachedOrFetchPRs uses singleflight to coalesce concurrent cold-cache paginated GitHub PR fetches. Signed-off-by: Andy Anderson <[email protected]>

kubestellar-prow · 2026-04-13T00:02:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mikespreitzer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2026-04-13T00:03:00Z

✅ Deploy Preview for kubestellarconsole canceled.

Name	Link
🔨 Latest commit	`25b9508`
🔍 Latest deploy log	https://app.netlify.com/projects/kubestellarconsole/deploys/69dc32b2960257000806ea3e

github-actions · 2026-04-13T00:03:03Z

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

Claude Code (Opus 4.5 / 4.6) — recommended

GitHub Copilot

Cursor

Other AI coding assistants

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.

This is an automated message.

Copilot

Pull request overview

This PR hardens backend streaming and background-work logic to prevent goroutine leaks, reduce cache stampedes, and improve safety around WebSocket/SSE framing and shutdown cleanup.

Changes:

Improves WebSocket hub shutdown/disconnect behavior and broadcast slice safety.
Strengthens SSE caching (user-scoped cache keys, singleflight coalescing) and clamps events stream limits.
Hardens nightly E2E GitHub fetching (singleflight, GHE base resolution, bounded error-body reads, bounded concurrency) and exec session teardown ordering.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pkg/api/server.go	Stops SSE cache evictor during server shutdown to avoid leaked background goroutines.
pkg/api/handlers/websocket.go	Adjusts hub broadcast iteration/copying, adds hub-wide close cleanup, and uses a nil sentinel to preserve single-writer WebSocket semantics on disconnect.
pkg/api/handlers/sse.go	Sanitizes SSE event names, scopes SSE cache by user, coalesces cold-cache fetches with singleflight, and clamps events stream limit.
pkg/api/handlers/nightly_e2e.go	Adds context-based prewarm timeout handling, singleflight for cold-cache fetch coalescing, GHE-aware GitHub API base, bounded error body reads, and concurrency limiting for failure classification.
pkg/api/handlers/feedback.go	Adds singleflight coalescing for PR cache cold fetches and defines the default GitHub API base constant used by resolver.
pkg/api/handlers/exec.go	Closes terminal resize queue and attempts to join the reader goroutine before writing the exit message.

Copilot · 2026-04-13T00:07:52Z

+// #7042 — Close every client.send channel so writer goroutines unblock
+// immediately instead of waiting for TCP connections to be forcibly closed.
+// Previously only Run's unregister case closed send channels, but Run exits
+// as soon as h.done is closed — leaving every writer goroutine stranded.
 func (h *Hub) Close() {
 	h.closeOnce.Do(func() {
 		close(h.done)
+
+		h.mu.Lock()
+		for client := range h.clients {
+			close(client.send)
+			delete(h.clients, client)
+		}
+		h.userIndex = make(map[uuid.UUID][]*Client)
+		h.mu.Unlock()


Hub.Close closes all client.send channels from outside the Hub.Run goroutine. Hub.Run’s broadcast case can still be sending on client.send when Close runs, and sending on a concurrently closed channel will panic. Consider moving the per-client send-channel closures into Hub.Run (e.g., handle <-h.done by locking and closing all client.send before returning), or otherwise synchronize so Close cannot close channels while Run is still broadcasting.

Suggested change

// #7042 — Close every client.send channel so writer goroutines unblock

// immediately instead of waiting for TCP connections to be forcibly closed.

// Previously only Run's unregister case closed send channels, but Run exits

// as soon as h.done is closed — leaving every writer goroutine stranded.

func (h *Hub) Close() {

h.closeOnce.Do(func() {

close(h.done)

h.mu.Lock()

for client := range h.clients {

close(client.send)

delete(h.clients, client)

}

h.userIndex = make(map[uuid.UUID][]*Client)

h.mu.Unlock()

// Per-client send channels must not be closed here because Hub.Run may still

// be broadcasting to them until it observes h.done and exits. That cleanup

// must happen from Hub.Run's shutdown path so sends and closes are serialized.

func (h *Hub) Close() {

h.closeOnce.Do(func() {

close(h.done)

Copilot · 2026-04-13T00:07:52Z

+		select {
+		case client.send <- nil: // nil sentinel triggers close in writer
+		default:
+			// Channel full — force-close the connection so the reader/writer
+			// goroutines exit on their next I/O call.


DisconnectUser does a non-blocking send on client.send. If the client is concurrently unregistered (unregister case closes client.send), this send can panic with “send on closed channel”. To make this robust, synchronize sends with the hub lock (e.g., hold h.mu.RLock during the send attempt / check client still registered) or wrap the send in a helper that safely recovers and falls back to client.closeConn().

Copilot · 2026-04-13T00:07:53Z

+	// #7048 — Wait for the reader goroutine to finish before writing the
+	// exit message, ensuring cleanup ordering (close(stdinCh), execCancel)
+	// happens before the handler returns.
+	<-done


Waiting on <-done here can deadlock. If the remote exec finishes normally, executor.StreamWithContext can return while the reader goroutine is still blocked in c.ReadMessage(), so done is never closed and the handler never progresses to write the exit message (or return). You likely need to actively terminate the reader before waiting (e.g., call execCancel() and close the WebSocket / set a short read deadline, or wait with a timeout and then force-close).

Suggested change

// #7048 — Wait for the reader goroutine to finish before writing the

// exit message, ensuring cleanup ordering (close(stdinCh), execCancel)

// happens before the handler returns.

<-done

// Unblock any reader goroutine that may still be stuck in ReadMessage()

// now that the remote exec session has completed.

if err := c.SetReadDeadline(time.Now()); err != nil {

slog.Debug("[Exec] failed to set read deadline while shutting down exec reader", "error", err)

}

// #7048 — Wait for the reader goroutine to finish before writing the

// exit message, ensuring cleanup ordering (close(stdinCh), execCancel)

// happens before the handler returns. Do not wait forever: if ReadMessage()

// does not unblock promptly, continue so the handler cannot deadlock.

select {

case <-done:

case <-time.After(250 * time.Millisecond):

slog.Warn("[Exec] timed out waiting for reader goroutine to exit after exec completion")

}

Copilot · 2026-04-13T00:07:53Z

+			default:
+			}
 			runs, err := h.fetchWorkflowRuns(wf)
 			ch <- result{idx: idx, runs: runs, err: err}
 		}(i, wf)


fetchAllWithContext is intended to cancel in-flight work on ctx.Done(), but the goroutine calls h.fetchWorkflowRuns(wf) which creates requests without ctx. If ctx is cancelled after this point, the HTTP call won’t be interrupted and fetchAllWithContext can still block until the client timeout. Consider threading ctx through (fetchWorkflowRuns(ctx, wf) + http.NewRequestWithContext) and checking ctx in any loops.

Copilot · 2026-04-13T00:07:53Z

+// fetchAllWithContext is the context-aware version of fetchAll (#7052).
+// When ctx is cancelled, HTTP requests made by sub-goroutines will be
+// interrupted instead of running to completion.
+func (h *NightlyE2EHandler) fetchAllWithContext(ctx context.Context) (*NightlyE2EResponse, error) {


The comment on fetchAllWithContext says ctx cancellation will interrupt HTTP requests, but fetchWorkflowRuns/detectGPUFailure (and other helpers called by fetchAll) currently build requests without ctx, so cancellation won’t actually stop in-flight GitHub calls. Either plumb ctx through to all HTTP calls (NewRequestWithContext / req.WithContext) or adjust the comment and timeout strategy accordingly.

Copilot · 2026-04-13T00:07:53Z

+				// #7045 — Use singleflight to coalesce concurrent cold-cache
+				// fetches for the same cache key into one Kubernetes API call.
+				v, fetchErr, _ := sseFetchGroup.Do(cKey, func() (interface{}, error) {
+					return fetchFn(ctx, clusterName)
+				})


sseFetchGroup.Do uses the per-request ctx (derived from streamCtx) inside the singleflight function. If the first caller’s ctx is cancelled (e.g., client disconnect) it will cancel the shared Kubernetes API call and propagate that error to other concurrent callers that are still active. To avoid cross-request cancellation, consider running the shared fetch with a context independent of any one client (e.g., context.WithTimeout(context.Background(), timeout)) and have each caller select on its own ctx while waiting (Group.DoChan + select).

github-actions · 2026-04-13T00:10:11Z

Thank you for your contribution! Your PR has been merged.

Check out what's new:

KubeStellar Console — Live multi-cluster dashboard
Marketplace — Community extensions
Knowledge Base — Troubleshooting and how-tos

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

github-actions · 2026-04-13T00:13:28Z

Post-merge build verification passed ✅

Both Go and frontend builds compiled successfully against merge commit c2896e46f6f3df8f3ae58da16cca2a8313539bf4.

github-actions · 2026-04-13T00:19:07Z

✅ Post-Merge Verification: passed

Commit: c2896e46f6f3df8f3ae58da16cca2a8313539bf4
Specs run: smoke.spec.ts
Report: https://github.com/kubestellar/console/actions/runs/24319775606

clubanderson added 3 commits April 12, 2026 20:01

clubanderson added the ai-generated Pull request generated by AI label Apr 13, 2026

Copilot AI review requested due to automatic review settings April 13, 2026 00:02

kubestellar-prow Bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 13, 2026

kubestellar-prow Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 13, 2026

Copilot started reviewing on behalf of clubanderson April 13, 2026 00:03 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

clubanderson merged commit c2896e4 into main Apr 13, 2026
53 of 56 checks passed

kubestellar-prow Bot deleted the fix/7041-7057-sse-ws-nightly branch April 13, 2026 00:10

github-actions Bot mentioned this pull request Apr 13, 2026

[Copilot Review] 6 comment(s) on merged PR #7069 #7071

Closed

-	// #7048 — Wait for the reader goroutine to finish before writing the
-	// exit message, ensuring cleanup ordering (close(stdinCh), execCancel)
-	// happens before the handler returns.
-	<-done
+	// Unblock any reader goroutine that may still be stuck in ReadMessage()
+	// now that the remote exec session has completed.
+	if err := c.SetReadDeadline(time.Now()); err != nil {
+		slog.Debug("[Exec] failed to set read deadline while shutting down exec reader", "error", err)
+	}
+	// #7048 — Wait for the reader goroutine to finish before writing the
+	// exit message, ensuring cleanup ordering (close(stdinCh), execCancel)
+	// happens before the handler returns. Do not wait forever: if ReadMessage()
+	// does not unblock promptly, continue so the handler cannot deadlock.
+	select {
+	case <-done:
+	case <-time.After(250 * time.Millisecond):
+		slog.Warn("[Exec] timed out waiting for reader goroutine to exit after exec completion")
+	}

Conversation

clubanderson commented Apr 13, 2026

Summary

Test plan

Uh oh!

kubestellar-prow Bot commented Apr 13, 2026

Uh oh!

netlify Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubestellarconsole canceled.

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

✅ Post-Merge Verification: passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify Bot commented Apr 13, 2026 •

edited

Loading