fix: Go backend batch 2 — error handling, validation, constants, race fixes#7815
fix: Go backend batch 2 — error handling, validation, constants, race fixes#7815clubanderson merged 2 commits intomainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole canceled.
|
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
There was a problem hiding this comment.
Pull request overview
This PR improves robustness of the Go backend by tightening validation, standardizing HTTP status/error handling, and addressing concurrency/race issues (notably in exec/WebSocket handling and operator caching).
Changes:
- Fixes exec/WebSocket error handling and shutdown ordering to avoid panics and better surface failures.
- Standardizes several HTTP status codes/constants and introduces clearer error/timeout constants.
- Adds validation improvements (CardID UUID, basic email format) and concurrency coalescing (singleflight) to reduce duplicate work.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/api/handlers/user.go | Adds basic email format validation on update. |
| pkg/api/handlers/topology.go | Replaces raw 503 with Fiber status constant. |
| pkg/api/handlers/orbit.go | Makes orbit mission IDs more collision-resistant via ms timestamp + random suffix. |
| pkg/api/handlers/notifications.go | Uses middleware helper for user ID extraction. |
| pkg/api/handlers/mcp_workloads.go | Logs per-cluster GetServices failures during aggregation. |
| pkg/api/handlers/mcp_resources.go | Replaces raw 400 literals with fiber.StatusBadRequest. |
| pkg/api/handlers/gitops.go | Adds singleflight.Group to coalesce operator cache-miss fetches. |
| pkg/api/handlers/exec.go | Handles exec_started write failures; fixes channel close ordering to prevent send-on-closed panic. |
| pkg/api/handlers/events.go | Returns 400 on invalid card_id UUID instead of silently ignoring. |
| pkg/api/handlers/crds.go | Uses 503 on cluster discovery failures; updates service-unavailable constant usage. |
| pkg/api/handlers/analytics_proxy.go | Extracts upstream timeout constant for the analytics proxy client. |
| pkg/api/handlers/admission_webhooks.go | Uses 503 on cluster discovery failures; updates service-unavailable constant usage. |
| pkg/agent/server_http.go | Returns proper non-200 error status via writeJSONError for k8s fetch failures. |
| pkg/agent/local_clusters.go | Adds debug logging when progress broadcast is dropped; logs kill errors on timeout. |
| entrypoint.sh | Switches shebang to bash to support bash-only wait -n. |
| if updates.Email != "" { | ||
| if !strings.Contains(updates.Email, "@") { | ||
| return fiber.NewError(fiber.StatusBadRequest, "invalid email format") | ||
| } |
There was a problem hiding this comment.
New behavior returns 400 for emails missing '@', but user handler tests don’t cover this validation path. Add a unit test asserting UpdateCurrentUser rejects an invalid email (and does not call UpdateUser) to prevent regressions.
| if input.CardID != "" { | ||
| cardID, err := uuid.Parse(input.CardID) | ||
| if err == nil { | ||
| event.CardID = &cardID | ||
| if err != nil { | ||
| return fiber.NewError(fiber.StatusBadRequest, "invalid card_id: must be a valid UUID") | ||
| } | ||
| event.CardID = &cardID |
There was a problem hiding this comment.
New behavior returns 400 for invalid card_id UUIDs, but the existing events handler tests don’t cover this negative case. Add a test that posts an invalid card_id and asserts a 400 response (and that RecordEvent is not called).
| @@ -1,4 +1,4 @@ | |||
| #!/bin/sh | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Switching the shebang to #!/bin/bash will break the published container image: the runtime stage in Dockerfile is Alpine and does not install bash, so the entrypoint will fail with “/bin/bash: not found”. Either install bash in the runtime image or rewrite the script to be POSIX sh compatible (avoid wait -n).
| operators, err := h.fetchOperatorsFromCluster(ctx, cluster) | ||
| if err != nil { |
There was a problem hiding this comment.
singleflight.Do runs the fetch using the first caller’s ctx. If that request context is canceled (client disconnect/timeout), it can abort the shared fetch and cause other concurrent callers (with still-valid contexts) to receive empty results/errors. Consider detaching cancellation for the shared work (e.g., context.WithoutCancel(ctx) plus an internal timeout) so one caller can’t inadvertently cancel the coalesced fetch.
| if m.ID == "" { | ||
| m.ID = "orbit-" + time.Now().Format("20060102150405") | ||
| // Use millisecond-precision timestamp plus a random suffix to avoid | ||
| // collisions when two missions are created in the same second (#7800). | ||
| m.ID = "orbit-" + time.Now().Format("20060102150405.000") + "-" + generateOrbitSuffix() | ||
| } |
There was a problem hiding this comment.
Orbit mission IDs now include a millisecond timestamp plus generateOrbitSuffix(), but there’s no test ensuring the ID is auto-generated when missing and includes the expected suffix/format. Adding a unit test for CreateMission ID generation (and basic uniqueness across multiple creates) would lock in the collision fix.
| if _, err := rand.Read(b); err != nil { | ||
| // Fallback: use nanosecond component if crypto/rand fails | ||
| return time.Now().Format("0000") | ||
| } |
There was a problem hiding this comment.
The crypto/rand failure fallback is broken: time.Now().Format("0000") returns the literal string "0000" (Go’s time.Format uses a reference layout), so the suffix becomes non-random and can reintroduce ID collisions. Use a real entropy/time-based fallback (e.g., derive 4 hex chars from time.Now().UnixNano() or use math/rand seeded once) instead of time.Format("0000").
| operators, err := h.fetchOperatorsFromCluster(ctx, cluster) | ||
| if err != nil { |
There was a problem hiding this comment.
Same singleflight/context issue here: the coalesced fetch uses the first caller’s ctx, so cancellation/timeout in one request can cancel the shared fetch for other concurrent callers. Detach cancellation for the shared work (while still honoring an internal timeout) to avoid cross-request interference.
| clusters, listErr = h.k8sClient.ListClusters(ctx) | ||
| if listErr != nil { | ||
| return c.Status(500).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) | ||
| return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) |
There was a problem hiding this comment.
This branch returns fiber.StatusServiceUnavailable directly instead of using the file-level statusServiceUnavailableCRD constant that’s already defined and used above. Using the constant here too keeps the status code policy centralized and avoids divergence if it changes again.
| return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) | |
| return c.Status(statusServiceUnavailableCRD).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) |
| clusters, listErr = h.k8sClient.ListClusters(ctx) | ||
| if listErr != nil { | ||
| return c.Status(500).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) | ||
| return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) |
There was a problem hiding this comment.
This branch returns fiber.StatusServiceUnavailable directly instead of using the file-level statusServiceUnavailableWebhook constant defined just above. Consider using the constant here as well to keep the status code choice consistent and maintainable.
| return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) | |
| return c.Status(statusServiceUnavailableWebhook).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false}) |
Signed-off-by: Andy Anderson <[email protected]>
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
|
Post-merge build verification passed ✅ Both Go and frontend builds compiled successfully against merge commit |
✅ Post-Merge Verification: passedCommit: |
Two production-breaking bugs and four cleanups/tests flagged by Copilot on the squash-merged #7815 (tracked by #7855): CRITICAL - entrypoint.sh: revert shebang from #!/bin/bash back to #!/bin/sh. The runtime image is alpine:3.20 which ships BusyBox ash — bash is not installed, so the container would fail with "/bin/bash: not found". Rewrote the `wait -n` bashism as a POSIX poll loop so the script also runs under plain ash. - pkg/api/handlers/orbit.go: the crypto/rand failure fallback used `time.Now().Format("0000")`, which returns the literal string "0000" because "0" is not a valid reference-layout token in Go's time package. That made the fallback suffix constant and guessable. Replaced with `fmt.Sprintf("%04x", time.Now().UnixNano()&0xffff)` so the fallback still produces variable hex. CLEANUPS - pkg/api/handlers/crds.go: use the existing file-level statusServiceUnavailableCRD constant instead of the literal. - pkg/api/handlers/admission_webhooks.go: same for statusServiceUnavailableWebhook. - pkg/api/handlers/gitops.go: detach the singleflight fetch context from the caller's ctx using context.WithoutCancel (Go 1.21+) plus a fresh WithTimeout, so one client disconnect no longer aborts the shared fetch for every waiter coalesced behind it. TESTS - pkg/api/handlers/user_test.go: verify UpdateCurrentUser returns 400 on an email missing "@" and does NOT call store.UpdateUser. - pkg/api/handlers/events_test.go: verify POST /events with an invalid card_id UUID returns 400 and does NOT insert. - pkg/api/handlers/orbit_test.go: verify an auto-generated mission ID matches orbit-<14-digit-datetime>.<3-digit-ms>-<4-hex-suffix>, and add a direct unit test for generateOrbitSuffix confirming the 4-hex shape plus variability (catches the old "0000" regression). Fixes #7855 Signed-off-by: Andy Anderson <[email protected]>
Two production-breaking bugs and four cleanups/tests flagged by Copilot on the squash-merged #7815 (tracked by #7855): CRITICAL - entrypoint.sh: revert shebang from #!/bin/bash back to #!/bin/sh. The runtime image is alpine:3.20 which ships BusyBox ash — bash is not installed, so the container would fail with "/bin/bash: not found". Rewrote the `wait -n` bashism as a POSIX poll loop so the script also runs under plain ash. - pkg/api/handlers/orbit.go: the crypto/rand failure fallback used `time.Now().Format("0000")`, which returns the literal string "0000" because "0" is not a valid reference-layout token in Go's time package. That made the fallback suffix constant and guessable. Replaced with `fmt.Sprintf("%04x", time.Now().UnixNano()&0xffff)` so the fallback still produces variable hex. CLEANUPS - pkg/api/handlers/crds.go: use the existing file-level statusServiceUnavailableCRD constant instead of the literal. - pkg/api/handlers/admission_webhooks.go: same for statusServiceUnavailableWebhook. - pkg/api/handlers/gitops.go: detach the singleflight fetch context from the caller's ctx using context.WithoutCancel (Go 1.21+) plus a fresh WithTimeout, so one client disconnect no longer aborts the shared fetch for every waiter coalesced behind it. TESTS - pkg/api/handlers/user_test.go: verify UpdateCurrentUser returns 400 on an email missing "@" and does NOT call store.UpdateUser. - pkg/api/handlers/events_test.go: verify POST /events with an invalid card_id UUID returns 400 and does NOT insert. - pkg/api/handlers/orbit_test.go: verify an auto-generated mission ID matches orbit-<14-digit-datetime>.<3-digit-ms>-<4-hex-suffix>, and add a direct unit test for generateOrbitSuffix confirming the 4-hex shape plus variability (catches the old "0000" regression). Fixes #7855 Signed-off-by: Andy Anderson <[email protected]>
Summary
sizeQueue.chexec_startedWebSocket write fails instead of silently discardingslog.WarnwhenGetServicesfails for a cluster in multi-cluster aggregationKill()error inrunWithTimeoutto surface zombie process failuresslog.DebugwhenbroadcastProgresshas no listener, replacing silent no-opsingleflight.Groupin operator cache to prevent check-then-act double-fetch racefiber.StatusServiceUnavailablefiber.StatusServiceUnavailable503literal in topology.go withfiber.StatusServiceUnavailableanalyticsUpstreamTimeoutconstant for analytics proxy HTTP client@) before storingmiddleware.GetUserID(c)in notification config handlers instead of rawc.Locals400literals in mcp_resources.go withfiber.StatusBadRequestwriteJSONErrorwith HTTP 503 instead of implicit 200#!/bin/shto#!/bin/bash(uses bash-onlywait -n)Test plan
go build ./...passesgo vet ./...passesFixes #7778, #7779, #7780, #7781, #7782, #7783, #7793, #7794, #7795, #7796, #7797, #7798, #7799, #7800, #7801, #7802, #7803
🤖 Generated with Claude Code