Skip to content

fix: Go backend batch 2 — error handling, validation, constants, race fixes#7815

Merged
clubanderson merged 2 commits intomainfrom
fix/go-backend-batch2
Apr 13, 2026
Merged

fix: Go backend batch 2 — error handling, validation, constants, race fixes#7815
clubanderson merged 2 commits intomainfrom
fix/go-backend-batch2

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

Test plan

  • go build ./... passes
  • go vet ./... passes
  • CI build/lint passes
  • Verify exec WebSocket sessions work without panics
  • Verify operator listing doesn't double-fetch under concurrent load

Fixes #7778, #7779, #7780, #7781, #7782, #7783, #7793, #7794, #7795, #7796, #7797, #7798, #7799, #7800, #7801, #7802, #7803

🤖 Generated with Claude Code

@clubanderson clubanderson added the ai-generated Pull request generated by AI label Apr 13, 2026
Copilot AI review requested due to automatic review settings April 13, 2026 23:19
@kubestellar-prow kubestellar-prow Bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 13, 2026
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 13, 2026

Deploy Preview for kubestellarconsole canceled.

Name Link
🔨 Latest commit e215c26
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69dd7c8e8b0fbd00080d2003

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@kubestellar-prow kubestellar-prow Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves robustness of the Go backend by tightening validation, standardizing HTTP status/error handling, and addressing concurrency/race issues (notably in exec/WebSocket handling and operator caching).

Changes:

  • Fixes exec/WebSocket error handling and shutdown ordering to avoid panics and better surface failures.
  • Standardizes several HTTP status codes/constants and introduces clearer error/timeout constants.
  • Adds validation improvements (CardID UUID, basic email format) and concurrency coalescing (singleflight) to reduce duplicate work.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/api/handlers/user.go Adds basic email format validation on update.
pkg/api/handlers/topology.go Replaces raw 503 with Fiber status constant.
pkg/api/handlers/orbit.go Makes orbit mission IDs more collision-resistant via ms timestamp + random suffix.
pkg/api/handlers/notifications.go Uses middleware helper for user ID extraction.
pkg/api/handlers/mcp_workloads.go Logs per-cluster GetServices failures during aggregation.
pkg/api/handlers/mcp_resources.go Replaces raw 400 literals with fiber.StatusBadRequest.
pkg/api/handlers/gitops.go Adds singleflight.Group to coalesce operator cache-miss fetches.
pkg/api/handlers/exec.go Handles exec_started write failures; fixes channel close ordering to prevent send-on-closed panic.
pkg/api/handlers/events.go Returns 400 on invalid card_id UUID instead of silently ignoring.
pkg/api/handlers/crds.go Uses 503 on cluster discovery failures; updates service-unavailable constant usage.
pkg/api/handlers/analytics_proxy.go Extracts upstream timeout constant for the analytics proxy client.
pkg/api/handlers/admission_webhooks.go Uses 503 on cluster discovery failures; updates service-unavailable constant usage.
pkg/agent/server_http.go Returns proper non-200 error status via writeJSONError for k8s fetch failures.
pkg/agent/local_clusters.go Adds debug logging when progress broadcast is dropped; logs kill errors on timeout.
entrypoint.sh Switches shebang to bash to support bash-only wait -n.

Comment thread pkg/api/handlers/user.go
Comment on lines 55 to +58
if updates.Email != "" {
if !strings.Contains(updates.Email, "@") {
return fiber.NewError(fiber.StatusBadRequest, "invalid email format")
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior returns 400 for emails missing '@', but user handler tests don’t cover this validation path. Add a unit test asserting UpdateCurrentUser rejects an invalid email (and does not call UpdateUser) to prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines 42 to +47
if input.CardID != "" {
cardID, err := uuid.Parse(input.CardID)
if err == nil {
event.CardID = &cardID
if err != nil {
return fiber.NewError(fiber.StatusBadRequest, "invalid card_id: must be a valid UUID")
}
event.CardID = &cardID
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior returns 400 for invalid card_id UUIDs, but the existing events handler tests don’t cover this negative case. Add a test that posts an invalid card_id and asserts a 400 response (and that RecordEvent is not called).

Copilot uses AI. Check for mistakes.
Comment thread entrypoint.sh
@@ -1,4 +1,4 @@
#!/bin/sh
#!/bin/bash
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching the shebang to #!/bin/bash will break the published container image: the runtime stage in Dockerfile is Alpine and does not install bash, so the entrypoint will fail with “/bin/bash: not found”. Either install bash in the runtime image or rewrite the script to be POSIX sh compatible (avoid wait -n).

Copilot uses AI. Check for mistakes.
Comment on lines +755 to +756
operators, err := h.fetchOperatorsFromCluster(ctx, cluster)
if err != nil {
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

singleflight.Do runs the fetch using the first caller’s ctx. If that request context is canceled (client disconnect/timeout), it can abort the shared fetch and cause other concurrent callers (with still-valid contexts) to receive empty results/errors. Consider detaching cancellation for the shared work (e.g., context.WithoutCancel(ctx) plus an internal timeout) so one caller can’t inadvertently cancel the coalesced fetch.

Copilot uses AI. Check for mistakes.
Comment thread pkg/api/handlers/orbit.go
Comment on lines 152 to 156
if m.ID == "" {
m.ID = "orbit-" + time.Now().Format("20060102150405")
// Use millisecond-precision timestamp plus a random suffix to avoid
// collisions when two missions are created in the same second (#7800).
m.ID = "orbit-" + time.Now().Format("20060102150405.000") + "-" + generateOrbitSuffix()
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orbit mission IDs now include a millisecond timestamp plus generateOrbitSuffix(), but there’s no test ensuring the ID is auto-generated when missing and includes the expected suffix/format. Adding a unit test for CreateMission ID generation (and basic uniqueness across multiple creates) would lock in the collision fix.

Copilot uses AI. Check for mistakes.
Comment thread pkg/api/handlers/orbit.go
Comment on lines +23 to +26
if _, err := rand.Read(b); err != nil {
// Fallback: use nanosecond component if crypto/rand fails
return time.Now().Format("0000")
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crypto/rand failure fallback is broken: time.Now().Format("0000") returns the literal string "0000" (Go’s time.Format uses a reference layout), so the suffix becomes non-random and can reintroduce ID collisions. Use a real entropy/time-based fallback (e.g., derive 4 hex chars from time.Now().UnixNano() or use math/rand seeded once) instead of time.Format("0000").

Copilot uses AI. Check for mistakes.
Comment on lines +836 to +837
operators, err := h.fetchOperatorsFromCluster(ctx, cluster)
if err != nil {
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same singleflight/context issue here: the coalesced fetch uses the first caller’s ctx, so cancellation/timeout in one request can cancel the shared fetch for other concurrent callers. Detach cancellation for the shared work (while still honoring an internal timeout) to avoid cross-request interference.

Copilot uses AI. Check for mistakes.
Comment thread pkg/api/handlers/crds.go
clusters, listErr = h.k8sClient.ListClusters(ctx)
if listErr != nil {
return c.Status(500).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})
return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch returns fiber.StatusServiceUnavailable directly instead of using the file-level statusServiceUnavailableCRD constant that’s already defined and used above. Using the constant here too keeps the status code policy centralized and avoids divergence if it changes again.

Suggested change
return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})
return c.Status(statusServiceUnavailableCRD).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})

Copilot uses AI. Check for mistakes.
clusters, listErr = h.k8sClient.ListClusters(ctx)
if listErr != nil {
return c.Status(500).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})
return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch returns fiber.StatusServiceUnavailable directly instead of using the file-level statusServiceUnavailableWebhook constant defined just above. Consider using the constant here as well to keep the status code choice consistent and maintainable.

Suggested change
return c.Status(fiber.StatusServiceUnavailable).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})
return c.Status(statusServiceUnavailableWebhook).JSON(fiber.Map{"error": "cluster discovery failed", "isDemoData": false})

Copilot uses AI. Check for mistakes.
@clubanderson clubanderson merged commit 6fb2ada into main Apr 13, 2026
35 of 39 checks passed
@kubestellar-prow kubestellar-prow Bot deleted the fix/go-backend-batch2 branch April 13, 2026 23:47
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

This was referenced Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Post-merge build verification passed

Both Go and frontend builds compiled successfully against merge commit 6fb2ada3724325a98e11ce696f77127393bb6434.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Post-Merge Verification: passed

Commit: 6fb2ada3724325a98e11ce696f77127393bb6434
Specs run: smoke.spec.ts
Report: https://github.com/kubestellar/console/actions/runs/24372854917

clubanderson added a commit that referenced this pull request Apr 14, 2026
Two production-breaking bugs and four cleanups/tests flagged by Copilot
on the squash-merged #7815 (tracked by #7855):

CRITICAL

- entrypoint.sh: revert shebang from #!/bin/bash back to #!/bin/sh.
  The runtime image is alpine:3.20 which ships BusyBox ash — bash is
  not installed, so the container would fail with
  "/bin/bash: not found". Rewrote the `wait -n` bashism as a POSIX
  poll loop so the script also runs under plain ash.

- pkg/api/handlers/orbit.go: the crypto/rand failure fallback used
  `time.Now().Format("0000")`, which returns the literal string
  "0000" because "0" is not a valid reference-layout token in Go's
  time package. That made the fallback suffix constant and guessable.
  Replaced with `fmt.Sprintf("%04x", time.Now().UnixNano()&0xffff)`
  so the fallback still produces variable hex.

CLEANUPS

- pkg/api/handlers/crds.go: use the existing file-level
  statusServiceUnavailableCRD constant instead of the literal.
- pkg/api/handlers/admission_webhooks.go: same for
  statusServiceUnavailableWebhook.
- pkg/api/handlers/gitops.go: detach the singleflight fetch context
  from the caller's ctx using context.WithoutCancel (Go 1.21+) plus
  a fresh WithTimeout, so one client disconnect no longer aborts the
  shared fetch for every waiter coalesced behind it.

TESTS

- pkg/api/handlers/user_test.go: verify UpdateCurrentUser returns 400
  on an email missing "@" and does NOT call store.UpdateUser.
- pkg/api/handlers/events_test.go: verify POST /events with an
  invalid card_id UUID returns 400 and does NOT insert.
- pkg/api/handlers/orbit_test.go: verify an auto-generated mission
  ID matches orbit-<14-digit-datetime>.<3-digit-ms>-<4-hex-suffix>,
  and add a direct unit test for generateOrbitSuffix confirming the
  4-hex shape plus variability (catches the old "0000" regression).

Fixes #7855

Signed-off-by: Andy Anderson <[email protected]>
clubanderson added a commit that referenced this pull request Apr 14, 2026
Two production-breaking bugs and four cleanups/tests flagged by Copilot
on the squash-merged #7815 (tracked by #7855):

CRITICAL

- entrypoint.sh: revert shebang from #!/bin/bash back to #!/bin/sh.
  The runtime image is alpine:3.20 which ships BusyBox ash — bash is
  not installed, so the container would fail with
  "/bin/bash: not found". Rewrote the `wait -n` bashism as a POSIX
  poll loop so the script also runs under plain ash.

- pkg/api/handlers/orbit.go: the crypto/rand failure fallback used
  `time.Now().Format("0000")`, which returns the literal string
  "0000" because "0" is not a valid reference-layout token in Go's
  time package. That made the fallback suffix constant and guessable.
  Replaced with `fmt.Sprintf("%04x", time.Now().UnixNano()&0xffff)`
  so the fallback still produces variable hex.

CLEANUPS

- pkg/api/handlers/crds.go: use the existing file-level
  statusServiceUnavailableCRD constant instead of the literal.
- pkg/api/handlers/admission_webhooks.go: same for
  statusServiceUnavailableWebhook.
- pkg/api/handlers/gitops.go: detach the singleflight fetch context
  from the caller's ctx using context.WithoutCancel (Go 1.21+) plus
  a fresh WithTimeout, so one client disconnect no longer aborts the
  shared fetch for every waiter coalesced behind it.

TESTS

- pkg/api/handlers/user_test.go: verify UpdateCurrentUser returns 400
  on an email missing "@" and does NOT call store.UpdateUser.
- pkg/api/handlers/events_test.go: verify POST /events with an
  invalid card_id UUID returns 400 and does NOT insert.
- pkg/api/handlers/orbit_test.go: verify an auto-generated mission
  ID matches orbit-<14-digit-datetime>.<3-digit-ms>-<4-hex-suffix>,
  and add a direct unit test for generateOrbitSuffix confirming the
  4-hex shape plus variability (catches the old "0000" regression).

Fixes #7855

Signed-off-by: Andy Anderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated Pull request generated by AI dco-signoff: yes Indicates the PR's author has signed the DCO. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

close(sizeQueue.ch) called before <-done causing guaranteed send on closed channel panic

2 participants