🐛 fix(gpu): centralize GPU resource names + fix multi-accelerator sum + tests (Issue 9204 follow-up)#9216
Conversation
… + tests (Issue 9204 follow-up)
Addresses 3 Copilot review findings on PR kubestellar/console Issue 9204:
### 1. Multi-accelerator undercount (real bug)
Previously GetPods in client_resources.go did:
for name, qty := range c.Resources.Requests {
if isGPUResourceName(name) { ci.GPURequested = int(qty.Value()) }
}
A container requesting more than one accelerator (e.g., nvidia.com/gpu=1
+ habana.ai/gaudi=2) had ci.GPURequested overwritten per matching name,
so the final value depended on Go's randomized map iteration order and
undercounted the total.
### 2. Duplicate list between client_resources.go and client_gpu.go
Both files hard-coded the same six resource names. Any future addition
risked drifting one path behind the other — the exact class of bug Issue
9090 originally reported.
### 3. No test coverage for the new names
The new Intel and Gaudi names landed without a unit test that would fail
if a future refactor dropped one.
### Fix
Added pkg/k8s/gpu_resources.go as the single source of truth:
- `GPUResourceNames` — exported slice
- `IsGPUResourceName(name)` — exact-match predicate
- `SumGPURequested(rl)` — sums across ALL known names (fixes #1)
client_resources.go now calls SumGPURequested on Requests, then on Limits
as a fallback. client_gpu.go replaces its six repeated `if ok` blocks
with a single SumGPURequested call too, so both pod and node paths read
from the same list.
gpu_resources_test.go covers:
- IsGPUResourceName on all six known names (true) and 7 adjacent names
including google.com/tpu / intel.com/xpu / nvidia.com/mig-4g.20gb (false)
- SumGPURequested on empty, single-name, multi-accelerator (the
regression case — returns 3 for nvidia=1+gaudi=2), all-six-summed, and
non-GPU resources ignored.
Signed-off-by: Andrew Anderson <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
There was a problem hiding this comment.
Pull request overview
This PR follows up on GPU accounting issues by centralizing the Kubernetes GPU/accelerator resource-name list and fixing pod/node GPU requested totals when multiple accelerator resources are requested in the same container.
Changes:
- Added shared GPU resource helpers (
GPUResourceNames,IsGPUResourceName,SumGPURequested) and updated callers to use the shared summation logic. - Fixed multi-accelerator undercount by summing across all known GPU resource names (instead of overwriting per match).
- Added unit tests covering known/unknown GPU resource names and regression coverage for multi-accelerator sums.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/k8s/gpu_resources.go | Introduces centralized GPU resource-name list + helper functions, including deterministic summation across all known names. |
| pkg/k8s/gpu_resources_test.go | Adds tests for name classification and SumGPURequested regression coverage (multi-accelerator sum). |
| pkg/k8s/client_resources.go | Switches pod container GPURequested calculation to SumGPURequested (Requests, fallback to Limits). |
| pkg/k8s/client_gpu.go | Replaces repeated per-resource checks with a single SumGPURequested call for node allocation accounting. |
| var GPUResourceNames = []corev1.ResourceName{ | ||
| "nvidia.com/gpu", | ||
| "amd.com/gpu", | ||
| "gpu.intel.com/i915", | ||
| "habana.ai/gaudi", | ||
| "habana.ai/gaudi2", | ||
| "intel.com/gaudi", | ||
| } |
There was a problem hiding this comment.
GPUResourceNames is exported as a mutable slice. Since it’s only referenced within the k8s package, consider making the list unexported (and, if needed externally, expose a function that returns a defensive copy) to prevent accidental mutation and reduce public API surface.
| "nvidia.com/gpu", | ||
| "amd.com/gpu", | ||
| "gpu.intel.com/i915", | ||
| "habana.ai/gaudi", | ||
| "habana.ai/gaudi2", | ||
| "intel.com/gaudi", |
There was a problem hiding this comment.
The comment says to keep GPUResourceNames “sorted by vendor”, but the current order mixes Intel/Habana entries (Intel appears both before and after the Habana names). Either adjust the ordering to match the comment or update the comment to reflect the intended grouping.
| "nvidia.com/gpu", | |
| "amd.com/gpu", | |
| "gpu.intel.com/i915", | |
| "habana.ai/gaudi", | |
| "habana.ai/gaudi2", | |
| "intel.com/gaudi", | |
| "amd.com/gpu", | |
| "habana.ai/gaudi", | |
| "habana.ai/gaudi2", | |
| "gpu.intel.com/i915", | |
| "intel.com/gaudi", | |
| "nvidia.com/gpu", |
| // IsGPUResourceName reports whether the given Kubernetes resource name is a | ||
| // known GPU / AI accelerator. Exact match; no vendor-prefix heuristics. | ||
| func IsGPUResourceName(name corev1.ResourceName) bool { | ||
| for _, n := range GPUResourceNames { | ||
| if name == n { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
IsGPUResourceName is exported but (currently) only used by gpu_resources_test.go. If this helper is intended to be internal, consider unexporting it to avoid growing the public API; if it’s intended to be public, consider adding at least one non-test usage to justify the exported surface.
| // Keep each row's expected sum honest: the hand-written totals prove the | ||
| // SumGPURequested contract (sum across ALL known GPU resource names) without | ||
| // looping the implementation under test. |
There was a problem hiding this comment.
The comment about “each row's expected sum” and “SumGPURequested contract” sits above TestIsGPUResourceName, but it describes the SumGPURequested table test below. Consider moving/rewriting the comment so it documents the correct test and doesn’t confuse readers.
| // Keep each row's expected sum honest: the hand-written totals prove the | |
| // SumGPURequested contract (sum across ALL known GPU resource names) without | |
| // looping the implementation under test. | |
| // TestIsGPUResourceName verifies that the known GPU resource names are | |
| // accepted and that adjacent-but-unsupported resource names are rejected. |
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
|
Post-merge build verification passed ✅ Both Go and frontend builds compiled successfully against merge commit |
Summary
Follow-up PR addressing the 3 Copilot review findings on #9204:
ci.GPURequestedwas overwritten per matching resource name, so a container requesting bothnvidia.com/gpuandhabana.ai/gaudireturned whichever came last in Go's randomized map iteration.client_resources.goandclient_gpu.go— exact drift surface that Issue 9090 originally suffered.Changes
pkg/k8s/gpu_resources.go— single source of truth:GPUResourceNames(exported slice)IsGPUResourceName(name)predicateSumGPURequested(rl)— sums across ALL known names (fixes 🐛 Fix Playwright E2E test authentication and CI workflow #1)pkg/k8s/client_resources.go— usesSumGPURequestedon Requests, falls back to Limits.pkg/k8s/client_gpu.go— replaces six repeatedif okblocks with a singleSumGPURequestedcall on the Gaudi-era path, so pod and node trackers read the same list.pkg/k8s/gpu_resources_test.go:TestIsGPUResourceName— 6 known names (true) + 7 adjacent names includinggoogle.com/tpu,intel.com/xpu,nvidia.com/mig-4g.20gb(false)TestSumGPURequested— empty, single-name, multi-accelerator regression (nvidia=1 + gaudi=2 → 3), all-six-summed, non-GPU ignoredContext
Related PR: #9204 (original #9090 fix, merged 2026-04-20).
Copilot review comments addressed:
client_resources.go:39— centralize shared listclient_resources.go:107— add unit test coverageclient_resources.go:106— sum instead of overwrite (bug)