Skip to content

🐛 fix(gpu): centralize GPU resource names + fix multi-accelerator sum + tests (Issue 9204 follow-up)#9216

Merged
clubanderson merged 1 commit intomainfrom
fix/gpu-resources-centralize-sum-9204fu
Apr 20, 2026
Merged

🐛 fix(gpu): centralize GPU resource names + fix multi-accelerator sum + tests (Issue 9204 follow-up)#9216
clubanderson merged 1 commit intomainfrom
fix/gpu-resources-centralize-sum-9204fu

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

Follow-up PR addressing the 3 Copilot review findings on #9204:

  1. Multi-accelerator undercount (real bug)ci.GPURequested was overwritten per matching resource name, so a container requesting both nvidia.com/gpu and habana.ai/gaudi returned whichever came last in Go's randomized map iteration.
  2. Duplicate list between client_resources.go and client_gpu.go — exact drift surface that Issue 9090 originally suffered.
  3. No test coverage for the Intel / Gaudi names introduced in 🐛 fix(gpu): track Intel and Gaudi GPU requests in pod resource views (Issue 9090) #9204.

Changes

  • New pkg/k8s/gpu_resources.go — single source of truth:
  • pkg/k8s/client_resources.go — uses SumGPURequested on Requests, falls back to Limits.
  • pkg/k8s/client_gpu.go — replaces six repeated if ok blocks with a single SumGPURequested call on the Gaudi-era path, so pod and node trackers read the same list.
  • New pkg/k8s/gpu_resources_test.go:
    • TestIsGPUResourceName — 6 known names (true) + 7 adjacent names including google.com/tpu, intel.com/xpu, nvidia.com/mig-4g.20gb (false)
    • TestSumGPURequested — empty, single-name, multi-accelerator regression (nvidia=1 + gaudi=2 → 3), all-six-summed, non-GPU ignored

Context

Related PR: #9204 (original #9090 fix, merged 2026-04-20).
Copilot review comments addressed:

  • client_resources.go:39 — centralize shared list
  • client_resources.go:107 — add unit test coverage
  • client_resources.go:106 — sum instead of overwrite (bug)

… + tests (Issue 9204 follow-up)

Addresses 3 Copilot review findings on PR kubestellar/console Issue 9204:

### 1. Multi-accelerator undercount (real bug)

Previously GetPods in client_resources.go did:
    for name, qty := range c.Resources.Requests {
        if isGPUResourceName(name) { ci.GPURequested = int(qty.Value()) }
    }
A container requesting more than one accelerator (e.g., nvidia.com/gpu=1
+ habana.ai/gaudi=2) had ci.GPURequested overwritten per matching name,
so the final value depended on Go's randomized map iteration order and
undercounted the total.

### 2. Duplicate list between client_resources.go and client_gpu.go

Both files hard-coded the same six resource names. Any future addition
risked drifting one path behind the other — the exact class of bug Issue
9090 originally reported.

### 3. No test coverage for the new names

The new Intel and Gaudi names landed without a unit test that would fail
if a future refactor dropped one.

### Fix

Added pkg/k8s/gpu_resources.go as the single source of truth:
  - `GPUResourceNames` — exported slice
  - `IsGPUResourceName(name)` — exact-match predicate
  - `SumGPURequested(rl)` — sums across ALL known names (fixes #1)

client_resources.go now calls SumGPURequested on Requests, then on Limits
as a fallback. client_gpu.go replaces its six repeated `if ok` blocks
with a single SumGPURequested call too, so both pod and node paths read
from the same list.

gpu_resources_test.go covers:
  - IsGPUResourceName on all six known names (true) and 7 adjacent names
    including google.com/tpu / intel.com/xpu / nvidia.com/mig-4g.20gb (false)
  - SumGPURequested on empty, single-name, multi-accelerator (the
    regression case — returns 3 for nvidia=1+gaudi=2), all-six-summed, and
    non-GPU resources ignored.

Signed-off-by: Andrew Anderson <[email protected]>
Copilot AI review requested due to automatic review settings April 20, 2026 18:11
@kubestellar-prow kubestellar-prow Bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 20, 2026
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign eeshaansa for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 20, 2026

Deploy Preview for kubestellarconsole ready!

Name Link
🔨 Latest commit 726f538
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69e66c4cf5c2c70008315d73
😎 Deploy Preview https://deploy-preview-9216.console-deploy-preview.kubestellar.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@kubestellar-prow kubestellar-prow Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR follows up on GPU accounting issues by centralizing the Kubernetes GPU/accelerator resource-name list and fixing pod/node GPU requested totals when multiple accelerator resources are requested in the same container.

Changes:

  • Added shared GPU resource helpers (GPUResourceNames, IsGPUResourceName, SumGPURequested) and updated callers to use the shared summation logic.
  • Fixed multi-accelerator undercount by summing across all known GPU resource names (instead of overwriting per match).
  • Added unit tests covering known/unknown GPU resource names and regression coverage for multi-accelerator sums.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
pkg/k8s/gpu_resources.go Introduces centralized GPU resource-name list + helper functions, including deterministic summation across all known names.
pkg/k8s/gpu_resources_test.go Adds tests for name classification and SumGPURequested regression coverage (multi-accelerator sum).
pkg/k8s/client_resources.go Switches pod container GPURequested calculation to SumGPURequested (Requests, fallback to Limits).
pkg/k8s/client_gpu.go Replaces repeated per-resource checks with a single SumGPURequested call for node allocation accounting.

Comment thread pkg/k8s/gpu_resources.go
Comment on lines +15 to +22
var GPUResourceNames = []corev1.ResourceName{
"nvidia.com/gpu",
"amd.com/gpu",
"gpu.intel.com/i915",
"habana.ai/gaudi",
"habana.ai/gaudi2",
"intel.com/gaudi",
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUResourceNames is exported as a mutable slice. Since it’s only referenced within the k8s package, consider making the list unexported (and, if needed externally, expose a function that returns a defensive copy) to prevent accidental mutation and reduce public API surface.

Copilot uses AI. Check for mistakes.
Comment thread pkg/k8s/gpu_resources.go
Comment on lines +16 to +21
"nvidia.com/gpu",
"amd.com/gpu",
"gpu.intel.com/i915",
"habana.ai/gaudi",
"habana.ai/gaudi2",
"intel.com/gaudi",
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says to keep GPUResourceNames “sorted by vendor”, but the current order mixes Intel/Habana entries (Intel appears both before and after the Habana names). Either adjust the ordering to match the comment or update the comment to reflect the intended grouping.

Suggested change
"nvidia.com/gpu",
"amd.com/gpu",
"gpu.intel.com/i915",
"habana.ai/gaudi",
"habana.ai/gaudi2",
"intel.com/gaudi",
"amd.com/gpu",
"habana.ai/gaudi",
"habana.ai/gaudi2",
"gpu.intel.com/i915",
"intel.com/gaudi",
"nvidia.com/gpu",

Copilot uses AI. Check for mistakes.
Comment thread pkg/k8s/gpu_resources.go
Comment on lines +24 to +33
// IsGPUResourceName reports whether the given Kubernetes resource name is a
// known GPU / AI accelerator. Exact match; no vendor-prefix heuristics.
func IsGPUResourceName(name corev1.ResourceName) bool {
for _, n := range GPUResourceNames {
if name == n {
return true
}
}
return false
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsGPUResourceName is exported but (currently) only used by gpu_resources_test.go. If this helper is intended to be internal, consider unexporting it to avoid growing the public API; if it’s intended to be public, consider adding at least one non-test usage to justify the exported surface.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +12
// Keep each row's expected sum honest: the hand-written totals prove the
// SumGPURequested contract (sum across ALL known GPU resource names) without
// looping the implementation under test.
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about “each row's expected sum” and “SumGPURequested contract” sits above TestIsGPUResourceName, but it describes the SumGPURequested table test below. Consider moving/rewriting the comment so it documents the correct test and doesn’t confuse readers.

Suggested change
// Keep each row's expected sum honest: the hand-written totals prove the
// SumGPURequested contract (sum across ALL known GPU resource names) without
// looping the implementation under test.
// TestIsGPUResourceName verifies that the known GPU resource names are
// accepted and that adjacent-but-unsupported resource names are rejected.

Copilot uses AI. Check for mistakes.
@clubanderson clubanderson merged commit 9e3fbf1 into main Apr 20, 2026
35 of 36 checks passed
@clubanderson clubanderson deleted the fix/gpu-resources-centralize-sum-9204fu branch April 20, 2026 18:16
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

@github-actions
Copy link
Copy Markdown
Contributor

Post-merge build verification passed

Both Go and frontend builds compiled successfully against merge commit 9e3fbf1502f966b4a82db3924df2c7c7125b203a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tier/2-standard

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants