Skip to content

fix(fingerprint): match GPU SKUs on token boundaries not substrings#1350

Merged
mchmarny merged 4 commits into
mainfrom
fix/1349-gpu-sku-token-matching
Jun 15, 2026
Merged

fix(fingerprint): match GPU SKUs on token boundaries not substrings#1350
mchmarny merged 4 commits into
mainfrom
fix/1349-gpu-sku-token-matching

Conversation

@mchmarny

Copy link
Copy Markdown
Member

Summary

Fix the GPU SKU classifier so it matches accelerator product names on whole-token boundaries instead of raw substrings, eliminating silent misclassification (e.g. L40S read as l40, RTX A1000 read as a100).

Motivation / Context

ParseGPUSKU matched nvidia-smi/NFD product names with strings.Contains, so any SKU whose name contained a shorter registered pattern was collapsed onto the wrong accelerator — and it bypassed the unknown-sku fallback that surfaces registry gaps. This surfaced on an Oracle OKE cluster where nvidia.com/gpu.product=NVIDIA-L40S fingerprinted as l40. The same trap silently mislabels RTX A1000 as datacenter a100. The pattern table already carried two hand-patched workarounds for the identical root cause (GB200-before-B200 ordering; GH200 exclusion).

Fixes: #1349
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/fingerprint — snapshot SKU classification)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)

Implementation Notes

  • ParseGPUSKU now upper-cases and tokenizes the product name on whitespace and hyphens, then matches registry entries as contiguous token sequences. Hyphen splitting normalizes nvidia-smi (NVIDIA A100 80GB) and NFD label (NVIDIA-A100-80GB) forms identically.
  • Because tokens match exactly, entry order is no longer significant, so the GB200-before-B200 ordering and the GH200 exclusion workarounds are removed. Unrecognized SKUs resolve to "", preserving the unknown-sku signal.
  • No enum/API/recipe surface changed. Adding a SKU as a first-class selectable accelerator (criteria enum, OpenAPI, docs) is intentionally out of scope and should land with actual recipe coverage to avoid a "no matching recipe" footgun.

Testing

make test   # PASS — total coverage 77.1% (gpu_sku.go funcs 100%)
make lint   # PASS — 0 issues; AGENTS.md in sync

TestParseGPUSKU expanded to 70 cases: every supported SKU across its real nvidia-smi and NFD-label product-name variants (H100/H200/A100/B200/GB200/L40/RTX PRO 6000), plus an exhaustive decoy set proving look-alikes do not misclassify — GH200, L40S, L4/L20/L2, A800/A40/A30/A16/A10/A2, H800/H20, B100, RTX A1000/A2000/A4000/A5000/A6000, RTX 6000 Ada, Quadro RTX 6000, GeForce RTX 4090/5090, V100/T4/P100, Grace.

Note: make qualify's e2e (KWOK/kind) and grype scan stages were not run locally (they don't exercise this code path and require cluster/tooling); CI covers them.

Risk Assessment

  • Low — Isolated change to one pure-Go function, exhaustively unit-tested, easy to revert.

Rollout notes: Snapshots of unsupported SKUs (e.g. L40S) now correctly fingerprint as unknown-sku rather than a wrong-but-valid accelerator. No previously-supported SKU changes classification (covered by tests). N/A migration.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (N/A — no user-facing surface changed)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

ParseGPUSKU used strings.Contains against the product name, so any SKU
whose name contained a shorter registered pattern was silently collapsed
onto the wrong accelerator (e.g. L40S -> l40, RTX A1000 -> a100), which
also defeated the unknown-sku fallback that surfaces registry gaps.

Tokenize the product name on whitespace and hyphens and match whole
contiguous token sequences instead. This structurally eliminates the
collision class and removes the GB200-before-B200 ordering and GH200
exclusion workarounds. Hyphen splitting also normalizes nvidia-smi
(space-separated) and NFD label (hyphenated) forms identically.

Adds a decoy test matrix asserting L40S, RTX A1000, L4, A800, H800, and
GH200 do not false-map while all supported SKUs still resolve.

Closes #1349
@mchmarny mchmarny requested a review from a team as a code owner June 13, 2026 10:54
@mchmarny mchmarny added area/collector theme/recipes Recipe expansion, overlays, mixins, and component registry labels Jun 13, 2026
@mchmarny mchmarny self-assigned this Jun 13, 2026
@mchmarny mchmarny added the theme/recipes Recipe expansion, overlays, mixins, and component registry label Jun 13, 2026
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 60b149a3-6546-47b8-8a4c-f35b1394c708

📥 Commits

Reviewing files that changed from the base of the PR and between 5086d9c and 4d4430a.

📒 Files selected for processing (1)
  • pkg/fingerprint/gpu_sku_test.go

📝 Walkthrough

Walkthrough

This PR refactors the GPU SKU classifier to eliminate false positive matches caused by substring matching. The registry is redefined from string patterns to token sequences, and the matching logic switches to splitting input on whitespace and hyphens, then searching for contiguous token-sequence matches. Two helper functions implement tokenization and sequence matching. Test coverage is expanded with multiple SKU variant formats and comprehensive collision-guard cases to verify that similar-named SKUs (L40S, RTX A1000, GH200, etc.) no longer misclassify.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely describes the core change: replacing substring matching with token-boundary matching for GPU SKU detection, which directly addresses the root cause of silent misclassification documented in issue #1349.
Description check ✅ Passed The PR description comprehensively explains the motivation, implementation details, test coverage, risk assessment, and commit message, all directly related to the GPU SKU token-boundary matching changes in the codebase.
Linked Issues check ✅ Passed The code changes fully implement all acceptance criteria from issue #1349: token-boundary matching on spaces and hyphens, unrecognized SKUs return empty string, ordering/exclusion workarounds removed, and test matrix expanded to 70 cases covering supported SKUs and extensive decoy patterns.
Out of Scope Changes check ✅ Passed All changes are scoped to pkg/fingerprint/gpu_sku.go and its test file, directly addressing the substring-matching bug and its tests. No external APIs, enums, or unrelated functionality were modified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1349-gpu-sku-token-matching

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 77.1%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.1%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/fingerprint 98.22% (+0.15%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/fingerprint/gpu_sku.go 100.00% (ø) 20 (+13) 20 (+13) 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@ArangoGutierrez ArangoGutierrez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — token matching is the right call, and it lets you delete both existing workarounds (GB200-before-B200 ordering, GH200 exclusion) instead of adding a third. The behavior reads correctly: single-token SKUs match exactly, RTX PRO 6000 matches as a contiguous run, and the look-alikes fall through to "".

Code review

  • Correctness looks right. tokenizeProductName splitting on whitespace and hyphens normalizes the nvidia-smi and NFD-label forms to identical tokens, and containsTokenSequence's bounds are correct.
  • The decoy half of the test table is what makes this trustworthy: L40S, RTX A1000, GH200, B100 all assert "", so the guard discriminates rather than just confirming the happy path. Nice that both the product-name and NFD-label forms are covered.

Go review

  • Pure function, no error or concurrency surface, nothing for the linter. The per-call allocation in tokenizeProductName runs once per node, so no concern.
  • One nit inline: the registry is order-independent now.

No blockers. The unknown-sku fallback is intact and it's a clean revert if anything regresses.

Comment thread pkg/fingerprint/gpu_sku.go
Comment thread pkg/fingerprint/gpu_sku_test.go
mchmarny and others added 2 commits June 15, 2026 06:27
The comment implied GB200 matched correctly due to registry ordering,
but exact-token matching makes it correct by construction — GB200 and
B200 tokenize to distinct sequences regardless of entry order.

@ArangoGutierrez ArangoGutierrez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at 4d4430a. The only change since my review is the GB200 comment fix (2e1a02a), which addresses the ordering-dependency note I left — ParseGPUSKU and all test assertions are unchanged. CI is green except tests/E2E and analyze still running; the merge gate holds the merge until those pass. No blockers. Approving to unblock the SMI refactor in #1352.

@mchmarny mchmarny disabled auto-merge June 15, 2026 14:00
@mchmarny mchmarny merged commit 410f5a3 into main Jun 15, 2026
50 of 66 checks passed
@mchmarny mchmarny deleted the fix/1349-gpu-sku-token-matching branch June 15, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M theme/recipes Recipe expansion, overlays, mixins, and component registry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: GPU SKU classifier substring match silently mismaps SKUs

2 participants