fix(fingerprint): match GPU SKUs on token boundaries not substrings#1350
Conversation
ParseGPUSKU used strings.Contains against the product name, so any SKU whose name contained a shorter registered pattern was silently collapsed onto the wrong accelerator (e.g. L40S -> l40, RTX A1000 -> a100), which also defeated the unknown-sku fallback that surfaces registry gaps. Tokenize the product name on whitespace and hyphens and match whole contiguous token sequences instead. This structurally eliminates the collision class and removes the GB200-before-B200 ordering and GH200 exclusion workarounds. Hyphen splitting also normalizes nvidia-smi (space-separated) and NFD label (hyphenated) forms identically. Adds a decoy test matrix asserting L40S, RTX A1000, L4, A800, H800, and GH200 do not false-map while all supported SKUs still resolve. Closes #1349
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR refactors the GPU SKU classifier to eliminate false positive matches caused by substring matching. The registry is redefined from string patterns to token sequences, and the matching logic switches to splitting input on whitespace and hyphens, then searching for contiguous token-sequence matches. Two helper functions implement tokenization and sequence matching. Test coverage is expanded with multiple SKU variant formats and comprehensive collision-guard cases to verify that similar-named SKUs (L40S, RTX A1000, GH200, etc.) no longer misclassify. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Nice fix — token matching is the right call, and it lets you delete both existing workarounds (GB200-before-B200 ordering, GH200 exclusion) instead of adding a third. The behavior reads correctly: single-token SKUs match exactly, RTX PRO 6000 matches as a contiguous run, and the look-alikes fall through to "".
Code review
- Correctness looks right.
tokenizeProductNamesplitting on whitespace and hyphens normalizes the nvidia-smi and NFD-label forms to identical tokens, andcontainsTokenSequence's bounds are correct. - The decoy half of the test table is what makes this trustworthy: L40S, RTX A1000, GH200, B100 all assert "", so the guard discriminates rather than just confirming the happy path. Nice that both the product-name and NFD-label forms are covered.
Go review
- Pure function, no error or concurrency surface, nothing for the linter. The per-call allocation in
tokenizeProductNameruns once per node, so no concern. - One nit inline: the registry is order-independent now.
No blockers. The unknown-sku fallback is intact and it's a clean revert if anything regresses.
The comment implied GB200 matched correctly due to registry ordering, but exact-token matching makes it correct by construction — GB200 and B200 tokenize to distinct sequences regardless of entry order.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Re-reviewed at 4d4430a. The only change since my review is the GB200 comment fix (2e1a02a), which addresses the ordering-dependency note I left — ParseGPUSKU and all test assertions are unchanged. CI is green except tests/E2E and analyze still running; the merge gate holds the merge until those pass. No blockers. Approving to unblock the SMI refactor in #1352.
Summary
Fix the GPU SKU classifier so it matches accelerator product names on whole-token boundaries instead of raw substrings, eliminating silent misclassification (e.g. L40S read as
l40, RTX A1000 read asa100).Motivation / Context
ParseGPUSKUmatchednvidia-smi/NFD product names withstrings.Contains, so any SKU whose name contained a shorter registered pattern was collapsed onto the wrong accelerator — and it bypassed theunknown-skufallback that surfaces registry gaps. This surfaced on an Oracle OKE cluster wherenvidia.com/gpu.product=NVIDIA-L40Sfingerprinted asl40. The same trap silently mislabelsRTX A1000as datacentera100. The pattern table already carried two hand-patched workarounds for the identical root cause (GB200-before-B200 ordering; GH200 exclusion).Fixes: #1349
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/fingerprint— snapshot SKU classification)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
ParseGPUSKUnow upper-cases and tokenizes the product name on whitespace and hyphens, then matches registry entries as contiguous token sequences. Hyphen splitting normalizes nvidia-smi (NVIDIA A100 80GB) and NFD label (NVIDIA-A100-80GB) forms identically."", preserving theunknown-skusignal.Testing
TestParseGPUSKUexpanded to 70 cases: every supported SKU across its real nvidia-smi and NFD-label product-name variants (H100/H200/A100/B200/GB200/L40/RTX PRO 6000), plus an exhaustive decoy set proving look-alikes do not misclassify — GH200, L40S, L4/L20/L2, A800/A40/A30/A16/A10/A2, H800/H20, B100, RTX A1000/A2000/A4000/A5000/A6000, RTX 6000 Ada, Quadro RTX 6000, GeForce RTX 4090/5090, V100/T4/P100, Grace.Note:
make qualify's e2e (KWOK/kind) and grype scan stages were not run locally (they don't exercise this code path and require cluster/tooling); CI covers them.Risk Assessment
Rollout notes: Snapshots of unsupported SKUs (e.g. L40S) now correctly fingerprint as
unknown-skurather than a wrong-but-valid accelerator. No previously-supported SKU changes classification (covered by tests). N/A migration.Checklist
make testwith-race)make lint)git commit -S)