Skip to content

fix(ci): query GPU snapshot by subtype name instead of index#509

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/gpu-snapshot-validate-subtype
Apr 8, 2026
Merged

fix(ci): query GPU snapshot by subtype name instead of index#509
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/gpu-snapshot-validate-subtype

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Fix GPU CI validation failure caused by #502 adding a hardware subtype before smi in GPU measurements.

Root Cause

PR #502 (feat: wire NFDHardwareDetector into production snapshot pipeline) added a Phase 1 hardware subtype that appears before the existing smi subtype in the GPU measurement array:

measurements:
  - type: GPU
    subtypes:
      - name: hardware   # NEW — Phase 1 NFD detection (no gpu.model field)
      - name: smi        # Existing — has gpu.model, gpu-count, etc.

The snapshot validation action (.github/actions/gpu-snapshot-validate/action.yml) used subtypes[0] to read gpu.model, which now hits hardware instead of smiGPU model: null → all H100 GPU tests fail.

Confirmed on main: https://github.com/NVIDIA/aicr/actions/runs/24151637858

Fix

Query by subtype name instead of index:

- GPU_MODEL=$(yq eval '... | .subtypes[0].data["gpu.model"]' snapshot.yaml)
+ GPU_MODEL=$(yq eval '... | .subtypes[] | select(.name == "smi") | .data["gpu.model"]' snapshot.yaml)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Other: CI action (.github/actions/gpu-snapshot-validate)

Checklist

  • Commits are cryptographically signed (git commit -S)

PR NVIDIA#502 added a Phase 1 "hardware" subtype before the existing "smi"
subtype in GPU measurements. The snapshot validation action used
subtypes[0] to read gpu.model, which now hits "hardware" (no model
field) instead of "smi", causing GPU model: null on all H100 runners.

Fix: query by subtype name (select(.name == "smi")) instead of index.

Signed-off-by: Yuan Chen <[email protected]>
@mchmarny mchmarny enabled auto-merge (squash) April 8, 2026 18:52
@mchmarny mchmarny merged commit c334bfc into NVIDIA:main Apr 8, 2026
17 of 18 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 8, 2026
Fix follow-up to NVIDIA#509 — the YAML field is 'subtype' not 'name':

  subtypes:
    - subtype: smi    # correct field
      data:
        gpu.model: NVIDIA H100 NVL

select(.name == "smi") returns empty; select(.subtype == "smi") works.

Signed-off-by: Yuan Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants