Skip to content

feat(validators): enhance inference performance validation#1133

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-configurable-model
Jun 2, 2026
Merged

feat(validators): enhance inference performance validation#1133
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/inference-perf-configurable-model

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes the inference-perf performance check configurable per recipe, defaults it
to a real-workload model, and backs it with a model-weights cache — so
aicr validate --phase performance reflects GPU compute (not serving overhead)
and works out of the box. Three headline changes: per-recipe model/concurrency,
model cache on by default, and per-accelerator gate thresholds re-measured on
real hardware
.

Draft / WIP — opening for early review. Full make qualify (e2e + scan) is
pending before marking ready; make test (-race) and make lint pass locally.

Motivation / Context

Fixes #1116.

The old check ran a tiny smoke-test model (Qwen3-0.6B, conc 16) that never loaded
the GPU and used global-only knobs. Characterization across RTX PRO 6000, H100
(EKS + GKE), and GB200 showed an 8B model at conc 256/GPU is the throughput sweet
spot and that the per-accelerator thresholds needed real numbers.

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update

Component(s) Affected

  • Recipe engine / data (pkg/recipe, recipes/overlays, recipes/validators/catalog.yaml)
  • Validator (pkg/validator, validators/performance)
  • Docs/examples (docs/)

Implementation Notes

  • Per-recipe inputs: inference-model and inference-concurrency-per-gpu read
    from validation.performance.constraints, resolving recipe > catalog env > default
    (symmetric with the throughput/TTFT thresholds). Defaults: Qwen/Qwen3-8B at 256/GPU.
  • Model-weights cache (PVC) on by default: one populate download shared by all
    workers (offline), avoiding per-IP HF 429s; …MODEL_CACHE_SIZE=off to disable.
    Fails fast with guidance when no StorageClass is resolvable instead of hanging.
  • EKS default StorageClass: aws-ebs-csi-driver provisions a default gp3 SC
    (defaultStorageClass.enabled); GKE already has standard-rwo.
  • Per-accelerator gates in the *-inference-dynamo overlays re-measured at
    8B/256, with model + concurrency pinned alongside.
  • Configurable timeouts: workload-ready / endpoint-readiness / health timeouts
    (AICR_INFERENCE_PERF_WORKLOAD_READY_TIMEOUT, …HEALTH_TIMEOUT) for large-model
    bring-up. The inference-perf catalog timeout budget now accounts for the
    cache-populate phase (50m → 65m), and the facade ValidationOperationTimeout was
    raised to 75m (> the largest catalog check timeout) with a catalog-vs-facade
    guard test.
  • Security/scoping (Codex review): model passed to AIPerf via the AICR_MODEL
    env referenced as "$AICR_MODEL" (no shell command-substitution); HF_TOKEN
    forwarded only to the inference-perf validator.
  • Input hardening (review rounds):
    • Resolved model ID validated against the HF repo-ID character set — rejects
      YAML/shell metacharacters before it reaches the Dynamo template.
    • SERVED_MODEL_NAME: "${MODEL}" quoted in the deploy template so a
      scalar-looking model ID (pure-numeric / boolean-like) parses as a string, not
      int/bool/null (+ parse test).
    • Populate container carries explicit CPU/memory requests
      (100m/512Mi) so it schedules under ResourceQuota / LimitRange — but
      no memory limit: it streams tens of GB to the PVC and a hard limit
      OOMKills large-model downloads via page cache on cgroup v2 (caught on a
      live 8B GKE run).
    • Scale the inference-throughput gate to the free-GPU count actually
      benchmarked. The workload is sized to a node's free GPUs, so on a shared
      node the absolute full-node gate would false-fail a healthy per-GPU result;
      the gate is scaled by freeGPUs / nodeGPUs (TTFT p99, a latency, is not
      scaled). Full-node runs are a no-op.
    • Preserve the model-cache populate Job's timeout/unavailable error
      classification (PropagateOrWrap instead of flattening to ErrCodeInternal).
    • Propagate the validator pod's imagePullSecrets to the model-cache populate
      Job (parity with the AIPerf Job) so private-mirror / air-gapped clusters can
      pull the cache worker image.
    • Reject non-positive cache sizes (0Gi / -1Gi) with ErrCodeInvalidRequest.
    • Clear a stale HF-token Secret when the token is unset; carry resourceVersion
      on Secret update; bound the cache PVC/Job and HF-token Secret API calls with
      named timeouts.
    • Recognize the legacy beta default-StorageClass annotation.

Testing

Commands run:

  • make test (-race) — pass; total coverage 76.4% (≥ 70% floor).
  • make lint (golangci-lint + yamllint + license/agents-sync) — 0 issues.

Validated end to end on live clusters with the compiled defaults (Qwen3-8B,
256 concurrency/GPU, cache on; no tuning flags beyond the per-region
registry/tag, plus MODEL_CACHE_STORAGE_CLASS=gp2 on GB200 which has no default
StorageClass):

Cluster GPUs throughput TTFT p99 peak GPU util
EKS RTX PRO 6000 8 58,434 tok/s 458 ms 100%
GKE H100 8 107,742 tok/s 796 ms 100%
EKS H100 (aicr2, earlier) 8 ~108,790 tok/s ~688 ms 100%
EKS GB200 4 69,072 tok/s 989 ms ~80%

All passed their per-accelerator gates. GB200 reaches only ~80% util on 8B (it is
over-provisioned for a model this small) — the documented case for the remaining
precision/TP work. The on-by-default cache bound to each cluster's StorageClass
(standard-rwo on GKE, ebs-csi-default-sc on the AICR-provisioned EKS, gp2
on GB200), and the no-default-SC fast-fail was confirmed on GB200 (~6 s, with
actionable guidance).

Re-verified the head-of-branch build end to end on GKE H100 (validator image
rebuilt from the latest commit): per-recipe model/concurrency resolved from the
recipe, cache on by default bound zero-config to the cluster's standard-rwo
default SC (100Gi RWO aicr-model-cache), and the populate container scheduled
with its new 100m/512Mi requests + 4Gi limit.

  • Pending before ready-for-review: full make qualify (e2e + grype scan).

Risk Assessment

  • Medium — Touches the inference-perf validator, recipe overlays, and the
    EBS-CSI default-SC; default model/concurrency changed (heavier default run).

Rollout notes: the default inference-perf run now uses 8B at conc 256 with the
cache on (slower, ~6–7 min, benefits from the cache); disable via
AICR_INFERENCE_PERF_MODEL_CACHE_SIZE=off.

⚠️ Cluster-wide default StorageClass — wants conscious maintainer sign-off.
This PR sets defaultStorageClass.enabled: true on aws-ebs-csi-driver, which
provisions a cluster-default gp3 SC (ebs-csi-default-sc) on every
AICR-deployed EKS cluster that pulls in the component — not just inference
recipes
; training overlays inherit it too. Two consequences a maintainer
should approve knowingly:

  1. If a cluster already has a default SC, this creates a two-default ambiguity
    (Kubernetes treats multiple defaults as undefined) — the operator must unset
    the other default.
  2. Any unrelated PVC that previously failed-fast on "no default SC" will now
    silently bind gp3, which can mask a misconfiguration.

This is intentional (EKS ships no default SC, so the model-weights cache PVC
would otherwise hang Pending), and is the one change flagged in review
(@njhensley) as needing explicit maintainer approval. The inline NOTE: in
recipes/components/aws-ebs-csi-driver/values.yaml documents the same caveat.

HF_TOKEN forwarding (reviewed, accepted): the token reaches the validator pod
as a plaintext pod-spec env (workers/cache Job already use secretKeyRef). The
secretKeyRef-on-validator alternative was considered and declined — narrow
upside, doesn't remove the runtime plaintext, low-sensitivity rate-limit token in
an ephemeral per-run namespace. The trade-off is documented inline at
pkg/validator/v1/job_plan_internal.go.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR makes the inference-perf validator configurable per recipe and adds a PVC-backed model-weights cache. It wires inference-model and inference-concurrency-per-gpu from recipe constraints (recipe > catalog env > compiled default Qwen/Qwen3-8B at 256/GPU), forwards HF token only to inference-perf jobs, adds env-configurable readiness/health timeouts, validates tuning knobs fail-closed, pre-populates an optional model cache (PVC + populate Job) and injects it into workers, passes resolved model via AICR_MODEL to AIPerf (avoiding shell interpolation), updates overlays and catalog docs with measured gates, and enables a cluster-default StorageClass in the AWS EBS CSI driver chart.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR comprehensively addresses all objectives from issue #1116: per-recipe config via constraints [inference-model, inference-concurrency-per-gpu], model-weights cache on-by-default with fast-fail StorageClass validation, updated per-accelerator gates re-measured on real hardware, and security scoping (AICR_MODEL env, HF_TOKEN forwarding).
Out of Scope Changes check ✅ Passed All changes align with the stated objectives: documentation updates explain new constraints/cache behavior; code changes implement per-recipe config resolution, model cache PVC/Job logic, HF_TOKEN forwarding, and timeout budgets; recipe overlays pin model/concurrency/thresholds consistently. AWS EBS CSI default StorageClass enablement directly enables cache zero-config on EKS.
Title check ✅ Passed The title 'feat(validators): enhance inference performance validation' directly reflects the main changes: configurable per-recipe inference model/concurrency, model-cache enablement, and re-measured performance gates across multiple validator overlays.
Description check ✅ Passed The pull request description directly addresses the changeset, explaining the three headline changes (per-recipe model/concurrency, model cache on by default, re-measured gates), implementation details, testing, risk assessment, and rollout notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/validator.md`:
- Around line 160-172: Update the example under validation.performance so the
inference-ttft-p99 threshold matches the later examples/overlay guidance; change
the value for the key inference-ttft-p99 in the YAML block to use the unified
threshold (e.g., "<= 1000" or "<= 2000" as used elsewhere) so it no longer
conflicts with the rest of the doc.

In `@docs/user/validation.md`:
- Around line 169-170: Update the wording in docs/user/validation.md where the
CLI example shows "aicr validate --data <overlay>" to correctly indicate the
expected external data directory (use "aicr validate --data <dir>" or similar),
i.e., replace "<overlay>" with "<dir>" in the paragraph referencing the `aicr
validate --data` usage so it clearly reflects that `--data` expects a directory
path rather than an overlay identifier.

In `@pkg/validator/v1/job_plan_internal.go`:
- Around line 104-110: The code appends catalog-provided env vars after the
orchestrator-forwarded HF_TOKEN, allowing entry.Env to override the forwarded
token; update the loop that appends entry.Env to skip any EnvVar with Name ==
"HF_TOKEN" (or generally dedupe by checking existing env slice for that name) so
the forwarded token set when entry.Name == inferencePerfCheckName always wins;
reference the variables/idents env, entry.Env, inferencePerfCheckName, and
corev1.EnvVar when making the change.

In `@recipes/overlays/b200-gke-cos-inference-dynamo.yaml`:
- Around line 76-92: The constraints block currently contains UNTESTED
PLACEHOLDER thresholds for the B200 overlay (see constraint names
inference-model, inference-concurrency-per-gpu, inference-throughput,
inference-ttft-p99) while conformance keeps the inference-perf gate enabled;
before merging either replace these placeholder values with measured B200
results from the referenced GKE testbed run (use real throughput and p99 TTFT
numbers) or remove/disable the inference-perf gate in conformance until
validated results are available so the recipe cannot pass with
non-representative limits.

In `@validators/performance/inference_perf_constraint.go`:
- Around line 878-881: The update path for the existing HF token Secret is
missing the required resourceVersion; before calling secrets.Update(ctx.Ctx,
secret, ... ) on the IsAlreadyExists branch, set secret.ResourceVersion to the
existing secret's ResourceVersion (metadata.resourceVersion) so the update
includes the current version; locate the AlreadyExists block around
secrets.Update and assign secret.ResourceVersion =
existingSecret.ResourceVersion (or the variable holding the fetched secret)
prior to calling secrets.Update to avoid API rejections.

In `@validators/performance/model_cache.go`:
- Around line 124-125: clusterHasDefaultStorageClass only checks the GA
annotation constant (defaultStorageClassAnnotation) and misses clusters using
the legacy beta key; update the check in clusterHasDefaultStorageClass (and any
use of defaultStorageClassAnnotation) to consider both
"storageclass.kubernetes.io/is-default-class" and
"storageclass.beta.kubernetes.io/is-default-class" as indicators of a default
StorageClass, returning true if either annotation on scs.Items[i].Annotations
equals "true".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4675d810-4969-4fe1-b074-4b120dbf8e9f

📥 Commits

Reviewing files that changed from the base of the PR and between 0479e45 and 2094d03.

📒 Files selected for processing (20)
  • docs/contributor/data.md
  • docs/contributor/validator.md
  • docs/integrator/data-flow.md
  • docs/integrator/recipe-development.md
  • docs/user/cli-reference.md
  • docs/user/validation.md
  • pkg/validator/v1/job_plan_internal.go
  • pkg/validator/v1/job_plan_test.go
  • recipes/components/aws-ebs-csi-driver/values.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go
  • validators/performance/model_cache.go
  • validators/performance/model_cache_test.go
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread docs/contributor/validator.md
Comment thread docs/user/validation.md Outdated
Comment thread pkg/validator/v1/job_plan_internal.go Outdated
Comment thread recipes/overlays/b200-gke-cos-inference-dynamo.yaml Outdated
Comment thread validators/performance/inference_perf_constraint.go Outdated
Comment thread validators/performance/model_cache.go Outdated
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch 2 times, most recently from 6bf61be to badaa91 Compare June 1, 2026 17:43

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml`:
- Around line 74-86: The inference-throughput floor "value: \">= 50000\"" is too
low for the H100 gate given the measured baseline ≈108,790 tok/s; update the
constraints block (constraint with name "inference-throughput") for the H100
overlay that targets model "inference-model: Qwen/Qwen3-8B" and concurrency
"inference-concurrency-per-gpu: \"256\"" to use a more appropriate floor (e.g.,
">= 90000") so the gate catches real regressions while still allowing the
evaluator's ~10% run-to-run variance; ensure this change is only applied to this
H100-specific overlay rather than a shared accelerator minimum.

In `@validators/performance/inference_perf_test.go`:
- Around line 992-1002: The test currently only checks that
ensureHFTokenSecret(ctx, ns) returns no error but doesn't assert the secret
content changed; after calling ensureHFTokenSecret, retrieve the Secret from the
fake client (via client.CoreV1().Secrets(ns).Get(...)) and assert that the
StringData or Data for key hfTokenSecretKey equals the new env value ("hf_new")
and that the Secret name matches hfTokenSecretName (and optionally
ResourceVersion changed), so the subtest verifies the secret was actually
updated by ensureHFTokenSecret.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: faa4ff86-372d-4ba0-b80b-f64639c00b56

📥 Commits

Reviewing files that changed from the base of the PR and between 2094d03 and 6bf61be.

📒 Files selected for processing (20)
  • docs/contributor/data.md
  • docs/contributor/validator.md
  • docs/integrator/data-flow.md
  • docs/integrator/recipe-development.md
  • docs/user/cli-reference.md
  • docs/user/validation.md
  • pkg/validator/v1/job_plan_internal.go
  • pkg/validator/v1/job_plan_test.go
  • recipes/components/aws-ebs-csi-driver/values.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go
  • validators/performance/model_cache.go
  • validators/performance/model_cache_test.go
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml Outdated
Comment thread validators/performance/inference_perf_test.go Outdated
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch from badaa91 to 7c5a33e Compare June 1, 2026 17:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 873-905: The Secret API calls (secrets.Get/Delete/Update/Create)
use ctx.Ctx directly and need a bounded context; wrap this block in a timed
context using ctxWithTimeout, errCtx := context.WithTimeout(ctx.Ctx,
defaults.DiagnosticTimeout) (or another named default), defer errCtx() (cancel)
and replace ctx.Ctx with the derived context when calling secrets.Get,
secrets.Delete, secrets.Update and secrets.Create; keep hfTokenSecretName,
hfTokenSecretKey and the existing logic for exists/secret.ResourceVersion
unchanged so only the context usage for these I/O calls is updated.

In `@validators/performance/model_cache.go`:
- Around line 175-186: The PVC/Job creation logic (buildModelCachePVC,
buildModelCachePopulateJob, pvc, jobName, modelCachePopulateJobPrefix) currently
reuses existing artifacts when apierrors.IsAlreadyExists(err), causing stale
caches for different config.model or cache settings; instead compute a short
config hash or generation key from the resolved config (including config.model,
storage class, cache flags, runID/generation), add it as an annotation or suffix
to the PVC and Job, and in the AlreadyExists path fetch the existing PVC/Job,
compare the stored annotation/hash to the current one and if they differ delete
and recreate the resource (or always create with the hash-suffixed name) so the
populate Job and PVC are regenerated when config changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1451a4e2-bbb7-4198-b378-0f6e089df059

📥 Commits

Reviewing files that changed from the base of the PR and between 6bf61be and badaa91.

📒 Files selected for processing (20)
  • docs/contributor/data.md
  • docs/contributor/validator.md
  • docs/integrator/data-flow.md
  • docs/integrator/recipe-development.md
  • docs/user/cli-reference.md
  • docs/user/validation.md
  • pkg/validator/v1/job_plan_internal.go
  • pkg/validator/v1/job_plan_test.go
  • recipes/components/aws-ebs-csi-driver/values.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go
  • validators/performance/model_cache.go
  • validators/performance/model_cache_test.go
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread validators/performance/inference_perf_constraint.go Outdated
Comment thread validators/performance/model_cache.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml (1)

74-88: ⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Throughput floor inconsistent with measured baseline and inline rationale.

The comment states the floor carries "margin beyond the evaluator's 10% tolerance" to catch regressions, but >= 50000 against measured ≈ 108,790 tok/s permits a ~54% throughput drop before failing. With the evaluator's 10% tolerance, the effective passing floor is ~98k tok/s * 0.9 = ~88k, yet the gate only enforces 50k. This 50k value is reused across accelerators with very different baselines (RTX PRO 6000 @ 59k, GB200 @ 66k, H100 @ 108k), making it a shared accelerator-family minimum rather than an H100-specific regression gate.

Recommend tightening to >= 90000 so the gate reflects H100 capability and catches meaningful regressions per the inline rationale.

🔧 Suggested tighter H100 floor
         - name: inference-throughput
-          value: ">= 50000"
+          value: ">= 90000"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml` around lines 74 - 88,
The throughput floor for the constraint named inference-throughput is far too
low for the H100 baseline and contradicts the inline rationale; update the
constraint value from ">= 50000" to ">= 90000" (or make the gate
accelerator-specific) so the gate reflects the H100 measured throughput
(~108,790 tok/s) and enforces a meaningful regression threshold consistent with
the evaluator's ~10% tolerance; ensure the change is applied to the constraints
block where inference-throughput is defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/h100-gke-cos-inference-dynamo.yaml`:
- Around line 74-88: The throughput floor in the constraints block
("inference-throughput" value) is too low (">= 50000") relative to the stated
H100 baseline (~108,164 tok/s) and the inline rationale; update the
inference-throughput constraint to a tighter H100-specific floor (recommend
setting value to ">= 90000") so the gate captures meaningful regressions and
aligns with the comment's described margin beyond the evaluator's 10% tolerance.

In `@recipes/validators/catalog.yaml`:
- Around line 70-79: The comment around the timeout: 65m value is ambiguous
about whether "cache-populate (10m)" is included in the 50m baseline or counted
separately; update the comment near the timeout field to explicitly state that
the 50m worst-case sum includes the nominal WORKLOAD_READY_TIMEOUT (10m) for
cache-populate but the extra 15m buffer (65m − 50m) is intended to accommodate
cache-populate overruns and deferred cleanup, and reference the relevant knobs
(CheckExecutionTimeout, WORKLOAD_READY_TIMEOUT, cache-populate) so readers
understand that large-model cache-populate can exceed its nominal 10m and the
extra time covers that overrun plus cleanup.

In `@validators/performance/inference_perf_constraint.go`:
- Around line 1389-1393: Update the stale comment for resolveInferenceModel so
it no longer calls the fallback the “default smoke-test model”; change the
wording to state that unsetting or empty AICR_INFERENCE_PERF_MODEL falls back to
the real-workload default (Qwen/Qwen3-8B). Mention AICR_INFERENCE_PERF_MODEL and
resolveInferenceModel in the comment to make it clear which override and
function are affected.

In `@validators/performance/model_cache.go`:
- Around line 241-258: The populate Job's container (Name: "populate", Image:
cacheWorkerImage) lacks resource requests/limits; add a Resources field on that
v1.Container with a v1.ResourceRequirements struct setting sensible Requests and
Limits (e.g. cpu and memory) so the pod can be scheduled and prevented from
OOMing during snapshot_download; update the Container literal inside the model
cache job creation in validators/performance/model_cache.go to include
Resources: v1.ResourceRequirements{Requests: v1.ResourceList{"cpu":
resource.MustParse("250m"), "memory": resource.MustParse("512Mi")}, Limits:
v1.ResourceList{"cpu": resource.MustParse("1"), "memory":
resource.MustParse("4Gi")}} (adjust values as needed or make them configurable).

---

Duplicate comments:
In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml`:
- Around line 74-88: The throughput floor for the constraint named
inference-throughput is far too low for the H100 baseline and contradicts the
inline rationale; update the constraint value from ">= 50000" to ">= 90000" (or
make the gate accelerator-specific) so the gate reflects the H100 measured
throughput (~108,790 tok/s) and enforces a meaningful regression threshold
consistent with the evaluator's ~10% tolerance; ensure the change is applied to
the constraints block where inference-throughput is defined.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d1809639-6996-4f96-abc7-85bf3568d5f0

📥 Commits

Reviewing files that changed from the base of the PR and between badaa91 and 7c5a33e.

📒 Files selected for processing (22)
  • docs/contributor/data.md
  • docs/contributor/validator.md
  • docs/integrator/data-flow.md
  • docs/integrator/recipe-development.md
  • docs/user/cli-reference.md
  • docs/user/validation.md
  • pkg/defaults/timeouts.go
  • pkg/defaults/timeouts_test.go
  • pkg/validator/v1/job_plan_internal.go
  • pkg/validator/v1/job_plan_test.go
  • recipes/components/aws-ebs-csi-driver/values.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go
  • validators/performance/model_cache.go
  • validators/performance/model_cache_test.go
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread recipes/overlays/h100-gke-cos-inference-dynamo.yaml Outdated
Comment thread recipes/validators/catalog.yaml Outdated
Comment thread validators/performance/inference_perf_constraint.go Outdated
Comment thread validators/performance/model_cache.go
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch 3 times, most recently from c4deb44 to 211b4ca Compare June 1, 2026 18:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
validators/performance/inference_perf_constraint.go (2)

1389-1393: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the stale default-model wording.

Line 1392 still says this falls back to the "default smoke-test model", but the compiled default is now Qwen/Qwen3-8B. That wording is now misleading when reading the precedence docs in code.

As per coding guidelines, "Ensure code comments are accurate and helpful".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/inference_perf_constraint.go` around lines 1389 -
1393, Update the stale comment in resolveInferenceModel: change the phrase that
says it "falls back to the default smoke-test model" to explicitly name the
compiled default (e.g., "compiled default Qwen/Qwen3-8B") and/or clarify that an
unset/empty AICR_INFERENCE_PERF_MODEL will use the compiled default model ID
(Qwen/Qwen3-8B) so the precedence docs reflect the current default behavior.

866-905: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound the Secret CRUD calls with a named timeout.

These Kubernetes Get/Delete/Update/Create calls still use ctx.Ctx directly, so a slow apiserver can consume most of the validator budget before workload setup even starts. Use a derived timeout context here, like the other K8s writes in this file.

Suggested fix
 func ensureHFTokenSecret(ctx *validators.Context, namespace string) error {
 	secrets := ctx.Clientset.CoreV1().Secrets(namespace)
 	token := strings.TrimSpace(os.Getenv(envHFToken))
+	secretCtx, cancel := context.WithTimeout(ctx.Ctx, defaults.DiagnosticTimeout)
+	defer cancel()
 
 	// Read any existing secret first so an update carries the current
 	// resourceVersion (an Update without it can be rejected by the apiserver)
 	// and so an unset token can clear a stale one.
-	existing, getErr := secrets.Get(ctx.Ctx, hfTokenSecretName, metav1.GetOptions{})
+	existing, getErr := secrets.Get(secretCtx, hfTokenSecretName, metav1.GetOptions{})
 	if getErr != nil && !apierrors.IsNotFound(getErr) {
 		return errors.Wrap(errors.ErrCodeInternal, "failed to read HF token secret", getErr)
 	}
 	exists := getErr == nil
@@
 	if token == "" {
 		// Anonymous run: delete any leftover token so a reused per-run namespace
 		// can't silently inject stale credentials via the workers' optional
 		// secretKeyRefs.
 		if exists {
-			if err := secrets.Delete(ctx.Ctx, hfTokenSecretName, metav1.DeleteOptions{}); err != nil && !apierrors.IsNotFound(err) {
+			if err := secrets.Delete(secretCtx, hfTokenSecretName, metav1.DeleteOptions{}); err != nil && !apierrors.IsNotFound(err) {
 				return errors.Wrap(errors.ErrCodeInternal, "failed to delete stale HF token secret", err)
 			}
 			slog.Info("Cleared stale Hugging Face token secret (HF_TOKEN unset; anonymous downloads)", "namespace", namespace)
 		}
@@
 	}
 	if exists {
 		secret.ResourceVersion = existing.ResourceVersion
-		if _, err := secrets.Update(ctx.Ctx, secret, metav1.UpdateOptions{}); err != nil {
+		if _, err := secrets.Update(secretCtx, secret, metav1.UpdateOptions{}); err != nil {
 			return errors.Wrap(errors.ErrCodeInternal, "failed to update HF token secret", err)
 		}
 		return nil
 	}
-	if _, err := secrets.Create(ctx.Ctx, secret, metav1.CreateOptions{}); err != nil {
+	if _, err := secrets.Create(secretCtx, secret, metav1.CreateOptions{}); err != nil {
 		return errors.Wrap(errors.ErrCodeInternal, "failed to create HF token secret", err)
 	}

As per coding guidelines, "All I/O operations must use a context with timeout" and "Use context.WithTimeout(ctx, timeout) with named constants like defaults.HTTPClientTimeout or defaults.ServerHandlerTimeout for all I/O deadlines".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/inference_perf_constraint.go` around lines 866 - 905,
The secret CRUD calls in ensureHFTokenSecret use ctx.Ctx directly and must be
bounded with a timeout; wrap calls to secrets.Get, Delete, Update, and Create
with a derived context created via context.WithTimeout(ctx.Ctx,
defaults.HTTPClientTimeout) (or the project's chosen timeout constant), assign
that to a local timedCtx and defer cancel() before each K8s call (or reuse one
timedCtx per function scope) and replace ctx.Ctx with timedCtx when invoking
secrets.Get/Delete/Update/Create to ensure the operations respect the configured
deadline.
recipes/overlays/h100-gke-cos-inference-dynamo.yaml (1)

74-88: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Throughput floor too loose for H100 regression detection.

The >= 50000 floor against measured ≈108,164 tok/s allows a ~54% throughput drop before the gate fails. This is the same loose floor reused across multiple accelerators, making it a shared minimum rather than an H100-specific regression gate as the inline comment describes.

Recommend tightening to >= 90000 to catch meaningful H100 regressions while tolerating normal run-to-run variance, consistent with the stated intent and the H100 EKS overlay.

🔧 Tighten H100 regression floor
         - name: inference-throughput
-          value: ">= 50000"
+          value: ">= 90000"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/h100-gke-cos-inference-dynamo.yaml` around lines 74 - 88,
The inference throughput floor is too loose for H100; update the constraints
block for the H100 overlay by changing the inference-throughput constraint's
value from ">= 50000" to ">= 90000" so the gate detects meaningful H100
regressions; locate the constraints list (entries with name: inference-model,
name: inference-concurrency-per-gpu, name: inference-throughput) and only adjust
the value for inference-throughput to ">= 90000".
recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml (1)

74-88: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Throughput floor too loose for H100 regression detection.

The >= 50000 floor against measured ≈108,790 tok/s allows a ~54% throughput drop before the gate fails. Even with the evaluator's 10% tolerance, this permits a ~59% regression from baseline — too wide to catch meaningful H100 degradation as intended by the inline "catches real regressions without flaking" rationale.

Recommend tightening to >= 90000 to align with the measured baseline and provide the described margin while still catching real regressions on H100.

🔧 Tighten H100 regression floor
         - name: inference-throughput
-          value: ">= 50000"
+          value: ">= 90000"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml` around lines 74 - 88,
The inference throughput floor in the constraints block is too low: update the
constraint with name inference-throughput (currently value ">= 50000") to a
tighter floor ">= 90000" so the H100 gate aligns with the measured ~108,790
tok/s baseline and catches meaningful regressions as described in the header
comment.
recipes/validators/catalog.yaml (1)

70-79: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify timeout budget arithmetic.

The comment lists phases that sum to 55m (10+5+10+5+5+15), but states "50m worst-case phase sum." Either the arithmetic needs correction or the relationship between "model-cache populate" and "workload-ready" needs clarification (are they the same phase or distinct steps?).

🔍 Suggested clarification approach

If cache-populate and workload-ready are the same timeout, remove the duplicate:

-    # model-cache populate (up to WORKLOAD_READY_TIMEOUT, 10m; cache is on by
-    # default) + namespace-termination (5m) + workload-ready (10m) + health (5m)
+    # namespace-termination (5m) + workload-ready including model-cache populate
+    # (up to WORKLOAD_READY_TIMEOUT, 10m; cache is on by default) + health (5m)

Or if they're distinct, correct the sum to 55m and adjust the explanation accordingly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/validators/catalog.yaml` around lines 70 - 79, The comment above the
timeout: 65m value has inconsistent arithmetic — it lists phase durations that
sum to 55m (10+5+10+5+5+15) yet claims a "50m worst-case phase sum" and
references model-cache populate vs workload-ready ambiguously; update the
comment in recipes/validators/catalog.yaml to either (A) clarify that
"model-cache populate" and "workload-ready" are the same phase (remove the
duplicate duration) if that is true, or (B) correct the summed total to 55m and
adjust the explanatory text (including references to CheckExecutionTimeout and
WORKLOAD_READY_TIMEOUT) so the timeout rationale consistently explains why
timeout: 65m was chosen.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/model_cache.go`:
- Around line 175-183: In ensureModelCache, wrap the PVC and Job Create calls
with per-operation contexts instead of using ctx.Ctx directly: for the PVC
created via buildModelCachePVC and
ctx.Clientset.CoreV1().PersistentVolumeClaims(...).Create use a
context.WithTimeout(ctx.Ctx, defaults.DiagnosticTimeout) (remember to defer
cancel()) and pass that new ctx to Create; similarly, for the Job created via
buildModelCachePopulateJob and ctx.Clientset.BatchV1().Jobs(...).Create use
context.WithTimeout(ctx.Ctx, defaults.K8sJobCreationTimeout) (defer cancel())
and pass that timed context to Create; keep existing error handling (including
apierrors.IsAlreadyExists) unchanged.

---

Duplicate comments:
In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml`:
- Around line 74-88: The inference throughput floor in the constraints block is
too low: update the constraint with name inference-throughput (currently value
">= 50000") to a tighter floor ">= 90000" so the H100 gate aligns with the
measured ~108,790 tok/s baseline and catches meaningful regressions as described
in the header comment.

In `@recipes/overlays/h100-gke-cos-inference-dynamo.yaml`:
- Around line 74-88: The inference throughput floor is too loose for H100;
update the constraints block for the H100 overlay by changing the
inference-throughput constraint's value from ">= 50000" to ">= 90000" so the
gate detects meaningful H100 regressions; locate the constraints list (entries
with name: inference-model, name: inference-concurrency-per-gpu, name:
inference-throughput) and only adjust the value for inference-throughput to ">=
90000".

In `@recipes/validators/catalog.yaml`:
- Around line 70-79: The comment above the timeout: 65m value has inconsistent
arithmetic — it lists phase durations that sum to 55m (10+5+10+5+5+15) yet
claims a "50m worst-case phase sum" and references model-cache populate vs
workload-ready ambiguously; update the comment in
recipes/validators/catalog.yaml to either (A) clarify that "model-cache
populate" and "workload-ready" are the same phase (remove the duplicate
duration) if that is true, or (B) correct the summed total to 55m and adjust the
explanatory text (including references to CheckExecutionTimeout and
WORKLOAD_READY_TIMEOUT) so the timeout rationale consistently explains why
timeout: 65m was chosen.

In `@validators/performance/inference_perf_constraint.go`:
- Around line 1389-1393: Update the stale comment in resolveInferenceModel:
change the phrase that says it "falls back to the default smoke-test model" to
explicitly name the compiled default (e.g., "compiled default Qwen/Qwen3-8B")
and/or clarify that an unset/empty AICR_INFERENCE_PERF_MODEL will use the
compiled default model ID (Qwen/Qwen3-8B) so the precedence docs reflect the
current default behavior.
- Around line 866-905: The secret CRUD calls in ensureHFTokenSecret use ctx.Ctx
directly and must be bounded with a timeout; wrap calls to secrets.Get, Delete,
Update, and Create with a derived context created via
context.WithTimeout(ctx.Ctx, defaults.HTTPClientTimeout) (or the project's
chosen timeout constant), assign that to a local timedCtx and defer cancel()
before each K8s call (or reuse one timedCtx per function scope) and replace
ctx.Ctx with timedCtx when invoking secrets.Get/Delete/Update/Create to ensure
the operations respect the configured deadline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 27fe1d73-e447-453a-9ad7-94d78210ce25

📥 Commits

Reviewing files that changed from the base of the PR and between 7c5a33e and 0503224.

📒 Files selected for processing (22)
  • docs/contributor/data.md
  • docs/contributor/validator.md
  • docs/integrator/data-flow.md
  • docs/integrator/recipe-development.md
  • docs/user/cli-reference.md
  • docs/user/validation.md
  • pkg/defaults/timeouts.go
  • pkg/defaults/timeouts_test.go
  • pkg/validator/v1/job_plan_internal.go
  • pkg/validator/v1/job_plan_test.go
  • recipes/components/aws-ebs-csi-driver/values.yaml
  • recipes/overlays/b200-gke-cos-inference-dynamo.yaml
  • recipes/overlays/gb200-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/h100-gke-cos-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/validators/catalog.yaml
  • validators/performance/inference_perf_constraint.go
  • validators/performance/inference_perf_test.go
  • validators/performance/model_cache.go
  • validators/performance/model_cache_test.go
  • validators/performance/testdata/inference/dynamo-deployment.yaml

Comment thread validators/performance/model_cache.go Outdated
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch from 211b4ca to e9b7691 Compare June 1, 2026 18:34
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch 3 times, most recently from da8238b to a24dffd Compare June 1, 2026 19:41
@yuanchen8911 yuanchen8911 changed the title [WIP] feat(validators): per-recipe inference config + model cache [WIP] feat(validators): enhance inference performance validation Jun 1, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch 4 times, most recently from 2f4b534 to 0801cfb Compare June 1, 2026 23:33
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 2, 2026 16:19
@yuanchen8911 yuanchen8911 requested review from a team as code owners June 2, 2026 16:19
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 2, 2026 16:27

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecturally sound and unusually well-tested — fail-closed env parsing, the "$AICR_MODEL" shell-injection avoidance (with a direct $(...) test), the catalog-vs-facade timeout guard test, and the cgroup-v2 page-cache reasoning behind the populate container's missing memory limit are all excellent.

My only real gating item is a conscious decision on the cluster-wide default StorageClass flip (inline below); everything else is nits, one low-severity security note, and a couple of process confirmations.

Process confirmations before flipping out of Draft:

  1. recipes/components/aws-ebs-csi-driver/values.yaml changed — per CLAUDE.md, run make bom-docs and commit docs/user/container-images.md if it changes. Toggling defaultStorageClass.enabled renders only a StorageClass (no new images) so it's almost certainly a no-op, but bom-check isn't in the merge gate, so worth confirming.
  2. PR notes make qualify (e2e + grype) is still pending — that's the gate that matters most for an XL validator change that touches cluster-wide defaults.

LGTM once the StorageClass decision is settled and make qualify is green.

Comment thread recipes/components/aws-ebs-csi-driver/values.yaml
Comment thread pkg/validator/v1/job_plan_internal.go
Comment thread validators/performance/model_cache.go
Comment thread validators/performance/model_cache.go
Comment thread validators/performance/model_cache.go
Comment thread recipes/overlays/b200-gke-cos-inference-dynamo.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch 2 times, most recently from d14294d to 02679d7 Compare June 2, 2026 17:12
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

@njhensley thanks for the thorough review — here's how each point was addressed (all in 02679d79, rebased on current main):

# Comment Resolution
1 Cluster-wide default StorageClass — wants conscious sign-off Kept intentionally (EKS ships no default SC, so the model-cache PVC would otherwise hang Pending). Two-default caveat documented inline in values.yaml; added a prominent ⚠️ rollout-note callout in the PR description (blast radius beyond inference, two-default ambiguity, silent-bind risk) for explicit maintainer approval.
2 HF_TOKEN plaintext on validator pod (non-blocker) Considered the orchestrator-creates-Secret + secretKeyRef pattern and declined as an accepted trade-off: narrow upside (keeps it out of the pod spec, where get pods is more widely granted than get secrets) that doesn't remove the runtime plaintext, for a low-sensitivity rate-limit token in an ephemeral namespace. Rationale documented inline at pkg/validator/v1/job_plan_internal.go, with a "revisit for higher-privilege credentials" note.
3 cacheWorkerImage not run through ResolveImage (air-gap) Added inline note: pinned to nvcr.io, not registry-rewritten, imagePullSecrets only help for nvcr.io creds — matches the Dynamo worker image (drift-guarded by TestCacheWorkerImageMatchesTemplate); full mirror/air-gap support is a separate follow-up.
4 Frontend pod also mounts the RO cache (nit) Confirmed intentional (per-service injection) and harmless (co-located RWO PVC, full-repo download resolves offline); added a clarifying comment to injectModelCacheMounts.
5 Missing unit test for no-default-SC fail-fast Covered — TestEnsureModelCache_NoDefaultStorageClassFailsFast (fake clientset, no default SC, empty MODEL_CACHE_STORAGE_CLASS → asserts ErrCodeInvalidRequest) plus TestClusterHasDefaultStorageClass.
6 Empty checks list — skip not error? Confirmed clean: FilterEntriesByValidation returns an empty set, the phase loop is a no-op, and WarnPhasesAgainstRecipe only slog.Warns — so b200-gke-cos-inference-dynamo with checks: [] skips the performance gate cleanly.

Inline replies posted under each thread as well.

@yuanchen8911 yuanchen8911 requested a review from njhensley June 2, 2026 17:14
njhensley
njhensley previously approved these changes Jun 2, 2026

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch 2 times, most recently from 2217aaf to 4db0cdd Compare June 2, 2026 18:05
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) June 2, 2026 18:08

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work — the code, comments, and test coverage are all top-shelf. Every failure mode I went looking for (cgroup-v2 page-cache OOMKill, scalar YAML coercion of model IDs, shell injection via the model name, missing default StorageClass, two-default-SC ambiguity, namespace reuse leaving a stale HF_TOKEN, image registry-rewrite parity, RWO-PVC binding under WaitForFirstConsumer) was already addressed with a fail-closed pattern, an inline comment explaining the why, and a regression test.

Three medium notes and two nits inline. Nothing blocking the substance.

CI: 130 pass, 9 expected skips (Tier 3 nightly, the standard *-skip push-only jobs, CodeQL on a non-scheduled run) — fully green for what runs on a PR.

Leaving as COMMENT rather than APPROVE since the body says full make qualify (e2e + grype scan) is pending before ready-for-review. Happy to flip to APPROVE once those run.

One overall thought: the EBS-CSI default-SC change is the one piece of this PR with a blast radius beyond inference-perf — flagging it inline so the catalog doc gets a parallel note for operators reading from that direction.

Comment thread recipes/components/aws-ebs-csi-driver/values.yaml
Comment thread validators/performance/model_cache.go
Comment thread pkg/validator/v1/job_plan_internal.go Outdated
Comment thread validators/performance/model_cache.go
Comment thread validators/performance/model_cache.go
… default

Make the inference-perf benchmark configurable per recipe, default it to a
real-workload model, and back it with a model-weights cache so the performance
gate reflects GPU compute rather than serving overhead and works out of the box.

- Per-recipe inputs: inference-model and inference-concurrency-per-gpu read from
  validation.performance.constraints, resolving recipe > catalog env
  (AICR_INFERENCE_PERF_*) > compiled default — symmetric with the
  throughput/TTFT thresholds.
- Defaults: Qwen/Qwen3-8B at 256 concurrency/GPU (empirical throughput sweet
  spot across RTX PRO 6000, GB200, and H100 on EKS/GKE).
- Model-weights cache (PVC) on by default: one populate download shared by all
  workers (HF_HUB_OFFLINE), avoiding per-IP Hugging Face 429s; disable with
  AICR_INFERENCE_PERF_MODEL_CACHE_SIZE=off. Fails fast with guidance when no
  StorageClass is resolvable instead of hanging until timeout.
- EKS: aws-ebs-csi-driver provisions a default gp3 StorageClass
  (defaultStorageClass.enabled) so the cache works zero-config; GKE already has
  standard-rwo.
- Per-accelerator gate thresholds in the *-inference-dynamo overlays re-measured
  at 8B/256; model + concurrency pinned alongside them.
- Configurable workload-ready / endpoint-readiness timeouts for large-model
  bring-up.
- Security/scoping (Codex review): pass the model to AIPerf via the AICR_MODEL
  container env referenced as "$AICR_MODEL" (no shell interpolation / command
  substitution); forward HF_TOKEN only to the inference-perf validator.
- Edge-case hardening (review): account for the cache-populate phase in the
  inference-perf timeout budget (catalog 50m -> 65m); clear a stale HF-token
  Secret when HF_TOKEN is unset and carry resourceVersion on Secret update;
  reject non-positive cache sizes (e.g. 0Gi/-1Gi) with ErrCodeInvalidRequest;
  validate the resolved model ID against the HF repo-ID character set (reject
  YAML/shell metacharacters before it reaches the Dynamo template); raise the
  facade ValidationOperationTimeout to 75m (> the 65m catalog check timeout) and
  add a catalog-vs-facade guard test; recognize the legacy beta default-SC
  annotation; bound the model-cache PVC/Job and HF-token Secret API calls with
  named timeouts; give the cache-populate container CPU/memory requests for
  schedulability (no memory limit — a limit OOMKills large-model downloads via
  page cache on cgroup v2, confirmed on a live 8B GKE run); scale the
  inference-throughput gate to the free-GPU count actually benchmarked (the
  full-node gate would false-fail a healthy per-GPU result on a partially
  occupied node, since the workload is sized to free GPUs); preserve
  WaitForJobCompletion's timeout/unavailable classification on the model-cache
  populate Job (PropagateOrWrap instead of flattening to ErrCodeInternal);
  propagate the validator pod's imagePullSecrets to the model-cache populate Job
  (parity with the AIPerf Job — lets a private-mirror/air-gapped cluster pull
  the cache worker image).
- Docs: validator, validation, recipe-development, data, data-flow,
  cli-reference (incl. MODEL_CACHE_STORAGE_CLASS as a catalog/--data knob).

Refs NVIDIA#1116

- Review (CodeRabbit): clarify the H100 inference-throughput comments to state
  that >= 50000 is a deliberately conservative floor shared across the
  inference overlays (the evaluator already scales it to the benchmarked GPU
  count and applies a 10% tolerance), not an H100-tuned regression gate; disable
  the inference-perf gate on b200-gke-cos-inference-dynamo until a reference
  benchmark exists, since a hard gate with untested placeholder thresholds can
  pass against non-representative limits.
@yuanchen8911 yuanchen8911 force-pushed the feat/inference-perf-configurable-model branch from 81fd3d7 to 0b97d69 Compare June 2, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve inference performance validation: tunable, per-accelerator params wired into recipes

3 participants