feat(validators): enhance inference performance validation#1133
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR makes the inference-perf validator configurable per recipe and adds a PVC-backed model-weights cache. It wires inference-model and inference-concurrency-per-gpu from recipe constraints (recipe > catalog env > compiled default Qwen/Qwen3-8B at 256/GPU), forwards HF token only to inference-perf jobs, adds env-configurable readiness/health timeouts, validates tuning knobs fail-closed, pre-populates an optional model cache (PVC + populate Job) and injects it into workers, passes resolved model via AICR_MODEL to AIPerf (avoiding shell interpolation), updates overlays and catalog docs with measured gates, and enables a cluster-default StorageClass in the AWS EBS CSI driver chart. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/validator.md`:
- Around line 160-172: Update the example under validation.performance so the
inference-ttft-p99 threshold matches the later examples/overlay guidance; change
the value for the key inference-ttft-p99 in the YAML block to use the unified
threshold (e.g., "<= 1000" or "<= 2000" as used elsewhere) so it no longer
conflicts with the rest of the doc.
In `@docs/user/validation.md`:
- Around line 169-170: Update the wording in docs/user/validation.md where the
CLI example shows "aicr validate --data <overlay>" to correctly indicate the
expected external data directory (use "aicr validate --data <dir>" or similar),
i.e., replace "<overlay>" with "<dir>" in the paragraph referencing the `aicr
validate --data` usage so it clearly reflects that `--data` expects a directory
path rather than an overlay identifier.
In `@pkg/validator/v1/job_plan_internal.go`:
- Around line 104-110: The code appends catalog-provided env vars after the
orchestrator-forwarded HF_TOKEN, allowing entry.Env to override the forwarded
token; update the loop that appends entry.Env to skip any EnvVar with Name ==
"HF_TOKEN" (or generally dedupe by checking existing env slice for that name) so
the forwarded token set when entry.Name == inferencePerfCheckName always wins;
reference the variables/idents env, entry.Env, inferencePerfCheckName, and
corev1.EnvVar when making the change.
In `@recipes/overlays/b200-gke-cos-inference-dynamo.yaml`:
- Around line 76-92: The constraints block currently contains UNTESTED
PLACEHOLDER thresholds for the B200 overlay (see constraint names
inference-model, inference-concurrency-per-gpu, inference-throughput,
inference-ttft-p99) while conformance keeps the inference-perf gate enabled;
before merging either replace these placeholder values with measured B200
results from the referenced GKE testbed run (use real throughput and p99 TTFT
numbers) or remove/disable the inference-perf gate in conformance until
validated results are available so the recipe cannot pass with
non-representative limits.
In `@validators/performance/inference_perf_constraint.go`:
- Around line 878-881: The update path for the existing HF token Secret is
missing the required resourceVersion; before calling secrets.Update(ctx.Ctx,
secret, ... ) on the IsAlreadyExists branch, set secret.ResourceVersion to the
existing secret's ResourceVersion (metadata.resourceVersion) so the update
includes the current version; locate the AlreadyExists block around
secrets.Update and assign secret.ResourceVersion =
existingSecret.ResourceVersion (or the variable holding the fetched secret)
prior to calling secrets.Update to avoid API rejections.
In `@validators/performance/model_cache.go`:
- Around line 124-125: clusterHasDefaultStorageClass only checks the GA
annotation constant (defaultStorageClassAnnotation) and misses clusters using
the legacy beta key; update the check in clusterHasDefaultStorageClass (and any
use of defaultStorageClassAnnotation) to consider both
"storageclass.kubernetes.io/is-default-class" and
"storageclass.beta.kubernetes.io/is-default-class" as indicators of a default
StorageClass, returning true if either annotation on scs.Items[i].Annotations
equals "true".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 4675d810-4969-4fe1-b074-4b120dbf8e9f
📒 Files selected for processing (20)
docs/contributor/data.mddocs/contributor/validator.mddocs/integrator/data-flow.mddocs/integrator/recipe-development.mddocs/user/cli-reference.mddocs/user/validation.mdpkg/validator/v1/job_plan_internal.gopkg/validator/v1/job_plan_test.gorecipes/components/aws-ebs-csi-driver/values.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/gb200-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-gke-cos-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/validators/catalog.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.govalidators/performance/model_cache.govalidators/performance/model_cache_test.govalidators/performance/testdata/inference/dynamo-deployment.yaml
6bf61be to
badaa91
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml`:
- Around line 74-86: The inference-throughput floor "value: \">= 50000\"" is too
low for the H100 gate given the measured baseline ≈108,790 tok/s; update the
constraints block (constraint with name "inference-throughput") for the H100
overlay that targets model "inference-model: Qwen/Qwen3-8B" and concurrency
"inference-concurrency-per-gpu: \"256\"" to use a more appropriate floor (e.g.,
">= 90000") so the gate catches real regressions while still allowing the
evaluator's ~10% run-to-run variance; ensure this change is only applied to this
H100-specific overlay rather than a shared accelerator minimum.
In `@validators/performance/inference_perf_test.go`:
- Around line 992-1002: The test currently only checks that
ensureHFTokenSecret(ctx, ns) returns no error but doesn't assert the secret
content changed; after calling ensureHFTokenSecret, retrieve the Secret from the
fake client (via client.CoreV1().Secrets(ns).Get(...)) and assert that the
StringData or Data for key hfTokenSecretKey equals the new env value ("hf_new")
and that the Secret name matches hfTokenSecretName (and optionally
ResourceVersion changed), so the subtest verifies the secret was actually
updated by ensureHFTokenSecret.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: faa4ff86-372d-4ba0-b80b-f64639c00b56
📒 Files selected for processing (20)
docs/contributor/data.mddocs/contributor/validator.mddocs/integrator/data-flow.mddocs/integrator/recipe-development.mddocs/user/cli-reference.mddocs/user/validation.mdpkg/validator/v1/job_plan_internal.gopkg/validator/v1/job_plan_test.gorecipes/components/aws-ebs-csi-driver/values.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/gb200-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-gke-cos-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/validators/catalog.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.govalidators/performance/model_cache.govalidators/performance/model_cache_test.govalidators/performance/testdata/inference/dynamo-deployment.yaml
badaa91 to
7c5a33e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/inference_perf_constraint.go`:
- Around line 873-905: The Secret API calls (secrets.Get/Delete/Update/Create)
use ctx.Ctx directly and need a bounded context; wrap this block in a timed
context using ctxWithTimeout, errCtx := context.WithTimeout(ctx.Ctx,
defaults.DiagnosticTimeout) (or another named default), defer errCtx() (cancel)
and replace ctx.Ctx with the derived context when calling secrets.Get,
secrets.Delete, secrets.Update and secrets.Create; keep hfTokenSecretName,
hfTokenSecretKey and the existing logic for exists/secret.ResourceVersion
unchanged so only the context usage for these I/O calls is updated.
In `@validators/performance/model_cache.go`:
- Around line 175-186: The PVC/Job creation logic (buildModelCachePVC,
buildModelCachePopulateJob, pvc, jobName, modelCachePopulateJobPrefix) currently
reuses existing artifacts when apierrors.IsAlreadyExists(err), causing stale
caches for different config.model or cache settings; instead compute a short
config hash or generation key from the resolved config (including config.model,
storage class, cache flags, runID/generation), add it as an annotation or suffix
to the PVC and Job, and in the AlreadyExists path fetch the existing PVC/Job,
compare the stored annotation/hash to the current one and if they differ delete
and recreate the resource (or always create with the hash-suffixed name) so the
populate Job and PVC are regenerated when config changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1451a4e2-bbb7-4198-b378-0f6e089df059
📒 Files selected for processing (20)
docs/contributor/data.mddocs/contributor/validator.mddocs/integrator/data-flow.mddocs/integrator/recipe-development.mddocs/user/cli-reference.mddocs/user/validation.mdpkg/validator/v1/job_plan_internal.gopkg/validator/v1/job_plan_test.gorecipes/components/aws-ebs-csi-driver/values.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/gb200-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-gke-cos-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/validators/catalog.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.govalidators/performance/model_cache.govalidators/performance/model_cache_test.govalidators/performance/testdata/inference/dynamo-deployment.yaml
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml (1)
74-88:⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoffThroughput floor inconsistent with measured baseline and inline rationale.
The comment states the floor carries "margin beyond the evaluator's 10% tolerance" to catch regressions, but
>= 50000against measured≈ 108,790 tok/spermits a ~54% throughput drop before failing. With the evaluator's 10% tolerance, the effective passing floor is ~98k tok/s * 0.9 = ~88k, yet the gate only enforces 50k. This 50k value is reused across accelerators with very different baselines (RTX PRO 6000 @ 59k, GB200 @ 66k, H100 @ 108k), making it a shared accelerator-family minimum rather than an H100-specific regression gate.Recommend tightening to
>= 90000so the gate reflects H100 capability and catches meaningful regressions per the inline rationale.🔧 Suggested tighter H100 floor
- name: inference-throughput - value: ">= 50000" + value: ">= 90000"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml` around lines 74 - 88, The throughput floor for the constraint named inference-throughput is far too low for the H100 baseline and contradicts the inline rationale; update the constraint value from ">= 50000" to ">= 90000" (or make the gate accelerator-specific) so the gate reflects the H100 measured throughput (~108,790 tok/s) and enforces a meaningful regression threshold consistent with the evaluator's ~10% tolerance; ensure the change is applied to the constraints block where inference-throughput is defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/h100-gke-cos-inference-dynamo.yaml`:
- Around line 74-88: The throughput floor in the constraints block
("inference-throughput" value) is too low (">= 50000") relative to the stated
H100 baseline (~108,164 tok/s) and the inline rationale; update the
inference-throughput constraint to a tighter H100-specific floor (recommend
setting value to ">= 90000") so the gate captures meaningful regressions and
aligns with the comment's described margin beyond the evaluator's 10% tolerance.
In `@recipes/validators/catalog.yaml`:
- Around line 70-79: The comment around the timeout: 65m value is ambiguous
about whether "cache-populate (10m)" is included in the 50m baseline or counted
separately; update the comment near the timeout field to explicitly state that
the 50m worst-case sum includes the nominal WORKLOAD_READY_TIMEOUT (10m) for
cache-populate but the extra 15m buffer (65m − 50m) is intended to accommodate
cache-populate overruns and deferred cleanup, and reference the relevant knobs
(CheckExecutionTimeout, WORKLOAD_READY_TIMEOUT, cache-populate) so readers
understand that large-model cache-populate can exceed its nominal 10m and the
extra time covers that overrun plus cleanup.
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1389-1393: Update the stale comment for resolveInferenceModel so
it no longer calls the fallback the “default smoke-test model”; change the
wording to state that unsetting or empty AICR_INFERENCE_PERF_MODEL falls back to
the real-workload default (Qwen/Qwen3-8B). Mention AICR_INFERENCE_PERF_MODEL and
resolveInferenceModel in the comment to make it clear which override and
function are affected.
In `@validators/performance/model_cache.go`:
- Around line 241-258: The populate Job's container (Name: "populate", Image:
cacheWorkerImage) lacks resource requests/limits; add a Resources field on that
v1.Container with a v1.ResourceRequirements struct setting sensible Requests and
Limits (e.g. cpu and memory) so the pod can be scheduled and prevented from
OOMing during snapshot_download; update the Container literal inside the model
cache job creation in validators/performance/model_cache.go to include
Resources: v1.ResourceRequirements{Requests: v1.ResourceList{"cpu":
resource.MustParse("250m"), "memory": resource.MustParse("512Mi")}, Limits:
v1.ResourceList{"cpu": resource.MustParse("1"), "memory":
resource.MustParse("4Gi")}} (adjust values as needed or make them configurable).
---
Duplicate comments:
In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml`:
- Around line 74-88: The throughput floor for the constraint named
inference-throughput is far too low for the H100 baseline and contradicts the
inline rationale; update the constraint value from ">= 50000" to ">= 90000" (or
make the gate accelerator-specific) so the gate reflects the H100 measured
throughput (~108,790 tok/s) and enforces a meaningful regression threshold
consistent with the evaluator's ~10% tolerance; ensure the change is applied to
the constraints block where inference-throughput is defined.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d1809639-6996-4f96-abc7-85bf3568d5f0
📒 Files selected for processing (22)
docs/contributor/data.mddocs/contributor/validator.mddocs/integrator/data-flow.mddocs/integrator/recipe-development.mddocs/user/cli-reference.mddocs/user/validation.mdpkg/defaults/timeouts.gopkg/defaults/timeouts_test.gopkg/validator/v1/job_plan_internal.gopkg/validator/v1/job_plan_test.gorecipes/components/aws-ebs-csi-driver/values.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/gb200-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-gke-cos-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/validators/catalog.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.govalidators/performance/model_cache.govalidators/performance/model_cache_test.govalidators/performance/testdata/inference/dynamo-deployment.yaml
c4deb44 to
211b4ca
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
validators/performance/inference_perf_constraint.go (2)
1389-1393:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix the stale default-model wording.
Line 1392 still says this falls back to the "default smoke-test model", but the compiled default is now
Qwen/Qwen3-8B. That wording is now misleading when reading the precedence docs in code.As per coding guidelines, "Ensure code comments are accurate and helpful".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/performance/inference_perf_constraint.go` around lines 1389 - 1393, Update the stale comment in resolveInferenceModel: change the phrase that says it "falls back to the default smoke-test model" to explicitly name the compiled default (e.g., "compiled default Qwen/Qwen3-8B") and/or clarify that an unset/empty AICR_INFERENCE_PERF_MODEL will use the compiled default model ID (Qwen/Qwen3-8B) so the precedence docs reflect the current default behavior.
866-905:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBound the Secret CRUD calls with a named timeout.
These Kubernetes
Get/Delete/Update/Createcalls still usectx.Ctxdirectly, so a slow apiserver can consume most of the validator budget before workload setup even starts. Use a derived timeout context here, like the other K8s writes in this file.Suggested fix
func ensureHFTokenSecret(ctx *validators.Context, namespace string) error { secrets := ctx.Clientset.CoreV1().Secrets(namespace) token := strings.TrimSpace(os.Getenv(envHFToken)) + secretCtx, cancel := context.WithTimeout(ctx.Ctx, defaults.DiagnosticTimeout) + defer cancel() // Read any existing secret first so an update carries the current // resourceVersion (an Update without it can be rejected by the apiserver) // and so an unset token can clear a stale one. - existing, getErr := secrets.Get(ctx.Ctx, hfTokenSecretName, metav1.GetOptions{}) + existing, getErr := secrets.Get(secretCtx, hfTokenSecretName, metav1.GetOptions{}) if getErr != nil && !apierrors.IsNotFound(getErr) { return errors.Wrap(errors.ErrCodeInternal, "failed to read HF token secret", getErr) } exists := getErr == nil @@ if token == "" { // Anonymous run: delete any leftover token so a reused per-run namespace // can't silently inject stale credentials via the workers' optional // secretKeyRefs. if exists { - if err := secrets.Delete(ctx.Ctx, hfTokenSecretName, metav1.DeleteOptions{}); err != nil && !apierrors.IsNotFound(err) { + if err := secrets.Delete(secretCtx, hfTokenSecretName, metav1.DeleteOptions{}); err != nil && !apierrors.IsNotFound(err) { return errors.Wrap(errors.ErrCodeInternal, "failed to delete stale HF token secret", err) } slog.Info("Cleared stale Hugging Face token secret (HF_TOKEN unset; anonymous downloads)", "namespace", namespace) } @@ } if exists { secret.ResourceVersion = existing.ResourceVersion - if _, err := secrets.Update(ctx.Ctx, secret, metav1.UpdateOptions{}); err != nil { + if _, err := secrets.Update(secretCtx, secret, metav1.UpdateOptions{}); err != nil { return errors.Wrap(errors.ErrCodeInternal, "failed to update HF token secret", err) } return nil } - if _, err := secrets.Create(ctx.Ctx, secret, metav1.CreateOptions{}); err != nil { + if _, err := secrets.Create(secretCtx, secret, metav1.CreateOptions{}); err != nil { return errors.Wrap(errors.ErrCodeInternal, "failed to create HF token secret", err) }As per coding guidelines, "All I/O operations must use a context with timeout" and "Use
context.WithTimeout(ctx, timeout)with named constants likedefaults.HTTPClientTimeoutordefaults.ServerHandlerTimeoutfor all I/O deadlines".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/performance/inference_perf_constraint.go` around lines 866 - 905, The secret CRUD calls in ensureHFTokenSecret use ctx.Ctx directly and must be bounded with a timeout; wrap calls to secrets.Get, Delete, Update, and Create with a derived context created via context.WithTimeout(ctx.Ctx, defaults.HTTPClientTimeout) (or the project's chosen timeout constant), assign that to a local timedCtx and defer cancel() before each K8s call (or reuse one timedCtx per function scope) and replace ctx.Ctx with timedCtx when invoking secrets.Get/Delete/Update/Create to ensure the operations respect the configured deadline.recipes/overlays/h100-gke-cos-inference-dynamo.yaml (1)
74-88:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThroughput floor too loose for H100 regression detection.
The
>= 50000floor against measured≈108,164 tok/sallows a ~54% throughput drop before the gate fails. This is the same loose floor reused across multiple accelerators, making it a shared minimum rather than an H100-specific regression gate as the inline comment describes.Recommend tightening to
>= 90000to catch meaningful H100 regressions while tolerating normal run-to-run variance, consistent with the stated intent and the H100 EKS overlay.🔧 Tighten H100 regression floor
- name: inference-throughput - value: ">= 50000" + value: ">= 90000"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/h100-gke-cos-inference-dynamo.yaml` around lines 74 - 88, The inference throughput floor is too loose for H100; update the constraints block for the H100 overlay by changing the inference-throughput constraint's value from ">= 50000" to ">= 90000" so the gate detects meaningful H100 regressions; locate the constraints list (entries with name: inference-model, name: inference-concurrency-per-gpu, name: inference-throughput) and only adjust the value for inference-throughput to ">= 90000".recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml (1)
74-88:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThroughput floor too loose for H100 regression detection.
The
>= 50000floor against measured≈108,790 tok/sallows a ~54% throughput drop before the gate fails. Even with the evaluator's 10% tolerance, this permits a ~59% regression from baseline — too wide to catch meaningful H100 degradation as intended by the inline "catches real regressions without flaking" rationale.Recommend tightening to
>= 90000to align with the measured baseline and provide the described margin while still catching real regressions on H100.🔧 Tighten H100 regression floor
- name: inference-throughput - value: ">= 50000" + value: ">= 90000"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml` around lines 74 - 88, The inference throughput floor in the constraints block is too low: update the constraint with name inference-throughput (currently value ">= 50000") to a tighter floor ">= 90000" so the H100 gate aligns with the measured ~108,790 tok/s baseline and catches meaningful regressions as described in the header comment.recipes/validators/catalog.yaml (1)
70-79:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClarify timeout budget arithmetic.
The comment lists phases that sum to 55m (10+5+10+5+5+15), but states "50m worst-case phase sum." Either the arithmetic needs correction or the relationship between "model-cache populate" and "workload-ready" needs clarification (are they the same phase or distinct steps?).
🔍 Suggested clarification approach
If cache-populate and workload-ready are the same timeout, remove the duplicate:
- # model-cache populate (up to WORKLOAD_READY_TIMEOUT, 10m; cache is on by - # default) + namespace-termination (5m) + workload-ready (10m) + health (5m) + # namespace-termination (5m) + workload-ready including model-cache populate + # (up to WORKLOAD_READY_TIMEOUT, 10m; cache is on by default) + health (5m)Or if they're distinct, correct the sum to 55m and adjust the explanation accordingly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/validators/catalog.yaml` around lines 70 - 79, The comment above the timeout: 65m value has inconsistent arithmetic — it lists phase durations that sum to 55m (10+5+10+5+5+15) yet claims a "50m worst-case phase sum" and references model-cache populate vs workload-ready ambiguously; update the comment in recipes/validators/catalog.yaml to either (A) clarify that "model-cache populate" and "workload-ready" are the same phase (remove the duplicate duration) if that is true, or (B) correct the summed total to 55m and adjust the explanatory text (including references to CheckExecutionTimeout and WORKLOAD_READY_TIMEOUT) so the timeout rationale consistently explains why timeout: 65m was chosen.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/model_cache.go`:
- Around line 175-183: In ensureModelCache, wrap the PVC and Job Create calls
with per-operation contexts instead of using ctx.Ctx directly: for the PVC
created via buildModelCachePVC and
ctx.Clientset.CoreV1().PersistentVolumeClaims(...).Create use a
context.WithTimeout(ctx.Ctx, defaults.DiagnosticTimeout) (remember to defer
cancel()) and pass that new ctx to Create; similarly, for the Job created via
buildModelCachePopulateJob and ctx.Clientset.BatchV1().Jobs(...).Create use
context.WithTimeout(ctx.Ctx, defaults.K8sJobCreationTimeout) (defer cancel())
and pass that timed context to Create; keep existing error handling (including
apierrors.IsAlreadyExists) unchanged.
---
Duplicate comments:
In `@recipes/overlays/h100-eks-ubuntu-inference-dynamo.yaml`:
- Around line 74-88: The inference throughput floor in the constraints block is
too low: update the constraint with name inference-throughput (currently value
">= 50000") to a tighter floor ">= 90000" so the H100 gate aligns with the
measured ~108,790 tok/s baseline and catches meaningful regressions as described
in the header comment.
In `@recipes/overlays/h100-gke-cos-inference-dynamo.yaml`:
- Around line 74-88: The inference throughput floor is too loose for H100;
update the constraints block for the H100 overlay by changing the
inference-throughput constraint's value from ">= 50000" to ">= 90000" so the
gate detects meaningful H100 regressions; locate the constraints list (entries
with name: inference-model, name: inference-concurrency-per-gpu, name:
inference-throughput) and only adjust the value for inference-throughput to ">=
90000".
In `@recipes/validators/catalog.yaml`:
- Around line 70-79: The comment above the timeout: 65m value has inconsistent
arithmetic — it lists phase durations that sum to 55m (10+5+10+5+5+15) yet
claims a "50m worst-case phase sum" and references model-cache populate vs
workload-ready ambiguously; update the comment in
recipes/validators/catalog.yaml to either (A) clarify that "model-cache
populate" and "workload-ready" are the same phase (remove the duplicate
duration) if that is true, or (B) correct the summed total to 55m and adjust the
explanatory text (including references to CheckExecutionTimeout and
WORKLOAD_READY_TIMEOUT) so the timeout rationale consistently explains why
timeout: 65m was chosen.
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1389-1393: Update the stale comment in resolveInferenceModel:
change the phrase that says it "falls back to the default smoke-test model" to
explicitly name the compiled default (e.g., "compiled default Qwen/Qwen3-8B")
and/or clarify that an unset/empty AICR_INFERENCE_PERF_MODEL will use the
compiled default model ID (Qwen/Qwen3-8B) so the precedence docs reflect the
current default behavior.
- Around line 866-905: The secret CRUD calls in ensureHFTokenSecret use ctx.Ctx
directly and must be bounded with a timeout; wrap calls to secrets.Get, Delete,
Update, and Create with a derived context created via
context.WithTimeout(ctx.Ctx, defaults.HTTPClientTimeout) (or the project's
chosen timeout constant), assign that to a local timedCtx and defer cancel()
before each K8s call (or reuse one timedCtx per function scope) and replace
ctx.Ctx with timedCtx when invoking secrets.Get/Delete/Update/Create to ensure
the operations respect the configured deadline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 27fe1d73-e447-453a-9ad7-94d78210ce25
📒 Files selected for processing (22)
docs/contributor/data.mddocs/contributor/validator.mddocs/integrator/data-flow.mddocs/integrator/recipe-development.mddocs/user/cli-reference.mddocs/user/validation.mdpkg/defaults/timeouts.gopkg/defaults/timeouts_test.gopkg/validator/v1/job_plan_internal.gopkg/validator/v1/job_plan_test.gorecipes/components/aws-ebs-csi-driver/values.yamlrecipes/overlays/b200-gke-cos-inference-dynamo.yamlrecipes/overlays/gb200-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/h100-gke-cos-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/validators/catalog.yamlvalidators/performance/inference_perf_constraint.govalidators/performance/inference_perf_test.govalidators/performance/model_cache.govalidators/performance/model_cache_test.govalidators/performance/testdata/inference/dynamo-deployment.yaml
211b4ca to
e9b7691
Compare
da8238b to
a24dffd
Compare
2f4b534 to
0801cfb
Compare
njhensley
left a comment
There was a problem hiding this comment.
Architecturally sound and unusually well-tested — fail-closed env parsing, the "$AICR_MODEL" shell-injection avoidance (with a direct $(...) test), the catalog-vs-facade timeout guard test, and the cgroup-v2 page-cache reasoning behind the populate container's missing memory limit are all excellent.
My only real gating item is a conscious decision on the cluster-wide default StorageClass flip (inline below); everything else is nits, one low-severity security note, and a couple of process confirmations.
Process confirmations before flipping out of Draft:
recipes/components/aws-ebs-csi-driver/values.yamlchanged — per CLAUDE.md, runmake bom-docsand commitdocs/user/container-images.mdif it changes. TogglingdefaultStorageClass.enabledrenders only a StorageClass (no new images) so it's almost certainly a no-op, butbom-checkisn't in the merge gate, so worth confirming.- PR notes
make qualify(e2e + grype) is still pending — that's the gate that matters most for an XL validator change that touches cluster-wide defaults.
LGTM once the StorageClass decision is settled and make qualify is green.
d14294d to
02679d7
Compare
|
@njhensley thanks for the thorough review — here's how each point was addressed (all in
Inline replies posted under each thread as well. |
2217aaf to
4db0cdd
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Excellent work — the code, comments, and test coverage are all top-shelf. Every failure mode I went looking for (cgroup-v2 page-cache OOMKill, scalar YAML coercion of model IDs, shell injection via the model name, missing default StorageClass, two-default-SC ambiguity, namespace reuse leaving a stale HF_TOKEN, image registry-rewrite parity, RWO-PVC binding under WaitForFirstConsumer) was already addressed with a fail-closed pattern, an inline comment explaining the why, and a regression test.
Three medium notes and two nits inline. Nothing blocking the substance.
CI: 130 pass, 9 expected skips (Tier 3 nightly, the standard *-skip push-only jobs, CodeQL on a non-scheduled run) — fully green for what runs on a PR.
Leaving as COMMENT rather than APPROVE since the body says full make qualify (e2e + grype scan) is pending before ready-for-review. Happy to flip to APPROVE once those run.
One overall thought: the EBS-CSI default-SC change is the one piece of this PR with a blast radius beyond inference-perf — flagging it inline so the catalog doc gets a parallel note for operators reading from that direction.
4db0cdd to
81fd3d7
Compare
… default Make the inference-perf benchmark configurable per recipe, default it to a real-workload model, and back it with a model-weights cache so the performance gate reflects GPU compute rather than serving overhead and works out of the box. - Per-recipe inputs: inference-model and inference-concurrency-per-gpu read from validation.performance.constraints, resolving recipe > catalog env (AICR_INFERENCE_PERF_*) > compiled default — symmetric with the throughput/TTFT thresholds. - Defaults: Qwen/Qwen3-8B at 256 concurrency/GPU (empirical throughput sweet spot across RTX PRO 6000, GB200, and H100 on EKS/GKE). - Model-weights cache (PVC) on by default: one populate download shared by all workers (HF_HUB_OFFLINE), avoiding per-IP Hugging Face 429s; disable with AICR_INFERENCE_PERF_MODEL_CACHE_SIZE=off. Fails fast with guidance when no StorageClass is resolvable instead of hanging until timeout. - EKS: aws-ebs-csi-driver provisions a default gp3 StorageClass (defaultStorageClass.enabled) so the cache works zero-config; GKE already has standard-rwo. - Per-accelerator gate thresholds in the *-inference-dynamo overlays re-measured at 8B/256; model + concurrency pinned alongside them. - Configurable workload-ready / endpoint-readiness timeouts for large-model bring-up. - Security/scoping (Codex review): pass the model to AIPerf via the AICR_MODEL container env referenced as "$AICR_MODEL" (no shell interpolation / command substitution); forward HF_TOKEN only to the inference-perf validator. - Edge-case hardening (review): account for the cache-populate phase in the inference-perf timeout budget (catalog 50m -> 65m); clear a stale HF-token Secret when HF_TOKEN is unset and carry resourceVersion on Secret update; reject non-positive cache sizes (e.g. 0Gi/-1Gi) with ErrCodeInvalidRequest; validate the resolved model ID against the HF repo-ID character set (reject YAML/shell metacharacters before it reaches the Dynamo template); raise the facade ValidationOperationTimeout to 75m (> the 65m catalog check timeout) and add a catalog-vs-facade guard test; recognize the legacy beta default-SC annotation; bound the model-cache PVC/Job and HF-token Secret API calls with named timeouts; give the cache-populate container CPU/memory requests for schedulability (no memory limit — a limit OOMKills large-model downloads via page cache on cgroup v2, confirmed on a live 8B GKE run); scale the inference-throughput gate to the free-GPU count actually benchmarked (the full-node gate would false-fail a healthy per-GPU result on a partially occupied node, since the workload is sized to free GPUs); preserve WaitForJobCompletion's timeout/unavailable classification on the model-cache populate Job (PropagateOrWrap instead of flattening to ErrCodeInternal); propagate the validator pod's imagePullSecrets to the model-cache populate Job (parity with the AIPerf Job — lets a private-mirror/air-gapped cluster pull the cache worker image). - Docs: validator, validation, recipe-development, data, data-flow, cli-reference (incl. MODEL_CACHE_STORAGE_CLASS as a catalog/--data knob). Refs NVIDIA#1116 - Review (CodeRabbit): clarify the H100 inference-throughput comments to state that >= 50000 is a deliberately conservative floor shared across the inference overlays (the evaluator already scales it to the benchmarked GPU count and applies a 10% tolerance), not an H100-tuned regression gate; disable the inference-perf gate on b200-gke-cos-inference-dynamo until a reference benchmark exists, since a hard gate with untested placeholder thresholds can pass against non-representative limits.
81fd3d7 to
0b97d69
Compare
Summary
Makes the
inference-perfperformance check configurable per recipe, defaults itto a real-workload model, and backs it with a model-weights cache — so
aicr validate --phase performancereflects GPU compute (not serving overhead)and works out of the box. Three headline changes: per-recipe model/concurrency,
model cache on by default, and per-accelerator gate thresholds re-measured on
real hardware.
Motivation / Context
Fixes #1116.
The old check ran a tiny smoke-test model (Qwen3-0.6B, conc 16) that never loaded
the GPU and used global-only knobs. Characterization across RTX PRO 6000, H100
(EKS + GKE), and GB200 showed an 8B model at conc 256/GPU is the throughput sweet
spot and that the per-accelerator thresholds needed real numbers.
Type of Change
Component(s) Affected
pkg/recipe, recipes/overlays, recipes/validators/catalog.yaml)pkg/validator,validators/performance)docs/)Implementation Notes
inference-modelandinference-concurrency-per-gpureadfrom
validation.performance.constraints, resolving recipe > catalog env > default(symmetric with the throughput/TTFT thresholds). Defaults: Qwen/Qwen3-8B at 256/GPU.
workers (offline), avoiding per-IP HF 429s;
…MODEL_CACHE_SIZE=offto disable.Fails fast with guidance when no StorageClass is resolvable instead of hanging.
aws-ebs-csi-driverprovisions a default gp3 SC(
defaultStorageClass.enabled); GKE already hasstandard-rwo.*-inference-dynamooverlays re-measured at8B/256, with model + concurrency pinned alongside.
(
AICR_INFERENCE_PERF_WORKLOAD_READY_TIMEOUT,…HEALTH_TIMEOUT) for large-modelbring-up. The inference-perf catalog timeout budget now accounts for the
cache-populate phase (50m → 65m), and the facade
ValidationOperationTimeoutwasraised to 75m (> the largest catalog check timeout) with a catalog-vs-facade
guard test.
AICR_MODELenv referenced as
"$AICR_MODEL"(no shell command-substitution);HF_TOKENforwarded only to the inference-perf validator.
YAML/shell metacharacters before it reaches the Dynamo template.
SERVED_MODEL_NAME: "${MODEL}"quoted in the deploy template so ascalar-looking model ID (pure-numeric / boolean-like) parses as a string, not
int/bool/null (+ parse test).
(
100m/512Mi) so it schedules under ResourceQuota / LimitRange — butno memory limit: it streams tens of GB to the PVC and a hard limit
OOMKills large-model downloads via page cache on cgroup v2 (caught on a
live 8B GKE run).
inference-throughputgate to the free-GPU count actuallybenchmarked. The workload is sized to a node's free GPUs, so on a shared
node the absolute full-node gate would false-fail a healthy per-GPU result;
the gate is scaled by
freeGPUs / nodeGPUs(TTFT p99, a latency, is notscaled). Full-node runs are a no-op.
classification (
PropagateOrWrapinstead of flattening toErrCodeInternal).imagePullSecretsto the model-cache populateJob (parity with the AIPerf Job) so private-mirror / air-gapped clusters can
pull the cache worker image.
0Gi/-1Gi) withErrCodeInvalidRequest.resourceVersionon Secret update; bound the cache PVC/Job and HF-token Secret API calls with
named timeouts.
Testing
Commands run:
make test(-race) — pass; total coverage 76.4% (≥ 70% floor).make lint(golangci-lint + yamllint + license/agents-sync) — 0 issues.Validated end to end on live clusters with the compiled defaults (Qwen3-8B,
256 concurrency/GPU, cache on; no tuning flags beyond the per-region
registry/tag, plus
MODEL_CACHE_STORAGE_CLASS=gp2on GB200 which has no defaultStorageClass):
All passed their per-accelerator gates. GB200 reaches only ~80% util on 8B (it is
over-provisioned for a model this small) — the documented case for the remaining
precision/TP work. The on-by-default cache bound to each cluster's StorageClass
(
standard-rwoon GKE,ebs-csi-default-scon the AICR-provisioned EKS,gp2on GB200), and the no-default-SC fast-fail was confirmed on GB200 (~6 s, with
actionable guidance).
Re-verified the head-of-branch build end to end on GKE H100 (validator image
rebuilt from the latest commit): per-recipe model/concurrency resolved from the
recipe, cache on by default bound zero-config to the cluster's
standard-rwodefault SC (100Gi RWO
aicr-model-cache), and the populate container scheduledwith its new
100m/512Mirequests +4Gilimit.make qualify(e2e + grype scan).Risk Assessment
EBS-CSI default-SC; default model/concurrency changed (heavier default run).
Rollout notes: the default inference-perf run now uses 8B at conc 256 with the
cache on (slower, ~6–7 min, benefits from the cache); disable via
AICR_INFERENCE_PERF_MODEL_CACHE_SIZE=off.HF_TOKEN forwarding (reviewed, accepted): the token reaches the validator pod
as a plaintext pod-spec env (workers/cache Job already use
secretKeyRef). ThesecretKeyRef-on-validator alternative was considered and declined — narrowupside, doesn't remove the runtime plaintext, low-sensitivity rate-limit token in
an ephemeral per-run namespace. The trade-off is documented inline at
pkg/validator/v1/job_plan_internal.go.Checklist
make testwith-race)make lint)git commit -S)