fix(validators): bump dynamo runtime image 0.9.0 -> 1.0.2 (fixes #1192)#1193
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR bumps the Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
0d249fa to
2eb05d2
Compare
Pull request was converted to draft
The inference-perf validator intermittently times out on healthy clusters
(~50% of runs) waiting for the inference endpoint to serve, even though the
DynamoGraphDeployment reports successful and all workers are Ready. Root cause
is a known upstream dynamo bug: dynamo-frontend 0.9.0 hits a discovery-stream
panic ("Unfold must not be polled after it returned Poll::Ready(None)" in
futures-util 0.3.31), which leaves the v1/instances KV bucket unpopulated so
the HTTP router never registers worker backends (ai-dynamo/dynamo#7328).
The upstream fix (futures-util -> 0.3.32, ai-dynamo/dynamo#7346) first shipped
in dynamo v1.0.0 and was never backported to the 0.9.x line. AICR already pins
the dynamo-platform operator chart at 1.0.2 (recipes/registry.yaml), but the
workload runtime image tags stayed at 0.9.0 -- so a 1.0.2 operator scheduled
0.9.0 pods carrying the panic.
Align the workload runtime images with the operator by bumping every
dynamo-frontend / vllm-runtime pin from 0.9.0 to 1.0.2:
- validators/performance/testdata/inference/dynamo-deployment.yaml
- validators/performance/model_cache.go (cache-warmer image)
- tests/manifests/dynamo-vllm-smoke-test.yaml (+ stale version comment)
- demos/workloads/inference/vllm-agg.yaml
- pkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yaml
Also refresh stale 0.9.0 chart-version comments in
tests/chainsaw/ai-conformance/cluster/assert-dynamo.yaml (comment-only; the
file asserts only that the operator/grove Deployments are Available).
The DGD CRD apiVersion (nvidia.com/v1alpha1) is unchanged -- the 1.0.2 operator
already serves it, which is what the validator has been running against.
Verified on b40 (RTX PRO 6000) at 2048 concurrency: endpoint served with zero
Unfold panics / zero "bucket missing" events (vs 24 / 97 on 0.9.0), throughput
73,993 tok/s, TTFT p99 537 ms, phase passed.
Fixes NVIDIA#1192
2eb05d2 to
2d1f120
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Tight, well-evidenced fix. Root cause traced cleanly to the upstream futures-util 0.3.31 Unfold bug (fixed in dynamo 1.0.0), and the before/after capture (panic markers 24→0, 97→0; TIMEOUT→PASS) is deterministic enough to put the flake to bed. Sweep confirms all 9 refs bumped — no stragglers. Splitting the pkg/defaults centralization into a follow-up is the right call. Rebase before merge.
Summary
Bump the dynamo runtime image (
dynamo-frontend,vllm-runtime) from0.9.0to1.0.2across all 5 workload/validator manifests, aligning the runtime with the already-deployeddynamo-platform1.0.2 operator and clearing a known upstream frontend panic.Motivation / Context
The
inference-perfvalidator intermittently (~50% of runs) fails with[TIMEOUT] timed out waiting for inference endpoint to serve requests, even though theDynamoGraphDeploymentreportssuccessful, all workers areReady(8/8), and the frontend pod is1/1. Live frontend-log capture traced this to a known upstream dynamo bug:dynamo-frontend:0.9.0hitsthread 'tokio-runtime-worker' panicked … Unfold must not be polled after it returned Poll::Ready(None)(futures-util 0.3.31). The discovery stream dies and thev1/instancesKV bucket is never populated (KVStoreDiscovery::list: bucket missing for query=AllEndpoints), so the HTTP router has no worker backends and/v1/chat/completionsnever serves.futures-util→0.3.32), first shipped in dynamo v1.0.0, never backported to 0.9.x. Verified from each tag'sCargo.lock(0.9.0/0.9.1 = 0.3.31 buggy; 1.0.0+ = 0.3.32 fixed).Why we still ran the buggy version — operator/runtime version skew. AICR pins the dynamo version in two independent places. The
dynamo-platformoperator chart was bumped0.9.0 → 1.0.2(#459), but the workload runtime image tags are hardcoded literals in 5 files the chart bump never touches — so a 1.0.2 operator was scheduling 0.9.0 pods carrying the panic. This PR fixes the drift.Fixes: #1192
Related: ai-dynamo/dynamo#7328, ai-dynamo/dynamo#7346, #459
Type of Change
Component(s) Affected
pkg/validator) —validators/performancetestdata + cache-warmer imagetests/manifests,demos/workloads,pkg/evidence/cncf(dynamo workload manifests)Implementation Notes
dynamo-frontend/vllm-runtimeimage refs0.9.0 → 1.0.2(the only dynamo image references in the repo — verified by a full sweep; no other dynamo image basenames exist):validators/performance/testdata/inference/dynamo-deployment.yamlvalidators/performance/model_cache.go(cache-warmer image literal)tests/manifests/dynamo-vllm-smoke-test.yaml(+ stalev0.9.0version comment →v1.0.2)demos/workloads/inference/vllm-agg.yamlpkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yamlapiVersion: nvidia.com/v1alpha1is unchanged — the 1.0.2 operator already serves it (it is what the validator has been running against), so the runtime bump is drop-in for our DGD spec.pkg/defaultsconstant) and a drift-prevention check (asserting the runtime image tag matches thedynamo-platformchart version) are intentionally split into a separate follow-up so this PR stays a minimal, cherry-pickable fix.Testing
A version-string bump in manifest image tags + one Go const cannot regress
e2e(KWOK/no-GPU; does not pull these images) orscan(Go-binary vuln scan unaffected by a manifest tag). The relevant gate is lint + the package tests, both green.Confirmed on hardware (b40 / RTX PRO 6000, 2048 concurrency). Built the validator image from this branch, pushed to ECR, and ran the
inference-perfphase against the live cluster with the deployed frontend ondynamo-frontend:1.0.2:Unfold … Poll::Ready(None)panicbucket missing for query=AllEndpointsThe fix is deterministic (
futures-util 0.3.32removes the buggy Unfold path), so the zero-panic capture — not a single non-flaking run — is the conclusive evidence.nvcr.io/.../vllm-runtime:1.0.2(~12.3 GB) pulled cleanly with the cluster's existing nvcr.io creds (same registry path as 0.9.0).Risk Assessment
Rollout notes: No migration. Pulls
dynamo-frontend:1.0.2/vllm-runtime:1.0.2(must be reachable via existing nvcr.io pull creds). Reverting restores0.9.0.Checklist
make testwith-race) —go test ./validators/performance/...greenmake lint)TestCacheWorkerImageMatchesTemplatestill guards the cache image; no new functionality)git commit -S)