fix(validators): bump dynamo runtime image 0.9.0 -> 1.0.2 (fixes #1192) by yuanchen8911 · Pull Request #1193 · NVIDIA/aicr

yuanchen8911 · 2026-06-04T02:14:51Z

Summary

Bump the dynamo runtime image (dynamo-frontend, vllm-runtime) from 0.9.0 to 1.0.2 across all 5 workload/validator manifests, aligning the runtime with the already-deployed dynamo-platform 1.0.2 operator and clearing a known upstream frontend panic.

Motivation / Context

The inference-perf validator intermittently (~50% of runs) fails with [TIMEOUT] timed out waiting for inference endpoint to serve requests, even though the DynamoGraphDeployment reports successful, all workers are Ready (8/8), and the frontend pod is 1/1. Live frontend-log capture traced this to a known upstream dynamo bug:

dynamo-frontend:0.9.0 hits thread 'tokio-runtime-worker' panicked … Unfold must not be polled after it returned Poll::Ready(None) (futures-util 0.3.31). The discovery stream dies and the v1/instances KV bucket is never populated (KVStoreDiscovery::list: bucket missing for query=AllEndpoints), so the HTTP router has no worker backends and /v1/chat/completions never serves.
Upstream: ai-dynamo/dynamo#7328. Fixed by ai-dynamo/dynamo#7346 (futures-util → 0.3.32), first shipped in dynamo v1.0.0, never backported to 0.9.x. Verified from each tag's Cargo.lock (0.9.0/0.9.1 = 0.3.31 buggy; 1.0.0+ = 0.3.32 fixed).

Why we still ran the buggy version — operator/runtime version skew. AICR pins the dynamo version in two independent places. The dynamo-platform operator chart was bumped 0.9.0 → 1.0.2 (#459), but the workload runtime image tags are hardcoded literals in 5 files the chart bump never touches — so a 1.0.2 operator was scheduling 0.9.0 pods carrying the panic. This PR fixes the drift.

Fixes: #1192
Related: ai-dynamo/dynamo#7328, ai-dynamo/dynamo#7346, #459

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Validator (pkg/validator) — validators/performance testdata + cache-warmer image
Other: tests/manifests, demos/workloads, pkg/evidence/cncf (dynamo workload manifests)

Implementation Notes

Bumped all 9 dynamo-frontend / vllm-runtime image refs 0.9.0 → 1.0.2 (the only dynamo image references in the repo — verified by a full sweep; no other dynamo image basenames exist):
- validators/performance/testdata/inference/dynamo-deployment.yaml
- validators/performance/model_cache.go (cache-warmer image literal)
- tests/manifests/dynamo-vllm-smoke-test.yaml (+ stale v0.9.0 version comment → v1.0.2)
- demos/workloads/inference/vllm-agg.yaml
- pkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yaml
DGD CRD apiVersion: nvidia.com/v1alpha1 is unchanged — the 1.0.2 operator already serves it (it is what the validator has been running against), so the runtime bump is drop-in for our DGD spec.
The single-source-of-truth refactor (a pkg/defaults constant) and a drift-prevention check (asserting the runtime image tag matches the dynamo-platform chart version) are intentionally split into a separate follow-up so this PR stays a minimal, cherry-pickable fix.

Testing

make lint   # golangci-lint 0 issues, yamllint, license, agents-sync, docs, bom-pinning — all pass
go test ./validators/performance/...   # ok (incl. TestCacheWorkerImageMatchesTemplate)

A version-string bump in manifest image tags + one Go const cannot regress e2e (KWOK/no-GPU; does not pull these images) or scan (Go-binary vuln scan unaffected by a manifest tag). The relevant gate is lint + the package tests, both green.

Confirmed on hardware (b40 / RTX PRO 6000, 2048 concurrency). Built the validator image from this branch, pushed to ECR, and ran the inference-perf phase against the live cluster with the deployed frontend on dynamo-frontend:1.0.2:

Signal	0.9.0 (flaked run)	1.0.2 (this branch)
`Unfold … Poll::Ready(None)` panic	24×	0
`bucket missing for query=AllEndpoints`	97×	0
endpoint serving signals	0 (timed out)	served
outcome	TIMEOUT / FAIL	PASS

phase=performance status=passed  duration=14m58s
Inference throughput: 73,993 tokens/sec   (also ~24% higher than 0.9.0's 59,636 @ 2048)
Inference TTFT p99:   537 ms              (constraint ≤ 1000 ms ✓)

The fix is deterministic (futures-util 0.3.32 removes the buggy Unfold path), so the zero-panic capture — not a single non-flaking run — is the conclusive evidence. nvcr.io/.../vllm-runtime:1.0.2 (~12.3 GB) pulled cleanly with the cluster's existing nvcr.io creds (same registry path as 0.9.0).

Risk Assessment

Low — Isolated version-string bump, easy to revert; aligns runtime with the operator already in production.

Rollout notes: No migration. Pulls dynamo-frontend:1.0.2 / vllm-runtime:1.0.2 (must be reachable via existing nvcr.io pull creds). Reverting restores 0.9.0.

Checklist

Tests pass locally (make test with -race) — go test ./validators/performance/... green
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A (existing TestCacheWorkerImageMatchesTemplate still guards the cache image; no new functionality)
I updated docs if user-facing behavior changed — N/A (no user-facing surface change)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-06-04T02:19:43Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 632f5537-acdf-4cc0-924c-373141c97d9b

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb05d2 and 2d1f120.

📒 Files selected for processing (6)

demos/workloads/inference/vllm-agg.yaml
pkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yaml
tests/chainsaw/ai-conformance/cluster/assert-dynamo.yaml
tests/manifests/dynamo-vllm-smoke-test.yaml
validators/performance/model_cache.go
validators/performance/testdata/inference/dynamo-deployment.yaml

📝 Walkthrough

Walkthrough

This PR bumps the dynamo-frontend and vllm-runtime container image tags from 0.9.0 to 1.0.2 in demo and evidence manifests, test manifests and comments, the validator cacheWorkerImage constant, and validator testdata deployment fixtures.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

dynamo: centralize runtime image reference — SSOT + registry-override parity + drift guard #1194: addresses the same runtime image-version drift by updating frontend and vllm-runtime image tags including the cacheWorkerImage constant.

Suggested labels

size/M, area/validator

Suggested reviewers

mchmarny
njhensley

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: bumping dynamo runtime image tags from 0.9.0 to 1.0.2 across the codebase, and it references the issue being fixed.
Description check	✅ Passed	The description comprehensively explains the bug fix, root cause analysis, testing, and risk assessment, all directly related to the changeset of updating dynamo image versions.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The inference-perf validator intermittently times out on healthy clusters (~50% of runs) waiting for the inference endpoint to serve, even though the DynamoGraphDeployment reports successful and all workers are Ready. Root cause is a known upstream dynamo bug: dynamo-frontend 0.9.0 hits a discovery-stream panic ("Unfold must not be polled after it returned Poll::Ready(None)" in futures-util 0.3.31), which leaves the v1/instances KV bucket unpopulated so the HTTP router never registers worker backends (ai-dynamo/dynamo#7328). The upstream fix (futures-util -> 0.3.32, ai-dynamo/dynamo#7346) first shipped in dynamo v1.0.0 and was never backported to the 0.9.x line. AICR already pins the dynamo-platform operator chart at 1.0.2 (recipes/registry.yaml), but the workload runtime image tags stayed at 0.9.0 -- so a 1.0.2 operator scheduled 0.9.0 pods carrying the panic. Align the workload runtime images with the operator by bumping every dynamo-frontend / vllm-runtime pin from 0.9.0 to 1.0.2: - validators/performance/testdata/inference/dynamo-deployment.yaml - validators/performance/model_cache.go (cache-warmer image) - tests/manifests/dynamo-vllm-smoke-test.yaml (+ stale version comment) - demos/workloads/inference/vllm-agg.yaml - pkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yaml Also refresh stale 0.9.0 chart-version comments in tests/chainsaw/ai-conformance/cluster/assert-dynamo.yaml (comment-only; the file asserts only that the operator/grove Deployments are Available). The DGD CRD apiVersion (nvidia.com/v1alpha1) is unchanged -- the 1.0.2 operator already serves it, which is what the validator has been running against. Verified on b40 (RTX PRO 6000) at 2048 concurrency: endpoint served with zero Unfold panics / zero "bucket missing" events (vs 24 / 97 on 0.9.0), throughput 73,993 tok/s, TTFT p99 537 ms, phase passed. Fixes NVIDIA#1192

mchmarny

Tight, well-evidenced fix. Root cause traced cleanly to the upstream futures-util 0.3.31 Unfold bug (fixed in dynamo 1.0.0), and the before/after capture (panic markers 24→0, 97→0; TIMEOUT→PASS) is deterministic enough to put the flake to bed. Sweep confirms all 9 refs bumped — no stragglers. Splitting the pkg/defaults centralization into a follow-up is the right call. Rebase before merge.

yuanchen8911 added the bug label Jun 4, 2026

github-actions Bot added area/tests size/S labels Jun 4, 2026

yuanchen8911 mentioned this pull request Jun 4, 2026

dynamo: centralize runtime image reference — SSOT + registry-override parity + drift guard #1194

Open

5 tasks

yuanchen8911 force-pushed the fix/1192-dynamo-runtime-bump branch from 0d249fa to 2eb05d2 Compare June 4, 2026 03:03

yuanchen8911 marked this pull request as ready for review June 4, 2026 03:07

yuanchen8911 requested a review from a team as a code owner June 4, 2026 03:08

yuanchen8911 requested review from ArangoGutierrez, dims, lalitadithya, lockwobr and mchmarny June 4, 2026 03:24

yuanchen8911 enabled auto-merge (squash) June 4, 2026 04:06

yuanchen8911 marked this pull request as draft June 4, 2026 04:24

auto-merge was automatically disabled June 4, 2026 04:24
Pull request was converted to draft

mchmarny assigned yuanchen8911 Jun 4, 2026

yuanchen8911 force-pushed the fix/1192-dynamo-runtime-bump branch from 2eb05d2 to 2d1f120 Compare June 4, 2026 14:20

yuanchen8911 marked this pull request as ready for review June 4, 2026 14:20

yuanchen8911 enabled auto-merge (squash) June 4, 2026 15:25

This was referenced Jun 4, 2026

inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative) #1192

Closed

fix(validators): update and tune inference performance validation #1196

Merged

mchmarny approved these changes Jun 4, 2026

View reviewed changes

Merge branch 'main' into fix/1192-dynamo-runtime-bump

d83fb7b

mchmarny disabled auto-merge June 4, 2026 16:51

mchmarny merged commit 19dece7 into NVIDIA:main Jun 4, 2026
2 checks passed

yuanchen8911 mentioned this pull request Jun 4, 2026

inference-perf: stochastic worker-stall / throughput degradation on EKS H100 at 2048 concurrency #1197

Closed

yuanchen8911 mentioned this pull request Jul 1, 2026

feat(recipes): bump dynamo-platform and runtime to 1.2.1 #1581

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(validators): bump dynamo runtime image 0.9.0 -> 1.0.2 (fixes #1192)#1193

fix(validators): bump dynamo runtime image 0.9.0 -> 1.0.2 (fixes #1192)#1193
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/1192-dynamo-runtime-bump

yuanchen8911 commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading