Skip to content

fix(validators): bump dynamo runtime image 0.9.0 -> 1.0.2 (fixes #1192)#1193

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/1192-dynamo-runtime-bump
Jun 4, 2026
Merged

fix(validators): bump dynamo runtime image 0.9.0 -> 1.0.2 (fixes #1192)#1193
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/1192-dynamo-runtime-bump

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Bump the dynamo runtime image (dynamo-frontend, vllm-runtime) from 0.9.0 to 1.0.2 across all 5 workload/validator manifests, aligning the runtime with the already-deployed dynamo-platform 1.0.2 operator and clearing a known upstream frontend panic.

Motivation / Context

The inference-perf validator intermittently (~50% of runs) fails with [TIMEOUT] timed out waiting for inference endpoint to serve requests, even though the DynamoGraphDeployment reports successful, all workers are Ready (8/8), and the frontend pod is 1/1. Live frontend-log capture traced this to a known upstream dynamo bug:

  • dynamo-frontend:0.9.0 hits thread 'tokio-runtime-worker' panicked … Unfold must not be polled after it returned Poll::Ready(None) (futures-util 0.3.31). The discovery stream dies and the v1/instances KV bucket is never populated (KVStoreDiscovery::list: bucket missing for query=AllEndpoints), so the HTTP router has no worker backends and /v1/chat/completions never serves.
  • Upstream: ai-dynamo/dynamo#7328. Fixed by ai-dynamo/dynamo#7346 (futures-util0.3.32), first shipped in dynamo v1.0.0, never backported to 0.9.x. Verified from each tag's Cargo.lock (0.9.0/0.9.1 = 0.3.31 buggy; 1.0.0+ = 0.3.32 fixed).

Why we still ran the buggy version — operator/runtime version skew. AICR pins the dynamo version in two independent places. The dynamo-platform operator chart was bumped 0.9.0 → 1.0.2 (#459), but the workload runtime image tags are hardcoded literals in 5 files the chart bump never touches — so a 1.0.2 operator was scheduling 0.9.0 pods carrying the panic. This PR fixes the drift.

Fixes: #1192
Related: ai-dynamo/dynamo#7328, ai-dynamo/dynamo#7346, #459

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Validator (pkg/validator) — validators/performance testdata + cache-warmer image
  • Other: tests/manifests, demos/workloads, pkg/evidence/cncf (dynamo workload manifests)

Implementation Notes

  • Bumped all 9 dynamo-frontend / vllm-runtime image refs 0.9.0 → 1.0.2 (the only dynamo image references in the repo — verified by a full sweep; no other dynamo image basenames exist):
    • validators/performance/testdata/inference/dynamo-deployment.yaml
    • validators/performance/model_cache.go (cache-warmer image literal)
    • tests/manifests/dynamo-vllm-smoke-test.yaml (+ stale v0.9.0 version comment → v1.0.2)
    • demos/workloads/inference/vllm-agg.yaml
    • pkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yaml
  • DGD CRD apiVersion: nvidia.com/v1alpha1 is unchanged — the 1.0.2 operator already serves it (it is what the validator has been running against), so the runtime bump is drop-in for our DGD spec.
  • The single-source-of-truth refactor (a pkg/defaults constant) and a drift-prevention check (asserting the runtime image tag matches the dynamo-platform chart version) are intentionally split into a separate follow-up so this PR stays a minimal, cherry-pickable fix.

Testing

make lint   # golangci-lint 0 issues, yamllint, license, agents-sync, docs, bom-pinning — all pass
go test ./validators/performance/...   # ok (incl. TestCacheWorkerImageMatchesTemplate)

A version-string bump in manifest image tags + one Go const cannot regress e2e (KWOK/no-GPU; does not pull these images) or scan (Go-binary vuln scan unaffected by a manifest tag). The relevant gate is lint + the package tests, both green.

Confirmed on hardware (b40 / RTX PRO 6000, 2048 concurrency). Built the validator image from this branch, pushed to ECR, and ran the inference-perf phase against the live cluster with the deployed frontend on dynamo-frontend:1.0.2:

Signal 0.9.0 (flaked run) 1.0.2 (this branch)
Unfold … Poll::Ready(None) panic 24× 0
bucket missing for query=AllEndpoints 97× 0
endpoint serving signals 0 (timed out) served
outcome TIMEOUT / FAIL PASS
phase=performance status=passed  duration=14m58s
Inference throughput: 73,993 tokens/sec   (also ~24% higher than 0.9.0's 59,636 @ 2048)
Inference TTFT p99:   537 ms              (constraint ≤ 1000 ms ✓)

The fix is deterministic (futures-util 0.3.32 removes the buggy Unfold path), so the zero-panic capture — not a single non-flaking run — is the conclusive evidence. nvcr.io/.../vllm-runtime:1.0.2 (~12.3 GB) pulled cleanly with the cluster's existing nvcr.io creds (same registry path as 0.9.0).

Risk Assessment

  • Low — Isolated version-string bump, easy to revert; aligns runtime with the operator already in production.

Rollout notes: No migration. Pulls dynamo-frontend:1.0.2 / vllm-runtime:1.0.2 (must be reachable via existing nvcr.io pull creds). Reverting restores 0.9.0.

Checklist

  • Tests pass locally (make test with -race) — go test ./validators/performance/... green
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (existing TestCacheWorkerImageMatchesTemplate still guards the cache image; no new functionality)
  • I updated docs if user-facing behavior changed — N/A (no user-facing surface change)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 632f5537-acdf-4cc0-924c-373141c97d9b

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb05d2 and 2d1f120.

📒 Files selected for processing (6)
  • demos/workloads/inference/vllm-agg.yaml
  • pkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yaml
  • tests/chainsaw/ai-conformance/cluster/assert-dynamo.yaml
  • tests/manifests/dynamo-vllm-smoke-test.yaml
  • validators/performance/model_cache.go
  • validators/performance/testdata/inference/dynamo-deployment.yaml

📝 Walkthrough

Walkthrough

This PR bumps the dynamo-frontend and vllm-runtime container image tags from 0.9.0 to 1.0.2 in demo and evidence manifests, test manifests and comments, the validator cacheWorkerImage constant, and validator testdata deployment fixtures.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Suggested labels

size/M, area/validator

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: bumping dynamo runtime image tags from 0.9.0 to 1.0.2 across the codebase, and it references the issue being fixed.
Description check ✅ Passed The description comprehensively explains the bug fix, root cause analysis, testing, and risk assessment, all directly related to the changeset of updating dynamo image versions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the fix/1192-dynamo-runtime-bump branch from 0d249fa to 2eb05d2 Compare June 4, 2026 03:03
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 4, 2026 03:07
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 4, 2026 03:08
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) June 4, 2026 04:06
@yuanchen8911 yuanchen8911 marked this pull request as draft June 4, 2026 04:24
auto-merge was automatically disabled June 4, 2026 04:24

Pull request was converted to draft

The inference-perf validator intermittently times out on healthy clusters
(~50% of runs) waiting for the inference endpoint to serve, even though the
DynamoGraphDeployment reports successful and all workers are Ready. Root cause
is a known upstream dynamo bug: dynamo-frontend 0.9.0 hits a discovery-stream
panic ("Unfold must not be polled after it returned Poll::Ready(None)" in
futures-util 0.3.31), which leaves the v1/instances KV bucket unpopulated so
the HTTP router never registers worker backends (ai-dynamo/dynamo#7328).

The upstream fix (futures-util -> 0.3.32, ai-dynamo/dynamo#7346) first shipped
in dynamo v1.0.0 and was never backported to the 0.9.x line. AICR already pins
the dynamo-platform operator chart at 1.0.2 (recipes/registry.yaml), but the
workload runtime image tags stayed at 0.9.0 -- so a 1.0.2 operator scheduled
0.9.0 pods carrying the panic.

Align the workload runtime images with the operator by bumping every
dynamo-frontend / vllm-runtime pin from 0.9.0 to 1.0.2:
- validators/performance/testdata/inference/dynamo-deployment.yaml
- validators/performance/model_cache.go (cache-warmer image)
- tests/manifests/dynamo-vllm-smoke-test.yaml (+ stale version comment)
- demos/workloads/inference/vllm-agg.yaml
- pkg/evidence/cncf/scripts/manifests/dynamo-vllm-agg.yaml

Also refresh stale 0.9.0 chart-version comments in
tests/chainsaw/ai-conformance/cluster/assert-dynamo.yaml (comment-only; the
file asserts only that the operator/grove Deployments are Available).

The DGD CRD apiVersion (nvidia.com/v1alpha1) is unchanged -- the 1.0.2 operator
already serves it, which is what the validator has been running against.

Verified on b40 (RTX PRO 6000) at 2048 concurrency: endpoint served with zero
Unfold panics / zero "bucket missing" events (vs 24 / 97 on 0.9.0), throughput
73,993 tok/s, TTFT p99 537 ms, phase passed.

Fixes NVIDIA#1192

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tight, well-evidenced fix. Root cause traced cleanly to the upstream futures-util 0.3.31 Unfold bug (fixed in dynamo 1.0.0), and the before/after capture (panic markers 24→0, 97→0; TIMEOUT→PASS) is deterministic enough to put the flake to bed. Sweep confirms all 9 refs bumped — no stragglers. Splitting the pkg/defaults centralization into a follow-up is the right call. Rebase before merge.

@mchmarny mchmarny disabled auto-merge June 4, 2026 16:51
@mchmarny mchmarny merged commit 19dece7 into NVIDIA:main Jun 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

inference-perf: validator times out on healthy cluster when dynamo frontend discovery bootstrap races (false negative)

2 participants