feat(validation): switch inference-perf frontend routing to least-loaded by yuanchen8911 · Pull Request #1399 · NVIDIA/aicr

yuanchen8911 · 2026-06-22T16:32:22Z

Summary

Switch the inference-perf validation workload's Dynamo frontend from KV-cache-aware routing (DYN_ROUTER_MODE=kv) to load-aware least-loaded routing (DYN_ROUTER_MODE=least-loaded).

Motivation / Context

The inference-perf frontend routed by KV overlap, which can keep sending a transiently-slow worker its full share of requests. On EKS H100 at the 2048-concurrency saturation knee this produced stochastic worker-stall / throughput degradation — TTFT p99 spiking from ~700 ms to 4–77 s and aggregate throughput dropping 30–85% — while GKE H100 stayed consistently clean. least-loaded balances by each worker's active in-flight load, so a backed-up worker stops receiving new requests until it drains, removing the positive-feedback loop behind the stall.

Why least-loaded is the best fit for the AICR performance benchmark. The performance phase drives the workload with AIPerf under deliberately uniform, reproducible conditions — a single served model, a pinned prompt pool, fixed input/output token counts (stddev 0), and greedy decoding (temperature: 0) — against a homogeneous pool of identical single-GPU vLLM workers (one H100 per worker via DRA). Under those conditions KV-overlap (kv) routing has no advantage to exploit: there is no worker heterogeneity to weigh, and the synthetic prompts carry no meaningful cross-request prefix-cache reuse for the router to capitalize on. What KV routing does do is bias requests toward an "overlap-matching" worker, which lets a transiently-slow worker keep accumulating load — exactly the saturation-knee failure mode in #1197. Least-loaded optimizes for the one signal that actually governs tail latency in this benchmark — instantaneous per-worker queue depth — so it is the most appropriate mode for AICR's AIPerf-driven throughput/TTFT measurement. (KV-aware routing remains the right default for production traffic with real prompt-prefix reuse; this change scopes only the benchmark workload.)

Fixes: N/A
Related: #1197, #1043 (epic)

Type of Change

Bug fix (non-breaking change that fixes an issue)
Documentation update

Component(s) Affected

Validator (pkg/validator)
Docs/examples (docs/, examples/)

Implementation Notes

One-line behavior change: DYN_ROUTER_MODE on the Frontend container in validators/performance/testdata/inference/dynamo-deployment.yaml flips kv → least-loaded.
least-loaded is a first-class Dynamo 1.2 router mode (set via the same frontend env var). No image/runtime bump required — the manifest already targets Dynamo 1.2.0.
Workers keep the vLLM ZMQ KV-cache event publisher; least-loaded simply does not consume those events. Left in place to avoid widening scope and to keep kv mode trivially restorable.
The gateway-epp routing path (--router-mode direct sidecars, external endpoint-picker) is unchanged and out of scope.
Docs updated to match: docs/user/validation.md, docs/contributor/validator.md, and the investigation report docs/contributor/inference-perf-fluctuation.md.

Testing

yamllint validators/performance/testdata/inference/dynamo-deployment.yaml   # clean

YAML + testdata/docs only — no Go source changed, so test/e2e/Go-lint cannot regress. End-to-end behavior (degradation no longer reproduces under least-loaded on EKS H100) needs a cluster run before this leaves draft.

Risk Assessment

Low — Single env-value change in a validator testdata manifest, easy to revert; kv mode remains available.

Rollout notes: Affects only the inference-perf validation workload, not user-deployed recipes/bundles.

Checklist

Tests pass locally (make test with -race) — N/A, no Go changes
Linter passes (yamllint on changed manifest)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A (testdata value change)
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-06-22T16:33:31Z

🌿 Preview your docs: https://nvidia-preview-feat-dynamo-frontend-least-loaded-1197.docs.buildwithfern.com/aicr

coderabbitai · 2026-06-22T16:35:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ea34fd6c-19bc-4e53-9fde-a22cf4e7b16f

📥 Commits

Reviewing files that changed from the base of the PR and between 2534e10 and b6a205f.

📒 Files selected for processing (5)

docs/contributor/inference-perf-fluctuation.md
docs/contributor/validator.md
docs/user/validation.md
validators/performance/inference_perf_constraint.go
validators/performance/testdata/inference/dynamo-deployment.yaml

📝 Walkthrough

Walkthrough

The DYN_ROUTER_MODE environment variable in the Dynamo vLLM deployment test configuration is changed from kv to least-loaded, with added inline comments explaining the in-flight-load-based balancing behavior. The corresponding code comment for resolveRoutingMode is expanded to document both routing modes. Three documentation files are updated in parallel: docs/contributor/inference-perf-fluctuation.md replaces its long-term KV-cache routing plan with least-loaded routing and clarifies version-skew prevention, docs/contributor/validator.md updates the inference-perf routing methodology description, and docs/user/validation.md rewrites the dynamo-router mode description to reflect the new default. All references to KV-cache event consumption by the router are removed or clarified as unused by least-loaded mode.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

feat(validator): make inference-perf routing strategy configurable (AICR_INFERENCE_PERF_ROUTER_MODE) #1374: This PR hardcodes DYN_ROUTER_MODE=least-loaded in the deployment YAML, which is the same field the issue proposes making configurable via an AICR_INFERENCE_PERF_ROUTER_MODE environment variable to allow A/B testing without image edits.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: switching inference-perf frontend routing from KV-cache-aware to least-loaded mode.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the motivation, implementation, and impact of the routing mode switch.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mchmarny

Tightly-scoped, well-reasoned change. The env flip is on the Frontend container, the worker's vLLM ZMQ KV-events publisher is correctly retained (matching the "least-loaded doesn't consume them, kept for trivial revert" rationale), and all three doc files plus the resolveRoutingMode comment are updated coherently. CI is green.

One concern blocks my approval — not the diff itself, but verification:

The Testing section gates this on an EKS H100 cluster run ("degradation no longer reproduces under least-loaded ... needs a cluster run before this leaves draft"), but the PR is no longer a draft and no run is recorded. The whole premise — that least-loaded stops the worker-stall, and that least-loaded is even an accepted DYN_ROUTER_MODE value for the Dynamo 1.2 frontend — rests on that run. A bad enum value would only surface at frontend startup on a real cluster, so I'd want the deploy + non-reproduction confirmed before merge. Everything is reversible (kv restorable), so this is a "verify, then merge," not a redesign.

Non-blocking nit: CodeRabbit flagged the tension with #1374 (make DYN_ROUTER_MODE configurable via AICR_INFERENCE_PERF_ROUTER_MODE rather than hardcoding). Hardcoding is consistent with the current template, so fine to keep here and let #1374 generalize it later — worth a one-line note in the PR that this is intentional.

Holding as COMMENT rather than approve pending the cluster confirmation.

The inference-perf workload's Dynamo frontend defaulted to KV-cache-aware routing (DYN_ROUTER_MODE=kv), which routes by KV overlap and can keep sending a transiently-slow worker its full share of requests. On EKS H100 at the saturation knee this produced stochastic worker-stall / throughput degradation (TTFT p99 spiking 4-77s, throughput dropping 30-85%), while GKE H100 stayed clean. Switch the default to load-aware least-loaded routing (DYN_ROUTER_MODE=least-loaded), which balances by each worker's active in-flight load so a backed-up worker stops receiving new requests until it drains. Workers still publish vLLM KV-cache events; least-loaded simply does not consume them. Updates the user/contributor docs and the investigation report to reflect the new default. Refs NVIDIA#1197

yuanchen8911 · 2026-06-22T19:00:12Z

Validated on EKS H100 (aicr3) before it left draft. inference-perf passed at 8-GPU/2048 concurrency: 136,054 tok/s, TTFT p99 703 ms. The stall from #1197 didn't reproduce. least-loaded does a much better job than round-robin and kv-aware here — performance is much more consistent and predictable, and better overall.

yuanchen8911 added area/validator area/docs labels Jun 22, 2026

github-actions Bot added size/S and removed area/validator labels Jun 22, 2026

yuanchen8911 force-pushed the feat/dynamo-frontend-least-loaded-1197 branch from 9157f89 to 2534e10 Compare June 22, 2026 16:58

github-actions Bot added size/M and removed size/S labels Jun 22, 2026

yuanchen8911 mentioned this pull request Jun 22, 2026

inference-perf: stochastic worker-stall / throughput degradation on EKS H100 at 2048 concurrency #1197

Closed

yuanchen8911 marked this pull request as ready for review June 22, 2026 17:46

yuanchen8911 requested a review from a team as a code owner June 22, 2026 17:46

yuanchen8911 requested a review from mchmarny June 22, 2026 17:46

mchmarny reviewed Jun 22, 2026

View reviewed changes

yuanchen8911 force-pushed the feat/dynamo-frontend-least-loaded-1197 branch from 2534e10 to b6a205f Compare June 22, 2026 18:49

mchmarny assigned yuanchen8911 Jun 22, 2026

yuanchen8911 requested a review from mchmarny June 22, 2026 19:00

mchmarny approved these changes Jun 22, 2026

View reviewed changes

mchmarny merged commit d57f7ac into NVIDIA:main Jun 22, 2026
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(validation): switch inference-perf frontend routing to least-loaded#1399

feat(validation): switch inference-perf frontend routing to least-loaded#1399
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/dynamo-frontend-least-loaded-1197

yuanchen8911 commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related issues

Uh oh!

mchmarny left a comment

Uh oh!

yuanchen8911 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related issues

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Jun 22, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading