Skip to content

feat(validation): switch inference-perf frontend routing to least-loaded#1399

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/dynamo-frontend-least-loaded-1197
Jun 22, 2026
Merged

feat(validation): switch inference-perf frontend routing to least-loaded#1399
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/dynamo-frontend-least-loaded-1197

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Switch the inference-perf validation workload's Dynamo frontend from KV-cache-aware routing (DYN_ROUTER_MODE=kv) to load-aware least-loaded routing (DYN_ROUTER_MODE=least-loaded).

Motivation / Context

The inference-perf frontend routed by KV overlap, which can keep sending a transiently-slow worker its full share of requests. On EKS H100 at the 2048-concurrency saturation knee this produced stochastic worker-stall / throughput degradation — TTFT p99 spiking from ~700 ms to 4–77 s and aggregate throughput dropping 30–85% — while GKE H100 stayed consistently clean. least-loaded balances by each worker's active in-flight load, so a backed-up worker stops receiving new requests until it drains, removing the positive-feedback loop behind the stall.

Why least-loaded is the best fit for the AICR performance benchmark. The performance phase drives the workload with AIPerf under deliberately uniform, reproducible conditions — a single served model, a pinned prompt pool, fixed input/output token counts (stddev 0), and greedy decoding (temperature: 0) — against a homogeneous pool of identical single-GPU vLLM workers (one H100 per worker via DRA). Under those conditions KV-overlap (kv) routing has no advantage to exploit: there is no worker heterogeneity to weigh, and the synthetic prompts carry no meaningful cross-request prefix-cache reuse for the router to capitalize on. What KV routing does do is bias requests toward an "overlap-matching" worker, which lets a transiently-slow worker keep accumulating load — exactly the saturation-knee failure mode in #1197. Least-loaded optimizes for the one signal that actually governs tail latency in this benchmark — instantaneous per-worker queue depth — so it is the most appropriate mode for AICR's AIPerf-driven throughput/TTFT measurement. (KV-aware routing remains the right default for production traffic with real prompt-prefix reuse; this change scopes only the benchmark workload.)

Fixes: N/A
Related: #1197, #1043 (epic)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update

Component(s) Affected

  • Validator (pkg/validator)
  • Docs/examples (docs/, examples/)

Implementation Notes

  • One-line behavior change: DYN_ROUTER_MODE on the Frontend container in validators/performance/testdata/inference/dynamo-deployment.yaml flips kvleast-loaded.
  • least-loaded is a first-class Dynamo 1.2 router mode (set via the same frontend env var). No image/runtime bump required — the manifest already targets Dynamo 1.2.0.
  • Workers keep the vLLM ZMQ KV-cache event publisher; least-loaded simply does not consume those events. Left in place to avoid widening scope and to keep kv mode trivially restorable.
  • The gateway-epp routing path (--router-mode direct sidecars, external endpoint-picker) is unchanged and out of scope.
  • Docs updated to match: docs/user/validation.md, docs/contributor/validator.md, and the investigation report docs/contributor/inference-perf-fluctuation.md.

Testing

yamllint validators/performance/testdata/inference/dynamo-deployment.yaml   # clean

YAML + testdata/docs only — no Go source changed, so test/e2e/Go-lint cannot regress. End-to-end behavior (degradation no longer reproduces under least-loaded on EKS H100) needs a cluster run before this leaves draft.

Risk Assessment

  • Low — Single env-value change in a validator testdata manifest, easy to revert; kv mode remains available.

Rollout notes: Affects only the inference-perf validation workload, not user-deployed recipes/bundles.

Checklist

  • Tests pass locally (make test with -race) — N/A, no Go changes
  • Linter passes (yamllint on changed manifest)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (testdata value change)
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ea34fd6c-19bc-4e53-9fde-a22cf4e7b16f

📥 Commits

Reviewing files that changed from the base of the PR and between 2534e10 and b6a205f.

📒 Files selected for processing (5)
  • docs/contributor/inference-perf-fluctuation.md
  • docs/contributor/validator.md
  • docs/user/validation.md
  • validators/performance/inference_perf_constraint.go
  • validators/performance/testdata/inference/dynamo-deployment.yaml

📝 Walkthrough

Walkthrough

The DYN_ROUTER_MODE environment variable in the Dynamo vLLM deployment test configuration is changed from kv to least-loaded, with added inline comments explaining the in-flight-load-based balancing behavior. The corresponding code comment for resolveRoutingMode is expanded to document both routing modes. Three documentation files are updated in parallel: docs/contributor/inference-perf-fluctuation.md replaces its long-term KV-cache routing plan with least-loaded routing and clarifies version-skew prevention, docs/contributor/validator.md updates the inference-perf routing methodology description, and docs/user/validation.md rewrites the dynamo-router mode description to reflect the new default. All references to KV-cache event consumption by the router are removed or clarified as unused by least-loaded mode.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: switching inference-perf frontend routing from KV-cache-aware to least-loaded mode.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation, and impact of the routing mode switch.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the feat/dynamo-frontend-least-loaded-1197 branch from 9157f89 to 2534e10 Compare June 22, 2026 16:58
@github-actions github-actions Bot added size/M and removed size/S labels Jun 22, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 22, 2026 17:46
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 22, 2026 17:46
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 22, 2026 17:46

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightly-scoped, well-reasoned change. The env flip is on the Frontend container, the worker's vLLM ZMQ KV-events publisher is correctly retained (matching the "least-loaded doesn't consume them, kept for trivial revert" rationale), and all three doc files plus the resolveRoutingMode comment are updated coherently. CI is green.

One concern blocks my approval — not the diff itself, but verification:

  • The Testing section gates this on an EKS H100 cluster run ("degradation no longer reproduces under least-loaded ... needs a cluster run before this leaves draft"), but the PR is no longer a draft and no run is recorded. The whole premise — that least-loaded stops the worker-stall, and that least-loaded is even an accepted DYN_ROUTER_MODE value for the Dynamo 1.2 frontend — rests on that run. A bad enum value would only surface at frontend startup on a real cluster, so I'd want the deploy + non-reproduction confirmed before merge. Everything is reversible (kv restorable), so this is a "verify, then merge," not a redesign.

Non-blocking nit: CodeRabbit flagged the tension with #1374 (make DYN_ROUTER_MODE configurable via AICR_INFERENCE_PERF_ROUTER_MODE rather than hardcoding). Hardcoding is consistent with the current template, so fine to keep here and let #1374 generalize it later — worth a one-line note in the PR that this is intentional.

Holding as COMMENT rather than approve pending the cluster confirmation.

The inference-perf workload's Dynamo frontend defaulted to KV-cache-aware
routing (DYN_ROUTER_MODE=kv), which routes by KV overlap and can keep
sending a transiently-slow worker its full share of requests. On EKS H100
at the saturation knee this produced stochastic worker-stall / throughput
degradation (TTFT p99 spiking 4-77s, throughput dropping 30-85%), while
GKE H100 stayed clean.

Switch the default to load-aware least-loaded routing
(DYN_ROUTER_MODE=least-loaded), which balances by each worker's active
in-flight load so a backed-up worker stops receiving new requests until it
drains. Workers still publish vLLM KV-cache events; least-loaded simply
does not consume them.

Updates the user/contributor docs and the investigation report to reflect
the new default.

Refs NVIDIA#1197
@yuanchen8911 yuanchen8911 force-pushed the feat/dynamo-frontend-least-loaded-1197 branch from 2534e10 to b6a205f Compare June 22, 2026 18:49
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Validated on EKS H100 (aicr3) before it left draft. inference-perf passed at 8-GPU/2048 concurrency: 136,054 tok/s, TTFT p99 703 ms. The stall from #1197 didn't reproduce. least-loaded does a much better job than round-robin and kv-aware here — performance is much more consistent and predictable, and better overall.

@yuanchen8911 yuanchen8911 requested a review from mchmarny June 22, 2026 19:00
@mchmarny mchmarny merged commit d57f7ac into NVIDIA:main Jun 22, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants