feat(validation): switch inference-perf frontend routing to least-loaded#1399
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThe Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
9157f89 to
2534e10
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Tightly-scoped, well-reasoned change. The env flip is on the Frontend container, the worker's vLLM ZMQ KV-events publisher is correctly retained (matching the "least-loaded doesn't consume them, kept for trivial revert" rationale), and all three doc files plus the resolveRoutingMode comment are updated coherently. CI is green.
One concern blocks my approval — not the diff itself, but verification:
- The Testing section gates this on an EKS H100 cluster run ("degradation no longer reproduces under least-loaded ... needs a cluster run before this leaves draft"), but the PR is no longer a draft and no run is recorded. The whole premise — that least-loaded stops the worker-stall, and that
least-loadedis even an acceptedDYN_ROUTER_MODEvalue for the Dynamo 1.2 frontend — rests on that run. A bad enum value would only surface at frontend startup on a real cluster, so I'd want the deploy + non-reproduction confirmed before merge. Everything is reversible (kvrestorable), so this is a "verify, then merge," not a redesign.
Non-blocking nit: CodeRabbit flagged the tension with #1374 (make DYN_ROUTER_MODE configurable via AICR_INFERENCE_PERF_ROUTER_MODE rather than hardcoding). Hardcoding is consistent with the current template, so fine to keep here and let #1374 generalize it later — worth a one-line note in the PR that this is intentional.
Holding as COMMENT rather than approve pending the cluster confirmation.
The inference-perf workload's Dynamo frontend defaulted to KV-cache-aware routing (DYN_ROUTER_MODE=kv), which routes by KV overlap and can keep sending a transiently-slow worker its full share of requests. On EKS H100 at the saturation knee this produced stochastic worker-stall / throughput degradation (TTFT p99 spiking 4-77s, throughput dropping 30-85%), while GKE H100 stayed clean. Switch the default to load-aware least-loaded routing (DYN_ROUTER_MODE=least-loaded), which balances by each worker's active in-flight load so a backed-up worker stops receiving new requests until it drains. Workers still publish vLLM KV-cache events; least-loaded simply does not consume them. Updates the user/contributor docs and the investigation report to reflect the new default. Refs NVIDIA#1197
2534e10 to
b6a205f
Compare
|
Validated on EKS H100 (aicr3) before it left draft. inference-perf passed at 8-GPU/2048 concurrency: 136,054 tok/s, TTFT p99 703 ms. The stall from #1197 didn't reproduce. least-loaded does a much better job than round-robin and kv-aware here — performance is much more consistent and predictable, and better overall. |
Summary
Switch the
inference-perfvalidation workload's Dynamo frontend from KV-cache-aware routing (DYN_ROUTER_MODE=kv) to load-aware least-loaded routing (DYN_ROUTER_MODE=least-loaded).Motivation / Context
The inference-perf frontend routed by KV overlap, which can keep sending a transiently-slow worker its full share of requests. On EKS H100 at the 2048-concurrency saturation knee this produced stochastic worker-stall / throughput degradation — TTFT p99 spiking from ~700 ms to 4–77 s and aggregate throughput dropping 30–85% — while GKE H100 stayed consistently clean.
least-loadedbalances by each worker's active in-flight load, so a backed-up worker stops receiving new requests until it drains, removing the positive-feedback loop behind the stall.Why least-loaded is the best fit for the AICR performance benchmark. The performance phase drives the workload with AIPerf under deliberately uniform, reproducible conditions — a single served model, a pinned prompt pool, fixed input/output token counts (stddev 0), and greedy decoding (
temperature: 0) — against a homogeneous pool of identical single-GPU vLLM workers (one H100 per worker via DRA). Under those conditions KV-overlap (kv) routing has no advantage to exploit: there is no worker heterogeneity to weigh, and the synthetic prompts carry no meaningful cross-request prefix-cache reuse for the router to capitalize on. What KV routing does do is bias requests toward an "overlap-matching" worker, which lets a transiently-slow worker keep accumulating load — exactly the saturation-knee failure mode in #1197. Least-loaded optimizes for the one signal that actually governs tail latency in this benchmark — instantaneous per-worker queue depth — so it is the most appropriate mode for AICR's AIPerf-driven throughput/TTFT measurement. (KV-aware routing remains the right default for production traffic with real prompt-prefix reuse; this change scopes only the benchmark workload.)Fixes: N/A
Related: #1197, #1043 (epic)
Type of Change
Component(s) Affected
pkg/validator)docs/,examples/)Implementation Notes
DYN_ROUTER_MODEon the Frontend container invalidators/performance/testdata/inference/dynamo-deployment.yamlflipskv→least-loaded.least-loadedis a first-class Dynamo 1.2 router mode (set via the same frontend env var). No image/runtime bump required — the manifest already targets Dynamo 1.2.0.kvmode trivially restorable.gateway-epprouting path (--router-mode directsidecars, external endpoint-picker) is unchanged and out of scope.docs/user/validation.md,docs/contributor/validator.md, and the investigation reportdocs/contributor/inference-perf-fluctuation.md.Testing
yamllint validators/performance/testdata/inference/dynamo-deployment.yaml # cleanYAML + testdata/docs only — no Go source changed, so test/e2e/Go-lint cannot regress. End-to-end behavior (degradation no longer reproduces under least-loaded on EKS H100) needs a cluster run before this leaves draft.
Risk Assessment
kvmode remains available.Rollout notes: Affects only the inference-perf validation workload, not user-deployed recipes/bundles.
Checklist
make testwith-race) — N/A, no Go changesgit commit -S)