|
| 1 | +# Host Performance Regression Tests |
| 2 | + |
| 3 | +## Purpose |
| 4 | + |
| 5 | +These tests detect **host (CPU) performance regressions** in the PyExecutor |
| 6 | +pipeline using a two-layer approach: |
| 7 | + |
| 8 | +- **Layer 1 (E2E)**: Run real models with `trtllm-serve` on host-overhead-dominant |
| 9 | + workloads via `test_perf_sanity.py`. Standard metrics (ITL, TPOT, throughput) |
| 10 | + catch regressions. |
| 11 | +- **Layer 2 (Module)**: Isolated benchmarks of individual modules (scheduler, |
| 12 | + sampler, resource manager). Pinpoint *which* module regressed. |
| 13 | + |
| 14 | +## Layer 1: E2E Tests |
| 15 | + |
| 16 | +E2E host perf tests reuse the existing `test_perf_sanity.py` infrastructure with |
| 17 | +host-overhead-dominant YAML configs in `tests/scripts/perf-sanity/aggregated/host_perf_*.yaml`. |
| 18 | + |
| 19 | +### Why these workloads detect host regressions |
| 20 | + |
| 21 | +| Factor | Choice | Effect | |
| 22 | +|--------|--------|--------| |
| 23 | +| Model size | Small (8B-16B) | Fast GPU kernels, host overhead exposed | |
| 24 | +| Batch size | Small (1-32) | GPU not saturated, host scheduling overhead visible | |
| 25 | +| Sequence length | Short (ISL=128, OSL=128-256) | High iteration rate, many scheduling cycles | |
| 26 | +| GPU count | 1 | No communication overhead | |
| 27 | + |
| 28 | +### Models |
| 29 | + |
| 30 | +- **DeepSeek-V3-Lite (FP8)**: MLA attention + MoE architecture. Exercises |
| 31 | + attention DP scheduling and expert routing host paths. |
| 32 | +- **Llama-3.1-8B (FP16)**: Dense model baseline. Covers core scheduler/sampler |
| 33 | + overhead without MoE/MLA-specific paths. |
| 34 | + |
| 35 | +### E2E Metrics |
| 36 | + |
| 37 | +- **Mean ITL**: Inter-token latency — most direct proxy for per-iteration host overhead |
| 38 | +- **Mean TPOT**: Time per output token — includes host + GPU per token |
| 39 | +- **P99 ITL**: Catches outlier iterations (GC pauses, scheduling spikes) |
| 40 | +- **Token throughput**: Overall throughput sanity check |
| 41 | + |
| 42 | +### Running E2E tests |
| 43 | + |
| 44 | +```bash |
| 45 | +# Run a specific host perf config through perf_sanity |
| 46 | +pytest tests/integration/defs/perf/test_perf_sanity.py -v \ |
| 47 | + -k "host_perf_llama8b-llama8b_fp16_bs8_128_256" \ |
| 48 | + --output-dir ./host_perf_results |
| 49 | +``` |
| 50 | + |
| 51 | +Requires: GPU access (1 GPU), `LLM_MODELS_ROOT` set to model weights directory. |
| 52 | + |
| 53 | +## Layer 2: Module Tests |
| 54 | + |
| 55 | +### Scheduler (`test_module_scheduler.py`) |
| 56 | + |
| 57 | +Benchmarks `schedule_request()` latency at various batch sizes. Runs entirely |
| 58 | +on CPU — no GPU or model weights required. Tests the Python-side overhead of |
| 59 | +the `SimpleUnifiedScheduler` (capacity check + micro-batch scheduling). |
| 60 | + |
| 61 | +```bash |
| 62 | +pytest tests/integration/defs/perf/host_perf/test_module_scheduler.py -v -s |
| 63 | +``` |
| 64 | + |
| 65 | +### Module Metrics |
| 66 | + |
| 67 | +- **Mean latency (µs)**: Average per-call cost |
| 68 | +- **P50/P99 latency (µs)**: Distribution characteristics |
| 69 | +- **Calls/sec**: Throughput under stress |
| 70 | + |
| 71 | +## Adding new configs |
| 72 | + |
| 73 | +### E2E configs |
| 74 | +1. Create or edit a `host_perf_*.yaml` file in `tests/scripts/perf-sanity/aggregated/` |
| 75 | +2. Follow the existing YAML format (see `host_perf_deepseek_v3_lite.yaml`) |
| 76 | +3. Keep configs host-overhead-dominant: small batch, short sequences, small models |
| 77 | +4. Add the test entry to `l0_b200.yml` as `perf/test_perf_sanity.py::test_e2e[aggr_upload-{yaml_name}-{server_name}]` |
| 78 | + |
| 79 | +### Module tests |
| 80 | +1. Create a `test_module_<name>.py` file |
| 81 | +2. Set up minimal real objects with synthetic workload state |
| 82 | +3. Time the target function in a tight loop (1000+ calls) |
| 83 | +4. Report mean/P50/P99 latency per call |
0 commit comments