Skip to content

Commit 7876dc7

Browse files
authored
[TRTLLM-11324][perf] Add host performance regression test suite for PyExecutor (#12148)
Signed-off-by: Yukun He <[email protected]>
1 parent 8449c18 commit 7876dc7

13 files changed

Lines changed: 1422 additions & 0 deletions
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Host Performance Regression Tests
2+
3+
## Purpose
4+
5+
These tests detect **host (CPU) performance regressions** in the PyExecutor
6+
pipeline using a two-layer approach:
7+
8+
- **Layer 1 (E2E)**: Run real models with `trtllm-serve` on host-overhead-dominant
9+
workloads via `test_perf_sanity.py`. Standard metrics (ITL, TPOT, throughput)
10+
catch regressions.
11+
- **Layer 2 (Module)**: Isolated benchmarks of individual modules (scheduler,
12+
sampler, resource manager). Pinpoint *which* module regressed.
13+
14+
## Layer 1: E2E Tests
15+
16+
E2E host perf tests reuse the existing `test_perf_sanity.py` infrastructure with
17+
host-overhead-dominant YAML configs in `tests/scripts/perf-sanity/aggregated/host_perf_*.yaml`.
18+
19+
### Why these workloads detect host regressions
20+
21+
| Factor | Choice | Effect |
22+
|--------|--------|--------|
23+
| Model size | Small (8B-16B) | Fast GPU kernels, host overhead exposed |
24+
| Batch size | Small (1-32) | GPU not saturated, host scheduling overhead visible |
25+
| Sequence length | Short (ISL=128, OSL=128-256) | High iteration rate, many scheduling cycles |
26+
| GPU count | 1 | No communication overhead |
27+
28+
### Models
29+
30+
- **DeepSeek-V3-Lite (FP8)**: MLA attention + MoE architecture. Exercises
31+
attention DP scheduling and expert routing host paths.
32+
- **Llama-3.1-8B (FP16)**: Dense model baseline. Covers core scheduler/sampler
33+
overhead without MoE/MLA-specific paths.
34+
35+
### E2E Metrics
36+
37+
- **Mean ITL**: Inter-token latency — most direct proxy for per-iteration host overhead
38+
- **Mean TPOT**: Time per output token — includes host + GPU per token
39+
- **P99 ITL**: Catches outlier iterations (GC pauses, scheduling spikes)
40+
- **Token throughput**: Overall throughput sanity check
41+
42+
### Running E2E tests
43+
44+
```bash
45+
# Run a specific host perf config through perf_sanity
46+
pytest tests/integration/defs/perf/test_perf_sanity.py -v \
47+
-k "host_perf_llama8b-llama8b_fp16_bs8_128_256" \
48+
--output-dir ./host_perf_results
49+
```
50+
51+
Requires: GPU access (1 GPU), `LLM_MODELS_ROOT` set to model weights directory.
52+
53+
## Layer 2: Module Tests
54+
55+
### Scheduler (`test_module_scheduler.py`)
56+
57+
Benchmarks `schedule_request()` latency at various batch sizes. Runs entirely
58+
on CPU — no GPU or model weights required. Tests the Python-side overhead of
59+
the `SimpleUnifiedScheduler` (capacity check + micro-batch scheduling).
60+
61+
```bash
62+
pytest tests/integration/defs/perf/host_perf/test_module_scheduler.py -v -s
63+
```
64+
65+
### Module Metrics
66+
67+
- **Mean latency (µs)**: Average per-call cost
68+
- **P50/P99 latency (µs)**: Distribution characteristics
69+
- **Calls/sec**: Throughput under stress
70+
71+
## Adding new configs
72+
73+
### E2E configs
74+
1. Create or edit a `host_perf_*.yaml` file in `tests/scripts/perf-sanity/aggregated/`
75+
2. Follow the existing YAML format (see `host_perf_deepseek_v3_lite.yaml`)
76+
3. Keep configs host-overhead-dominant: small batch, short sequences, small models
77+
4. Add the test entry to `l0_b200.yml` as `perf/test_perf_sanity.py::test_e2e[aggr_upload-{yaml_name}-{server_name}]`
78+
79+
### Module tests
80+
1. Create a `test_module_<name>.py` file
81+
2. Set up minimal real objects with synthetic workload state
82+
3. Time the target function in a tight loop (1000+ calls)
83+
4. Report mean/P50/P99 latency per call
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"_description": "Baseline median latencies (µs) collected on B200 GPU, TRT-LLM 1.3.0rc7. Used for soft regression detection (warning-only, no test failure).",
3+
"_regression_factor": 2.0,
4+
"_collected": "2026-03-18",
5+
"scheduler": {
6+
"production_gen_only_bs8": 11.8,
7+
"production_mixed_32gen_4ctx": 40.5
8+
},
9+
"sampler": {
10+
"greedy_bs8": 43.5,
11+
"stopwords_bs32": 103.1
12+
},
13+
"kv_cache": {
14+
"gen_bs8": 9.5,
15+
"ctx_bs8": 12.7
16+
}
17+
}
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
"""Shared pytest fixtures for host_perf module-level tests.
16+
17+
Provides a session-scoped fixture that uploads accumulated module-level
18+
performance results (scheduler, sampler, kv_cache) to OpenSearch at the
19+
end of the test session.
20+
"""
21+
22+
import pytest
23+
24+
25+
@pytest.fixture(scope="session", autouse=True)
26+
def module_perf_db_finalizer():
27+
"""Upload accumulated module perf results to OpenSearch after all tests."""
28+
yield
29+
from .regression_helper import get_collected_results, post_module_perf_to_db
30+
31+
if get_collected_results():
32+
try:
33+
post_module_perf_to_db()
34+
except Exception as e:
35+
from defs.trt_test_alternative import print_info
36+
37+
print_info(f"[module_perf_db] Failed to upload to OpenSearch: {e}")

0 commit comments

Comments
 (0)