NVIDIA
diff --git a/‎tests/integration/defs/perf/host_perf/README.md‎
Lines changed: 83 additions & 0 deletions b/‎tests/integration/defs/perf/host_perf/README.md‎
Lines changed: 83 additions & 0 deletions
diff --git a/‎tests/integration/defs/perf/host_perf/__init__.py‎
Lines changed: 14 additions & 0 deletions b/‎tests/integration/defs/perf/host_perf/__init__.py‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎tests/integration/defs/perf/host_perf/baselines_b200.json‎
Lines changed: 17 additions & 0 deletions b/‎tests/integration/defs/perf/host_perf/baselines_b200.json‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎tests/integration/defs/perf/host_perf/conftest.py‎
Lines changed: 37 additions & 0 deletions b/‎tests/integration/defs/perf/host_perf/conftest.py‎
Lines changed: 37 additions & 0 deletions
@@ -0,0 +1,83 @@
+# Host Performance Regression Tests
+
+## Purpose
+
+These tests detect **host (CPU) performance regressions** in the PyExecutor
+pipeline using a two-layer approach:
+
+- **Layer 1 (E2E)**: Run real models with `trtllm-serve` on host-overhead-dominant
+  workloads via `test_perf_sanity.py`. Standard metrics (ITL, TPOT, throughput)
+  catch regressions.
+- **Layer 2 (Module)**: Isolated benchmarks of individual modules (scheduler,
+  sampler, resource manager). Pinpoint *which* module regressed.
+
+## Layer 1: E2E Tests
+
+E2E host perf tests reuse the existing `test_perf_sanity.py` infrastructure with
+host-overhead-dominant YAML configs in `tests/scripts/perf-sanity/aggregated/host_perf_*.yaml`.
+
+### Why these workloads detect host regressions
+
+| Factor | Choice | Effect |
+|--------|--------|--------|
+| Model size | Small (8B-16B) | Fast GPU kernels, host overhead exposed |
+| Batch size | Small (1-32) | GPU not saturated, host scheduling overhead visible |
+| Sequence length | Short (ISL=128, OSL=128-256) | High iteration rate, many scheduling cycles |
+| GPU count | 1 | No communication overhead |
+
+### Models
+
+- **DeepSeek-V3-Lite (FP8)**: MLA attention + MoE architecture. Exercises
+  attention DP scheduling and expert routing host paths.
+- **Llama-3.1-8B (FP16)**: Dense model baseline. Covers core scheduler/sampler
+  overhead without MoE/MLA-specific paths.
+
+### E2E Metrics
+
+- **Mean ITL**: Inter-token latency — most direct proxy for per-iteration host overhead
+- **Mean TPOT**: Time per output token — includes host + GPU per token
+- **P99 ITL**: Catches outlier iterations (GC pauses, scheduling spikes)
+- **Token throughput**: Overall throughput sanity check
+
+### Running E2E tests
+
+```bash
+# Run a specific host perf config through perf_sanity
+pytest tests/integration/defs/perf/test_perf_sanity.py -v \
+    -k "host_perf_llama8b-llama8b_fp16_bs8_128_256" \
+    --output-dir ./host_perf_results
+```
+
+Requires: GPU access (1 GPU), `LLM_MODELS_ROOT` set to model weights directory.
+
+## Layer 2: Module Tests
+
+### Scheduler (`test_module_scheduler.py`)
+
+Benchmarks `schedule_request()` latency at various batch sizes. Runs entirely
+on CPU — no GPU or model weights required. Tests the Python-side overhead of
+the `SimpleUnifiedScheduler` (capacity check + micro-batch scheduling).
+
+```bash
+pytest tests/integration/defs/perf/host_perf/test_module_scheduler.py -v -s
+```
+
+### Module Metrics
+
+- **Mean latency (µs)**: Average per-call cost
+- **P50/P99 latency (µs)**: Distribution characteristics
+- **Calls/sec**: Throughput under stress
+
+## Adding new configs
+
+### E2E configs
+1. Create or edit a `host_perf_*.yaml` file in `tests/scripts/perf-sanity/aggregated/`
+2. Follow the existing YAML format (see `host_perf_deepseek_v3_lite.yaml`)
+3. Keep configs host-overhead-dominant: small batch, short sequences, small models
+4. Add the test entry to `l0_b200.yml` as `perf/test_perf_sanity.py::test_e2e[aggr_upload-{yaml_name}-{server_name}]`
+
+### Module tests
+1. Create a `test_module_<name>.py` file
+2. Set up minimal real objects with synthetic workload state
+3. Time the target function in a tight loop (1000+ calls)
+4. Report mean/P50/P99 latency per call
@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
@@ -0,0 +1,17 @@
+{
+    "_description": "Baseline median latencies (µs) collected on B200 GPU, TRT-LLM 1.3.0rc7. Used for soft regression detection (warning-only, no test failure).",
+    "_regression_factor": 2.0,
+    "_collected": "2026-03-18",
+    "scheduler": {
+        "production_gen_only_bs8": 11.8,
+        "production_mixed_32gen_4ctx": 40.5
+    },
+    "sampler": {
+        "greedy_bs8": 43.5,
+        "stopwords_bs32": 103.1
+    },
+    "kv_cache": {
+        "gen_bs8": 9.5,
+        "ctx_bs8": 12.7
+    }
+}
@@ -0,0 +1,37 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Shared pytest fixtures for host_perf module-level tests.
+
+Provides a session-scoped fixture that uploads accumulated module-level
+performance results (scheduler, sampler, kv_cache) to OpenSearch at the
+end of the test session.
+"""
+
+import pytest
+
+
+@pytest.fixture(scope="session", autouse=True)
+def module_perf_db_finalizer():
+    """Upload accumulated module perf results to OpenSearch after all tests."""
+    yield
+    from .regression_helper import get_collected_results, post_module_perf_to_db
+
+    if get_collected_results():
+        try:
+            post_module_perf_to_db()
+        except Exception as e:
+            from defs.trt_test_alternative import print_info
+
+            print_info(f"[module_perf_db] Failed to upload to OpenSearch: {e}")