This page provides a high-level overview of the vLLM Semantic Router project: its purpose, architecture, key capabilities, and how it fits into the LLM inference stack. For detailed architectural components, see Architecture. For installation and deployment, see Getting Started. For configuration details, see Configuration Reference.
Sources: README.md31-48
vLLM Semantic Router is an intelligent routing layer that operates as an Envoy ExtProc (External Processor) server, providing system-level intelligence for Mixture-of-Models (MoM) deployments. It intercepts HTTP requests via Envoy's ext_proc filter, evaluates multiple signals to make routing decisions, applies security policies, and caches responses semantically.
The system addresses five fundamental questions:
Architecture Position:
The router operates at the gateway layer, intercepting all traffic through Envoy's ext_proc protocol. This architecture enables transparent request inspection, modification, and intelligent routing without requiring changes to client applications or backend models.
Sources: README.md31-48 Diagram 1 and 2 from high-level architecture </old_str>
<old_str>
System Components with Code Locations:
Key Code Locations:
| Component | File(s) | Primary Functions |
|---|---|---|
| ExtProc Server | cmd/main.go | main(), gRPC server setup |
| Request Processing | src/semantic-router/router/extproc.go | ProcessRequestHeaders(), ProcessRequestBody() |
| Response Processing | src/semantic-router/router/extproc.go | ProcessResponseHeaders(), ProcessResponseBody() |
| Classifier | src/semantic-router/classification/classifier.go | EvaluateAllSignals(), ClassifyDomain(), ClassifyPII() |
| Decision Engine | src/semantic-router/engine/decision.go | EvaluateDecisions(), EvaluateRules() |
| Plugins | src/semantic-router/plugins/ | SemanticCachePlugin, SystemPromptPlugin, HallucinationPlugin |
| Cache Backends | src/semantic-router/cache/ | InMemoryCache, MilvusCache, RedisCache, HybridCache |
| CGO Bindings | candle-binding/semantic-router.go | InitCandleBertClassifier(), ClassifyDomain(), GenerateEmbedding() |
| Rust FFI | ffi/init.rs ffi/classify.rs ffi/embedding.rs | ffi_init_*, ffi_classify_*, ffi_generate_embedding |
Sources: Diagram 1 from high-level architecture, src/vllm-sr/supervisord.conf1-53 README.md52-128 </old_str> <new_str>
vLLM Semantic Router provides four major capability areas:
| Capability | Description | Implementation |
|---|---|---|
| Signal-Driven Routing | Extracts 7 signal types and combines them via AND/OR rules to select optimal models | Classifier.EvaluateAllSignals() → DecisionEngine.EvaluateDecisions() |
| Semantic Caching | Caches responses by embedding similarity to reduce latency and costs | HNSW index, Milvus, Redis, Hybrid backends in src/semantic-router/cache/ |
| Content Safety | Detects and blocks PII, jailbreak attempts, and hallucinations via ML models | PII/Jailbreak/Hallucination classifiers via Rust FFI in ffi/classify.rs |
| Observability | Exposes Prometheus metrics, OpenTelemetry traces, and VSR headers | :9190/metrics, Jaeger OTLP, custom headers x-vsr-* |
The router evaluates 7 signal types in parallel and combines them to make decisions:
Signal Evaluation Path:
Classifier.EvaluateAllSignals() in src/semantic-router/classification/classifier.go calls individual signal evaluatorscandle-binding/semantic-router.goDecisionEngine.EvaluateDecisions() in src/semantic-router/engine/decision.goSecurity is enforced via a plugin chain:
PIIClassifier.ClassifyPII() calling ffi_classify_pii_modernbertJailbreakClassifier.ClassifyJailbreak()For details on PII configuration, see Security and PII Policies. For hallucination detection, see Hallucination Detection and Mitigation.
Sources: README.md31-48 website/docs/intro.md19-50 website/docs/overview/signal-driven-decisions.md8-146 Diagram 4 from high-level architecture
vLLM Semantic Router provides four major capability areas:
| Capability | Description | Key Technologies |
|---|---|---|
| Intelligent Routing | Route requests to optimal models based on intent classification, domain detection, embeddings, and user feedback | ModernBERT classifiers, LoRA adapters, Qwen3/Gemma embeddings |
| Semantic Caching | Cache responses using embedding similarity to reduce costs and latency | In-memory HNSW, Milvus, Redis, Hybrid backends |
| Content Safety | Detect and block PII, jailbreak attempts, and hallucinations | Fine-tuned BERT models, token-level classifiers, NLI explainers |
| Observability | Monitor decisions, performance, and system health | Prometheus metrics, Jaeger tracing, Grafana dashboards |
The system supports multiple signal types for routing decisions:
These signals can be combined using AND/OR logic in decision rules. For details, see Signals and Rules and Decisions and Routing Logic.
For PII configuration details, see Security and PII Policies. For hallucination detection, see TruthLens Hallucination Detection.
Sources: README.md31-48 website/sidebars.ts56-102 website/docs/tutorials/content-safety/pii-detection.md8-36 website/docs/tutorials/content-safety/hallucination-detection.md7-23
ExtProc Protocol Processing Phases:
Processing Phases:
| Phase | Handler Function | Key Actions |
|---|---|---|
| RequestHeaders | ProcessRequestHeaders() | Initialize RequestContext, extract tracing context, parse headers |
| RequestBody | ProcessRequestBody() | Extract signals, evaluate decisions, check security, lookup cache, mutate headers |
| ResponseHeaders | ProcessResponseHeaders() | Add VSR headers (x-vsr-selected-category, x-vsr-selected-model), record TTFT |
| ResponseBody | ProcessResponseBody() | Hallucination detection, cache storage, final metric recording |
Key Function Call Chain:
main() [cmd/main.go]
→ NewRouter() [router/router.go]
→ router.Process() [router/extproc.go]
→ ProcessRequestBody() [router/extproc.go]
→ classifier.EvaluateAllSignals() [classification/classifier.go]
→ ClassifyDomain() [via CGO to ffi_classify_domain_modernbert]
→ ClassifyPII() [via CGO to ffi_classify_pii_modernbert]
→ MatchKeywords() [classification/keyword.go]
→ FindSimilar() [classification/embedding.go]
→ engine.EvaluateDecisions() [engine/decision.go]
→ EvaluateRules() [engine/rules.go]
→ SelectBestModel() [engine/selection.go]
→ plugins.ApplyChain() [plugins/manager.go]
→ SemanticCachePlugin.FindSimilar() [plugins/semantic_cache.go]
→ JailbreakPlugin.Check() [plugins/jailbreak.go]
→ PIIPlugin.Check() [plugins/pii.go]
→ ProcessResponseBody() [router/extproc.go]
→ HallucinationPlugin.Detect() [plugins/hallucination.go]
→ cache.Set() [cache interface]
Sources: Diagram 2 from high-level architecture, src/vllm-sr/supervisord.conf10-20 README.md52-128
Key Code Paths:
ext_proc filter calls ProcessingRequest gRPC methodClassifier.EvaluateAllSignals() in src/semantic-router/classification/classifier.gocandle-binding/semantic-router.go call Rust functions in ffi/classify.rsEvaluateDecisionsWithSignals() in src/semantic-router/engine/decision.gosrc/semantic-router/cache/ (HNSW, Milvus, Redis)Sources: Diagram 2 in prompt, src/vllm-sr/supervisord.conf10-20
vLLM Semantic Router is built using a multi-language architecture optimized for different concerns:
| Language | Components | Why |
|---|---|---|
| Go | Router core, servers, decision engine, cache, observability | • High performance, low latency • Excellent concurrency primitives • Rich networking libraries • Production-ready HTTP/gRPC |
| Rust | ML models, embeddings, classification | • Zero-cost abstractions • Memory safety without GC • Candle framework for ML • CUDA/Flash Attention support |
| CGO | Go ↔ Rust boundary | • Zero-copy FFI • Thread-safe initialization ( OnceLock)• Minimal overhead |
| Python | CLI, config generation, orchestration | • User-friendly CLI interface • Easy config templating • Docker/subprocess management |
| JavaScript | Dashboard frontend | • React for interactive UI • Playground for testing • Observability integration |
Key Design Patterns:
ffi/init.rsSources: Diagram 4 in prompt, README.md58-122 src/vllm-sr/cli/docker_cli.py1-489
The router uses a layered configuration system:
| File | Purpose | Location |
|---|---|---|
config.yaml | User configuration (decisions, models, endpoints) | User-provided, mounted to container |
router-defaults.yaml | System defaults (inline models, external models, global settings) | Embedded in container at /app/.vllm-sr/ |
envoy.yaml | Generated Envoy configuration | Generated by cli/config_generator.py |
Configuration can be loaded from:
config_source: kubernetesFor details, see Configuration File Structure and Dynamic Configuration with Kubernetes.
The system can be deployed in multiple ways:
Docker Compose Components:
From src/vllm-sr/supervisord.conf:
start-router.shstart-dashboard.shProduction Stack Components:
From deploy/stack/supervisord.stack.conf:
Sources: src/vllm-sr/supervisord.conf1-53 deploy/stack/supervisord.stack.conf1-145 website/docs/installation/k8s/dynamo.md1-869
| Model Type | Implementation | Purpose | Code Location |
|---|---|---|---|
| BERT Similarity | Traditional BERT | Intent matching via dot product | ffi/classify.rs |
| ModernBERT Classifiers | ModernBERT-base/large | Domain/PII/Jailbreak/Fact-check/Feedback detection | ffi/classify.rs |
| LoRA Adapters | Fine-tuned adapters | Intent/PII/Jailbreak with extensibility | ffi/classify.rs |
| Qwen3 Embeddings | Alibaba Qwen3 | 1024-dim embeddings, 32K context, high quality | ffi/embedding.rs |
| Gemma Embeddings | Google Gemma | 768-dim embeddings, 8K context, balanced | ffi/embedding.rs |
| Hallucination Detector | Token-level classifier | Identifies unsupported claims in LLM outputs | Used in TruthLens |
| NLI Explainer | Natural Language Inference | Provides explanations for hallucinated spans | Used in TruthLens |
The DualPathUnifiedClassifier in Rust intelligently routes between LoRA and Traditional models:
This enables extensibility via LoRA while maintaining backward compatibility.
Sources: Diagram 4 in prompt, website/docs/tutorials/content-safety/pii-detection.md16-36 website/docs/tutorials/content-safety/hallucination-detection.md16-23
| Endpoint | Port | Type | Content |
|---|---|---|---|
/metrics | 9190 | Prometheus | Router metrics (decisions, cache hits, latency) |
/metrics | 9090 | Prometheus | Prometheus server aggregation |
| Jaeger UI | 16686 | HTTP | Distributed traces |
| Grafana | 3000 | HTTP | Dashboards and visualizations |
| Dashboard | 8700 | HTTP | Embedded observability tools |
The router exposes metrics for monitoring:
vsr_decisions_total, vsr_decision_duration_secondsvsr_cache_hits_total, vsr_cache_misses_totalvsr_classification_duration_secondspii_detections_total, pii_policy_violations_totaltruthlens_detections_total, truthlens_scoreFor detailed metrics documentation, see Metrics and Observability.
Sources: README.md130-136 Diagram 1 in prompt
Installation:
Basic Usage:
What vllm-sr serve does:
HF_ENDPOINT, HF_TOKEN, HF_HOME env vars)envoy.yaml from your config.yaml via EnvoyConfigGenerator.generate() in src/vllm-sr/cli/config_generator.pysupervisord managing:
router (ExtProc server :50051)envoy (HTTP listener :8801)dashboard (Web UI :8700)llmkatan (Mock LLM for testing)VLLM_SR_NOFILE_LIMIT, default: 65536)Container Runtime:
The CLI detects Docker or Podman automatically. Override:
Image Pull Policies:
For detailed setup, see Prerequisites and Requirements and Installation Methods. For Kubernetes, see Kubernetes Deployment.
Sources: README.md55-122 src/vllm-sr/README.md1-78 src/vllm-sr/cli/docker_cli.py28-149 src/vllm-sr/supervisord.conf1-53
Route different query types to specialized models:
Reduce API costs by caching similar queries:
A query like "What is machine learning?" can return cached results for "Define machine learning" or "Explain ML concepts."
Enforce PII policies per model:
Queries containing SSN, credit cards, or email addresses are automatically blocked for models without PII clearance.
Verify that LLM responses are grounded in retrieved documents:
The system detects and mitigates responses that fabricate information not present in the RAG context.
For more examples, see the Capacities section in the full documentation.
Sources: website/sidebars.ts56-102 website/docs/tutorials/content-safety/hallucination-detection.md125-158
Sources: website/sidebars.ts1-153
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.