Skip to main content

vLLM Semantic Router v0.1 Iris: The First Major Release

· One min read
Xunzhuo Liu
Intelligent Routing @vLLM

We are thrilled to announce the release of vLLM Semantic Router v0.1, codename Iris—our first major release that marks a transformative milestone for intelligent LLM routing. Since our experimental launch in September 2025, we've witnessed extraordinary community growth: over 600 Pull Requests merged, 300+ Issues addressed, and contributions from more than 50 outstanding engineers worldwide.

In Greek mythology, Iris (Ἶρις) served as the divine messenger who bridged the realms of gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. This symbolism perfectly captures what vLLM Semantic Router v0.1 achieves: a bridge between users and diverse AI models, intelligently routing requests across different LLM providers and architectures.

Synced from official vLLM Blog: vLLM Semantic Router v0.1 Iris: The First Major Release

banner


AMD × vLLM Semantic Router: Building the System Intelligence Together

· One min read
Xunzhuo Liu
Intelligent Routing @vLLM

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in how we think about AI system architecture.

AMD has been a long-term technology partner for the vLLM community, from accelerating the vLLM inference engine on AMD GPUs and ROCm™ Software to now co-building the next layer of the AI stack: intelligent routing and governance for Mixture-of-Models (MoM) systems.

Synced from official vLLM Blog: AMD × vLLM Semantic Router: Building the System Intelligence Together

banner


Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

· One min read
Xunzhuo Liu
Intelligent Routing @vLLM
Huamin Chen
Distinguished Engineer @ Red Hat

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right in front of them.

Building on our Signal-Decision Architecture, we introduce HaluGate—a conditional, token-level hallucination detection pipeline that catches unsupported claims before they reach your users. No LLM-as-judge. No Python runtime. Just fast, explainable verification at the point of delivery.

Synced from official vLLM Blog: Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

banner


Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

· One min read
Xunzhuo Liu
Intelligent Routing @vLLM

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then routed to corresponding models. While this worked for basic scenarios, we quickly discovered its limitations when building production AI systems for enterprises.

Synced from official vLLM Blog: Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

banner


Semantic Tool Selection: Building Smarter AI Agents with Context-Aware Routing

· 11 min read
Xunzhuo Liu
Intelligent Routing @vLLM
Huamin Chen
Distinguished Engineer @ Red Hat

Anthropic recently published an insightful blog post on code execution with MCP, highlighting a critical challenge in modern AI systems: as agents connect to more tools, loading all tool definitions upfront becomes increasingly inefficient. Their solution—using code execution to load tools on-demand—demonstrates how established software engineering patterns can dramatically improve agent efficiency.

This resonates deeply with our experience building the vLLM Semantic Router. We've observed the same problem from a different angle: when AI agents have access to hundreds or thousands of tools, how do they know which tools are relevant for a given task?

Our solution: semantic tool selection—using semantic similarity to automatically select the most relevant tools for each user query before the request even reaches the LLM.

tools

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

· 9 min read
Ivar Flakstad
Machine Learning @ Hugging Face
OneZero-Y
LLM Inference
Huamin Chen
Distinguished Engineer @ Red Hat
Xunzhuo Liu
Intelligent Routing @vLLM

Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number of models. This post examines how a recent refactoring of the vLLM Semantic Router's Rust-based classification layer addresses this problem through architectural modularity, Low-Rank Adaptation (LoRA), and concurrency optimization.

Sync from vLLM Official Blog.

Background: From BERT to a Modular System

The previous implementation relied primarily on BERT and ModernBERT for intent and jailbreak classification. While ModernBERT performs well for English text classification tasks, it has the following limitations:

  • Language Coverage: The original ModernBERT's multilingual support is limited compared to models trained on more diverse datasets. (Note: mmBERT, a massively multilingual variant of ModernBERT supporting 1800+ languages, was released after this refactoring began and represents an alternative approach to the multilingual challenge)
  • Context Length: While ModernBERT extends context to 8,192 tokens using RoPE (source), models like Qwen3-Embedding support up to 32,768 tokens, which is beneficial for very long document processing
  • Model Coupling: Classification logic was tightly coupled to specific model architectures, making it difficult to add new models

These constraints motivated a broader refactoring that would enable the system to support multiple model types while maintaining performance. The modular architecture means that newer models like mmBERT can be integrated alongside Qwen3-Embedding and EmbeddingGemma, allowing the router to select the most appropriate model for each task.

Architectural Restructuring

modular

Semantic Router Q4 2025 Roadmap: Journey to Iris

· 15 min read
Xunzhuo Liu
Intelligent Routing @vLLM
Huamin Chen
Distinguished Engineer @ Red Hat
Chen Wang
Senior Staff Research Scientist @ IBM
Yue Zhu
Staff Research Scientist @ IBM

As we approach the end of 2025, we're excited to share our Q4 2025 roadmap for vLLM Semantic Router. This quarter marks a significant milestone in our project's evolution as we prepare for our first major release: v0.1, codename "Iris", expected in late 2025 to early 2026.

iris