| Documentation | Rust SDK | Python SDK | Discord |
- Anthropic Messages API:
mistralrs servenow exposes Anthropic-compatible/v1/messagesand/v1/messages/count_tokensendpoints alongside the OpenAI-compatible/v1API. Guide - v0.8.2 CUDA performance: CUDA graphs, FlashInfer paged kernels, and MoE optimizations deliver strong results on GB10, B200, and H100 SXM. Benchmarks
- Agentic runtime: web search, local Python code execution with model feedback, session management, and custom tool hooks. Guide
- Gemma 4: full multimodal: text, image, video, and audio input. Guide | Video setup
- MXFP4 ISQ quantization: MXFP4 with optimized decode kernels for faster, smaller models. Quantization docs
v0.8.2 CUDA benchmarks
Mean tokens per second across prompt lengths and decode depths from 128 to 16384 tokens. Decode uses 256 generated tokens. See the full v0.8.2 report for commands, model revisions, host metadata, and appendix tables.
Q8 prefill TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0
| Model | Hardware | mistral.rs | llama.cpp |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 7395.7 | 3973.7 |
| Gemma 4 E4B | B200 | 27705.6 | 11992.4 |
| Gemma 4 E4B | H100 SXM | 26220.6 | 11702.1 |
| Gemma 4 26B-A4B | GB10 | 2947.0 | 2178.5 |
| Gemma 4 26B-A4B | B200 | 12725.3 | 8503.4 |
| Gemma 4 26B-A4B | H100 SXM | 12362.3 | 8055.1 |
Q8 decode TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0
| Model | Hardware | mistral.rs | llama.cpp |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 44.1 | 40.5 |
| Gemma 4 E4B | B200 | 241.4 | 194.4 |
| Gemma 4 E4B | H100 SXM | 223.1 | 183.0 |
| Gemma 4 26B-A4B | GB10 | 46.8 | 46.4 |
| Gemma 4 26B-A4B | B200 | 210.9 | 192.2 |
| Gemma 4 26B-A4B | H100 SXM | 199.8 | 183.9 |
BF16 prefill TPS: mistral.rs BF16 vs vLLM BF16
| Model | Hardware | mistral.rs | vLLM |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 5838.9 | 5812.9 |
| Gemma 4 E4B | B200 | 43547.8 | 39431.2 |
| Gemma 4 E4B | H100 SXM | 35852.2 | 39293.7 |
| Gemma 4 26B-A4B | GB10 | 592.2 | 3878.6 |
| Gemma 4 26B-A4B | B200 | 3467.3 | 28532.8 |
| Gemma 4 26B-A4B | H100 SXM | 2766.0 | 26295.9 |
BF16 decode TPS: mistral.rs BF16 vs vLLM BF16
| Model | Hardware | mistral.rs | vLLM |
|---|---|---|---|
| Gemma 4 E4B | GB10 | 25.1 | 18.8 |
| Gemma 4 E4B | B200 | 202.6 | 196.2 |
| Gemma 4 E4B | H100 SXM | 174.4 | 153.0 |
| Gemma 4 26B-A4B | GB10 | 26.9 | 23.2 |
| Gemma 4 26B-A4B | B200 | 159.6 | 220.2 |
| Gemma 4 26B-A4B | H100 SXM | 138.7 | 148.0 |
- Any Hugging Face model, zero config: Just
mistralrs run -m user/model. Architecture, quantization format, and chat template are auto-detected. - True multimodality: Text, vision, video, and audio, speech generation, image generation, and embeddings in one engine.
- Smart quantization:
--quantautomatically selects the best quantization format at that level: using a prebuilt UQFF if one is published, otherwise applying ISQ. Docs - OpenAI + Anthropic compatible serving: The same
mistralrs serveprocess exposes OpenAI-compatible/v1endpoints and Anthropic-compatible Messages endpoints. - Prometheus metrics:
mistralrs serveexposes a/metricsendpoint in Prometheus format, recording per-request counts and latency labeled by method, route, and status. Docs - Built-in web UI: Served at
/uiby default. Shows reasoning, code execution, plots, and files inline. Edit any message and the new branch runs with its own Python state. Pass--no-uito disable. - Hardware-aware:
mistralrs tunebenchmarks your system and picks optimal quantization + device mapping. - Flexible SDKs: Python package and Rust crate to build your projects.
- Native agentic support: built-in agentic loop with web search, local Python code execution with model feedback, session management, and custom tool hooks.
Linux/macOS:
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | shWindows (PowerShell):
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iexManual installation & other platforms
# Interactive chat
mistralrs run -m Qwen/Qwen3-4B
# One-shot prompt (no interactive session)
mistralrs run -m Qwen/Qwen3-4B -i "What is the capital of France?"
# One-shot with an image
mistralrs run -m google/gemma-4-E4B-it --image photo.jpg -i "Describe this image"
# Agentic REPL: search + code execution from the terminal
mistralrs run --agent -m Qwen/Qwen3-4B
# Start an API server with the built-in web UI
mistralrs serve -m google/gemma-4-E4B-itFor the server command, visit http://localhost:1234/ui for the web chat interface. OpenAI-compatible clients use http://localhost:1234/v1; Anthropic-compatible clients use http://localhost:1234.
The CLI is designed to be zero-config: just point it at a model and go.
- Auto-detection: Automatically detects model architecture, quantization format, and chat template
- All-in-one: Single binary for chat, server, benchmarks, and web UI (
run,serve,bench) - Hardware tuning: Run
mistralrs tuneto automatically benchmark and configure optimal settings for your hardware - Format-agnostic: Works with Hugging Face models, GGUF files, and UQFF quantizations seamlessly
# Auto-tune for your hardware and emit a config file
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml
# Run using the generated config
mistralrs from-config -f config.toml
# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
mistralrs doctorPerformance
- Continuous batching support by default on all devices.
- CUDA with FlashAttention V2/V3, Metal, and multi-GPU/distributed inference
- PagedAttention for high throughput continuous batching on CUDA or Apple Silicon, prefix caching (including multimodal)
Quantization (full docs)
- In-situ quantization (ISQ) of any Hugging Face model
- GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB support
- ⭐ Per-layer topology: Fine-tune quantization per layer for optimal quality/speed
- ⭐ Auto-select fastest quant method for your hardware
Flexibility
- LoRA & X-LoRA with weight merging
- AnyMoE: Create mixture-of-experts on any base model
- Multiple models: Load/unload at runtime
Agentic Features
- Integrated tool calling with grammar enforcement and strict schema mode
- ⭐ Server-side agentic loop: auto-execute tools and feed results back
- ⭐ Python code execution: persistent Jupyter-like sessions with matplotlib capture and multimodal feedback
- ⭐ Web search integration with embedding-based ranking
- ⭐ Tool dispatch URL: POST tool calls to your own endpoint
- ⭐ MCP client: Connect to external tools via Process, HTTP, or WebSocket
- Python/Rust tool callbacks for custom execution
Text Models
- Granite 4.0
- SmolLM 3
- DeepSeek V3
- GPT-OSS
- DeepSeek V2
- Qwen 3 Next
- Qwen 3 MoE
- Phi 3.5 MoE
- Qwen 3
- GLM 4
- GLM-4.7-Flash
- GLM-4.7 (MoE)
- Gemma 2
- Qwen 2
- Starcoder 2
- Phi 3
- Mixtral
- Phi 2
- Gemma
- Llama
- Mistral
Multimodal Models
- Qwen 3.5
- Qwen 3.5 MoE
- Qwen 3-VL
- Qwen 3-VL MoE
- Gemma 3n
- Llama 4
- Gemma 3
- Mistral 3
- Phi 4 multimodal
- Qwen 2.5-VL
- MiniCPM-O
- Llama 3.2 Vision
- Qwen 2-VL
- Idefics 3
- Idefics 2
- LLaVA Next
- LLaVA
- Phi 3V
Speech Models
- Voxtral (ASR/speech-to-text)
- Dia
Image Generation Models
- FLUX
Embedding Models
- Embedding Gemma
- Qwen 3 Embedding
Request a new model | Full compatibility tables
pip install mistralrs # or mistralrs-cuda, mistralrs-metal, mistralrs-mkl, mistralrs-acceleratefrom mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.Plain(model_id="Qwen/Qwen3-4B"),
in_situ_quant="4",
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
)
print(res.choices[0].message.content)Python SDK | Installation | Examples | Cookbook
cargo add mistralrsuse anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, MultimodalModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = MultimodalModelBuilder::new("google/gemma-4-E4B-it")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let messages = TextMessages::new().add_message(
TextMessageRole::User,
"Hello!",
);
let response = model.send_chat_request(messages).await?;
println!("{:?}", response.choices[0].message.content);
Ok(())
}For quick containerized deployment:
docker pull ghcr.io/ericlbuehler/mistral.rs:latest
docker run --gpus all -p 1234:1234 ghcr.io/ericlbuehler/mistral.rs:latest \
serve -m Qwen/Qwen3-4BFor production use, we recommend installing the CLI directly for maximum flexibility.
For complete documentation, see the Documentation.
Quick Links:
- CLI Reference - All commands and options
- OpenAI-compatible APIs - OpenAI-compatible Chat Completions, Responses, tools, and media endpoints
- Anthropic Messages API - Anthropic-compatible Messages, streaming, tool use, and token counting
- HTTP API - OpenAI-compatible and Anthropic-compatible endpoints
- Quantization - ISQ, GGUF, GPTQ, and more
- Multi-GPU and Distributed - NCCL TP, P2P layer mapping, multi-node, and ring
- Device Mapping - Layer placement and CPU offloading
- MCP Integration - MCP integration documentation
- Troubleshooting - Common issues and solutions
- Configuration - Environment variables for configuration
Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.
This project would not be possible without the excellent work at Candle. Thank you to all contributors!
mistral.rs is not affiliated with Mistral AI.

