Skip to content

EricLBuehler/mistral.rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,274 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mistral.rs

Fast, flexible LLM inference.

| Documentation | Rust SDK | Python SDK | Discord |

GitHub stars

Latest

  • Anthropic Messages API: mistralrs serve now exposes Anthropic-compatible /v1/messages and /v1/messages/count_tokens endpoints alongside the OpenAI-compatible /v1 API. Guide
  • v0.8.2 CUDA performance: CUDA graphs, FlashInfer paged kernels, and MoE optimizations deliver strong results on GB10, B200, and H100 SXM. Benchmarks
  • Agentic runtime: web search, local Python code execution with model feedback, session management, and custom tool hooks. Guide
  • Gemma 4: full multimodal: text, image, video, and audio input. Guide | Video setup
  • MXFP4 ISQ quantization: MXFP4 with optimized decode kernels for faster, smaller models. Quantization docs

Benchmarks

v0.8.2 CUDA benchmarks

Mean tokens per second across prompt lengths and decode depths from 128 to 16384 tokens. Decode uses 256 generated tokens. See the full v0.8.2 report for commands, model revisions, host metadata, and appendix tables.

Q8 prefill TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0

Model Hardware mistral.rs llama.cpp
Gemma 4 E4B GB10 7395.7 3973.7
Gemma 4 E4B B200 27705.6 11992.4
Gemma 4 E4B H100 SXM 26220.6 11702.1
Gemma 4 26B-A4B GB10 2947.0 2178.5
Gemma 4 26B-A4B B200 12725.3 8503.4
Gemma 4 26B-A4B H100 SXM 12362.3 8055.1

Q8 decode TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0

Model Hardware mistral.rs llama.cpp
Gemma 4 E4B GB10 44.1 40.5
Gemma 4 E4B B200 241.4 194.4
Gemma 4 E4B H100 SXM 223.1 183.0
Gemma 4 26B-A4B GB10 46.8 46.4
Gemma 4 26B-A4B B200 210.9 192.2
Gemma 4 26B-A4B H100 SXM 199.8 183.9

BF16 prefill TPS: mistral.rs BF16 vs vLLM BF16

Model Hardware mistral.rs vLLM
Gemma 4 E4B GB10 5838.9 5812.9
Gemma 4 E4B B200 43547.8 39431.2
Gemma 4 E4B H100 SXM 35852.2 39293.7
Gemma 4 26B-A4B GB10 592.2 3878.6
Gemma 4 26B-A4B B200 3467.3 28532.8
Gemma 4 26B-A4B H100 SXM 2766.0 26295.9

BF16 decode TPS: mistral.rs BF16 vs vLLM BF16

Model Hardware mistral.rs vLLM
Gemma 4 E4B GB10 25.1 18.8
Gemma 4 E4B B200 202.6 196.2
Gemma 4 E4B H100 SXM 174.4 153.0
Gemma 4 26B-A4B GB10 26.9 23.2
Gemma 4 26B-A4B B200 159.6 220.2
Gemma 4 26B-A4B H100 SXM 138.7 148.0

Why mistral.rs?

  • Any Hugging Face model, zero config: Just mistralrs run -m user/model. Architecture, quantization format, and chat template are auto-detected.
  • True multimodality: Text, vision, video, and audio, speech generation, image generation, and embeddings in one engine.
  • Smart quantization: --quant automatically selects the best quantization format at that level: using a prebuilt UQFF if one is published, otherwise applying ISQ. Docs
  • OpenAI + Anthropic compatible serving: The same mistralrs serve process exposes OpenAI-compatible /v1 endpoints and Anthropic-compatible Messages endpoints.
  • Prometheus metrics: mistralrs serve exposes a /metrics endpoint in Prometheus format, recording per-request counts and latency labeled by method, route, and status. Docs
  • Built-in web UI: Served at /ui by default. Shows reasoning, code execution, plots, and files inline. Edit any message and the new branch runs with its own Python state. Pass --no-ui to disable.
  • Hardware-aware: mistralrs tune benchmarks your system and picks optimal quantization + device mapping.
  • Flexible SDKs: Python package and Rust crate to build your projects.
  • Native agentic support: built-in agentic loop with web search, local Python code execution with model feedback, session management, and custom tool hooks.

Quick Start

Install

Linux/macOS:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Manual installation & other platforms

Run Your First Model

# Interactive chat
mistralrs run -m Qwen/Qwen3-4B

# One-shot prompt (no interactive session)
mistralrs run -m Qwen/Qwen3-4B -i "What is the capital of France?"

# One-shot with an image
mistralrs run -m google/gemma-4-E4B-it --image photo.jpg -i "Describe this image"

# Agentic REPL: search + code execution from the terminal
mistralrs run --agent -m Qwen/Qwen3-4B

# Start an API server with the built-in web UI
mistralrs serve -m google/gemma-4-E4B-it

For the server command, visit http://localhost:1234/ui for the web chat interface. OpenAI-compatible clients use http://localhost:1234/v1; Anthropic-compatible clients use http://localhost:1234.

The mistralrs CLI

The CLI is designed to be zero-config: just point it at a model and go.

  • Auto-detection: Automatically detects model architecture, quantization format, and chat template
  • All-in-one: Single binary for chat, server, benchmarks, and web UI (run, serve, bench)
  • Hardware tuning: Run mistralrs tune to automatically benchmark and configure optimal settings for your hardware
  • Format-agnostic: Works with Hugging Face models, GGUF files, and UQFF quantizations seamlessly
# Auto-tune for your hardware and emit a config file
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml

# Run using the generated config
mistralrs from-config -f config.toml

# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
mistralrs doctor

Full CLI documentation

UI Demo
UI Demo

What Makes It Fast

Performance

Quantization (full docs)

  • In-situ quantization (ISQ) of any Hugging Face model
  • GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB support
  • Per-layer topology: Fine-tune quantization per layer for optimal quality/speed
  • ⭐ Auto-select fastest quant method for your hardware

Flexibility

Agentic Features

Full feature documentation

Supported Models

Text Models
  • Granite 4.0
  • SmolLM 3
  • DeepSeek V3
  • GPT-OSS
  • DeepSeek V2
  • Qwen 3 Next
  • Qwen 3 MoE
  • Phi 3.5 MoE
  • Qwen 3
  • GLM 4
  • GLM-4.7-Flash
  • GLM-4.7 (MoE)
  • Gemma 2
  • Qwen 2
  • Starcoder 2
  • Phi 3
  • Mixtral
  • Phi 2
  • Gemma
  • Llama
  • Mistral
Multimodal Models
  • Qwen 3.5
  • Qwen 3.5 MoE
  • Qwen 3-VL
  • Qwen 3-VL MoE
  • Gemma 3n
  • Llama 4
  • Gemma 3
  • Mistral 3
  • Phi 4 multimodal
  • Qwen 2.5-VL
  • MiniCPM-O
  • Llama 3.2 Vision
  • Qwen 2-VL
  • Idefics 3
  • Idefics 2
  • LLaVA Next
  • LLaVA
  • Phi 3V
Speech Models
  • Voxtral (ASR/speech-to-text)
  • Dia
Image Generation Models
  • FLUX
Embedding Models
  • Embedding Gemma
  • Qwen 3 Embedding

Request a new model | Full compatibility tables

Python SDK

pip install mistralrs  # or mistralrs-cuda, mistralrs-metal, mistralrs-mkl, mistralrs-accelerate
from mistralrs import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    in_situ_quant="4",
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256,
    )
)
print(res.choices[0].message.content)

Python SDK | Installation | Examples | Cookbook

Rust SDK

cargo add mistralrs
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, TextMessages, MultimodalModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = MultimodalModelBuilder::new("google/gemma-4-E4B-it")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "Hello!",
    );

    let response = model.send_chat_request(messages).await?;

    println!("{:?}", response.choices[0].message.content);

    Ok(())
}

API Docs | Crate | Examples

Docker

For quick containerized deployment:

docker pull ghcr.io/ericlbuehler/mistral.rs:latest
docker run --gpus all -p 1234:1234 ghcr.io/ericlbuehler/mistral.rs:latest \
  serve -m Qwen/Qwen3-4B

Docker images

For production use, we recommend installing the CLI directly for maximum flexibility.

Documentation

For complete documentation, see the Documentation.

Quick Links:

Contributing

Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.

Credits

This project would not be possible without the excellent work at Candle. Thank you to all contributors!

mistral.rs is not affiliated with Mistral AI.

Back to Top

About

Fast, flexible LLM inference

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors