ServerlessLLM

Load models 10x faster. Serve 10 models with 1 GPU.

Docs • Quick Start • OSDI'24 Paper

⚡ Performance

ServerlessLLM loads models 6-10x faster than SafeTensors, enabling true serverless deployment where multiple models efficiently share GPU resources.

Model	Scenario	SafeTensors	ServerlessLLM	Speedup
Qwen/Qwen3-32B	Random	20.6s	3.2s	6.40x
Qwen/Qwen3-32B	Cached	12.5s	1.3s	9.95x
DeepSeek-R1-Distill-Qwen-32B	Random	19.1s	3.2s	5.93x
DeepSeek-R1-Distill-Qwen-32B	Cached	10.2s	1.2s	8.58x
Llama-3.1-8B-Instruct	Random	4.4s	0.7s	6.54x

Results obtained on NVIDIA H100 GPUs with NVMe SSD. "Random" simulates serverless multi-model serving; "Cached" shows repeated loading of the same model.

What is ServerlessLLM?

ServerlessLLM is a fast, low-cost system for deploying multiple AI models on shared GPUs, with three core innovations:

⚡ Ultra-Fast Checkpoint Loading: Custom storage format with O_DIRECT I/O loads models 6-10x faster than state-of-the-art checkpoint loaders
🔄 GPU Multiplexing: Multiple models share GPUs with fast switching and intelligent scheduling
🎯 Unified Inference + Fine-Tuning: Seamlessly integrates LLM serving with LoRA fine-tuning on shared resources

Result: Serve 10 models on 1 GPU, fine-tune on-demand, and serve a base model + 100s of LoRA adapters.

🚀 Quick Start (90 Seconds)

Start ServerlessLLM Cluster

Don't have Docker? Jump to Use the Fast Loader in Your Code for a Docker-free example.

# Download the docker-compose.yml file
curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml

# Set model storage location
export MODEL_FOLDER=/path/to/models

# Launch cluster (head node + worker with GPU)
docker compose up -d

# Wait for the cluster to be ready
docker logs -f sllm_head

Deploy a Model

docker exec sllm_head /opt/conda/envs/head/bin/sllm deploy --model Qwen/Qwen3-0.6B --backend transformers

Query the Model

curl http://127.0.0.1:8343/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What is ServerlessLLM?"}],
    "temperature": 0.7
  }'

That's it! Your model is now serving requests with an OpenAI-compatible API.

💡 Use the Fast Loader in Your Code

Use ServerlessLLM Store standalone to speed up torch-based model loading.

Install

pip install serverless-llm-store

Convert a Model

sllm-store save --model Qwen/Qwen3-0.6B --backend transformers

Start the Store Server

# Start the store server first
sllm-store start --storage-path ./models --mem-pool-size 4GB

Load it 6-10x Faster in Your Python Code

from sllm_store.transformers import load_model

# Load model (6-10x faster than from_pretrained!)
model = load_model(
    "Qwen/Qwen3-0.6B",
    device_map="auto",
    torch_dtype="float16"
)

# Use as a normal PyTorch/Transformers model
output = model.generate(**inputs)

How it works:

Custom binary format optimized for sequential reads
O_DIRECT I/O bypassing OS page cache
Pinned memory pool for DMA-accelerated GPU transfers
Parallel multi-threaded loading

🎯 Key Features

⚡ Ultra-Fast Model Loading

6-10x faster than the SafeTensors checkpoint loader
Supports both NVIDIA and AMD GPUs
Works with vLLM, Transformers, and custom models

📖 Docs: Fast Loading Guide | ROCm Guide

🔄 GPU Multiplexing

Run 10+ models on 1 GPU with fast switching
Storage-aware scheduling minimizes loading time
Auto-scale instances per model (scale to zero when idle)
Live migration for zero-downtime resource optimization

📖 Docs: Deployment Guide

🎯 Unified Inference + LoRA Fine-Tuning

Integrates LLM serving with serverless LoRA fine-tuning
Deploys fine-tuned adapters for inference on-demand
Serves a base model + 100s of LoRA adapters efficiently

📖 Docs: Fine-Tuning Guide

🔍 Embedding Models for RAG

Deploy embedding models alongside LLMs
Provides an OpenAI-compatible /v1/embeddings endpoint

💡 Example: RAG Example

🚀 Production-Ready

OpenAI-compatible API (drop-in replacement)
Docker and Kubernetes deployment
Multi-node clusters with distributed scheduling

📖 Docs: Deployment Guide | API Reference

💻 Supported Hardware

NVIDIA GPUs: Compute capability 7.0+ (V100, A100, H100, RTX 3060+)
AMD GPUs: ROCm 6.2+ (MI100, MI200 series) - Experimental

More Examples: ./examples/

🤝 Community

Discord: Join our community - Get help, share ideas
GitHub Issues: Report bugs
WeChat: QR Code - 中文支持
Contributing: See CONTRIBUTING.md

Maintained by 10+ contributors worldwide. Community contributions are welcome!

📄 Citation

If you use ServerlessLLM in your research, please cite our OSDI'24 paper:

@inproceedings{fu2024serverlessllm,
  title={ServerlessLLM: Low-Latency Serverless Inference for Large Language Models},
  author={Fu, Yao and Xue, Leyang and Huang, Yeqi and Brabete, Andrei-Octavian and Ustiugov, Dmitrii and Patel, Yuvraj and Mai, Luo},
  booktitle={OSDI'24},
  year={2024}
}

📝 License

Apache 2.0 - See LICENSE

⭐ Star this repo if ServerlessLLM helps you!

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.github		.github
benchmarks		benchmarks
blogs		blogs
configs		configs
docs		docs
examples		examples
sllm		sllm
sllm_store		sllm_store
tests		tests
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
entrypoint.sh		entrypoint.sh
py.typed		py.typed
pyproject.toml		pyproject.toml
requirements-lint.txt		requirements-lint.txt
requirements-test.txt		requirements-test.txt
requirements-worker.txt		requirements-worker.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ServerlessLLM

⚡ Performance

What is ServerlessLLM?

🚀 Quick Start (90 Seconds)

Start ServerlessLLM Cluster

Deploy a Model

Query the Model

💡 Use the Fast Loader in Your Code

Install

Convert a Model

Start the Store Server

Load it 6-10x Faster in Your Python Code

🎯 Key Features

⚡ Ultra-Fast Model Loading

🔄 GPU Multiplexing

🎯 Unified Inference + LoRA Fine-Tuning

🔍 Embedding Models for RAG

🚀 Production-Ready

💻 Supported Hardware

🤝 Community

📄 Citation

📝 License

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors 20

Uh oh!

Languages

License

ServerlessLLM/ServerlessLLM

Folders and files

Latest commit

History

Repository files navigation

ServerlessLLM

⚡ Performance

What is ServerlessLLM?

🚀 Quick Start (90 Seconds)

Start ServerlessLLM Cluster

Deploy a Model

Query the Model

💡 Use the Fast Loader in Your Code

Install

Convert a Model

Start the Store Server

Load it 6-10x Faster in Your Python Code

🎯 Key Features

⚡ Ultra-Fast Model Loading

🔄 GPU Multiplexing

🎯 Unified Inference + LoRA Fine-Tuning

🔍 Embedding Models for RAG

🚀 Production-Ready

💻 Supported Hardware

🤝 Community

📄 Citation

📝 License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors 20

Uh oh!

Languages

Packages