Load models 10x faster. Serve 10 models with 1 GPU.
Docs β’ Quick Start β’ OSDI'24 Paper
ServerlessLLM loads models 6-10x faster than SafeTensors, enabling true serverless deployment where multiple models efficiently share GPU resources.
| Model | Scenario | SafeTensors | ServerlessLLM | Speedup |
|---|---|---|---|---|
| Qwen/Qwen3-32B | Random | 20.6s | 3.2s | 6.40x |
| Cached | 12.5s | 1.3s | 9.95x | |
| DeepSeek-R1-Distill-Qwen-32B | Random | 19.1s | 3.2s | 5.93x |
| Cached | 10.2s | 1.2s | 8.58x | |
| Llama-3.1-8B-Instruct | Random | 4.4s | 0.7s | 6.54x |
Results obtained on NVIDIA H100 GPUs with NVMe SSD. "Random" simulates serverless multi-model serving; "Cached" shows repeated loading of the same model.
ServerlessLLM is a fast, low-cost system for deploying multiple AI models on shared GPUs, with three core innovations:
- β‘ Ultra-Fast Checkpoint Loading: Custom storage format with O_DIRECT I/O loads models 6-10x faster than state-of-the-art checkpoint loaders
- π GPU Multiplexing: Multiple models share GPUs with fast switching and intelligent scheduling
- π― Unified Inference + Fine-Tuning: Seamlessly integrates LLM serving with LoRA fine-tuning on shared resources
Result: Serve 10 models on 1 GPU, fine-tune on-demand, and serve a base model + 100s of LoRA adapters.
Don't have Docker? Jump to Use the Fast Loader in Your Code for a Docker-free example.
# Download the docker-compose.yml file
curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml
# Set model storage location
export MODEL_FOLDER=/path/to/models
# Launch cluster (head node + worker with GPU)
docker compose up -d
# Wait for the cluster to be ready
docker logs -f sllm_headdocker exec sllm_head /opt/conda/envs/head/bin/sllm deploy --model Qwen/Qwen3-0.6B --backend transformerscurl http://127.0.0.1:8343/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is ServerlessLLM?"}],
"temperature": 0.7
}'That's it! Your model is now serving requests with an OpenAI-compatible API.
Use ServerlessLLM Store standalone to speed up torch-based model loading.
pip install serverless-llm-storesllm-store save --model Qwen/Qwen3-0.6B --backend transformers# Start the store server first
sllm-store start --storage-path ./models --mem-pool-size 4GBfrom sllm_store.transformers import load_model
# Load model (6-10x faster than from_pretrained!)
model = load_model(
"Qwen/Qwen3-0.6B",
device_map="auto",
torch_dtype="float16"
)
# Use as a normal PyTorch/Transformers model
output = model.generate(**inputs)How it works:
- Custom binary format optimized for sequential reads
- O_DIRECT I/O bypassing OS page cache
- Pinned memory pool for DMA-accelerated GPU transfers
- Parallel multi-threaded loading
- 6-10x faster than the SafeTensors checkpoint loader
- Supports both NVIDIA and AMD GPUs
- Works with vLLM, Transformers, and custom models
π Docs: Fast Loading Guide | ROCm Guide
- Run 10+ models on 1 GPU with fast switching
- Storage-aware scheduling minimizes loading time
- Auto-scale instances per model (scale to zero when idle)
- Live migration for zero-downtime resource optimization
π Docs: Deployment Guide
- Integrates LLM serving with serverless LoRA fine-tuning
- Deploys fine-tuned adapters for inference on-demand
- Serves a base model + 100s of LoRA adapters efficiently
π Docs: Fine-Tuning Guide
- Deploy embedding models alongside LLMs
- Provides an OpenAI-compatible
/v1/embeddingsendpoint
π‘ Example: RAG Example
- OpenAI-compatible API (drop-in replacement)
- Docker and Kubernetes deployment
- Multi-node clusters with distributed scheduling
π Docs: Deployment Guide | API Reference
- NVIDIA GPUs: Compute capability 7.0+ (V100, A100, H100, RTX 3060+)
- AMD GPUs: ROCm 6.2+ (MI100, MI200 series) - Experimental
More Examples: ./examples/
- Discord: Join our community - Get help, share ideas
- GitHub Issues: Report bugs
- WeChat: QR Code - δΈζζ―ζ
- Contributing: See CONTRIBUTING.md
Maintained by 10+ contributors worldwide. Community contributions are welcome!
If you use ServerlessLLM in your research, please cite our OSDI'24 paper:
@inproceedings{fu2024serverlessllm,
title={ServerlessLLM: Low-Latency Serverless Inference for Large Language Models},
author={Fu, Yao and Xue, Leyang and Huang, Yeqi and Brabete, Andrei-Octavian and Ustiugov, Dmitrii and Patel, Yuvraj and Mai, Luo},
booktitle={OSDI'24},
year={2024}
}Apache 2.0 - See LICENSE
β Star this repo if ServerlessLLM helps you!