Skip to content

ServerlessLLM/ServerlessLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

ServerlessLLM

ServerlessLLM

Load models 10x faster. Serve 10 models with 1 GPU.

PyPI Downloads Discord WeChat License

Docs β€’ Quick Start β€’ OSDI'24 Paper


⚑ Performance

ServerlessLLM loads models 6-10x faster than SafeTensors, enabling true serverless deployment where multiple models efficiently share GPU resources.

Model Scenario SafeTensors ServerlessLLM Speedup
Qwen/Qwen3-32B Random 20.6s 3.2s 6.40x
Cached 12.5s 1.3s 9.95x
DeepSeek-R1-Distill-Qwen-32B Random 19.1s 3.2s 5.93x
Cached 10.2s 1.2s 8.58x
Llama-3.1-8B-Instruct Random 4.4s 0.7s 6.54x

Results obtained on NVIDIA H100 GPUs with NVMe SSD. "Random" simulates serverless multi-model serving; "Cached" shows repeated loading of the same model.

What is ServerlessLLM?

ServerlessLLM is a fast, low-cost system for deploying multiple AI models on shared GPUs, with three core innovations:

  1. ⚑ Ultra-Fast Checkpoint Loading: Custom storage format with O_DIRECT I/O loads models 6-10x faster than state-of-the-art checkpoint loaders
  2. πŸ”„ GPU Multiplexing: Multiple models share GPUs with fast switching and intelligent scheduling
  3. 🎯 Unified Inference + Fine-Tuning: Seamlessly integrates LLM serving with LoRA fine-tuning on shared resources

Result: Serve 10 models on 1 GPU, fine-tune on-demand, and serve a base model + 100s of LoRA adapters.


πŸš€ Quick Start (90 Seconds)

Start ServerlessLLM Cluster

Don't have Docker? Jump to Use the Fast Loader in Your Code for a Docker-free example.

# Download the docker-compose.yml file
curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml

# Set model storage location
export MODEL_FOLDER=/path/to/models

# Launch cluster (head node + worker with GPU)
docker compose up -d

# Wait for the cluster to be ready
docker logs -f sllm_head

Deploy a Model

docker exec sllm_head /opt/conda/envs/head/bin/sllm deploy --model Qwen/Qwen3-0.6B --backend transformers

Query the Model

curl http://127.0.0.1:8343/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What is ServerlessLLM?"}],
    "temperature": 0.7
  }'

That's it! Your model is now serving requests with an OpenAI-compatible API.


πŸ’‘ Use the Fast Loader in Your Code

Use ServerlessLLM Store standalone to speed up torch-based model loading.

Install

pip install serverless-llm-store

Convert a Model

sllm-store save --model Qwen/Qwen3-0.6B --backend transformers

Start the Store Server

# Start the store server first
sllm-store start --storage-path ./models --mem-pool-size 4GB

Load it 6-10x Faster in Your Python Code

from sllm_store.transformers import load_model

# Load model (6-10x faster than from_pretrained!)
model = load_model(
    "Qwen/Qwen3-0.6B",
    device_map="auto",
    torch_dtype="float16"
)

# Use as a normal PyTorch/Transformers model
output = model.generate(**inputs)

How it works:

  • Custom binary format optimized for sequential reads
  • O_DIRECT I/O bypassing OS page cache
  • Pinned memory pool for DMA-accelerated GPU transfers
  • Parallel multi-threaded loading

🎯 Key Features

⚑ Ultra-Fast Model Loading

  • 6-10x faster than the SafeTensors checkpoint loader
  • Supports both NVIDIA and AMD GPUs
  • Works with vLLM, Transformers, and custom models

πŸ“– Docs: Fast Loading Guide | ROCm Guide


πŸ”„ GPU Multiplexing

  • Run 10+ models on 1 GPU with fast switching
  • Storage-aware scheduling minimizes loading time
  • Auto-scale instances per model (scale to zero when idle)
  • Live migration for zero-downtime resource optimization

πŸ“– Docs: Deployment Guide


🎯 Unified Inference + LoRA Fine-Tuning

  • Integrates LLM serving with serverless LoRA fine-tuning
  • Deploys fine-tuned adapters for inference on-demand
  • Serves a base model + 100s of LoRA adapters efficiently

πŸ“– Docs: Fine-Tuning Guide


πŸ” Embedding Models for RAG

  • Deploy embedding models alongside LLMs
  • Provides an OpenAI-compatible /v1/embeddings endpoint

πŸ’‘ Example: RAG Example


πŸš€ Production-Ready

  • OpenAI-compatible API (drop-in replacement)
  • Docker and Kubernetes deployment
  • Multi-node clusters with distributed scheduling

πŸ“– Docs: Deployment Guide | API Reference


πŸ’» Supported Hardware

  • NVIDIA GPUs: Compute capability 7.0+ (V100, A100, H100, RTX 3060+)
  • AMD GPUs: ROCm 6.2+ (MI100, MI200 series) - Experimental

More Examples: ./examples/


🀝 Community

Maintained by 10+ contributors worldwide. Community contributions are welcome!


πŸ“„ Citation

If you use ServerlessLLM in your research, please cite our OSDI'24 paper:

@inproceedings{fu2024serverlessllm,
  title={ServerlessLLM: Low-Latency Serverless Inference for Large Language Models},
  author={Fu, Yao and Xue, Leyang and Huang, Yeqi and Brabete, Andrei-Octavian and Ustiugov, Dmitrii and Patel, Yuvraj and Mai, Luo},
  booktitle={OSDI'24},
  year={2024}
}

πŸ“ License

Apache 2.0 - See LICENSE


⭐ Star this repo if ServerlessLLM helps you!