Ultra Memory-Efficient
LLM Inference
Run 32B models on 24GB GPUs.58.7 tok/s on 7B. Single .zse file, no network calls.
7B Cold Start
tok/s (7B)
32B VRAM
tok/s (32B)
Get Running in 3 Steps
From zero to serving models in under a minute
Install ZSE
One pip command to get started. No complex dependencies or configurations.
Convert to .zse
Convert any HuggingFace model to optimized .zse format with 11× faster loading.
Serve Your Model
Start the OpenAI-compatible API server with instant cold starts.
Query the API
Use the OpenAI-compatible API with any client library or framework.
Why ZSE?
ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.
bitsandbytes (Standard)
Every time you load a model:
- Download FP16 weights (14GB for 7B model)
- Quantize to INT4 (takes 40+ seconds)
- Finally ready to use
.zse Format (Pre-quantized)
With ZSE, you quantize once, load instantly:
- One-time:
zse quantize→ .zse file - Every load: Read pre-quantized weights (instant)
- Ready in seconds, not minutes
ZSE vs llama.cpp (GGUF)
On Qwen 72B: ZSE loads in 6.5s vs llama.cpp GGUF in 10.2s — ZSE is 1.6× faster while using Python ecosystem.
When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you need fast cold starts with the Python/HuggingFace ecosystem.
Verified Benchmarks
.zse v1.2.0 performance with bnb.matmul_4bit. Tested on NVIDIA H200.
| Model | File Size | Load Time | VRAM | Throughput |
|---|---|---|---|---|
| Qwen 7B | 5.57 GB | 9.1s | 5.9 GB | 58.7 tok/s |
| Qwen 32B | 19.23 GB | 24.1s | 20.9 GB | 26.9 tok/s |
32B Model Performance (v1.2.0)
Fits on 24GB consumer GPUs (RTX 3090/4090)
Built for Efficiency
Every feature designed for memory efficiency and fast cold starts
58.7 tok/s
Generate at 58.7 tokens/sec on 7B models. Real throughput using bitsandbytes CUDA kernels.
32B in 21GB VRAM
Run 32B models on 24GB consumer GPUs (RTX 3090/4090). True memory efficiency.
9s Cold Start
Load 7B models in 9 seconds. Single .zse file with embedded config and tokenizer.
OpenAI Compatible
Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.
Perfect For
From local development to production deployments
Serverless Inference
Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.
Local AI Development
Run large models on your laptop. Test and iterate without cloud costs or API rate limits.
Edge Deployment
Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.
Cost Optimization
Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.
Verified Models
Tested and optimized for ZSE. VRAM shown for INT4 quantization.
| Model | Provider | Category | VRAM (INT4) | .zse Ready |
|---|---|---|---|---|
| Qwen 2.5 7B | Alibaba | Chat/Code | 5.9 GB | ✓ |
| Qwen 2.5 32B | Alibaba | Chat/Code | 20.9 GB | ✓ |
| Mistral 7B v0.3 | Mistral AI | Chat | ~6 GB | — |
| DeepSeek Coder 6.7B | DeepSeek | Code | ~5.5 GB | — |
| Llama 3.2 3B | Meta | Chat | ~2.5 GB | — |
| Gemma 2 9B | Reasoning | ~7 GB | — | |
| Phi-3 Mini | Microsoft | Reasoning | ~3 GB | — |
| TinyLlama 1.1B | TinyLlama | Testing | ~1 GB | — |
Simple, Powerful API
Start serving models with just a few lines
# Install ZSE
$ pip install zllm-zse
# Convert to optimized .zse format (11× faster loading)
$ zse quantize Qwen/Qwen2.5-7B-Instruct -o ./model.zse
# Serve your model with instant cold starts
$ zse serve ./model.zse --port 8000
# OpenAI-compatible API is ready!
$ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello!"}]}'Ready to Try ZSE?
Get memory-efficient LLM inference with fast cold starts. Install and start serving in minutes.