Ultra memory-efficient LLM inference engine with native INT4 CUDA kernels.
Run 32B models on 24GB GPUs. Run 7B models on 8GB GPUs. Fast cold starts, single-file deployment.
Train 7B models on 8GB GPUs. Train 70B models on 48GB GPUs.
from zse.format import load_zse_model
from zse.training import LoRAConfig, add_lora_to_model, save_lora_adapter
# Load INT4 model (uses ~6GB for 7B)
model, tokenizer, info = load_zse_model("model.zse", device="cuda")
# Add LoRA adapters (~1% trainable params)
config = LoRAConfig(rank=16, alpha=32)
model = add_lora_to_model(model, config)
# Train as usual with PyTorch/HuggingFace
# ... your training loop ...
# Save adapter (tiny file, ~25MB for 7B)
save_lora_adapter(model, "my_adapter.safetensors")Install training dependencies: pip install zllm-zse[training]
| Model | File Size | VRAM | Speed | Cold Start | GPU |
|---|---|---|---|---|---|
| Qwen 7B | 5.57 GB | 5.67 GB | 37.2 tok/s | 5.7s | H200 |
| Qwen 14B | 9.95 GB | 10.08 GB | 20.8 tok/s | 10.5s | H200 |
| Qwen 32B | 19.23 GB | 19.47 GB | 10.9 tok/s | 20.4s | H200 |
| Qwen 72B | 41.21 GB | 41.54 GB | 6.3 tok/s | 51.8s | H200 |
| Model | VRAM | Speed | Cold Start |
|---|---|---|---|
| Qwen 7B | 6.57 GB | 45.6 tok/s | 6.0s |
| Qwen 14B | 11.39 GB | 27.6 tok/s | 7.1s |
| Qwen 32B | 22.27 GB | 20.4 tok/s | 20.8s |
| Qwen 72B | 47.05 GB | 16.4 tok/s | 53.0s |
| Model | Custom Kernel | bnb Backend | Savings |
|---|---|---|---|
| 7B | 5.67 GB | 6.57 GB | -0.90 GB (14%) |
| 14B | 10.08 GB | 11.39 GB | -1.31 GB (12%) |
| 32B | 19.47 GB | 22.27 GB | -2.80 GB (13%) |
| 72B | 41.54 GB | 47.05 GB | -5.51 GB (12%) |
| GPU | VRAM | Max Model |
|---|---|---|
| RTX 3070/4070 | 8GB | 7B |
| RTX 3080 | 12GB | 14B |
| RTX 3090/4090 | 24GB | 32B |
| A100-40GB | 40GB | 32B |
| A100-80GB / H200 | 80-141GB | 72B |
- π¦ Single .zse File: Model + tokenizer + config in one file
- π« No Network Calls: Everything embedded, works offline
- β‘ ZSE Custom Kernel: Native INT4 inference with maximum VRAM efficiency
- π§ Memory Efficient: 72B in 41GB, 32B in 19GB, 7B in 5.7GB VRAM
- π Fast Cold Start: 5.7s for 7B, 20s for 32B, 52s for 72B
- π― Dual Backend: Custom kernel (default) or bnb backend (alternate)
- π₯ QLoRA Training: Fine-tune INT4 models with LoRA adapters (NEW in v1.4.0)
pip install zllm-zseRequirements:
- Python 3.11+
- CUDA GPU (8GB+ VRAM recommended)
- bitsandbytes (auto-installed)
# Convert any HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen7b.zse
zse convert Qwen/Qwen2.5-32B-Instruct -o qwen32b.zse
# Or in Python
from zse.format.writer import convert_model
convert_model("Qwen/Qwen2.5-7B-Instruct", "qwen7b.zse", quantization="int4")from zse.format.reader_v2 import load_zse_model
# Load model (auto-detects optimal settings)
model, tokenizer, info = load_zse_model("qwen7b.zse")
# Generate
inputs = tokenizer("Write a poem about AI:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))zse serve qwen7b.zse --port 8000import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")
response = client.chat.completions.create(
model="qwen7b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)| Feature | HuggingFace | .zse Format |
|---|---|---|
| Cold start (7B) | 45s | 9s |
| Cold start (32B) | 120s | 24s |
| Network calls on load | Yes | No |
| Files to manage | Many | One |
| Quantization time | Runtime | Pre-done |
Train any model with QLoRA - LoRA adapters on quantized INT4 base models.
# Install training dependencies
pip install zllm-zse[training]from zse.format import load_zse_model, convert_model
from zse.training import (
LoRAConfig,
add_lora_to_model,
save_lora_adapter,
load_lora_adapter
)
import torch
# 1. Convert model to .zse (one-time)
convert_model("meta-llama/Llama-3-8B", "llama8b.zse", quantization="int4")
# 2. Load INT4 model
model, tokenizer, info = load_zse_model("llama8b.zse", device="cuda")
# 3. Add LoRA adapters
config = LoRAConfig(
rank=16, # LoRA rank (higher = more capacity)
alpha=32, # LoRA alpha (scaling factor)
dropout=0.05, # Dropout for regularization
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] # Which layers
)
model = add_lora_to_model(model, config)
# 4. Train with standard PyTorch
optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad],
lr=2e-4
)
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 5. Save adapter (~25MB for 7B model)
save_lora_adapter(model, "my_adapter.safetensors")
# 6. Load adapter for inference
model, tokenizer, info = load_zse_model("llama8b.zse", lora="my_adapter.safetensors")QLoRA VRAM Usage:
| Model | Base VRAM | + LoRA Training | Trainable Params |
|---|---|---|---|
| 7B | 6 GB | ~8 GB | 12M (0.2%) |
| 14B | 11 GB | ~14 GB | 25M (0.2%) |
| 32B | 20 GB | ~26 GB | 50M (0.2%) |
| 70B | 42 GB | ~52 GB | 100M (0.1%) |
# Auto (default): Detect VRAM, pick optimal strategy
model, tok, info = load_zse_model("qwen7b.zse", cache_weights="auto")
# Force bnb mode (low VRAM, fast inference)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=False)
# Force FP16 cache (max speed, high VRAM)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=True)# Full benchmark
python3 -c "
import time, torch
from zse.format.reader_v2 import load_zse_model
t0 = time.time()
model, tokenizer, info = load_zse_model('qwen7b.zse')
print(f'Load: {time.time()-t0:.1f}s, VRAM: {torch.cuda.memory_allocated()/1e9:.1f}GB')
inputs = tokenizer('Hello', return_tensors='pt').to('cuda')
model.generate(**inputs, max_new_tokens=10) # Warmup
prompt = 'Write a detailed essay about AI.'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
torch.cuda.synchronize()
t0 = time.time()
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
torch.cuda.synchronize()
tokens = out.shape[1] - inputs['input_ids'].shape[1]
print(f'{tokens} tokens in {time.time()-t0:.2f}s = {tokens/(time.time()-t0):.1f} tok/s')
"# Convert model
zse convert <model_id> -o output.zse
# Start server
zse serve <model.zse> --port 8000
# Interactive chat
zse chat <model.zse>
# Show model info
zse info <model.zse>
# Check hardware
zse hardware- Conversion: Quantize HF model to INT4, pack weights, embed tokenizer + config
- Loading: Memory-map .zse file, load INT4 weights directly to GPU
- Inference: ZSE custom kernel (default) or bnb backend for matmul
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β HuggingFace ββββββΆβ .zse File ββββββΆβ GPU Model β
β Model (FP16) β β (INT4 + tok) β β (ZSE Kernel) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
One-time Single file 12% less VRAM
conversion ~0.5 bytes/param vs bitsandbytes
Run local models with OpenClaw - the 24/7 AI assistant by @steipete.
# Start ZSE server
zse serve <model-name> --port 8000
# Configure OpenClaw to use local ZSE
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=zseOr in OpenClaw's config.yaml:
llm:
provider: openai-compatible
api_base: http://localhost:8000/v1
api_key: zse
model: defaultBenefits: 100% private, zero API costs, works offline, run ANY model.
# CPU
docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest
# GPU (NVIDIA)
docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu
# With model pre-loaded
docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latestSee deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes.
Apache 2.0
- Website: zllm.in
- Company: Zyora Labs
- Email: [email protected]
Made with β€οΈ by Zyora Labs