Ultra Memory-Efficient
LLM Inference

Run 32B models on 24GB GPUs.58.7 tok/s on 7B. Single .zse file, no network calls.

$pip install zllm-zse
0.0s

7B Cold Start

0.0

tok/s (7B)

0.0GB

32B VRAM

0.0

tok/s (32B)

Get Running in 3 Steps

From zero to serving models in under a minute

01

Install ZSE

One pip command to get started. No complex dependencies or configurations.

02

Convert to .zse

Convert any HuggingFace model to optimized .zse format with 11× faster loading.

03

Serve Your Model

Start the OpenAI-compatible API server with instant cold starts.

04

Query the API

Use the OpenAI-compatible API with any client library or framework.

terminal
$ pip install zllm-zse

Why ZSE?

ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.

bitsandbytes (Standard)

Every time you load a model:

  1. Download FP16 weights (14GB for 7B model)
  2. Quantize to INT4 (takes 40+ seconds)
  3. Finally ready to use
Qwen 7B Load Time:45.4s

.zse Format (Pre-quantized)

With ZSE, you quantize once, load instantly:

  1. One-time: zse quantize → .zse file
  2. Every load: Read pre-quantized weights (instant)
  3. Ready in seconds, not minutes
Qwen 7B Load Time:3.9s

ZSE vs llama.cpp (GGUF)

On Qwen 72B: ZSE loads in 6.5s vs llama.cpp GGUF in 10.2s — ZSE is 1.6× faster while using Python ecosystem.

When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you need fast cold starts with the Python/HuggingFace ecosystem.

Verified Benchmarks

.zse v1.2.0 performance with bnb.matmul_4bit. Tested on NVIDIA H200.

ModelFile SizeLoad TimeVRAMThroughput
Qwen 7B5.57 GB9.1s5.9 GB58.7 tok/s
Qwen 32B19.23 GB24.1s20.9 GB26.9 tok/s

32B Model Performance (v1.2.0)

Fits on 24GB consumer GPUs (RTX 3090/4090)

File Size19.23 GB
VRAM Usage20.9 GB
Throughput26.9 tok/s

Built for Efficiency

Every feature designed for memory efficiency and fast cold starts

58.7
tok/s

58.7 tok/s

Generate at 58.7 tokens/sec on 7B models. Real throughput using bitsandbytes CUDA kernels.

21
GB VRAM

32B in 21GB VRAM

Run 32B models on 24GB consumer GPUs (RTX 3090/4090). True memory efficiency.

9.1
seconds

9s Cold Start

Load 7B models in 9 seconds. Single .zse file with embedded config and tokenizer.

100%
Compatible

OpenAI Compatible

Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.

Perfect For

From local development to production deployments

Serverless Inference

Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.

Local AI Development

Run large models on your laptop. Test and iterate without cloud costs or API rate limits.

Edge Deployment

Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.

Cost Optimization

Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.

Verified Models

Tested and optimized for ZSE. VRAM shown for INT4 quantization.

ModelProviderCategoryVRAM (INT4).zse Ready
Qwen 2.5 7BAlibabaChat/Code5.9 GB
Qwen 2.5 32BAlibabaChat/Code20.9 GB
Mistral 7B v0.3Mistral AIChat~6 GB
DeepSeek Coder 6.7BDeepSeekCode~5.5 GB
Llama 3.2 3BMetaChat~2.5 GB
Gemma 2 9BGoogleReasoning~7 GB
Phi-3 MiniMicrosoftReasoning~3 GB
TinyLlama 1.1BTinyLlamaTesting~1 GB

Simple, Powerful API

Start serving models with just a few lines

terminal
# Install ZSE
$ pip install zllm-zse

# Convert to optimized .zse format (11× faster loading)
$ zse quantize Qwen/Qwen2.5-7B-Instruct -o ./model.zse

# Serve your model with instant cold starts
$ zse serve ./model.zse --port 8000

# OpenAI-compatible API is ready!
$ curl localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}]}'

Ready to Try ZSE?

Get memory-efficient LLM inference with fast cold starts. Install and start serving in minutes.

Apache 2.0 Licensed
Open Source
PyPI Published