v1.2.0 released — bnb.matmul_4bit integration!

Ultra Memory-Efficient
LLM Inference

Run 32B models on 24GB GPUs.58.7 tok/s on 7B. Single .zse file, no network calls.

$pip install zllm-zse

Get Started

View on GitHub

0.0s

7B Cold Start

0.0

tok/s (7B)

0.0GB

32B VRAM

0.0

tok/s (32B)

Get Running in 3 Steps

From zero to serving models in under a minute

Install ZSE

One pip command to get started. No complex dependencies or configurations.

Convert to .zse

Convert any HuggingFace model to optimized .zse format with 11× faster loading.

Serve Your Model

Start the OpenAI-compatible API server with instant cold starts.

Query the API

Use the OpenAI-compatible API with any client library or framework.

terminal

$ pip install zllm-zse

Why ZSE?

ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.

bitsandbytes (Standard)

Every time you load a model:

Download FP16 weights (14GB for 7B model)
Quantize to INT4 (takes 40+ seconds)
Finally ready to use

Qwen 7B Load Time:45.4s

.zse Format (Pre-quantized)

With ZSE, you quantize once, load instantly:

One-time: zse quantize → .zse file
Every load: Read pre-quantized weights (instant)
Ready in seconds, not minutes

Qwen 7B Load Time:3.9s

ZSE vs llama.cpp (GGUF)

On Qwen 72B: ZSE loads in 6.5s vs llama.cpp GGUF in 10.2s — ZSE is 1.6× faster while using Python ecosystem.

When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you need fast cold starts with the Python/HuggingFace ecosystem.

Verified Benchmarks

.zse v1.2.0 performance with bnb.matmul_4bit. Tested on NVIDIA H200.

Model	File Size	Load Time	VRAM	Throughput
Qwen 7B	5.57 GB	9.1s	5.9 GB	58.7 tok/s
Qwen 32B	19.23 GB	24.1s	20.9 GB	26.9 tok/s

32B Model Performance (v1.2.0)

Fits on 24GB consumer GPUs (RTX 3090/4090)

File Size19.23 GB

VRAM Usage20.9 GB

Throughput26.9 tok/s

Built for Efficiency

Every feature designed for memory efficiency and fast cold starts

58.7

tok/s

58.7 tok/s

Generate at 58.7 tokens/sec on 7B models. Real throughput using bitsandbytes CUDA kernels.

GB VRAM

32B in 21GB VRAM

Run 32B models on 24GB consumer GPUs (RTX 3090/4090). True memory efficiency.

9.1

seconds

9s Cold Start

Load 7B models in 9 seconds. Single .zse file with embedded config and tokenizer.

100%

Compatible

OpenAI Compatible

Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.

Perfect For

From local development to production deployments

Serverless Inference

Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.

Local AI Development

Run large models on your laptop. Test and iterate without cloud costs or API rate limits.

Edge Deployment

Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.

Cost Optimization

Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.

Verified Models

Tested and optimized for ZSE. VRAM shown for INT4 quantization.

Model	Provider	Category	VRAM (INT4)	.zse Ready
Qwen 2.5 7B	Alibaba	Chat/Code	5.9 GB	✓
Qwen 2.5 32B	Alibaba	Chat/Code	20.9 GB	✓
Mistral 7B v0.3	Mistral AI	Chat	~6 GB	—
DeepSeek Coder 6.7B	DeepSeek	Code	~5.5 GB	—
Llama 3.2 3B	Meta	Chat	~2.5 GB	—
Gemma 2 9B	Google	Reasoning	~7 GB	—
Phi-3 Mini	Microsoft	Reasoning	~3 GB	—
TinyLlama 1.1B	TinyLlama	Testing	~1 GB	—

See full model compatibility list →

Simple, Powerful API

Start serving models with just a few lines