Quantization should not be exclusive to any particular inference server. oQ produces standard mlx-lm compatible models that work everywhere — oMLX, mlx-lm, and any app that supports MLX safetensors. No custom loader required.
oQ is a data-driven mixed-precision quantization system for Apple Silicon. Instead of assigning bits by fixed rules or tensor type, oQ measures each layer's actual quantization sensitivity through calibration and allocates bits where the data says they matter most.
| Benchmark | Samples | 2-bit | 3-bit | 4-bit | |||
|---|---|---|---|---|---|---|---|
| mlx-lm | oQ | mlx-lm | oQ | mlx-lm | oQ | ||
| MMLU | 300 | 14.0% | 64.0% | 76.3% | 85.0% | 79.7% | 83.3% |
| TRUTHFULQA | 300 | 17.0% | 80.0% | 81.7% | 86.7% | 87.7% | 88.0% |
| HUMANEVAL | 164 (full) | 0.0% | 78.0% | 84.8% | 86.6% | 87.2% | 85.4% |
| MBPP | 300 | 0.3% | 63.3% | 69.0% | 72.0% | 71.7% | 74.3% |
| Level | Base Bits | Target bpw | Description |
|---|---|---|---|
| oQ2 | 2 | ~2.9 | Extreme compression |
| oQ3 | 3 | ~3.5 | Balanced |
| oQ3.5 | 3 | ~3.8 | Quality balanced |
| oQ4 | 4 | ~4.6 | Recommended |
| oQ5 | 5 | ~5.5 | High quality |
| oQ6 | 6 | ~6.5 | Near-lossless |
| oQ8 | 8 | ~8.6 | Near-lossless |
Base format is affine quantization (group_size=64) for all levels except 8-bit, which uses mxfp8 (group_size=32).
oQ and oQ+ share the same levels. oQ+ adds GPTQ-based weight optimization before quantization.
1. Load model (full)
2. Measure per-layer sensitivity (relative MSE)
3. Build budget plan (sensitivity-driven bit allocation)
4. GPTQ weight optimization (all quantizable weights)
5. Quantize with mixed-precision predicate
6. Save
The GPTQ step uses Hessian-based error compensation to optimize rounding decisions for every quantizable weight in the model. For MoE models, this includes all routed expert weights (typically 90%+ of total parameters), which are processed using a batched algorithm that handles all experts in a layer simultaneously.
1. Load tensors via mmap
2. Apply model sanitize
3. Measure per-layer sensitivity (temporary model load)
4. Build budget plan
5. Per-tensor quantize + shard flush
6. Save config + tokenizer
| Tensor | Treatment |
|---|---|
| lm_head | 8-bit (within budget) |
| MoE router | 8-bit |
| shared_expert_gate | 8-bit |
| Vision encoder | fp16 |
| SSM state params | fp32 |
This is the core differentiator of oQ. Instead of fixed tier systems that assign bits by tensor type, oQ runs actual calibration inference through the model and measures where quantization error hurts the most:
sensitivity = MSE(float_output, quantized_output) / mean(float_output²)
Normalizing by output magnitude prevents later layers from appearing artificially sensitive due to residual accumulation.
The sensitivity score determines the boost tier:
| Sensitivity Ratio | Boost | Example (oQ4) |
|---|---|---|
| Top (≥50% of max) | base+4 | 4 → 8 bit |
| High (≥20% of max) | base+2 | 4 → 6 bit |
| Moderate (<20%) | base+1 | 4 → 5 bit |
Boosts apply only to non-expert tensors. Routed experts (93-98% of MoE params) stay at base bits — not by rule, but because their byte cost relative to quality gain makes them poor candidates in the budget optimization.
The budget plan ensures total bpw stays within the target and hard cap for each level. The result is that every model gets a different bit allocation tailored to its specific layer sensitivities, rather than a one-size-fits-all profile.
No budget plan. Position-based heuristics only:
- lm_head: 6-bit
- SSM output: 8-bit
- Embedding: base+2
- Sensitive layers (first/last 12.5%): base+1
- Everything else: base
oQ+ uses an optimized implementation of GPTQ (Frantar et al., arXiv:2210.17323) to improve quantization quality without changing the output format or inference speed.
Standard quantization rounds each weight to the nearest quantization grid point. GPTQ takes a smarter approach: it processes weights column by column, and when rounding one column introduces error, it adjusts the remaining columns to compensate. The adjustment direction is guided by the inverse Hessian of the calibration inputs, which captures how each weight column affects the layer's output.
For each column i:
q[i] = round_to_grid(w[i])
error = (w[i] - q[i]) / H_inv[i, i]
w[i+1:] -= error * H_inv[i, i+1:] # compensate remaining columns
The result is the same 4-bit quantized format — identical structure, identical inference speed — but with rounding decisions that minimize actual output error rather than per-element error.
In MoE models, routed experts make up 90%+ of all parameters. Processing them one at a time would take hours. oQ solves this with batched expert GPTQ: all experts in a layer share the same Hessian (since they receive the same input hidden states), so the column-by-column optimization can run on all experts simultaneously as a single batched operation.
For Qwen3.5-35B-A3B (256 experts × 40 layers):
- Per-expert sequential: ~90 minutes
- Batched: ~6 minutes (15x speedup, identical results)
The GPTQ optimization uses the actual target bits assigned by the sensitivity budget plan. If a tensor is boosted to 6-bit, the error compensation optimizes for 6-bit quantization boundaries — not the base 4-bit. This eliminates the mismatch between optimization and final quantization.
Unlike smoothing-based methods that modify normalization weights to redistribute quantization difficulty, oQ's GPTQ implementation only adjusts the rounding of weights that will be quantized. Non-quantized weights (norms, biases) remain untouched, preserving the model's original computation graph.
For large models (70B+), the streaming path processes tensors one at a time via safetensors mmap.
- No full model instantiation.
- Shards flushed at 5 GB boundary.
- Non-quantized float32 weights cast to bfloat16 for inference parity.
- Sensitivity measurement requires temporary model load (peak memory ≈ model size).
Built-in calibration dataset shipped with oQ. No download required.
600 samples across 7 categories, ~726 KB total:
| Category | Samples | Composition |
|---|---|---|
| code | 200 | Python classes, imports, JS snippets (avg 26 lines) |
| en | 150 | Wikipedia + C4 web text + OpenOrca conversations |
| ko | 60 | Wikipedia |
| zh | 50 | Wikipedia |
| ja | 60 | Wikipedia |
| tool_calling | 40 | Function call patterns |
| reasoning | 40 | GSM8K, chain-of-thought |
Code samples include real-world patterns (class definitions, import chains, multi-language) rather than benchmark-only code. Reasoning category covers mathematical and step-by-step inference, which is absent from typical calibration sets.
| Architecture | GPTQ Optimization | Notes |
|---|---|---|
| Qwen3.5 MoE (hybrid attn) | Full (batched experts) | Validated with benchmarks |
| Qwen3.5 dense (hybrid attn) | Full | Same hybrid handling |
| MiniMax-M2.5 MoE | Full | Per-expert dense GPTQ |
| GLM MoE | Full | Fused expert support |
| Step-3.5 MoE | Full | moe.*_proj fused support |
| Nemotron-Cascade MoE | Full | Per-expert dense GPTQ |
| Llama, Mistral, dense models | Full | Standard layer structure |
| VLM models | Full (text) | Vision weights kept fp16 |
All models supported by mlx-lm/mlx-vlm. No architecture restrictions.
oQ's weight optimization is based on the GPTQ algorithm by Frantar et al. The batched expert processing and MoE-aware Hessian sharing are oQ-specific optimizations. Sensitivity-driven budget allocation was inspired by approaches in llm-compressor and GGUF K-quants.