Model Quantization Techniques#

๐Ÿ“– Overview#

LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.


๐Ÿ”ง Quantization Modes#

Quantization Mode

Weight Quantization

Activation Quantization

Compute Kernel

Supported Hardware

fp8-vllm

FP8 channel symmetric

FP8 channel dynamic symmetric

VLLM

H100/H200/H800, RTX 40 series, etc.

int8-vllm

INT8 channel symmetric

INT8 channel dynamic symmetric

VLLM

A100/A800, RTX 30/40 series, etc.

fp8-sgl

FP8 channel symmetric

FP8 channel dynamic symmetric

SGL

H100/H200/H800, RTX 40 series, etc.

int8-sgl

INT8 channel symmetric

INT8 channel dynamic symmetric

SGL

A100/A800, RTX 30/40 series, etc.

fp8-q8f

FP8 channel symmetric

FP8 channel dynamic symmetric

Q8-Kernels

RTX 40 series, L40S, etc.

int8-q8f

INT8 channel symmetric

INT8 channel dynamic symmetric

Q8-Kernels

RTX 40 series, L40S, etc.

int8-torchao

INT8 channel symmetric

INT8 channel dynamic symmetric

TorchAO

A100/A800, RTX 30/40 series, etc.

int4-g128-marlin

INT4 group symmetric

FP16

Marlin

H200/H800/A100/A800, RTX 30/40 series, etc.

fp8-b128-deepgemm

FP8 block symmetric

FP8 group symmetric

DeepGemm

H100/H200/H800, RTX 40 series, etc.


๐Ÿ”ง Obtaining Quantized Models#

Method 1: Download Pre-Quantized Models#

Download pre-quantized models from LightX2V model repositories:

DIT Models

Download pre-quantized DIT models from Wan2.1-Distill-Models:

# Download DIT FP8 quantized model
huggingface-cli download lightx2v/Wan2.1-Distill-Models \
    --local-dir ./models \
    --include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"

Encoder Models

Download pre-quantized T5 and CLIP models from Encoders-LightX2V:

# Download T5 FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_t5_umt5-xxl-enc-fp8.pth"

# Download CLIP FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"

Method 2: Self-Quantize Models#

For detailed quantization tool usage, refer to: Model Conversion Documentation


๐Ÿš€ Using Quantized Models#

DIT Model Quantization#

Supported Quantization Modes#

DIT quantization modes (dit_quant_scheme) support: fp8-vllm, int8-vllm, fp8-sgl, int8-sgl, fp8-q8f, int8-q8f, int8-torchao, int4-g128-marlin, fp8-b128-deepgemm

Configuration Example#

{
    "dit_quantized": true,
    "dit_quant_scheme": "fp8-sgl",
    "dit_quantized_ckpt": "/path/to/dit_quantized_model"  // Optional
}

๐Ÿ’ก Tip: When thereโ€™s only one DIT model in the scriptโ€™s model_path, dit_quantized_ckpt doesnโ€™t need to be specified separately.

T5 Model Quantization#

Supported Quantization Modes#

T5 quantization modes (t5_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example#

{
    "t5_quantized": true,
    "t5_quant_scheme": "fp8-sgl",
    "t5_quantized_ckpt": "/path/to/t5_quantized_model"  // Optional
}

๐Ÿ’ก Tip: When a T5 quantized model exists in the scriptโ€™s specified model_path (such as models_t5_umt5-xxl-enc-fp8.pth or models_t5_umt5-xxl-enc-int8.pth), t5_quantized_ckpt doesnโ€™t need to be specified separately.

CLIP Model Quantization#

Supported Quantization Modes#

CLIP quantization modes (clip_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example#

{
    "clip_quantized": true,
    "clip_quant_scheme": "fp8-sgl",
    "clip_quantized_ckpt": "/path/to/clip_quantized_model"  // Optional
}

๐Ÿ’ก Tip: When a CLIP quantized model exists in the scriptโ€™s specified model_path (such as models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth or models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth), clip_quantized_ckpt doesnโ€™t need to be specified separately.

Performance Optimization Strategy#

If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to Parameter Offload Documentation: