Model Quantization Techniques#

📖 Overview#

LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.

🔧 Quantization Modes#

Quantization Mode	Weight Quantization	Activation Quantization	Compute Kernel	Supported Hardware
`fp8-vllm`	FP8 channel symmetric	FP8 channel dynamic symmetric	VLLM	H100/H200/H800, RTX 40 series, etc.
`int8-vllm`	INT8 channel symmetric	INT8 channel dynamic symmetric	VLLM	A100/A800, RTX 30/40 series, etc.
`fp8-sgl`	FP8 channel symmetric	FP8 channel dynamic symmetric	SGL	H100/H200/H800, RTX 40 series, etc.
`int8-sgl`	INT8 channel symmetric	INT8 channel dynamic symmetric	SGL	A100/A800, RTX 30/40 series, etc.
`fp8-q8f`	FP8 channel symmetric	FP8 channel dynamic symmetric	Q8-Kernels	RTX 40 series, L40S, etc.
`int8-q8f`	INT8 channel symmetric	INT8 channel dynamic symmetric	Q8-Kernels	RTX 40 series, L40S, etc.
`int8-torchao`	INT8 channel symmetric	INT8 channel dynamic symmetric	TorchAO	A100/A800, RTX 30/40 series, etc.
`int4-g128-marlin`	INT4 group symmetric	FP16	Marlin	H200/H800/A100/A800, RTX 30/40 series, etc.
`fp8-b128-deepgemm`	FP8 block symmetric	FP8 group symmetric	DeepGemm	H100/H200/H800, RTX 40 series, etc.

🔧 Obtaining Quantized Models#

Method 1: Download Pre-Quantized Models#

Download pre-quantized models from LightX2V model repositories:

DIT Models

Download pre-quantized DIT models from Wan2.1-Distill-Models:

# Download DIT FP8 quantized model
huggingface-cli download lightx2v/Wan2.1-Distill-Models \
    --local-dir ./models \
    --include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"

Encoder Models

Download pre-quantized T5 and CLIP models from Encoders-LightX2V:

# Download T5 FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_t5_umt5-xxl-enc-fp8.pth"

# Download CLIP FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"

Method 2: Self-Quantize Models#

For detailed quantization tool usage, refer to: Model Conversion Documentation

🚀 Using Quantized Models#

DIT Model Quantization#

Supported Quantization Modes#

DIT quantization modes (dit_quant_scheme) support: fp8-vllm, int8-vllm, fp8-sgl, int8-sgl, fp8-q8f, int8-q8f, int8-torchao, int4-g128-marlin, fp8-b128-deepgemm

Configuration Example#

{
    "dit_quantized": true,
    "dit_quant_scheme": "fp8-sgl",
    "dit_quantized_ckpt": "/path/to/dit_quantized_model"  // Optional
}

💡 Tip: When there’s only one DIT model in the script’s model_path, dit_quantized_ckpt doesn’t need to be specified separately.

T5 Model Quantization#

Supported Quantization Modes#

T5 quantization modes (t5_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example#

{
    "t5_quantized": true,
    "t5_quant_scheme": "fp8-sgl",
    "t5_quantized_ckpt": "/path/to/t5_quantized_model"  // Optional
}

💡 Tip: When a T5 quantized model exists in the script’s specified model_path (such as models_t5_umt5-xxl-enc-fp8.pth or models_t5_umt5-xxl-enc-int8.pth), t5_quantized_ckpt doesn’t need to be specified separately.

CLIP Model Quantization#

Supported Quantization Modes#

CLIP quantization modes (clip_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example#

{
    "clip_quantized": true,
    "clip_quant_scheme": "fp8-sgl",
    "clip_quantized_ckpt": "/path/to/clip_quantized_model"  // Optional
}

💡 Tip: When a CLIP quantized model exists in the script’s specified model_path (such as models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth or models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth), clip_quantized_ckpt doesn’t need to be specified separately.

Performance Optimization Strategy#

If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to Parameter Offload Documentation:

Wan2.1 Configuration: Refer to offload config files

Wan2.2 Configuration: Refer to wan22 config files with 4090 suffix

Model Quantization Techniques

Contents

Model Quantization Techniques#

📖 Overview#

🔧 Quantization Modes#

🔧 Obtaining Quantized Models#

Method 1: Download Pre-Quantized Models#

Method 2: Self-Quantize Models#

🚀 Using Quantized Models#

DIT Model Quantization#

Supported Quantization Modes#

Configuration Example#

T5 Model Quantization#

Supported Quantization Modes#

Configuration Example#

CLIP Model Quantization#

Supported Quantization Modes#

Configuration Example#

Performance Optimization Strategy#

📚 Related Resources#

Configuration File Examples#

Run Scripts#

Tool Documentation#

Model Repositories#