Overview

Relevant source files

Purpose and Scope

This document provides an overview of the LightX2V framework, a lightweight image and video generation inference system designed for efficient deployment of diffusion-based generative models. LightX2V specializes in running inference for text-to-video (T2V), image-to-video (I2V), text-to-image (T2I), image-editing (I2I), and audio-to-video (S2V) tasks across diverse hardware platforms.

For detailed information on specific topics:

For optimization techniques (quantization, caching, offloading), see 1.1
For system architecture details, see 1.2
For hardware platform support, see 1.3

Sources: README.md1-332 README_zh.md1-332 lightx2v/infer.py1-151

Framework Purpose

LightX2V is an inference-only framework that converts text, images, or audio inputs into visual outputs (images or videos). The "X2V" naming reflects this transformation: X (various input modalities) → V (vision/video output). The framework prioritizes two critical goals:

Performance: Achieving ~20x speedup through step distillation and system optimizations on single GPU configurations
Resource Efficiency: Running 14B-parameter models on consumer hardware (8GB VRAM + 16GB RAM) through aggressive memory optimization

The framework does not include training capabilities—it exclusively handles inference for pre-trained diffusion models.

Sources: README.md17-19 README_zh.md17-19

Core Architecture: Three-Stage Inference Pipeline

LightX2V organizes inference into three sequential stages, coordinated by the LightX2VPipeline class:

Stage 1: Input Encoding

Input modalities are converted into latent representations that the diffusion model can process:

Encoder Type	Input	Output	Key Classes
Text Encoder	Text prompts	Contextual embeddings (4096-dim T5 or 1280-dim CLIP)	`T5EncoderModel`, `CLIPModel`
Image Encoder	Reference images	Visual features (CLIP) + latent frames (VAE)	`CLIPModel`, `WanVAE`, `Wan2_2_VAE`
Audio Encoder	Audio waveforms	Audio features (1024-dim Hubert)	`SekoAudioEncoderModel`, `AudioAdapter`

The encoding stage is implemented in runner methods like _run_input_encoder_local_i2v() and _run_input_encoder_local_t2v().

Sources: lightx2v/models/runners/default_runner.py276-337 lightx2v/models/runners/wan/wan_runner.py222-279 lightx2v/models/runners/wan/wan_audio_runner.py405-465

Stage 2: Diffusion Transformer Inference

The core computational stage where latent noise is iteratively refined into structured latents:

The scheduler controls the denoising process through a predictor-corrector loop with configurable steps (4-50). Key classes:

WanScheduler lightx2v/models/schedulers/wan/scheduler.py13-200
WanModel lightx2v/models/networks/wan/model.py39-513
WanTransformerInfer lightx2v/models/networks/wan/infer/transformer_infer.py17-254

Sources: lightx2v/models/schedulers/wan/scheduler.py13-200 lightx2v/models/networks/wan/model.py415-478 lightx2v/models/networks/wan/infer/transformer_infer.py90-154

Stage 3: VAE Decoding and Post-Processing

Latent representations are decoded into pixel-space outputs:

The VAE decoder operates with stride factors (typically 8×8 spatially) to convert 16-channel latents into 3-channel RGB frames. Optional post-processing includes:

Frame interpolation via RIFE (RIFEWrapper)
Super-resolution scaling (VSRWrapper)

Sources: lightx2v/models/runners/default_runner.py391-413 lightx2v/models/video_encoders/hf/wan/vae.py1-200

Runner System and Model Registry

LightX2V uses a Runner pattern where each model architecture has a dedicated runner class that orchestrates the three-stage pipeline. Runners are registered via the RUNNER_REGISTER decorator.

Runner Responsibilities

Each runner implements the following lifecycle methods:

Method	Purpose	Example
`init_modules()`	Load model components (transformer, encoders, VAE)	lightx2v/models/runners/default_runner.py71-95
`load_transformer()`	Instantiate the main diffusion model	lightx2v/models/runners/wan/wan_runner.py44-58
`init_scheduler()`	Create scheduler for diffusion sampling	lightx2v/models/runners/wan/wan_runner.py201-214
`run_pipeline()`	Execute full inference pipeline	lightx2v/models/runners/default_runner.py459-474
`run_segment()`	Run transformer for N denoising steps	lightx2v/models/runners/default_runner.py170-204

Runners also handle task-specific input processing. For example, WanAudioRunner implements multi-person audio segmentation and face detection for audio-to-video tasks.

Sources: lightx2v/models/runners/default_runner.py56-488 lightx2v/models/runners/wan/wan_runner.py35-200 lightx2v/models/runners/wan/wan_audio_runner.py278-726 lightx2v/utils/registry_factory.py

Supported Models and Tasks

LightX2V supports three major model families, each with specialized variants:

Model Family Matrix

Task-to-Runner Mapping

Task	Description	Primary Runners	Key Feature
T2V	Text → Video	`WanRunner`, `HunyuanVideo15Runner`	Generates video from text prompts
I2V	Image → Video	`WanRunner`, `Wan22MoeRunner`	Animates static images
S2V	Audio → Video	`WanAudioRunner`, `Wan22AudioRunner`	Syncs facial animation to audio (digital humans)
T2I	Text → Image	`QwenImageRunner`	Generates single images
I2I	Image → Image	`QwenImageRunner`	Edits existing images

Sources: README.md200-228 lightx2v/infer.py36-66 lightx2v/models/runners/wan/wan_runner.py35-40

Entry Point: LightX2VPipeline

The LightX2VPipeline class provides the primary user interface for inference:

The pipeline internally:

Instantiates the appropriate runner from RUNNER_REGISTER[model_cls]
Loads model components (transformer, encoders, VAE)
Configures optimization strategies (offloading, quantization, attention operators)
Executes the three-stage inference pipeline
Saves or returns results

Sources: README.md132-189 lightx2v/infer.py29-147

Configuration System

LightX2V uses a hierarchical configuration system managed by set_config():

Key Configuration Categories

Category	Example Parameters	Purpose
Model	`model_cls`, `model_path`, `num_layers`, `dim`	Define model architecture
Task	`task`, `target_video_length`, `target_height`, `target_width`	Specify generation task
Optimization	`dit_quantized`, `dit_quant_scheme`, `cpu_offload`, `offload_granularity`	Memory and compute optimization
Sampling	`infer_steps`, `sample_guide_scale`, `sample_shift`	Control diffusion process
Attention	`attention_type`, `self_attn_1_type`, `cross_attn_1_type`	Select attention operators
Caching	`feature_caching`, `teacache_thresh`	Enable feature reuse
Parallel	`seq_parallel`, `cfg_parallel`, `parallel.seq_p_size`	Multi-GPU configuration

The LockableDict class ensures configuration immutability after initialization to prevent runtime modification bugs.

Sources: lightx2v/utils/set_config.py14-130 lightx2v/models/runners/default_runner.py148-165

Optimization Architecture

LightX2V achieves high performance through a multi-layered optimization stack:

Optimization Selection Flow

Optimizations are configured via config parameters and applied at initialization:

Step Distillation: Enabled by using distilled model checkpoints (wan2.1_distill runner)
Quantization: Configured via dit_quantized=True and dit_quant_scheme (e.g., "fp8-triton")
Offloading: Enabled via cpu_offload=True and offload_granularity ("model", "block", or "phase")
Attention Operators: Selected via attention_type, self_attn_1_type, etc. (e.g., "sage_attn2")
Feature Caching: Configured via feature_caching ("Tea", "Mag", "Ada")
Parallelism: Enabled via parallel dict with seq_p_size and cfg_p_size

The framework automatically applies compatible optimizations. For example, FP8 quantization with Sage Attention 2 is a common high-performance configuration.

Sources: README.md246-263 lightx2v/models/networks/wan/model.py62-95 lightx2v/models/networks/wan/infer/transformer_infer.py17-72

Platform Abstraction Layer

The lightx2v_platform subsystem enables hardware portability:

The platform layer abstracts hardware-specific operations through registries:

PLATFORM_DEVICE_REGISTER: Device initialization and distributed setup
ROPE_REGISTER: Rotary position embedding implementations
ATTN_WEIGHT_REGISTER: Attention operator implementations

Platform selection occurs via the PLATFORM environment variable (e.g., PLATFORM=mlu for Cambricon). Each backend provides:

Device management primitives (torch.cuda equivalent)
Custom operators for attention, quantization, etc.
Distributed communication setup

Sources: lightx2v_platform/README.md1-19 lightx2v_platform/README_zh.md1-20 lightx2v/infer.py132-134

Component Interaction Example: I2V Inference

This end-to-end flow demonstrates how components interact for an image-to-video task:

The sequence shows:

Initialization: Pipeline instantiates runner and loads components
Encoding: Text, image, and VAE encoding run in parallel/sequence
Denoising Loop: Iterative refinement via transformer + scheduler
Decoding: VAE decoder converts latents to pixels
Post-processing: Format conversion and file saving

Sources: lightx2v/models/runners/default_runner.py459-474 lightx2v/models/runners/default_runner.py170-204 lightx2v/models/networks/wan/model.py415-478 lightx2v/models/schedulers/wan/scheduler.py51-96

Key File Locations

For reference, here are the primary file paths for core components:

Component	Path
Main entry point	`lightx2v/infer.py`
Pipeline API	`lightx2v/__init__.py` (exports `LightX2VPipeline`)
Base runner	`lightx2v/models/runners/base_runner.py`
Default runner	`lightx2v/models/runners/default_runner.py`
WAN runners	`lightx2v/models/runners/wan/`
HunyuanVideo runners	`lightx2v/models/runners/hunyuan_video/`
Qwen runners	`lightx2v/models/runners/qwen_image/`
Model implementations	`lightx2v/models/networks/`
Schedulers	`lightx2v/models/schedulers/`
Encoders (text/image/audio)	`lightx2v/models/input_encoders/`
VAE encoders/decoders	`lightx2v/models/video_encoders/`
Attention operators	`lightx2v/common/ops/attn/`
Quantization	`lightx2v/common/ops/mm/`
Configuration	`lightx2v/utils/set_config.py`
Platform abstraction	`lightx2v_platform/`

Sources: lightx2v/infer.py1-151 lightx2v/models/runners/default_runner.py1-489 lightx2v/models/networks/wan/model.py1-513

Overview

Relevant source files

Purpose and Scope

For detailed information on specific topics:

For optimization techniques (quantization, caching, offloading), see 1.1
For system architecture details, see 1.2
For hardware platform support, see 1.3

Sources: README.md1-332 README_zh.md1-332 lightx2v/infer.py1-151

Framework Purpose

Performance: Achieving ~20x speedup through step distillation and system optimizations on single GPU configurations
Resource Efficiency: Running 14B-parameter models on consumer hardware (8GB VRAM + 16GB RAM) through aggressive memory optimization

The framework does not include training capabilities—it exclusively handles inference for pre-trained diffusion models.

Sources: README.md17-19 README_zh.md17-19

Core Architecture: Three-Stage Inference Pipeline

LightX2V organizes inference into three sequential stages, coordinated by the LightX2VPipeline class:

Stage 1: Input Encoding

Input modalities are converted into latent representations that the diffusion model can process:

Encoder Type	Input	Output	Key Classes
Text Encoder	Text prompts	Contextual embeddings (4096-dim T5 or 1280-dim CLIP)	`T5EncoderModel`, `CLIPModel`
Image Encoder	Reference images	Visual features (CLIP) + latent frames (VAE)	`CLIPModel`, `WanVAE`, `Wan2_2_VAE`
Audio Encoder	Audio waveforms	Audio features (1024-dim Hubert)	`SekoAudioEncoderModel`, `AudioAdapter`

The encoding stage is implemented in runner methods like _run_input_encoder_local_i2v() and _run_input_encoder_local_t2v().

Sources: lightx2v/models/runners/default_runner.py276-337 lightx2v/models/runners/wan/wan_runner.py222-279 lightx2v/models/runners/wan/wan_audio_runner.py405-465

Stage 2: Diffusion Transformer Inference

The core computational stage where latent noise is iteratively refined into structured latents:

The scheduler controls the denoising process through a predictor-corrector loop with configurable steps (4-50). Key classes:

WanScheduler lightx2v/models/schedulers/wan/scheduler.py13-200
WanModel lightx2v/models/networks/wan/model.py39-513
WanTransformerInfer lightx2v/models/networks/wan/infer/transformer_infer.py17-254

Sources: lightx2v/models/schedulers/wan/scheduler.py13-200 lightx2v/models/networks/wan/model.py415-478 lightx2v/models/networks/wan/infer/transformer_infer.py90-154

Stage 3: VAE Decoding and Post-Processing

Latent representations are decoded into pixel-space outputs:

The VAE decoder operates with stride factors (typically 8×8 spatially) to convert 16-channel latents into 3-channel RGB frames. Optional post-processing includes:

Frame interpolation via RIFE (RIFEWrapper)
Super-resolution scaling (VSRWrapper)

Sources: lightx2v/models/runners/default_runner.py391-413 lightx2v/models/video_encoders/hf/wan/vae.py1-200

Runner System and Model Registry

LightX2V uses a Runner pattern where each model architecture has a dedicated runner class that orchestrates the three-stage pipeline. Runners are registered via the RUNNER_REGISTER decorator.

Runner Responsibilities

Each runner implements the following lifecycle methods:

Method	Purpose	Example
`init_modules()`	Load model components (transformer, encoders, VAE)	lightx2v/models/runners/default_runner.py71-95
`load_transformer()`	Instantiate the main diffusion model	lightx2v/models/runners/wan/wan_runner.py44-58
`init_scheduler()`	Create scheduler for diffusion sampling	lightx2v/models/runners/wan/wan_runner.py201-214
`run_pipeline()`	Execute full inference pipeline	lightx2v/models/runners/default_runner.py459-474
`run_segment()`	Run transformer for N denoising steps	lightx2v/models/runners/default_runner.py170-204

Runners also handle task-specific input processing. For example, WanAudioRunner implements multi-person audio segmentation and face detection for audio-to-video tasks.

Sources: lightx2v/models/runners/default_runner.py56-488 lightx2v/models/runners/wan/wan_runner.py35-200 lightx2v/models/runners/wan/wan_audio_runner.py278-726 lightx2v/utils/registry_factory.py

Supported Models and Tasks

LightX2V supports three major model families, each with specialized variants:

Model Family Matrix

Task-to-Runner Mapping

Task	Description	Primary Runners	Key Feature
T2V	Text → Video	`WanRunner`, `HunyuanVideo15Runner`	Generates video from text prompts
I2V	Image → Video	`WanRunner`, `Wan22MoeRunner`	Animates static images
S2V	Audio → Video	`WanAudioRunner`, `Wan22AudioRunner`	Syncs facial animation to audio (digital humans)
T2I	Text → Image	`QwenImageRunner`	Generates single images
I2I	Image → Image	`QwenImageRunner`	Edits existing images

Sources: README.md200-228 lightx2v/infer.py36-66 lightx2v/models/runners/wan/wan_runner.py35-40

Entry Point: LightX2VPipeline

The LightX2VPipeline class provides the primary user interface for inference:

The pipeline internally:

Instantiates the appropriate runner from RUNNER_REGISTER[model_cls]
Loads model components (transformer, encoders, VAE)
Configures optimization strategies (offloading, quantization, attention operators)
Executes the three-stage inference pipeline
Saves or returns results

Sources: README.md132-189 lightx2v/infer.py29-147

Configuration System

LightX2V uses a hierarchical configuration system managed by set_config():

Key Configuration Categories

Category	Example Parameters	Purpose
Model	`model_cls`, `model_path`, `num_layers`, `dim`	Define model architecture
Task	`task`, `target_video_length`, `target_height`, `target_width`	Specify generation task
Optimization	`dit_quantized`, `dit_quant_scheme`, `cpu_offload`, `offload_granularity`	Memory and compute optimization
Sampling	`infer_steps`, `sample_guide_scale`, `sample_shift`	Control diffusion process
Attention	`attention_type`, `self_attn_1_type`, `cross_attn_1_type`	Select attention operators
Caching	`feature_caching`, `teacache_thresh`	Enable feature reuse
Parallel	`seq_parallel`, `cfg_parallel`, `parallel.seq_p_size`	Multi-GPU configuration

The LockableDict class ensures configuration immutability after initialization to prevent runtime modification bugs.

Sources: lightx2v/utils/set_config.py14-130 lightx2v/models/runners/default_runner.py148-165

Optimization Architecture

LightX2V achieves high performance through a multi-layered optimization stack:

Optimization Selection Flow

Optimizations are configured via config parameters and applied at initialization:

Step Distillation: Enabled by using distilled model checkpoints (wan2.1_distill runner)
Quantization: Configured via dit_quantized=True and dit_quant_scheme (e.g., "fp8-triton")
Offloading: Enabled via cpu_offload=True and offload_granularity ("model", "block", or "phase")
Attention Operators: Selected via attention_type, self_attn_1_type, etc. (e.g., "sage_attn2")
Feature Caching: Configured via feature_caching ("Tea", "Mag", "Ada")
Parallelism: Enabled via parallel dict with seq_p_size and cfg_p_size

The framework automatically applies compatible optimizations. For example, FP8 quantization with Sage Attention 2 is a common high-performance configuration.

Sources: README.md246-263 lightx2v/models/networks/wan/model.py62-95 lightx2v/models/networks/wan/infer/transformer_infer.py17-72

Platform Abstraction Layer

The lightx2v_platform subsystem enables hardware portability:

The platform layer abstracts hardware-specific operations through registries:

PLATFORM_DEVICE_REGISTER: Device initialization and distributed setup
ROPE_REGISTER: Rotary position embedding implementations
ATTN_WEIGHT_REGISTER: Attention operator implementations

Platform selection occurs via the PLATFORM environment variable (e.g., PLATFORM=mlu for Cambricon). Each backend provides:

Device management primitives (torch.cuda equivalent)
Custom operators for attention, quantization, etc.
Distributed communication setup

Sources: lightx2v_platform/README.md1-19 lightx2v_platform/README_zh.md1-20 lightx2v/infer.py132-134

Component Interaction Example: I2V Inference

This end-to-end flow demonstrates how components interact for an image-to-video task:

The sequence shows:

Initialization: Pipeline instantiates runner and loads components
Encoding: Text, image, and VAE encoding run in parallel/sequence
Denoising Loop: Iterative refinement via transformer + scheduler
Decoding: VAE decoder converts latents to pixels
Post-processing: Format conversion and file saving

Key File Locations

For reference, here are the primary file paths for core components:

Component	Path
Main entry point	`lightx2v/infer.py`
Pipeline API	`lightx2v/__init__.py` (exports `LightX2VPipeline`)
Base runner	`lightx2v/models/runners/base_runner.py`
Default runner	`lightx2v/models/runners/default_runner.py`
WAN runners	`lightx2v/models/runners/wan/`
HunyuanVideo runners	`lightx2v/models/runners/hunyuan_video/`
Qwen runners	`lightx2v/models/runners/qwen_image/`
Model implementations	`lightx2v/models/networks/`
Schedulers	`lightx2v/models/schedulers/`
Encoders (text/image/audio)	`lightx2v/models/input_encoders/`
VAE encoders/decoders	`lightx2v/models/video_encoders/`
Attention operators	`lightx2v/common/ops/attn/`
Quantization	`lightx2v/common/ops/mm/`
Configuration	`lightx2v/utils/set_config.py`
Platform abstraction	`lightx2v_platform/`

Sources: lightx2v/infer.py1-151 lightx2v/models/runners/default_runner.py1-489 lightx2v/models/networks/wan/model.py1-513

Overview

Purpose and Scope

Framework Purpose

Core Architecture: Three-Stage Inference Pipeline

Stage 1: Input Encoding

Stage 2: Diffusion Transformer Inference

Stage 3: VAE Decoding and Post-Processing

Runner System and Model Registry

Runner Responsibilities

Supported Models and Tasks

Model Family Matrix

Task-to-Runner Mapping

Entry Point: LightX2VPipeline

Configuration System

Key Configuration Categories

Optimization Architecture

Optimization Selection Flow

Platform Abstraction Layer

Component Interaction Example: I2V Inference

Key File Locations

On this page

Overview

Purpose and Scope

Framework Purpose

Core Architecture: Three-Stage Inference Pipeline

Stage 1: Input Encoding

Stage 2: Diffusion Transformer Inference

Stage 3: VAE Decoding and Post-Processing

Runner System and Model Registry

Runner Responsibilities

Supported Models and Tasks

Model Family Matrix

Task-to-Runner Mapping

Entry Point: LightX2VPipeline

Configuration System

Key Configuration Categories

Optimization Architecture

Optimization Selection Flow

Platform Abstraction Layer

Component Interaction Example: I2V Inference

Key File Locations

On this page