This document provides an overview of the LightX2V framework, a lightweight image and video generation inference system designed for efficient deployment of diffusion-based generative models. LightX2V specializes in running inference for text-to-video (T2V), image-to-video (I2V), text-to-image (T2I), image-editing (I2I), and audio-to-video (S2V) tasks across diverse hardware platforms.
For detailed information on specific topics:
Sources: README.md1-332 README_zh.md1-332 lightx2v/infer.py1-151
LightX2V is an inference-only framework that converts text, images, or audio inputs into visual outputs (images or videos). The "X2V" naming reflects this transformation: X (various input modalities) → V (vision/video output). The framework prioritizes two critical goals:
The framework does not include training capabilities—it exclusively handles inference for pre-trained diffusion models.
Sources: README.md17-19 README_zh.md17-19
LightX2V organizes inference into three sequential stages, coordinated by the LightX2VPipeline class:
Input modalities are converted into latent representations that the diffusion model can process:
| Encoder Type | Input | Output | Key Classes |
|---|---|---|---|
| Text Encoder | Text prompts | Contextual embeddings (4096-dim T5 or 1280-dim CLIP) | T5EncoderModel, CLIPModel |
| Image Encoder | Reference images | Visual features (CLIP) + latent frames (VAE) | CLIPModel, WanVAE, Wan2_2_VAE |
| Audio Encoder | Audio waveforms | Audio features (1024-dim Hubert) | SekoAudioEncoderModel, AudioAdapter |
The encoding stage is implemented in runner methods like _run_input_encoder_local_i2v() and _run_input_encoder_local_t2v().
Sources: lightx2v/models/runners/default_runner.py276-337 lightx2v/models/runners/wan/wan_runner.py222-279 lightx2v/models/runners/wan/wan_audio_runner.py405-465
The core computational stage where latent noise is iteratively refined into structured latents:
The scheduler controls the denoising process through a predictor-corrector loop with configurable steps (4-50). Key classes:
WanScheduler lightx2v/models/schedulers/wan/scheduler.py13-200WanModel lightx2v/models/networks/wan/model.py39-513WanTransformerInfer lightx2v/models/networks/wan/infer/transformer_infer.py17-254Sources: lightx2v/models/schedulers/wan/scheduler.py13-200 lightx2v/models/networks/wan/model.py415-478 lightx2v/models/networks/wan/infer/transformer_infer.py90-154
Latent representations are decoded into pixel-space outputs:
The VAE decoder operates with stride factors (typically 8×8 spatially) to convert 16-channel latents into 3-channel RGB frames. Optional post-processing includes:
RIFEWrapper)VSRWrapper)Sources: lightx2v/models/runners/default_runner.py391-413 lightx2v/models/video_encoders/hf/wan/vae.py1-200
LightX2V uses a Runner pattern where each model architecture has a dedicated runner class that orchestrates the three-stage pipeline. Runners are registered via the RUNNER_REGISTER decorator.
Each runner implements the following lifecycle methods:
| Method | Purpose | Example |
|---|---|---|
init_modules() | Load model components (transformer, encoders, VAE) | lightx2v/models/runners/default_runner.py71-95 |
load_transformer() | Instantiate the main diffusion model | lightx2v/models/runners/wan/wan_runner.py44-58 |
init_scheduler() | Create scheduler for diffusion sampling | lightx2v/models/runners/wan/wan_runner.py201-214 |
run_pipeline() | Execute full inference pipeline | lightx2v/models/runners/default_runner.py459-474 |
run_segment() | Run transformer for N denoising steps | lightx2v/models/runners/default_runner.py170-204 |
Runners also handle task-specific input processing. For example, WanAudioRunner implements multi-person audio segmentation and face detection for audio-to-video tasks.
Sources: lightx2v/models/runners/default_runner.py56-488 lightx2v/models/runners/wan/wan_runner.py35-200 lightx2v/models/runners/wan/wan_audio_runner.py278-726 lightx2v/utils/registry_factory.py
LightX2V supports three major model families, each with specialized variants:
| Task | Description | Primary Runners | Key Feature |
|---|---|---|---|
| T2V | Text → Video | WanRunner, HunyuanVideo15Runner | Generates video from text prompts |
| I2V | Image → Video | WanRunner, Wan22MoeRunner | Animates static images |
| S2V | Audio → Video | WanAudioRunner, Wan22AudioRunner | Syncs facial animation to audio (digital humans) |
| T2I | Text → Image | QwenImageRunner | Generates single images |
| I2I | Image → Image | QwenImageRunner | Edits existing images |
Sources: README.md200-228 lightx2v/infer.py36-66 lightx2v/models/runners/wan/wan_runner.py35-40
The LightX2VPipeline class provides the primary user interface for inference:
The pipeline internally:
RUNNER_REGISTER[model_cls]Sources: README.md132-189 lightx2v/infer.py29-147
LightX2V uses a hierarchical configuration system managed by set_config():
| Category | Example Parameters | Purpose |
|---|---|---|
| Model | model_cls, model_path, num_layers, dim | Define model architecture |
| Task | task, target_video_length, target_height, target_width | Specify generation task |
| Optimization | dit_quantized, dit_quant_scheme, cpu_offload, offload_granularity | Memory and compute optimization |
| Sampling | infer_steps, sample_guide_scale, sample_shift | Control diffusion process |
| Attention | attention_type, self_attn_1_type, cross_attn_1_type | Select attention operators |
| Caching | feature_caching, teacache_thresh | Enable feature reuse |
| Parallel | seq_parallel, cfg_parallel, parallel.seq_p_size | Multi-GPU configuration |
The LockableDict class ensures configuration immutability after initialization to prevent runtime modification bugs.
Sources: lightx2v/utils/set_config.py14-130 lightx2v/models/runners/default_runner.py148-165
LightX2V achieves high performance through a multi-layered optimization stack:
Optimizations are configured via config parameters and applied at initialization:
wan2.1_distill runner)dit_quantized=True and dit_quant_scheme (e.g., "fp8-triton")cpu_offload=True and offload_granularity ("model", "block", or "phase")attention_type, self_attn_1_type, etc. (e.g., "sage_attn2")feature_caching ("Tea", "Mag", "Ada")parallel dict with seq_p_size and cfg_p_sizeThe framework automatically applies compatible optimizations. For example, FP8 quantization with Sage Attention 2 is a common high-performance configuration.
Sources: README.md246-263 lightx2v/models/networks/wan/model.py62-95 lightx2v/models/networks/wan/infer/transformer_infer.py17-72
The lightx2v_platform subsystem enables hardware portability:
The platform layer abstracts hardware-specific operations through registries:
PLATFORM_DEVICE_REGISTER: Device initialization and distributed setupROPE_REGISTER: Rotary position embedding implementationsATTN_WEIGHT_REGISTER: Attention operator implementationsPlatform selection occurs via the PLATFORM environment variable (e.g., PLATFORM=mlu for Cambricon). Each backend provides:
torch.cuda equivalent)Sources: lightx2v_platform/README.md1-19 lightx2v_platform/README_zh.md1-20 lightx2v/infer.py132-134
This end-to-end flow demonstrates how components interact for an image-to-video task:
The sequence shows:
Sources: lightx2v/models/runners/default_runner.py459-474 lightx2v/models/runners/default_runner.py170-204 lightx2v/models/networks/wan/model.py415-478 lightx2v/models/schedulers/wan/scheduler.py51-96
For reference, here are the primary file paths for core components:
| Component | Path |
|---|---|
| Main entry point | lightx2v/infer.py |
| Pipeline API | lightx2v/__init__.py (exports LightX2VPipeline) |
| Base runner | lightx2v/models/runners/base_runner.py |
| Default runner | lightx2v/models/runners/default_runner.py |
| WAN runners | lightx2v/models/runners/wan/ |
| HunyuanVideo runners | lightx2v/models/runners/hunyuan_video/ |
| Qwen runners | lightx2v/models/runners/qwen_image/ |
| Model implementations | lightx2v/models/networks/ |
| Schedulers | lightx2v/models/schedulers/ |
| Encoders (text/image/audio) | lightx2v/models/input_encoders/ |
| VAE encoders/decoders | lightx2v/models/video_encoders/ |
| Attention operators | lightx2v/common/ops/attn/ |
| Quantization | lightx2v/common/ops/mm/ |
| Configuration | lightx2v/utils/set_config.py |
| Platform abstraction | lightx2v_platform/ |
Sources: lightx2v/infer.py1-151 lightx2v/models/runners/default_runner.py1-489 lightx2v/models/networks/wan/model.py1-513
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.