Light-X : Generative 4D Video Rendering with Camera and Illumination Control


Tianqi Liu1,2,3   Zhaoxi Chen1   Zihao Huang1,2,3   Shaocong Xu2   Saining Zhang2,4  
Chongjie Ye5   Bohan Li6,7   Zhiguo Cao3   Wei Li1   Hao Zhao4,2*   Ziwei Liu1*

1S-Lab, NTU     2BAAI     3HUST     4AIR,THU     5FNii, CUHKSZ     6SJTU     7EIT (Ningbo)

TL;DR


Light-X is a video generation framework that jointly controls camera trajectory and illumination from monocular videos.



Overview Video




Camera-Illumination Control

Light-X enables joint camera trajectory and illumination control for a monocular video input.



Diverse Camera Trajectories

Light-X supports diverse camera trajectory control, including bullet time, frozen camera, and dolly zoom.



Text-Conditioned Video Relighting

Light-X enables realistic video relighting given a textual relighting prompt.



Various Lighting Conditions

Beyond text prompts, Light-X supports background-image-, reference-image-, and HDR-conditioned video relighting.



Application in Embodied AI




Application in Autonomous Driving




Abstract


Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control. Besides, our model surpasses prior video relighting methods in text- and background-conditioned settings. Ablation studies further validate the effectiveness of the disentangled formulation and degradation pipeline.



Method


Overview of Light-X. Given an input video \( \mathbf{V}^s \), we first relight one frame with IC-Light, conditioned on a lighting text prompt, to obtain a sparse relit video \( \hat{\mathbf{V}}^s \). We then estimate depths to construct a dynamic point cloud \( \mathcal{P} \) from \( \mathbf{V}^s \) and a relit point cloud \( \hat{\mathcal{P}} \) from \( \hat{\mathbf{V}}^s \). Both point clouds are projected along a user-specified camera trajectory, producing geometry-aligned renders and masks \( (\mathbf{V}^p, \mathbf{V}^m) \) and \( (\hat{\mathbf{V}}^p, \hat{\mathbf{V}}^m) \). These six cues, together with illumination tokens extracted via a Q-Former, are fed into DiT blocks for conditional denoising. Finally, a VAE decoder reconstructs a high-fidelity video \( \mathbf{V}^t \) faithful to the target trajectory and illumination.



Comparisons

Light-X outperforms prior baselines in temporal consistency, illumination fidelity, and novel-view content generation.



Citation


@article{liu2025light,
  title={Light-X: Generative 4D Video Rendering with Camera and Illumination Control},
  author={Liu, Tianqi and Chen, Zhaoxi and Huang, Zihao and Xu, Shaocong and Zhang, Saining and Ye, Chongjie and Li, Bohan and Cao, Zhiguo and Li, Wei and Zhao, Hao and others},
  journal={arXiv preprint arXiv:2512.05115},
  year={2025}
}