Content


VDOT: Efficient Unified Video Creation via
Optimal Transport Distillation

Yutong Wang1     Haiyu Zhang3,2     Tianfan Xue4,2     Yu Qiao2
Yaohui Wang2     Chang Xu1*     Xinyuan Chen2*    
*Corresponding authors  

Teaser video: Character Replacement in just 4 denoising steps


V2V #1: Depth video to video

Show Input Prompt

Input prompt: Inside a grand cathedral with towering Gothic arches: a lone visitor walks slowly down the central nave toward the altar, sunlight streaming through stained-glass windows high above, pews and columns receding symmetrically into deep perspective.

V2V #2: Pose video to video

Show Input Prompt

Input prompt: An elderly man with white hair sits on a park bench practicing tai chi, dressed in a navy blue cotton-linen practice outfit. His movements are slow and fluid, his arms tracing smooth arcs like flowing clouds, as his center of gravity shifts steadily between his legs. His expression is serene, his breathing even, and his palms turn with gentle yet powerful force. The backdrop is a park shrouded in morning mist, with the lake surface glistening faintly and willow branches swaying gently. A medium-shot side view, the camera slowly pans across, capturing the details of the arm's trajectory and the extension of the fingertips. The lighting is soft, showcasing the harmonious blend of movement and stillness in the Eastern rhythm.

V2V #3: Flow video to video

Show Input Prompt

Input prompt: An orange tabby cat darts across a sun-drenched living room with hardwood floors and scattered toys, intensely chasing a tiny red laser dot that zips erratically over a beige carpet—suddenly leaping, skidding, and making sharp 90-degree turns, its fur rippling with each abrupt movement, dust motes glowing in diagonal sunbeams.

V2V #4: Greyscale video to video

Show Input Prompt

Input prompt: An elderly sculptor in a dusty atelier chisels a marble bust by north-facing window light—fine stone dust floating in sunbeams, sharp highlights on emerging cheekbones, deep undercut shadows in eye sockets, rough raw stone contrasting with polished surfaces, tools scattered on wooden workbench.

V2V #5: Scribble video to video

Show Input Prompt

Input prompt: A paper airplane glides in a smooth arc across a sunlit classroom—simple folded-wing shape tumbling gently, passing in front of chalkboard and rows of empty desks, sunlight highlighting its edges.

MV2V #6: Temporal extension

Show Input Prompt

Input prompt: A basketball player in a dimly lit gym takes a deep breath, steps back beyond the three-point line, and launches a high-arcing shot—the ball spinning with backspin as it climbs, hanging momentarily at its apex under flickering fluorescent lights, then descending cleanly through the net with a soft swish, triggering cheers from blurred spectators in the background.

MV2V #7: Video outpainting

Show Input Prompt

Input prompt: A street performer plays violin in a subway station—original video crops only their instrument and hands; the extended view includes tiled walls with ads, commuters walking in both directions, escalators in the background, and vaulted ceilings with fluorescent lighting.

R2V #8: Multi-reference to video

Figure
Show Input Prompt

Input prompt: A man dressed in a vibrant Hawaiian shirt with a colorful floral pattern, sits on a beach lounge chair. On his shoulder, a Pikachu with a small detective hat perches. The man holds an ice cream cone, taking a bite.

Figure
Show Input Prompt

Input prompt: In a garage, a man sits on a chair. He retrieves a small black utility bag from a white desk.

Composite task #9: Firstframe+pose video

Show Input Prompt

Input prompt: A futuristic cyborg with glowing blue ocular implants and a sleek black exoskeleton walks deliberately through a ruined metropolis overgrown with bioluminescent vines, neon holographic ads flickering on crumbling skyscrapers under a violet twilight sky.

Composite task #10: Reference image+scribble video

Figure
Show Input Prompt

Input prompt: A jellyfish pulses rhythmically in deep blue ocean water, its translucent bell contracting and expanding with each beat, long tentacles trailing behind in slow, undulating waves as bioluminescent plankton flicker around it.

Untrained task #11: Video inpainting

Show Input Prompt

Input prompt: A colossal golden eagle soared through the bustling city sky, its feathers blazing like flames, radiating a warm glow as its wings spread majestically. The eagle held its head high, eyes gleaming, gently flapping its wings to emit a soft radiance.

Show Input Prompt

Input prompt: A person wearing a black helmet, black jacket, beige pants, and white sneakers is riding a black motorcycle on a highway. The rider has a black backpack and is holding the handlebars while riding in the right lane. The background shows green trees and bushes along the side of the road, with a clear blue sky above. The camera angle is from behind the rider, following closely as they ride down the highway. The lighting is bright and sunny, casting shadows on the road. The scene appears to be real-life footage.

Untrained task #12: Swap anything

Figure
Show Input Prompt

Input prompt: The video shows a person riding a horse across a vast grassland. She appears to be engaged in some kind of outdoor activity or performance. The backdrop features spectacular mountain ranges and a cloudy sky, evoking a sense of tranquility and expansiveness. The entire video is shot from a fixed angle, focusing on the rider and her horse.

Untrained task #13: Character animation

Untrained task #14: Character replacement

Untrained task #15: Video try-on

Figure
Figure

Abstract

The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.

Method

The contributions of this work are two-fold:

  1. We propose VDOT, an efficient unified video creation framework based on optimal-transport distillation. The OT regularizer provides a geometric constraint to distribution matching, improving training stability and efficiency.
  2. We develop a fully automated multi-task data construction pipeline and curate a comprehensive benchmark, UVCBench. Experiments on UVCBench demonstrate that our unified video creator achieves superior performance on both objective metrics and human evaluations while maintaining few-step inference.
VDOT pipeline
VDOT dataset

BibTeX

Page template borrowed from VIRAL.