Input prompt: Inside a grand cathedral with towering Gothic arches: a lone visitor walks slowly down the central nave toward the altar, sunlight streaming through stained-glass windows high above, pews and columns receding symmetrically into deep perspective.
Input prompt: An elderly man with white hair sits on a park bench practicing tai chi, dressed in a navy blue cotton-linen practice outfit. His movements are slow and fluid, his arms tracing smooth arcs like flowing clouds, as his center of gravity shifts steadily between his legs. His expression is serene, his breathing even, and his palms turn with gentle yet powerful force. The backdrop is a park shrouded in morning mist, with the lake surface glistening faintly and willow branches swaying gently. A medium-shot side view, the camera slowly pans across, capturing the details of the arm's trajectory and the extension of the fingertips. The lighting is soft, showcasing the harmonious blend of movement and stillness in the Eastern rhythm.
Input prompt: An orange tabby cat darts across a sun-drenched living room with hardwood floors and scattered toys, intensely chasing a tiny red laser dot that zips erratically over a beige carpet—suddenly leaping, skidding, and making sharp 90-degree turns, its fur rippling with each abrupt movement, dust motes glowing in diagonal sunbeams.
Input prompt: An elderly sculptor in a dusty atelier chisels a marble bust by north-facing window light—fine stone dust floating in sunbeams, sharp highlights on emerging cheekbones, deep undercut shadows in eye sockets, rough raw stone contrasting with polished surfaces, tools scattered on wooden workbench.
Input prompt: A paper airplane glides in a smooth arc across a sunlit classroom—simple folded-wing shape tumbling gently, passing in front of chalkboard and rows of empty desks, sunlight highlighting its edges.
Input prompt: A basketball player in a dimly lit gym takes a deep breath, steps back beyond the three-point line, and launches a high-arcing shot—the ball spinning with backspin as it climbs, hanging momentarily at its apex under flickering fluorescent lights, then descending cleanly through the net with a soft swish, triggering cheers from blurred spectators in the background.
Input prompt: A street performer plays violin in a subway station—original video crops only their instrument and hands; the extended view includes tiled walls with ads, commuters walking in both directions, escalators in the background, and vaulted ceilings with fluorescent lighting.
Input prompt: A man dressed in a vibrant Hawaiian shirt with a colorful floral pattern, sits on a beach lounge chair. On his shoulder, a Pikachu with a small detective hat perches. The man holds an ice cream cone, taking a bite.
Input prompt: In a garage, a man sits on a chair. He retrieves a small black utility bag from a white desk.
Input prompt: A futuristic cyborg with glowing blue ocular implants and a sleek black exoskeleton walks deliberately through a ruined metropolis overgrown with bioluminescent vines, neon holographic ads flickering on crumbling skyscrapers under a violet twilight sky.
Input prompt: A jellyfish pulses rhythmically in deep blue ocean water, its translucent bell contracting and expanding with each beat, long tentacles trailing behind in slow, undulating waves as bioluminescent plankton flicker around it.
Input prompt: A colossal golden eagle soared through the bustling city sky, its feathers blazing like flames, radiating a warm glow as its wings spread majestically. The eagle held its head high, eyes gleaming, gently flapping its wings to emit a soft radiance.
Input prompt: A person wearing a black helmet, black jacket, beige pants, and white sneakers is riding a black motorcycle on a highway. The rider has a black backpack and is holding the handlebars while riding in the right lane. The background shows green trees and bushes along the side of the road, with a clear blue sky above. The camera angle is from behind the rider, following closely as they ride down the highway. The lighting is bright and sunny, casting shadows on the road. The scene appears to be real-life footage.
Input prompt: The video shows a person riding a horse across a vast grassland. She appears to be engaged in some kind of outdoor activity or performance. The backdrop features spectacular mountain ranges and a cloudy sky, evoking a sense of tranquility and expansiveness. The entire video is shot from a fixed angle, focusing on the rider and her horse.
Loading caption...
Loading caption...
The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.
The contributions of this work are two-fold: