SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Junsong Chen¹²*, Yuyang Zhao¹*, Jincheng Yu¹*, Ruihang Chu⁴, Junyu Chen¹, Shuai Yang¹, Xianbang Wang³, Yicheng Pan⁴, Daquan Zhou⁵, Huan Ling¹, Haozhe Liu⁶, Hongwei Yi¹, Hao Zhang¹, Muyang Li³, Yukang Chen¹, Han Cai¹, Sanja Fidler¹, Ping Luo², Song Han¹³, Enze Xie¹
¹ NVIDIA, ² HKU, ³ MIT, ⁴ Tsinghua University, ⁵ PKU, ⁶ KAUST
(* indicates equal contribution)

News

Awards

No items found.

Competition Awards

No items found.

Abstract

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720×1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16× faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4× speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.

Several Core Design Details for Efficiency

Linear Diffusion Transformers:

  • Replace quadratic attention with O(N) linear attention across the DiT for token-heavy video.
  • Video-specific upgrades: 3D RoPE after ReLU for stability/locality; lightweight temporal Mix-FFN for motion.
  • Results: 2× speedup at 480p and 4× at 720p; unified T2I/T2V/I2V with strong text-video alignment.

Constant-Memory KV Cache for Block Linear Attention

  • Causal linear attention keeps cumulative state/keys only, yielding fixed VRAM and O(D^2) per-token compute.
  • Adds a causal temporal cache (previous-block last frame) to maintain motion continuity across blocks.
  • Enables minute-long autoregressive generation; monotonic SNR + improved Self-Forcing stabilize long-horizon quality.

We illustrate the causal linear attention mechanism in the video below.

 World-Model Applications

  • Demonstrated fine-tunes for embodied robotics, autonomous driving, and game simulation (Minecraft), showing strong long-horizon realism and controllability for simulation and data generation.

LongSANA: Real-Time Minute Length Video Generation

  • SANA-Video + LongLive achieves 27 FPS real-time minute length video generation. We first adopt SANA-video to Causal Linear Attention and Mix-FFN via fine-tuning. Then we implement the "Training Long Test Long" and "Prompt Recache" techs in LongLive to achive a real-time few-step video generator LongSANA. We show the causal linear attention and causal Mix-FFN mechanism in the video below.

Overall Performance

Our SANA-Video models focus on efficient video generation with linear attention, achieving high-quality generation. We compare SANA-Video with SoTA text-to-image and image-to-video methods in the videos below.

The comprehensive efficiency and performance comparison among SANA-Video with stateof-the-art is illustrated in Table 4. We adopt VBench (Zhang et al., 2024) as the performance evaluation metric and the generation latency of a 480P 81-frame video as efficiency metric. As shown in Table 4, SANA-Video exhibits remarkable latency of 60 seconds, marking it the fastest model compared. This translates to a throughput that is 7.2× faster than MAGI-1 and over 4×faster than Step-Video. In terms of comparison, SANA-Video achieves a Total Score of 83.71 on text-to-video generation, comparable with large model Open-Sora-2.0 (14B) and outperforming
Wan2.1 (1.3B). In addition, SANA-Video achieves 88.02 Total Score on image-to-video generation, outperformance large DiT models Wan2.1 (14B) and HunyuanVideo-I2V (11B). Furthermore, SANA-Video achieves the best semantic / I2V score across all the methods, demonstrating strong vision-text semantic alignment.

We further conduct quantitative evaluation on 720p resolution and VBench-Long to verify the effectiveness of SANA-Video and LongSANA in high resolution and long video generation. We compare SANA-Video on 720p resolution with current SOTA methods. SANA-Video achieves competitive performance while costs only 36s to generate a 5s video, much faster than the others. Besides, we compare LongSANA with previous state-of-the-art methods on 30-second video generation in Table 9. Our LongSANA achieves the best semantic and total scores. In addition, SANA-Video is the fastest method that can generate videos in real-time with 27.5 FPS, demonstrating the efficiency and effectiveness of LongSANA when handling long video sequences.

Video

Citation

@misc{chen2025sana,
     title={SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer},
     author={Chen, Junsong and Zhao, Yuyang and Yu, Jincheng and Chu, Ruihang and Chen, Junyu and Yang, Shuai and Wang, Xianbang and Pan, Yicheng and Zhou, Daquan and Ling, Huan and others},
     year={2025},
     eprint={2509.24695},
     archivePrefix={arXiv},
     primaryClass={cs.CV},
     url={https://arxiv.org/abs/2509.24695},
   }

Media

No media articles found.

Acknowledgment

We would like to express our heartfelt gratitude to Shuchen Xue from UCAS, Haocheng Xi from UCB, Songlin Yang, Xingyang Li and Wenkun He from MIT for their invaluable insightful discussions on efficient attention designs, as well as Tian Ye from HKUST(GZ) for his expertise on data curation. Their collaborative efforts and constructive discussions have been instrumental in shaping this work.

Team Members