Astra: General Interactive World Model with Autoregressive Denoising

📄 [arXiv] 🏠 [Project Page] 🤗 [Huggingface]

Yixuan Zhu¹, Jiaqi Feng¹, Wenzhao Zheng^{1 †}, Yuan Gao², Xin Tao², Pengfei Wan², Jie Zhou ¹, Jiwen Lu¹

(† Project leader)

¹Tsinghua University, ²Kuaishou Technology.

📖 Introduction

TL;DR: Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.

Astra is an interactive, action-driven world model that predicts long-horizon future videos across diverse real-world scenarios. Built on an autoregressive diffusion transformer with temporal causal attention, Astra supports streaming prediction while preserving strong temporal coherence. Astra introduces noise-augmented history memory to stabilize long rollouts, an action-aware adapter for precise control signals, and a mixture of action experts to route heterogeneous action modalities. Through these key innovations, Astra delivers consistent, controllable, and high-fidelity video futures for applications such as autonomous driving, robot manipulation, and camera motion.

Gallery

Astra+Wan2.1

zigzag_garden_5.mp4	left_road.mp4	moveleft_hall.mp4	left_right_indoor.mp4
zigzag_gate.mp4	zigzag_drone_view.mp4	zigzag_garden_1.mp4	zigzag_garden_3.mp4

🔥 Updates

[2025.11.17]: Release the project page.
[2025.12.09]: Release the inference code, model checkpoint.

🎯 TODO List

Release dataset preprocessing tools
Release full inference pipelines for additional scenarios:
- 🚗 Autonomous driving
- 🤖 Robotic manipulation
- 🛸 Drone navigation / exploration
Open-source training scripts:
- ⬆️ Action-conditioned autoregressive denoising training
- 🔄 Multi-scenario joint training pipeline
Provide unified evaluation toolkit

⚙️ Run Astra (Inference and Training)

Astra is built upon Wan2.1-1.3B, a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:

Inference

Step 1: Set up the environment

DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"

Install DiffSynth-Studio:

git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .

Step 2: Download the pretrained checkpoints

Download the pre-trained Wan2.1 models

cd script
python download_wan2.1.py

Download the pre-trained Astra checkpoint

Please download from huggingface and place it in models/Astra/checkpoints.

Step 3: Test the example image

python infer_demo.py \
  --dit_path ../models/Astra/checkpoints/diffusion_pytorch_model.ckpt \
  --wan_model_path ../models/Wan-AI/Wan2.1-T2V-1.3B \
  --condition_image ../examples/condition_images/garden_1.png \
  --cam_type 4 \
  --prompt "A sunlit European street lined with historic buildings and vibrant greenery creates a warm, charming, and inviting atmosphere. The scene shows a picturesque open square paved with red bricks, surrounded by classic narrow townhouses featuring tall windows, gabled roofs, and dark-painted facades. On the right side, a lush arrangement of potted plants and blooming flowers adds rich color and texture to the foreground. A vintage-style streetlamp stands prominently near the center-right, contributing to the timeless character of the street. Mature trees frame the background, their leaves glowing in the warm afternoon sunlight. Bicycles are visible along the edges of the buildings, reinforcing the urban yet leisurely feel. The sky is bright blue with scattered clouds, and soft sun flares enter the frame from the left, enhancing the scene’s inviting, peaceful mood."  \
  --output_path ../examples/output_videos/output_moe_framepack_sliding.mp4 \

This inference can be conducted on a single 24GB GPU, such as the NVIDIA 3090.

Step 4: Test your own images

To test with your own custom images, you need to prepare the target images and their corresponding text prompts. We recommend that the size of the input images is close to 832×480 (width × height, 16:9), which is consistent with the resolution of the generated video and can help achieve better video generation effects. For prompts generation, you can refer to the Prompt Extension section in Wan2.1 for guidance on crafting the captions.

python infer_demo.py \
  --dit_path path/to/your/dit_ckpt \
  --wan_model_path path/to/your/Wan2.1-T2V-1.3B \
  --condition_image path/to/your/image \
  --cam_type your_cam_type \
  --prompt your_prompt \
  --output_path path/to/your/output_video

We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.

cam_type	Trajectory
1	Move Forward (Straight)
2	Rotate Left In Place
3	Rotate Right In Place
4	Move Forward + Rotate Left
5	Move Forward + Rotate Right
6	S-shaped Trajectory
7	Rotate Left → Rotate Right

Training

Step 1: Data Preprocessing

To facilitate joint training across large-scale open-source datasets, we implement a preprocessing pipeline designed for maximum training efficiency. This process involves three key steps:

Video Encoding: Compressing raw videos into latent space using a Video VAE.
Prompt Encoding: Converting textual descriptions into embeddings via a text tokenizer.
Action Extraction: Generating precise action/pose embeddings from the source data.

This preprocessing stage must be completed prior to starting the training loop. You can find the implementation scripts and detailed usage instructions in the ./data directory.

Step 2: Training on a Single Dataset

Once the data is preprocessed, you can initiate training on a specific dataset. This stage allows for initial model validation or fine-tuning on a targeted domain.

Execute the training script using the following command:

python train_single.py

The training requires about 60G GPU memory.

Hardware Requirements

GPU Memory: The training process requires approximately 60 GB of VRAM.
Recommended Hardware: We recommend using high-end GPUs such as the NVIDIA A100 (80GB) or H100 to ensure stable performance and accommodate memory overhead.
Cost Optimization: For environments with limited resources, the --max_condition_frames can be shortened to reduce VRAM consumption and computational costs.

Future Work 🚀

Looking ahead, we plan to further enhance Astra in several directions:

Training with Wan-2.2: Upgrade our model using the latest Wan-2.2 framework to release a more powerful version with improved generation quality.
3D Spatial Consistency: Explore techniques to better preserve 3D consistency across frames for more coherent and realistic video generation.
Long-Term Memory: Incorporate mechanisms for long-term memory, enabling the model to handle extended temporal dependencies and complex action sequences.

These directions aim to push Astra towards more robust and interactive video world modeling.

🤗 Awesome Related Works

Feel free to explore these outstanding related works, including but not limited to:

ReCamMaster: ReCamMaster re-captures in-the-wild videos with novel camera trajectories.

GCD: GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.

ReCapture: a method for generating new videos with novel camera trajectories from a single user-provided video.

Trajectory Attention: Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.

GS-DiT: GS-DiT provides 4D video control for a single monocular video.

Diffusion as Shader: a versatile video generation control model for various tasks.

TrajectoryCrafter: TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.

GEN3C: a generative video model with precise Camera Control and temporal 3D Consistency.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@article{zhu2025astra,
  title={Astra: General Interactive World Model with Autoregressive Denoising},
  author={Zhu, Yixuan and Feng, Jiaqi and Zheng, Wenzhao and Gao, Yuan and Tao, Xin and Wan, Pengfei and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2512.08931},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
assets		assets
data		data
diffsynth		diffsynth
examples		examples
icons		icons
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.sh		infer.sh
infer_demo.sh		infer_demo.sh
requirements.txt		requirements.txt
setup.py		setup.py
train_single.py		train_single.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Astra: General Interactive World Model with Autoregressive Denoising

📄 [arXiv] 🏠 [Project Page] 🤗 [Huggingface]

📖 Introduction

Gallery

Astra+Wan2.1

🔥 Updates

🎯 TODO List

⚙️ Run Astra (Inference and Training)

Inference

Step 1: Set up the environment

Step 2: Download the pretrained checkpoints

Step 3: Test the example image

Step 4: Test your own images

Training

Step 1: Data Preprocessing

Step 2: Training on a Single Dataset

Hardware Requirements

Future Work 🚀

🤗 Awesome Related Works

🌟 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

EternalEvan/Astra

Folders and files

Latest commit

History

Repository files navigation

Astra: General Interactive World Model with Autoregressive Denoising

📄 [arXiv] 🏠 [Project Page] 🤗 [Huggingface]

📖 Introduction

Gallery

Astra+Wan2.1

🔥 Updates

🎯 TODO List

⚙️ Run Astra (Inference and Training)

Inference

Step 1: Set up the environment

Step 2: Download the pretrained checkpoints

Step 3: Test the example image

Step 4: Test your own images

Training

Step 1: Data Preprocessing

Step 2: Training on a Single Dataset

Hardware Requirements

Future Work 🚀

🤗 Awesome Related Works

🌟 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages