The distillation pipeline converts a post-trained DreamDojo teacher model into a fast, causal student model capable of long-horizon autoregressive generation at 10 FPS. The pipeline consists of three stages:
- Teacher Generation — Generate multi-step denoising targets from the teacher model.
- Warmup — Train the causal student architecture to match the teacher's outputs.
- Self-Forcing — Finetune the student with its own autoregressive predictions to reduce error accumulation.
After distillation, you can run offline inference to generate videos from a dataset of action sequences, or real-time inference for interactive teleoperation.
Generate denoising targets from the teacher model at few-step noise levels. This pre-computes the supervision for warmup training.
bash launch_teacher_gen.shTrain the causal student network to match the teacher's denoising outputs. This initializes the student before self-forcing.
bash launch_warmup.shThe warmup experiment configs are defined in cosmos_predict2/_src/predict2/interactive/configs/experiment/exp_action_warmup.py.
Finetune the student model with its own autoregressive rollouts to improve long-horizon stability, using the teacher model to provide the score during DMD distillation.
bash launch_self_forcing.shThe self-forcing experiment configs are defined in cosmos_predict2/_src/predict2/interactive/configs/experiment/exp_action_self_forcing.py.
Generate videos conditioned on pre-recorded action sequences via:
bash launch_student_inference.shKey arguments:
--experiment: Self-forcing experiment config name, which should match the one used during distillation.--ckpt_path: Path to the distilled checkpoint.--input_json: Path to a JSON file containing evaluation entries (each entry specifies a video path, actions, and metadata).
Run the distilled model interactively with live action inputs (e.g., teleoperation):
bash launch_student_inference_teleop.shKey arguments:
--ckpt_path: Path to the distilled checkpoint.--input_frame: Path to the initial conditioning frame (PNG image).--action_source: Action input source (filefor pre-recorded, or other sources for live input).--action_file: Path to a.npyfile containing actions (when usingfilesource).--max_latent_frames: Maximum number of latent frames to generate.--fps: Target generation framerate (default: 10.0).--save_output: Path to save the generated video (e.g.,output.mp4).
<= Previous: [DreamDojo Post-Training]
=> Next: [Evaluation]