LaST₀: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

🌐Project Page | ✍️Paper(Arxiv) | 🎥Demo

Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang, Pheng-Ann Heng, Shanghang Zhang

🤖LaST₀ is a framework that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize. Specifically, we introduce a token-efficient latent CoT space that models future visual dynamics, 3D structural information, and robot proprioceptive states, and further extends these representations across time to enable temporally consistent implicit reasoning trajectories. Furthermore, LaST₀ adopts a dual-system architecture implemented via a Mixture-of-Transformers design, where a reasoning expert conducts low-frequency latent inference and an acting expert generates high-frequency actions conditioned on robotics-oriented latent representations. To facilitate coordination, LaST₀ is trained with heterogeneous operation frequencies, enabling adaptive switching during deployment.

✨ News ✨

[2026/03/05] The code of LaST₀ is released! Including training and evaluating on various benchmarks. We also release our pretrained and finetuned checkpoints! The real-world scripts is comming soon! 🚀
[2026/02/02] A new version of LaST₀ has been updated on arXiv, with more real-world experiments added, including tabletop manipulation, mobile manipulation, and dexterous hand manipulation! 🚀
[2026/01/08] LaST₀ is now live on arXiv! The code is also comming soon. 🚀

📦 Installation

The code is built using Python 3.10, we also recommand to use Python above Python 3.10. We require PyTorch >= 2.2.0 and CUDA >= 12.0 (It may run with lower versions, but we have not tested it). We recommend using Miniconda and create an environment as follows:

cd last0
conda create -n last0 python=3.10
conda activate last0

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"

# Install Flash Attention 2
# =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip install "flash-attn==2.5.5" --no-build-isolation

If you want to test on LIBERO or RLBench, refer to the corresponding section to view environmental details.

🧩 Framework

Our code is built based on Janus and Mirage and is organized in the following framework:

conf: config files for training
experiments: launch scripts for LIBERO evaluation
scripts: scripts for training and RLBench evaluation
janus: contains last0 models, including last0 backbone & flow-matching & vlm & vla
transformers: modified version of transformers, adding the MoT architecture
util: contains different kinds of tools funtion
vla: from openvla's vla structure, including action tokenizer, etc.

🤗 Model Zoo

We release our pretrained model's parameters on hugging face as follows:

Robotic Large-Scale Pretrained Checkpoint for Action Expert: Action Chunk = 8, Action Chunk = 16
LIBERO SFT Checkpoints: LIBERO Spatial, LIBERO Object, LIBERO Goal, LIBERO 10
RLBench SFT Checkpoint

💡Getting Started

For quick evaluation, download the released checkpoints and test on these scripts:

# LIBERO (action_dim=7; action_chunk=8 for LIBERO_Spatial, action_chunk=16 for other 3 settings)
bash experiments/test_libero.sh

# RLBench (action_dim=7, action_chunk=1)
cd scripts
bash test_rlbench.sh

💾 Data Construction

We provide the processed LIBERO data in .npy format on libero data.

Constructing the latent CoT data is very important for LaST₀, and we provide the preprocess scripts:

cd utils

# for LIBERO
python gen_libero_json_stats.py

# for RLBench
python gen_rlbench_json_stats.py

The RLBench data includes point cloud, and LIBERO not. You can refer to these two scripts to build your own datasets.

📈 Fully Fine-Tuning

To fully fine-tune the pretrained models, we use accelerate package.

First, download our pretrain model for action expert, and change PRETRAIN_ACTION_PATH to your local ckpt absolute path.

Then launch the training script. We use one node with 8 A100 GPUs as an example.

cd scripts
bash train.sh

For the LIBERO benchmark and some datasets without point cloud data, we provide a clean version without point cloud. And the latent size is reduced to 8.

cd scripts
bash train_wopc.sh

🔍Test in LIBERO

We evaluated our LaST₀ in LIBERO and get the state-of-the-art performance. First, install the LIBERO denpendencies:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt

For more details of the eval scripts, refer to OpenVLA-OFT. Then run the evaluation script:

# /path/to/last0
bash experiments/test_libero.sh

🔍Test in RLBench

We also evaluated our hybridvla in RLBench, which based on the CoppeliaSim simulator. Install the virtual environment for testing in RLBench according to the following steps and begin your test. Thanks to the amazing work LIFT3D.

export COPPELIASIM_ROOT=${HOME}/CoppeliaSim
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOT

wget https://downloads.coppeliarobotics.com/V4_1_0/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
mkdir -p $COPPELIASIM_ROOT && tar -xf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz -C $COPPELIASIM_ROOT --strip-components 1
rm -rf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz

cd LIFT3D/third_party/RLBench
pip install -e .
cd ../../..

cd LIFT3D
pip install -e .
cd ..

📜️ License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 BibTeX

@misc{liu2026last0latentspatiotemporalchainofthought,
      title={LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model}, 
      author={Zhuoyang Liu and Jiaming Liu and Hao Chen and Jiale Yu and Ziyu Guo and Chengkai Hou and Chenyang Gu and Xiangju Mi and Renrui Zhang and Kun Wu and Zhengping Che and Jian Tang and Pheng-Ann Heng and Shanghang Zhang},
      year={2026},
      eprint={2601.05248},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.05248}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaST₀: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

✨ News ✨

📦 Installation

🧩 Framework

🤗 Model Zoo

💡Getting Started

💾 Data Construction

📈 Fully Fine-Tuning

🔍Test in LIBERO

🔍Test in RLBench

📜️ License

📚 BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
LIFT3D		LIFT3D
asset		asset
config		config
experiments		experiments
janus		janus
scripts		scripts
transformers		transformers
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LaST​0​: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

✨ News ✨

📦 Installation

🧩 Framework

🤗 Model Zoo

💡Getting Started

💾 Data Construction

📈 Fully Fine-Tuning

🔍Test in LIBERO

🔍Test in RLBench

📜️ License

📚 BibTeX

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

LaST₀: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

Packages