Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models
Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao
RSS 2025
This is the official implementation of Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions.
This repository contains code for the 2 critical sections of Manual2Skill:
- Hierachical Assembly Graph Generation
- Per-step Assembly Pose Estimation
- Clone repository:
git clone https://github.com/owensun2004/Manual2Skill.git
cd Manual2Skill- Install dependencies:
conda env create -f environment.yml
# for blender rendering of variations in pre-assembly scene
wget https://download.blender.org/release/Blender3.6/blender-3.6.19-linux-x64.tar.xz
tar -xf blender-3.6.19-linux-x64.tar.xz
cd blender-3.6.19-linux-x64
sudo ln -s $(pwd)/blender /usr/local/bin/blender
cd ..
rm blender-3.6.19-linux-x64.tar.xzNote: If you meet trouble when installing pytorch3d, we recommand install the prebuilt wheel like:
pip install --no-index --no-cache-dir pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py310_cu117_pyt201/download.htmlmore detailed installation guidance about pytorch3d can be found at https://github.com/facebookresearch/pytorch3d/blob/main/INSTALL.md
- OpenAI API Key
export OPENAI_API_KEY="your-api-key"This section includes scripts for generating variations in pre-assembly scene, running VLM inference on the 102 furniture to generate hierachical assembly graphs, and performing evaluation for VLM generated results.
cd VLM_assembly_plan_genDownload and extract data for 102 furniture manuals and pre-assembly scenes:
mkdir data
gdown https://drive.google.com/uc?id=1hPesH_zd_NMd842JGaXaUxkviLU2Th4L
unzip data.zip -d ./dataThis section explains how to use Blender to generate variations pre-assembly scenes:
python scene_gen/generator.py --rand_translate true --rand_rotate true--rand_translate truerandomly shuffles furniture parts--rand_rotate truerandomly rotates furniture parts in place- If neither argument is specified, original pre-assembly scenes from
data/preassembly_sceneswill be generated again
Generated scenes are saved as scene_rot.png and scene_rot_annotated.png under data/preassembly_scenes for each furniture item. Each run produces unique scenes due to randomness.
This section explains how to generate assembly graphs for each furniture. Each furniture item uses approximately 15 images and 1700 words, taking around 1-2 minutes per item.
Example (default parameters with original pre-assembly scene for all furniture):
python inference/run.py Outputs will be stored in outputs/[timestamp]/ with each furniture item containing a tree.json, which stores the predicted assembly graph in a nested array.
Custom Parameters Example
python inference/run.py \
--start 0 \
--end 2 \
--temperature 0.1 \
--model o1 \
--prompt_type not_numbered \
--scene_type varied \
--debugParameters
--startand--end: Furniture item index (0-101) if you want to inference on a specific subset of furniture--temperature: Controls generation randomness (0-1)--model: Selects fromgpt-4o,gpt-4.5,o1, oro3--prompt_type: Selects fromnumberedornot_numberedmanual, indicating whether there is a number next to each input manual image, which represents the current assembly step--scene_type: Selects from theoriginalpre-assembly scene, or thevariedpre-assembly scene that you generated with Blender above--debug: Enables detailed input/output logging saved as .txt and .json files under the output folder of the current run
Overall, gpt-4o with a temperature of 0 achieves great performance. The prompt_type and scene_type do not affect the performance too much.
Navigate to evaluation scripts (assuming you are currently under VLM_assembly_graph_gen):
cd evalTo test the success rates of your generated assembly graphs, copy the name of your inference output folder to --tree_dir. For example, suppose your most recent inference output folder is named 2025_05_01_193302, then you can type this command:
export PYTHONPATH=$PYTHONPATH:$(pwd)
python manual_generation/test_accuracy.py \
--data_json ../data/main_data.json \
--parts_dir ../data/parts \
--part_features_pkl resources/features_dgcnn_1024_102.pkl \
--tree_dir 2025_05_01_193302Parameters
--data_json,--parts_dir,--part_features_pkl: Mandatory arguments--tree_dir: Specify output directory name containing the generated assembly graphs for evaluation- Examples: Your own inference folder mentioned above,
ours(paper results),singlesteporgeocluster(baselines)
- Examples: Your own inference folder mentioned above,
--difficulty: Evalutate by complexity (easyonly evaluates furniture with 2-4 parts,mediumis for 5-6 parts,hardfor 7-8 parts,impossiblefor 9-19 parts, andallfor evaluating all furniture with 2-19 parts)--debug: Print the predicted assembly graphs for each individual furniture
Sometimes, you may encouter json.decoder.JSONDecodeError: Extra data, this is due to the VLM's limitations, which may output an incorrectly formatted json file. In this case, set the --debug parameter to see which furniture contains a wrong tree.json file, and edit the file accordingly.
Disclaimer: Because of the complexity of the prompts and the multi-stage VLM querying theme, the VLM may output different results even with the same inputs. This may result in slightly different success rates compared to the metrics reported in the paper. We expect the success rate to further increase as better VLMs are introduced.
1.Data Preparation
Download the official PartNet dataset or our processed dataset (recommended), and unzip it under the directory pose_estimation/dataset/raw_data, when it's done, it should be like as follow directory structure:
blender-3.6.19-linux-x64
├── blender
└── *
Manual2Skill
├── pose_estimation
├── dataset
├── data_generation
└── raw_data
├── Bag_8340
├── Bag_8383
...
└── Vase_6498
└── pose_estimation
└── *
In pose_estimation/dataset/data_generation, run python generate_data.py to generate data for training (other python scripts in this directory are used for driving Blender). It may take some time to finish this script because we generate subassembly variants from three strategies (shapes, subassemblies, camera views). After finishing it, the project structure should be like this (take the category Chair as an example):
Manual2Skill
└── *
partnet_chair_mono
├── Chair_38784
├── raw_data_Chair_38784__0_1_2_3
├── 0
├── camera_pose.txt
├── data.npz
└── image.png
...
├── 19
└── *
└── *
└── *
2.Train
Configs for training are in pose_estimation/configs/base_pose_estimate.py. Actually, many operations have already been wrapped. To train, you can directly run python train_new_no_dist.py to perform end-to-end training and obtain the model weights. To be conducive to the following training, we use pickle to serialize data generated by python generate_data.py, so it may take some time in the first time of running train_new_no_dist.py.
The model weights are saved in pose_estimation/logs.
In pose_estimation/eval.py, modify cfg.train.pretrained_weights into the path of the model weights to evaluate such as ./logs/GNNNetwork_partnet_chair_mono_04-19-21-08-36/best.ckpt. Then run python eval.py to evaluate, the results are in saved in pose_estimation/logs.
Part of the code are adapted from IKEA-Manual. We thank the authors for their excellent work!
If you find our work useful, please cite:
@inproceedings{tie2025manual,
title = {Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models},
author = {Tie, Chenrui and Sun, Shengxiang and Zhu, Jinxuan and Liu, Yiwei and Guo, Jingxiang and Hu, Yue and Chen, Haonan and Chen, Junting and Wu, Ruihai and Shao, Lin},
booktitle = {Proceedings of Robotics: Science and Systems (RSS)},
year = {2025}
}