This is the official repository of the ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation. For technical details, please refer to our paper on ACM Sensys 2025:
ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images
Fangqiang Ding1,*, Yunzhou Zhu2,*, Xiangyu Wen1, Gaowen Liu3, Chris Xiaoxuan Lu4,โ
[arXiv] [page] [demo] [data-a] [data-b] [slide]
1University of Edinburgh, 2Georgia Institute of Technology, 3Cisco, 4UCL
*Equal contribution, โ Corresponding author
- [2024-02-27] Our preprint paper is available on ๐arXiv.
- [2024-08-26] Our automatic annotation pipeline code is uploaded.
- [2025-02-24] Our paper is accepted by Sensys 2025 (acceptance rateโ18%)๐.
- [2025-03-24] Our TherFormer baseline code is uploaded. Stay tuned for update๐!
- [2025-04-22] Our ThermoHands oral presentation slides are uploaded. Check ๐slides for more details.
If you find our work helpful to your research, please consider citing:
@InProceedings{Ding_2025_Sensys,
title={ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images},
author={Ding, Fangqiang and Zhu, Yunzhou and Wen, Xiangyu and and Liu, Gaowen and Lu, Chris Xiaoxuan},
booktitle={23rd ACM Conference on Embedded Networked Sensor Systems (Sensys)},
year={2025}
}Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.
Please see more visualization in our demo video and project page.
This video shows an example of 3D hand pose annotations. We show the left (blue) and right (red) hand 3D joints projected onto RGB images. From the same viewpoint, we also visualize the corresponding hand mesh annotation.
This part shows qualitative results for different spectra under the well-illuminated office (main) setting. 3D hand joints are projected onto 2D images for visualization. Ground truth hand pose is shown in green while the prediction results in blue.
Qualitative results for thermal vs. RGB (NIR) under our four auxiliary settings, including the glove, darkness, sun glare and kitchen scenairos. We show the left (blue) and right (red) hand 3D joints projected onto 2D images.
Our full dataset is splited into two parts: the main part is collected under the well-illumiated office setting, while the auxiliary part is obtained under our four challenging settings.
-
Main part - [Download link]
-
Auxiliary part - [Download link]
After downloading, the dataset directory structure should look like this:
${DATASET_ROOT}
|-- egocenctirc
| |-- subject_01
| | |-- cut_paper
| | | |-- rgb
| | | |-- depth
| | | |-- gt_info
| | | |-- thermal
| | | |-- ir
| | |-- fold_paper
| | |-- ...
| | |-- write_with_pencil
| |-- subject_01_gestures
| | |-- tap
| |-- ...
| |-- subject_02
|-- exocentric
where, for the same subject, we place hand-object interaction actions into subject_xx folder while hand-virtual interaciton actions into subject_xx_gestures folder. Data captured from the egocentric and exocentric view is stored seperately.
If you would like to use our calibration information for further development, please refer to our calibration folder.
We test our code in following enviroment:
Ubuntu 20.04
python 3.9
pytorch 2.2.0
CUDA 11.8
Please install pytorch compatible to your device according to offical pytorch web page. Other packages can be install via pip:
cd TherFormer
pip install -r requirements.txt
Our code also depends on libyana and DAB-DETR:
pip install git+https://github.com/hassony2/[email protected]
cd models/dab_deformable_detr/ops
python setup.py build install
We follow HTT to use lmdb during training, thus data prepocessing is required before the training. Please refer to make_lmdb.py as an example.
To train the therformer-V for thermal images:
cd TherFormer
python train_baseline.py --dataset_folder [your_data_folder] --cache_folder [you_workspace_path] --train_dataset thermal
Simliarly, to train the model for IR images:
python train_baseline.py --dataset_folder [your_data_folder] --cache_folder [you_workspace_path] --train_dataset thermal_ir --experiment_tag THEFomer_ir
For the non-video version, please set both the "--ntokens_pose" and "--ntokens_action" to 1.
Besides our baseline code, we also release our self-developed automatic hand pose annotation tools in thermohands-annotator. You can use this code to generate the 3D hand pose annotations for your own data.
Note: our automatic annotation method is designed for two-view settings and demand depth and RGB image capture from each view, as well as camera calibrations. However, you can modify our code to fit your own settings if necessary.
First of all, please organize your capture data into the directory structure like this:
${DATA_ROOT}
|-- calibration
| |-- ego_calib.json
| |-- exo_calib.json
|-- subject_01
| |-- cut_paper
| | |-- egocentric
| | | |-- depth
| | | | |-- 1707151988183.png
| | | | |-- ...
| | | |-- ir
| | | | |-- 1707151988183.png
| | | | |-- ...
| | | |-- rgb
| | | | |-- 1707151988183.png
| | | | |-- ...
| | | |-- thermal
| | | | |-- 1707151988186334431.tiff
| | | | |-- ...
| | |-- exocentric
| | | |-- depth
| | | |-- rgb
| |-- fold_paper
| | |-- egocentric
| | |-- exocentric
| |-- ...
|-- subject_02
|-- ...
where ego_calib.json stores the camera instrinsic and extrinsic for the egocentric platform while exo_calib.json for the exocentric platforms. Please refer to our calibration folder.
Second, install all libraries our used tools by the following command (an independent conda environment is recommended):
pip install -r requirements.txt
Also, download the Segment-Anything pretrained model from here and save as \sam\sam_vit_l_0b3195.pth.
Then, run our pipeline code for automatic 3D hand pose annotation by the following command:
cd thermohands-annotator
python pipeline.py \
--subs 01 18 25 \ # replace with your own subject indices
--root_dir /path/to/data \ # where you place the data capture
--save_dir /path/to/output # where you save the annotation output
Our pipeline contains 7 steps:
- Visualize all capture data and save their images
- Infer point cloud from ego depth image and run KISS-ICP to obtain the odometry
- Annotate markers for the first frames and calculate the transformation between two views. This step demands the graphical interface for marker labelling. Please label the marker in the same order from two views.
- Infer 2D hand pose and mask, infer 3D hand pose and hand point cloud
- Optimize the 3D hand pose by fitting MANO model
- Generate 2D mask ground truth based on hand pose ground truth
- Make annotation movies
You can use state_dataset.py to summarize your dataset and th2player.py to visualize the 3D hand mesh. Make proper modification to these files' path parameters to adapt to your own dataset.
Many thanks to these excellent projects:












