Depth-Consistent Human Modeling (DCHM) framework enhances multiview pedestrian detection by achieving annotation-free 3D human modeling through superpixel-wise Gaussian Splatting, outperforming existing methods across challenging crowded scenarios. Note: The code is still under release.
Check our website for videos and reconstruction results!
- Release source code of the inference and training.
- Release the spatially consistent pseudo-deph label.
- Release the checkpoint of model.
Follow these steps to set up the DCHM codebase on your system.
git clone https://github.com/Jiahao-Ma/DCHM-code
cd DCHM-codeconda create -n DCHM python=3.10
pip3 install torch torchvision torchaudio # use the correct version of cuda for your systemFollowing the Grounding-SAM preparation process, create a submodules folder and clone the required libraries into it.
-
Grounding-SAM
# Clone the Grounded-SAM and install dependency mkdir submodules && cd submodules git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git cd Grounded-Segment-Anything python -m pip install -e segment_anything pip install --no-build-isolation -e GroundingDINO pip install --upgrade "diffusers[torch]" git submodule update --init --recursive cd grounded-sam-osx && bash install.sh git clone https://github.com/xinyu1205/recognize-anything.git pip install -r ./recognize-anything/requirements.txt pip install -e ./recognize-anything/ # Download the pre-trained model wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
-
GSplat
We customize Gaussian Splatting (GSplat) for rendering, with the implementation available in
submodules/customized_gsplat/rendering.py.pip install gsplat
-
DepthAnything v2
git clone https://github.com/DepthAnything/Depth-Anything-V2
-
MvCHM
We evaluate our consistent representation in the supervised multi-view detection using MvCHM.
git clone https://github.com/Jiahao-Ma/MvCHM.git
# The code and data are coming soon ...python supervised_cluster.py \
--root, 'path\to\wildtrack_data_gt', \
--round, 2_1, \
--n-segments, 30, \
--start-with, 0, \
--end-with, -1, \
--fun_type 'Inference'python unsupervised_cluster.py \
--root, 'path\to\wildtrack_data_gt', \
--round, 2_1, \
--n-segments, 30, \
--start-with, 0, \
--end-with, 10, \
--min_gs_threshold, 10We provide the initial training pipeline as follows. For the complete iterative process, please refer to pipeline.sh.
If you prefer to skip the training process for generating consistent pseudo-depth labels:
⬇️ Download pre-generated labels (Coming soon)
Note: The pseudo-depth labels are currently in preparation and will be available shortly. We recommend checking back later or following our release updates.
# Step 1: Extract video frames
DATA_DIR=/path/to/wildtrack_data_gt
python s01_data_download.py \
--data-dir ${DATA_DIR} \
--duration 35 \
--fps 2 \
--output-folder Image_subsets# Step2: segment the foreground and background using grounded-sam
mkdir submodules && cd submodules
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git
cd Grounded-Segment-Anything
# Step2.1: Generate the foreground (human) mask
TARGET1="people"
python s02_sam2_wgt.py \
--root ${DATA_DIR} \
--config submodules/Grounded-Segment-Anything/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
--grounded_checkpoint submodules/Grounded-Segment-Anything/groundingdino_swint_ogc.pth \
--sam_checkpoint submodules/Grounded-Segment-Anything/sam_vit_h_4b8939.pth \
--input_image None \
--output_dir ${DATA_DIR}/masks/${TARGET1} \
--box_threshold 0.3 \
--text_threshold 0.25 \
--text_prompt ${TARGET1} \
--device "cuda:0"
python s02_sam2.py \
--root ${DATA_DIR} \
--config submodules/Grounded-Segment-Anything/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
--grounded_checkpoint submodules/Grounded-Segment-Anything/groundingdino_swint_ogc.pth \
--sam_checkpoint submodules/Grounded-Segment-Anything/sam_vit_h_4b8939.pth \
--input_image None \
--output_dir ${DATA_DIR}/masks/${TARGET1} \
--box_threshold 0.3 \
--text_threshold 0.25 \
--text_prompt ${TARGET1} \
--device "cuda:0"
# Step2.2: Generate the background (ground) mask
TARGET2="ground"
python s02_sam2.py \
--root ${DATA_DIR} \
--config submodules/Grounded-Segment-Anything/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
--grounded_checkpoint submodules/Grounded-Segment-Anything/groundingdino_swint_ogc.pth \
--sam_checkpoint submodules/Grounded-Segment-Anything/sam_vit_h_4b8939.pth \
--input_image None \
--output_dir ${DATA_DIR}/masks/${TARGET2} \
--box_threshold 0.3 \
--text_threshold 0.25 \
--text_prompt ${TARGET2} \
--device "cuda:0"# Step3: Initialization for GS. Generate super-pixel for the foreground
python s03_super_pixel.py \
--root ${DATA_DIR} \
--n_segments "30, 60"# --- Iterative Matching Label --- #
# --- First Rround --- #
# Step4: per frame training for GS
python s04_gs_us.py --root ${DATA_DIR} --round 1_1 --n-segments 30 --start-with 1 --end-with -1# Step5: fine-tuning mono-depth estimation
python s05_depth_finetune.py \
--epochs 200 \
--encoder vits \
--bs 1 \
--lr 0.000005 \
--save-path "output/1_1_w_dc_200" \
--dataset wildtrack \
--img-size 518 \
--min-depth 0.001 \
--max-depth 40 \
--pretrained-from checkpoints/depth_anything_v2_vits.pth \
--port 20596 \
--data_root ${DATA_DIR} \
--round 1_1
Note: Please refer to the
pipeline.shfor the all iterative training process. The iterative training process design for achieving consistent pseudo-depth label and fining the mono-depth label.
The fine-tuning is only for supervised method only.
python supervised_cluster.py \
--root, 'path\to\wildtrack_data_gt', \
--round, 2_1, \
--n-segments, 30, \
--start-with, 0, \
--end-with, -1, \
--fun_type 'Train'The trained checkpoint of supervised deocoder will be released soon.
python supervised_evaluate.py \
--pr_dir_pred "output/exp_sup/pr_dir_pred.txt" \
--pr_dir_gt "output/exp_sup/pr_dir_gt.txt" \
--sup_decoder_checkpoint "path/to/checkpoint.pth"The trained GS and the depth prediction model checkpoints will be released soon.
python unsupervised_evaluate.py \
--root 'path/to/wildtrack_data_gt' \
--round 1_2 \
--n-segments 30 \
--start-with 0 \
--end-with 10 \
--min_gs_threshold 10If you find our code or paper useful for your research, please consider citing:
@article{ma2025dchm,
title={DCHM: Depth-Consistent Human Modeling for Multiview Detection},
author={Ma, Jiahao and Wang, Tianyu and Liu, Miaomiao and Ahmedt-Aristizabal, David and Nguyen, Chuong},
journal={arXiv preprint arXiv:2507.14505},
year={2025}
}