Honggyu An1* · Jaewoo Jung1* · Mungyeom Kim1 . Sunghwan Hong2 · Chaehyun Kim1 · Kazumi Fukuda3 · Minkyeong Jeon1 · Jisang Han1 · Takuya Narihira3 . Hyuna Ko1 . Junsu Kim1 . Yuki Mitsfuji3,4† . Seungryong Kim1†
*Co-first author, †Co-corresponding author
We propose a feed-forward framework for learning compact 3D representations from unposed images. Our approach estimates only 2K Gaussians that allocated in meaningful regions to enable generalizable scene reconstruction and understanding.
- Pretrained weights.
- Preprocessed version of Replica dataset.
- Multi-view novel view synthesis evaluation code.
- Probe3d evaluation code.
Our code is developed based on pytorch 2.5.1, CUDA 12.4 and python 3.11.
We recommend using conda for installation:
conda create -n c3g python=3.11
conda activate c3g
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txtThen, you should download VGGT pretrained weights from VGGT. Create a folder named pretrained_weights and save the file as model.pt.
Here is an example:
mkdir -p pretrained_weights
wget https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt?download=true -O ./pretrained_weights/model.pt
For LSeg feature lifting, you should download LSeg pretrained weights.
gdown 1FTuHY1xPUkM-5gaDtMfgCl3D0gR89WV7 -O ./pretrained_weights/demo_e200.ckpt
For training and multi-view novel view synthesis evaluation, we use the preprocessed RealEstate10K dataset following pixelSplat and MVSplat.
For 3D scene understanding evaluation, we use ScanNet following LSM and use Replica, which we follow preprocessing and evaluation protocol of Feature 3DGS.
Our pretrained checkpoints are available on Hugging Face.
-
gaussian_decoder.ckpt: Gaussian Decoder trained for 2-view input. -
gaussian_decoder_multiview.ckpt: Gaussian Decoder trained for multi-view input. -
feature_decoder_lseg.ckpt: Feature Decoder trained with the LSeg model. -
feature_decoder_dinov3L.ckpt: Feature Decoder trained with the DINOv3-L model. -
feature_decoder_dinov2.ckpt: Feature Decoder trained with the DINOv2-L model.
To train the Gaussian Decoder, you can run the following commands.
To train the Gaussian Decoder:
python -m src.main +training=gaussian_head wandb.mode=online wandb.name="wandb_name"To train the Gaussian Decoder when multi-view is available:
python -m src.main +training=gaussian_head_multiview wandb.mode=online wandb.name="wandb_name"To train the Gaussian Decoder faster when multi-view is available, you can continue from the 2-view training settings:
python -m src.main +training=gaussian_head wandb.mode=online wandb.name="wandb_name" checkpointing.load="2view_checkpoint" model.decoder.low_pass_filter=0.3If you do not want to log to wandb, just set wandb.mode=disabled
To train Feature Decoder, you can run the following commands.
Important
Update the CUDA Rasterizer
When you change the model, you must update NUM_SEMANTIC_CHANNELS in the config file.
File: ./submodules/diff_gaussian_rasterization_w_feature_detach/cuda_rasterizer/config.h
Values:
- 512 for LSeg
- 768 for DINOv2-base
- 1024 for DINOv2-large / DINOv3-large
- 128 for VGGT-tracking
To train the Feature Decoder with various VFM models (We tested LSeg, DINOv2-base, DINOv2-large, DINOv3-large, and VGGT-Tracking):
## for LSeg
python -m src.main +training=feature_head_lseg wandb.mode=online wandb.name="wandb_name" model.encoder.pretrained_weights="2view_checkpoint"
## for DINOv2-base
python -m src.main +training=feature_head_dinov2_B wandb.mode=online wandb.name="wandb_name" model.encoder.pretrained_weights="2view_checkpoint"
## for DINOv2-large
python -m src.main +training=feature_head_dinov2_L wandb.mode=online wandb.name="wandb_name" model.encoder.pretrained_weights="2view_checkpoint"
## for DINOv3-large
python -m src.main +training=feature_head_dinov3_L wandb.mode=online wandb.name="wandb_name" model.encoder.pretrained_weights="2view_checkpoint"
## for VGGT-tracking
python -m src.main +training=feature_head_vggt wandb.mode=online wandb.name="wandb_name" model.encoder.pretrained_weights="2view_checkpoint"If you do not want to log to wandb, just set wandb.mode=disabled
This is an example of training the Feature Decoder when multi-view input is available:
## for LSeg
python -m src.main +training=feature_head_lseg_multiview wandb.mode=online wandb.name="wandb_name" model.encoder.pretrained_weights="multiview_checkpoint"Evaluation code of novel view synthesis on RealEstate10K dataset when only 2 view is available.
python -m src.main +evaluation=re10k mode=test dataset/[email protected]_sampler=evaluation dataset.re10k.view_sampler.index_path=assets/evaluation_index_re10k.json test.save_compare=true wandb.mode=online checkpointing.load="checkpoint_path" wandb.name="wandb_name" Evaluation code of novel view synthesis on the RealEstate10K dataset when multi-view is available.
python -m src.main +evaluation=re10k_multiview mode=test dataset/[email protected]_sampler=evaluation dataset.re10k.view_sampler.index_path=assets/evaluation_index_re10k.json test.save_compare=true wandb.mode=online checkpointing.load="checkpoint_path" wandb.name="wandb_name" Evaluation code of 3D scene understanding on the ScanNet dataset.
python -m src.main +evaluation=scannet wandb.mode=online mode=test test.save_compare=true test.pose_align_steps=1000 checkpointing.load="checkpoint_path" wandb.name="wandb_name" If you do not want to log to wandb, just set wandb.mode=disabled
@article{an2025c3g,
title={C3G: Learning Compact 3D Representations with 2K Gaussians},
author={An, Honggyu and Jung, Jaewoo and Kim, Mungyeom and Hong, Sunghwan and Kim, Chaehyun and Fukuda, Kazumi and Jeon, Minkyeong and Han, Jisang and Narihira, Takuya and Ko, Hyuna and others},
journal={arXiv preprint arXiv:2512.04021},
year={2025}
}
We thank the authors of VGGT and NoPoSplat for their excellent work and code, which served as the foundation for this project.