SceneSplat++:

A Large Dataset and Comprehensive
Benchmark for Language Gaussian Splatting

1INSAIT, Sofia University "St. Kliment Ohridski" 2Nanjing University of Aeronautics and Astronautics 3ETH Zürich 4University of Amsterdam
5Johns Hopkins University 6University of Pisa 7University of Trento
indicates equal contribution. indicates the corresponding author.
INSAIT logo ETH Zurich logo University of Amsterdam logo Johns Hopkins University logo University of Pisa logo University of Trento logo Nanjing University of Aeronautics and Astronautics logo

TL;DR

  • SceneSplat-49K Dataset. We introduce a large-scale 3DGS dataset comprising approximately 49K diverse indoor and outdoor scenes.
  • SceneSplat-Bench. We introduce a comprehensive benchmark for evaluating Language Gaussian Splatting (LGS) methods at scale, with 3.7× more semantic classes and 50.5× more scenes than existing evaluation protocols.
  • Key Insights. We conclude that generalizable models achieve the strongest performance, while within the per-scene category, optimization-free methods outperform the optimization-based ones.
SceneSplat++ teaser

Motivation

Language Gaussian Splatting (LGS) enables better 3D interactions. However, current evaluation has major limitations:

  • Small number of test scenes: LGS methods are evaluated on a very small number of test scenes (e.g. 10-20).
  • Tested close to train views: LGS methods are evaluated close to views used during training.
  • Only 2D metrics: LGS methods are only evaluated using a 2D metric on views where the solution is rendered.

SceneSplat-49K Dataset

We present SceneSplat-49K, a large-scale 3D Gaussian Splatting dataset comprising approximately 49K raw scenes and 46K curated 3DGS scenes aggregated from SceneSplat-7K, DL3DV-10K, HoliCity, Aria Synthetic Environments, and newly collected crowdsourced data. The corpus spans diverse indoor and outdoor environments, from rooms and apartments to streets. To support 3DGS scene understanding, 12K scenes are further enriched with per-primitive vision-language embeddings extracted using state-of-the-art vision-language models.

Dataset statistics
Dataset statistics visualization

Appearance, Geometry, and Scale Statistics of the SceneSplat-49K Dataset. Distributions of photometric (PSNR, SSIM, LPIPS) and geometric (depth ℓ1) reconstruction errors show consistently high-quality renders across scenes, while the wide spread in total Gaussian number and indoor/outdoor scene floor area demonstrates the dataset’s diversity. The curves are convolved from the bucket values and vertical dotted lines mark the mean of each metric.

SceneSplat-Bench

To address the above mentioned limitations, we propose the first large-scale benchmark that systematically assesses LGS methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Our benchmark contains 3.7× more semantic classes and 50.5× more scenes compared to prior protocols. Also, we group LGS methods into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach.

Benchmark overview

Language Gaussian Splatting (LGS) Benchmark Overview. Left: Grouped methods and their properties. Right: Benchmark characteristics. Our proposed SceneSplat-Bench benchmark evaluates the LGS methods at scale, across three indoor datasets and one outdoor dataset.

Experiments & Key Findings

Task 1: Semantic Segmentation

Semantic segmentation table

Zero-Shot 3D Semantic Segmentation Experiments On ScanNet++ and Matterport3D. All methods are evaluated on a 10-scene mini-validation set, with the full set evaluated only for selected methods due to runtime limitations

Semantic segmentation visualization

Qualitative Results of Zero-Shot 3D Semantic Segmentation. The semantic classes "bicycle" and "kitchen table" are highlighted, which are not labeled in Ground Truth.

Task 2: Object Localization

Our benchmark evaluates precise language-driven localization via textual grounding and object-centric reasoning.

Localization table

3D Object Localization Experiments on ScanNet and ScanNet++. Accuracy is reported with bounding box–based and segmentation–based evaluation.

Textual query visualization

Text-Based Scene Query. Given the prompt "These are fruits" to different LGS methods, the queried parts are highlighted in red.

Scaling-Up

Scaling the generalizable pipeline on the proposed dataset consistently improves performance. Notably, models trained solely on indoor data transfer effectively to outdoor scenes.

Scaling results table

Impact of training-data scaling on indoor benchmarks and cross-domain generalization to HoliCity. Results show that more training data consistently improves indoor performance. Furthermore, models trained on indoor data only surprisingly transfer to outdoor scenes.

Scaling to outdoor scenes

Zero-shot predictions of indoor-trained SceneSplat on outdoor scenes. The results highlight the cross-domain capability. Color palette denotes buildings (red), roads (blue), terrains (yellow), and trees (green).

BibTeX


  @inproceedings{ma2025scenesplatpp,
    title     = {SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting},
    author    = {Ma, Mengjiao and Ma, Qi and Li, Yue and Cheng, Jiahuan and Yang, Runyi and Ren, Bin and Popovic, Nikola and Wei, Mingqiang and Sebe, Nicu and Van Gool, Luc and Gevers, Theo and Oswald, Martin R. and Paudel, Danda Pani},
    booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)},
    year      = {2025}
  }