SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

1Yonsei University 2Carnegie Mellon University 3UC Merced & Google DeepMind 4Seoul National University * : Equal contribution
arxiv 2025

Motivation

(a) Distribution of motion embeddings extracted with the feature extractor (HumanML3D) and visualized via PCA. Scene-aware datasets (HUMANISE, TRUMANS) show narrower distributions than T2M dataset(HumanML3D), indicating lower diversity and semantic coverage. (b) Models trained on T2M datasets capture diverse action semantics but lack scene awareness, penetrating the obstacles. (c) Models trained on scene-aware datasets satisfy scene constraints but fail to follow text conditions.





Abstract

Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches fail to generate diverse motion while simultaneously respecting scene constraints, since constructing large-scale datasets with both rich text-motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene–motion and text–motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.





Overview

Method Overview

Stage 0: Pretrain text-to-motion model (MDM). Stage 1: Insert Context-aware Keyframing (CaKey) layers and train them with a motion inbetweening objective, which only requires motion sequences. Stage 2: Add scene-conditioning layers (denoted as SceneCo) and train them with a scene-aware inbetweening objective, using scene-motion pairs. Inference: Only use the base model and ScenCo layers for scene-aware text-to-motion generation.

Results

Results
Results

Results - Comparisons

Results - Scene CFG comparsions

Results - Goal conditioned generation