0% found this document useful (0 votes)
41 views15 pages

Diffusionsfm: Predicting Structure and Motion Via Ray Origin and Endpoint Diffusion

DiffusionSfM is a novel framework that unifies structure and motion recovery from multi-view images by directly predicting 3D geometry and camera poses as pixel-wise ray origins and endpoints using a denoising diffusion model. This approach contrasts with traditional two-stage Structure-from-Motion pipelines, enabling end-to-end multi-view reasoning and improved accuracy in camera estimates. The framework addresses challenges such as missing data and unbounded scene coordinates, demonstrating superior performance on both synthetic and real datasets.

Uploaded by

angelmombiela77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views15 pages

Diffusionsfm: Predicting Structure and Motion Via Ray Origin and Endpoint Diffusion

DiffusionSfM is a novel framework that unifies structure and motion recovery from multi-view images by directly predicting 3D geometry and camera poses as pixel-wise ray origins and endpoints using a denoising diffusion model. This approach contrasts with traditional two-stage Structure-from-Motion pipelines, enabling end-to-end multi-view reasoning and improved accuracy in camera estimates. The framework addresses challenges such as missing data and unbounded scene coordinates, demonstrating superior performance on both synthetic and real datasets.

Uploaded by

angelmombiela77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DiffusionSfM: Predicting Structure and Motion

via Ray Origin and Endpoint Diffusion

Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani
Carnegie Mellon University
Project page: qitaozhao.github.io/DiffusionSfM
Input Images Ray Origin & Endpoint Diffusion Timesteps 3D PCs & Recovered Cameras
arXiv:2505.05473v1 [cs.CV] 8 May 2025

Figure 1. DiffusionSfM. Top: Given a set of multi-view images (left), DiffusionSfM represents scene geometry and cameras (right) as
pixel-wise ray origins and endpoints in a global frame. It learns a denoising diffusion model to infer these elements directly from multi-
view inputs. Unlike traditional Structure-from-Motion (SfM) pipelines, which separate pairwise reasoning and global optimization into
two stages, our approach unifies both into a single end-to-end multi-view reasoning framework. Bottom: Example results of inferred scene
geometry and cameras for two distinct settings: a real-world outdoor scene (left) and a synthetic indoor scene (right).

Abstract 1. Introduction

The task of recovering structure (geometry) and motion


Current Structure-from-Motion (SfM) methods typically fol- (cameras) from multi-view images has long been a focus of
low a two-stage pipeline, combining learned or geometric the computer vision community, with typical pipelines [26]
pairwise reasoning with a subsequent global optimization performing pairwise correspondence estimation followed
step. In contrast, we propose a data-driven multi-view rea- by global optimization. While classical methods relied on
soning approach that directly infers 3D scene geometry and hand-designed features, matching, and optimization, there
camera poses from multi-view images. Our framework, Dif- has been a recent shift towards incorporating learning-based
fusionSfM, parameterizes scene geometry and cameras as alternatives [5, 6, 15, 24]. More recently, the widely in-
pixel-wise ray origins and endpoints in a global frame and fluential DUSt3R [37] advocates for predicting pairwise
employs a transformer-based denoising diffusion model to 3D pointmaps (instead of only correspondences), demon-
predict them from multi-view inputs. To address practical strating that this can yield accurate dense geometry and
challenges in training diffusion models with missing data cameras. In order to reconstruct more than two views,
and unbounded scene coordinates, we introduce specialized DUSt3R (and its variants [10]) still require a global opti-
mechanisms that ensure robust learning. We empirically mization reminiscent of classic bundle adjustment. While
validate DiffusionSfM on both synthetic and real datasets, these methods, both classical and learning-based, have led
demonstrating that it outperforms classical and learning- to impressive improvements in SfM, the overall approach
based approaches while naturally modeling uncertainty. is largely unchanged – learned or geometric pairwise rea-
soning followed by global optimization. In this work, we

1
seek to develop an alternative approach that directly predicts multi-view reasoning model for 3D geometry and cameras.
both structure and motion, unifying pairwise reasoning and
global optimization into a single multi-view framework.
We are of course not the first to attempt to find unified al-
ternatives to the two-stage SfM pipeline. In the sparse-view
setting where conventional correspondence-based methods 2. Related Work
struggle, several works employ multi-view architectures to
jointly reason across input images. SparsePose [27], Rel- Structure from Motion. Structure-from-Motion (SfM)
Pose++ [12], and PoseDiffusion [35] all leverage multi- systems [26] aim to simultaneously recover geometry
view transformers to estimate camera pose for input im- and cameras given a set of input images. The typical
ages, albeit using differing mechanisms such as regression, SfM pipeline extracts pixel correspondences from key-
energy-based modeling, and denoising diffusion. More re- point matching [2, 17], and performs global bundle ad-
cently, RayDiffusion [42] argues for a local raymap parame- justment (BA) to optimize sparse 3D points and camera
terization of cameras instead of a global extrinsic matrix and parameters by minimizing reprojection errors. Recently,
shows that existing patch-based transformers can be easily SfM pipelines have been substantially enhanced by replac-
adapted for this task, yielding significantly more accurate ing classical subcomponents with learning-based methods,
pose predictions. Importantly, such methods predict only such as neural feature descriptors [6, 8], keypoint matching
camera motion and fail to predict scene structure. [16, 24, 31], and bundle adjustment [14, 33].
In this work, we present DiffusionSfM, an end-to-end
More recently, an emerging body of research aims to
multi-view model that directly infers dense 3D geometry
unify the various SfM subcomponents into an end-to-end
and cameras from multiple input images. Instead of infer-
neural framework. Notably, ACEZero [3] fits a single neu-
ring (depth-agnostic) rays per image patch (as in RayDiffu-
ral network to input images and learns pixel-aligned 3D
sion [42]) or 3D points per pixel (as in DUSt3R [37]), Dif-
coordinates in a self-supervised manner, while FlowMap
fusionSfM effectively combines both to predict ray origins
[28] predicts per-frame cameras and depth maps using off-
and endpoints per pixel, directly reporting both scene geom-
the-shelf optical flow as supervision. Though ACEZero
etry (endpoints) and generalized cameras (rays). These can
and FlowMap are promising attempts to revolutionize SfM
readily be converted back to traditional cameras [42]. Com-
pipelines, they both register images incrementally and may
pared to RayDiffusion, our model directly predicts structure
suffer under large viewpoint changes. DUSt3R [37] di-
as well as motion at a finer scale (pixel-wise v.s. patch-
rectly regresses 3D pointmaps from image pairs and shows
wise). Compared to DUSt3R, our model directly predicts
strong generalization ability [11, 22]. Building on DUSt3R,
motion as well as structure but, even more importantly, does
MASt3R [10] introduces a feature head, offering pixel-
so for N views, eliminating the need for memory-intensive
matching capabilities. MASt3R-SfM [7] is a more scalable
global alignment. To model uncertainty, we train a denois-
SfM pipeline based on MASt3R. While these approaches
ing diffusion model but find two key challenges that need
[7, 10, 37] show impressive performance and robustness un-
to be addressed. First, diffusion models require (noisy)
der sparse views, they are essentially pair-based, requiring
ground truth as input for training, but existing real datasets
sophisticated global alignment procedures to form a consis-
do not have known endpoints for all pixels due to missing
tent estimate for more than two views.
depth in multi-view stereo. Second, the 3D coordinates of
endpoints can be potentially unbounded, whereas diffusion Pose Estimation with Global Reasoning. For the task
models require normalized data. We develop mechanisms of sparse-view pose estimation, learning-based methods
to overcome these challenges, leveraging additional “GT equipped with global reasoning show favorable robustness
mask conditioning” as input to inform the model of missing where traditional SfM methods [26, 29] fail. This line of
input data and parameterizing 3D points in projective space research includes energy-based [12, 41], regression-based
instead of Euclidean space. We find these strategies allow [27], and diffusion-based pose estimators [35, 42]. Among
us to learn accurate predictions for structure and motion. them, diffusion-based methods show a better ability to han-
We train and evaluate DiffusionSfM on real-world and dle uncertainty [42]. Closest to our work, RayDiffusion
synthetic datasets [22, 25, 43] and find that it can infer ac- [42] leverages a denoising diffusion model [9, 20] with a
curate geometry and cameras for both object-centric and patch-aligned (depth-unaware) ray representation to predict
scene-level images (see Fig. 1). In particular, we find that generic cameras. Our method goes further and pursues a
DiffusionSfM yields more accurate camera estimates com- generic representation for both geometry and cameras in the
pared to prior work across these settings while also model- form of ray origins and endpoints for each pixel. In addition
ing the underlying uncertainty via the diffusion process. In to resulting in a richer output, this joint geometry and pose
summary, we show that DiffusionSfM can serve as a unified prediction also yields improvements for pose estimation.

2
Noisy Origins
Input & Endpoints
GT Mask × # Denoising Steps


Diffusion DPT
DINO Transformer Decoder

Semantic Feats. Geometric Feats.


Semantic & Denoised Denoised 3D Point Clouds &
Input Images
Geometric Feats. Origins Endpoints Recovered Cameras

Figure 2. Method. Given sparse multi-view images as input, DiffusionSfM predicts pixel-wise ray origins and endpoints in a global frame
(Sec. 3.1) using a denoising diffusion process (Sec. 3.2). For each image, we compute patch-wise embeddings with DINOv2 [19] and
embed noisy ray origins and endpoints into latents using a single downsampling convolutional layer, ensuring alignment with the spatial
footprint of the image embeddings. We implement a Diffusion Transformer architecture that predicts clean ray origins and endpoints from
noisy samples. A convolutional DPT [21] head outputs full-resolution denoised ray origins and endpoints. To handle incomplete ground
truth (GT) during training, we condition the model on GT masks (Sec. 3.3). At inference, the GT masks are set to all ones, enabling
the model to predict origins and endpoints for all pixels. The predicted ray origins and endpoints can be directly visualized in 3D or
post-processed to recover camera extrinsics, intrinsics, and multi-view consistent depth maps.

3. Method Learning Over-Parameterized Representations. We rep-


resent 3D scenes and cameras using distributed ray ori-
Given a set of sparse (i.e., 2-8) input images, DiffusionSfM gins O and endpoints E rather than global alternatives such
predicts the geometry and cameras of a 3D scene in a global as quaternions and translation vectors. This design is in-
coordinate frame. In Sec. 3.1, we propose to represent 3D spired by RayDiffusion [42], and it facilitates the use of the
scenes as dense pixel-aligned ray origins and endpoints. To distributed deep features learned by state-of-the-art vision
predict such scene representations from sparse input images backbones, such as DINOv2 [19], which encode image in-
while modeling uncertainty, Sec. 3.2 proposes a denoising formation in a patch-wise manner. Notably, while ray ori-
diffusion architecture. We then discuss some key practical gins O should ideally be identical across all pixels, we pre-
challenges in training such a model in Sec. 3.3. dict them densely alongside ray endpoints E. This encour-
ages ray origins to remain close within the same image, pro-
3.1. 3D Scenes as Ray Origins and Endpoints
viding implicit regularization during training. Practically,
Given an input image I ∈ RH×W ×3 with a depth map D ∈ predicting both ray origins and endpoints simultaneously is
RH×W , camera intrinsics K ∈ R3×3 , and world-to-camera easy to implement using a single projection head.
extrinsics T ∈ R4×4 (equivalently, rotation R ∈ SO(3)
and translation t ∈ R3 ), each 2D image pixel Pij = [u, v] 3.2. DiffusionSfM
corresponds to a ray that travels from the camera center c We propose a denoising Diffusion Transformer (DiT) ar-
through the pixel’s projected position on the image plane, chitecture [20] that predicts ray origins and endpoints (Sec.
terminating at the object’s surface as specified by the depth 3.1) via a denoising diffusion process. An overview of Dif-
map D. The endpoint of the ray associated with image pixel fusionSfM is given in Fig. 2.
Pij is given by: Diffusion Framework. Given pixel-aligned ray origins and
endpoints S = stack({S(n) }N n=1 ) associated with a set of
\label {eq:end_point} \mathbf {E}_{ij} = \mathbf {T}^{-1} h \left ( \mathbf {D}_{ij} \cdot \mathbf {K}^{-1} [u, v, 1]^T \right ) (1) N input images, we apply a forward diffusion process [9,
30] that adds time-dependent Gaussian noise to them. Let
where h maps the 3D point into homogeneous coordinates. St denote the noisy ray origins and endpoints at timestep t,
The shared ray origin Oij for all pixels is equivalent to the where S0 is the clean sample and ST (at the final diffusion
camera center c, and can be computed as: step T ) approximates pure Gaussian noise. The forward
diffusion process is defined as:
\label {eq:origin} \mathbf {O}_{ij} = \mathbf {c} = h \left (-\mathbf {R}^{-1}\mathbf {t}\right ) (2)
\label {eq:forward_diffusion} \mathcal {S}_t = \sqrt {\bar {\alpha }_t} \, \mathcal {S}_0 + \sqrt {1 - \bar {\alpha }_t} \, \epsilon (3)
In summary, we associate each image pixel with a ray ori-
gin and endpoint Sij = ⟨Oij , Eij ⟩ in world coordinates, where t ∼ Uniform(0, T ], ϵ ∼ N (0, I), and ᾱt follows
describing the location of the observing camera and the ob- a pre-defined noise schedule that controls the strength of
served 3D point on the object surface. Given a bundle of added noise at each timestep. To perform the reverse dif-
ray origins and endpoints, we can easily extract the corre- fusion process, which progressively reconstructs the clean
sponding camera pose [42]. sample given noisy observations, we train a diffusion model

3
fθ that takes St as input and optionally incorporates addi- where w is an arbitrary scale factor. To encourage bounded
tional conditioning information C. The model is trained us- coordinates in practice, we choose w such that the homoge-
ing the following loss function (with the “x0 ” objective): neous coordinate is unit-norm:

\label {eq:diffusion_loss} \mathcal {L}_\text {Diffusion} = \mathbb {E}_{t, \mathcal {S}_0, \epsilon } \| \mathcal {S}_0 - f_\theta (\mathcal {S}_t, t, \mathcal {C}) \|^2 (4) w:=\sqrt {x^2+y^2+z^2+1} (7)

Architecture. We implement f_\theta using a DiT [20] architec- Unit normalization allows homogeneous coordinates to
ture conditioned on deep image features C ∈ RN ×h×w×c1 serve as a bounded representation for unbounded scene ge-
from DINOv2 [19], where h and w are the patch resolution ometry. For example, (x, y, z, 0) is a point at infinity in the
and c1 is the embedding dimension. To align pixels to the direction of (x, y, z). We find that this representation makes
spatial information learned by DINOv2, we apply a convo- large coordinate values more tractable during training.
lutional layer that spatially downsamples the noisy ray ori- Training with Incomplete Ground Truth. Many real-
gins and endpoints St to match the DINOv2 features while world datasets (e.g., CO3D [22] and MegaDepth [11]), pro-
increasing their feature dimension: vide only sparse point clouds, leading to incomplete depth
information. This presents a significant challenge in dif-
\label {eq:line_segment_embedding} \mathcal {F} = \texttt {Conv}(\mathcal {S}_t) \in \mathbb {R}^{N\times h \times w \times c_2} (5) fusion training, as ground-truth (GT) depth values that are
used to create clean ray endpoints often contain invalid or
The combined DiT input is constructed by concatenating
missing data. It is highly undesirable for these missing ray
these two feature sets along the channel dimension: F ⊕
endpoints to be interpreted as part of the target distribution.
C. Within DiT, patch-wise features attend to others through
Unlike regression models (e.g., DUSt3R [37]), which do
self-attention [34]. To distinguish between different images
not require (noisy) GT data as input and can simply apply
and their respective patches, we apply sinusoidal positional
masks on the loss for supervision, diffusion models must
encoding [34] based on image and patch indices.
handle incomplete inputs during training.
While the DiT operates on low-resolution features, our
To mitigate this issue, we further apply GT masks M ∈
objective is to produce pixel-aligned dense ray origins and
RN ×H×W to the DiT inputs, where zero values indicate
endpoints. To achieve this, we employ a DPT (Dense Pre-
pixels with invalid depth. During training, we multiply
diction Transformer) [21] decoder, which takes intermedi-
noisy rays with GT masks element-wise, then concatenate
ate feature maps from both DINOv2 and DiT as inputs. The
along the channel dimension: St′ = (M · St ) ⊕ M. Then,
DPT decoder progressively increases the feature resolution
we only compute the diffusion loss in Eq. 4 over unmasked
through several convolutional layers. The final ray origins
pixels. By implementing these strategies, we encourage the
and endpoints are decoded from the DPT output using a
model to focus on regions with valid GT values during train-
single linear layer. During inference, we apply the trained
ing. During inference, however, we would like the diffusion
model in the reverse diffusion process to iteratively denoise
process to estimate ray origins and endpoints at all pixels,
a randomly initialized Gaussian sample.
so we always use GT masks with values set to one.
3.3. Practical Training Considerations Sparse-to-Dense Training. In practice, we find that train-
ing the entire model from scratch leads to slow convergence
Homogeneous Coordinates for Unbounded Geometry. and suboptimal performance. To address this, we propose a
Real-world scenes often exhibit significant variations in sparse-to-dense training approach. First, we train a sparse
scale, both across different scenes and within a single scene. version of the model, where the DPT decoder is removed,
While normalizing by the average distance of 3D points and the output ray origins and endpoints have the same spa-
[37] helps address cross-scene variation, within-scene vari- tial resolution as the DINOv2 features. Unlike Eq. 5, no
ations remain. For instance, a background building may be spatial downsampling is required, so this sparse model uses
much farther away than foreground elements, potentially re- a single linear layer to embed the noisy ray origins and end-
sulting in extremely large coordinate values when generat- points. Once the sparse model is trained, we initialize the
ing ground-truth ray origins and endpoints. However, neu- dense model DiT with the learned weights from the sparse
ral networks (especially diffusion models) tend to train most model. This two-stage approach significantly improves per-
effectively when working with bounded inputs and outputs formance; see the supplementary material for comparisons.
(e.g., between -1 and 1). To stabilize training across the
large scale variations present in 3D scene datasets, we pro- 4. Experiments
pose to represent ray origins and endpoints in homogeneous
coordinates. Specifically, given any 3D point, we apply a 4.1. Experimental Setup
homogeneous transform from R3 → P3 : Datasets. We introduce two model variants, each trained
on different datasets. (1) DiffusionSfM-CO3D: Following
(x,y,z)\to \frac 1w(x,y,z,1) (6) prior work [12, 42], we train and evaluate our model on the

4
Input / GT PCs DiffusionSfM RD + MoGe DUSt3R

Figure 3. Qualitative Comparison on Camera Pose Accuracy and Predicted Geometry. For each method, we plot the ground-truth
cameras in black and the predicted cameras in other colors. DiffusionSfM demonstrates robust performance even with challenging inputs.
Compared to DUSt3R, which sometimes fails to register images in a consistent manner, DiffusionSfM consistently yields a coherent global
prediction. Additionally, while we observe that DUSt3R can predict highly precise camera rotations, it often struggles with camera centers
(see the backpack example). Input images depicting scenes are out-of-distribution for RayDiffusion, as it is trained on CO3D only.

CO3D dataset [22], which consists of turntable video se- datasets used for DUSt3R [37], excluding Waymo [32] due
quences of various object categories. Specifically, we train to its excessively sparse depth maps. The included datasets
on 41 object categories and evaluate on both these seen cate- are Habitat [25], CO3D [22], ScanNet++ [40], ArkitScenes
gories and an additional 10 unseen categories to assess gen- [1], Static Scenes 3D [18], MegaDepth [11], and Blended-
eralization. (2) DiffusionSfM: This variant is trained on the MVS [39]. We follow the DUSt3R official repository [38]

5
Input Images DiffusionSfM RayDiffusion DUSt3R

Figure 4. Additional Qualitative Results on Predicted Camera Poses. DiffusionSfM shows robustness to ambiguous patterns in inputs.

guidelines to extract image pairs, ensuring reasonable over- fer Distance (CD). Additionally, to explore whether com-
lap between paired images. To construct multi-view data bining an off-the-shelf sparse-view pose estimation method
given these pre-computed pairs, we iteratively sample im- (RayDiffusion) with a monocular depth estimation model
ages using an adjacency matrix, maintaining sufficient over- (MoGe [36]) is sufficient to infer 3D scene geometry from
lap with the selected set. Beyond CO3D, this model variant multiple images, we introduce a new baseline, RD+MoGe.
is also evaluated on Habitat and RealEstate10k [43]. We include more details in the supplementary. Unless oth-
Baselines and Metrics. To evaluate camera pose accu- erwise specified, all evaluations use 2–8 input images.
racy in the sparse-view setup, we compare with RayDif-
4.2. Evaluation on CO3D
fusion [42] and DUSt3R [37], along with previous meth-
ods [12, 23, 35]. DUSt3R is trained initially on a mix- Camera Pose Accuracy. We present the quantitative re-
ture of eight datasets, while most other baselines are trained sults in Tab. 1. For camera rotation accuracy, both ver-
only on CO3D. For a fair and comprehensive comparison, sions of DiffusionSfM outperform all baselines, except
we re-train DUSt3R on the 41-10 split of CO3D, using that DUSt3R achieves slightly higher overall accuracy than
the authors’ official implementation [38] and hyperparam- DiffusionSfM-CO3D on unseen categories – likely due to
eters (referred to as DUSt3R-CO3D). To evaluate camera its training on more data. For camera center accuracy, our
predictions, we follow prior work [42] and convert pre- approach consistently outperforms all other methods. We
dicted rays back to traditional extrinsic matrices and report hypothesize that these gains stem from our explicit model-
two pose accuracy metrics: (1) Camera Rotation Accuracy ing of camera centers through ray origin prediction. The
which compares the predicted relative camera rotation be- qualitative results in Fig. 4 illustrate that DiffusionSfM pro-
tween images against ground truth and (2) Camera Center duces robust predictions given challenging inputs, whereas
Accuracy which compares predicted camera centers to the DUSt3R produces inaccurate results. We attribute this im-
ground truth after a similarity alignment. To evaluate the provement to our model’s probabilistic modeling capability
estimated geometry (i.e., ray endpoints), we report Cham- derived from diffusion, as well as its multi-view reasoning

6
Rotation Accuracy (↑, @ 15◦ ) Center Accuracy (↑, @ 0.1)
# of Images 2 3 4 5 6 7 8 2 3 4 5 6 7 8
COLMAP [23] 30.7 28.4 26.5 26.8 27.0 28.1 30.6 100 34.5 23.8 18.9 15.6 14.5 15.0
PoseDiffusion [35] 75.7 76.4 76.8 77.4 78.0 78.7 78.8 100 77.5 69.7 65.9 63.7 62.8 61.9
Seen Categories

RelPose++ [12] 81.8 82.8 84.1 84.7 84.9 85.3 85.5 100 85.0 78.0 74.2 71.9 70.3 68.8
RayDiffusion [42] 91.8 92.4 92.6 92.9 93.1 93.3 93.3 100 94.2 90.5 87.8 86.2 85.0 84.1
DUSt3R-CO3D [37] 86.7 87.9 88.0 88.2 88.6 88.8 88.9 100 92.0 86.8 83.8 82.0 81.1 80.4
DUSt3R [37] 91.7 92.7 93.3 93.6 93.8 94.0 94.3 100 93.0 85.7 81.9 79.6 77.8 76.8
DiffusionSfM-CO3D 93.4 94.0 94.5 94.8 95.0 95.2 95.1 100 95.9 93.6 92.2 91.2 90.7 90.2
DiffusionSfM 92.6 94.1 94.6 95.0 95.3 95.5 95.5 100 95.6 93.4 92.4 91.7 91.1 90.7
COLMAP [23] 34.5 31.8 31.0 31.7 32.7 35.0 38.5 100 36.0 25.5 20.0 17.9 17.6 19.1
Unseen Categories

PoseDiffusion [35] 63.2 64.2 64.2 65.7 66.2 67.0 67.7 100 63.6 50.5 45.7 43.0 41.2 39.9
RelPose++ [12] 69.8 71.1 71.9 72.8 73.8 74.4 74.9 100 70.6 58.8 53.4 50.4 47.8 46.6
RayDiffusion [42] 83.5 85.6 86.3 86.9 87.2 87.5 88.1 100 87.7 81.1 77.0 74.1 72.4 71.4
DUSt3R-CO3D [37] 79.8 81.5 82.6 82.7 83.0 83.3 83.7 100 83.6 77.2 71.8 70.0 68.1 67.0
DUSt3R [37] 90.8 92.6 93.6 93.6 93.8 93.6 93.4 100 87.9 79.8 74.3 71.7 69.4 67.8
DiffusionSfM-CO3D 90.4 91.2 92.7 93.0 93.1 93.3 93.5 100 91.1 87.7 85.3 83.7 82.7 82.0
DiffusionSfM 91.3 92.8 93.8 94.5 95.0 95.1 95.3 100 92.6 88.4 87.0 86.4 85.1 84.7

Table 1. Camera Rotation and Center Accuracy on CO3D. On the left, we report the proportion of relative camera rotations within 15◦
of the ground truth. On the right, we report the proportion of camera centers within 10% of the scene scale, relative to the ground truth.
To align the predicted camera centers to ground truth, we apply an optimal similarity transform (s, R, t), hence the alignment is perfect
at N = 2 but worsens with more images. DiffusionSfM outperforms all other methods for camera center accuracy, and outperforms all
methods trained on equivalent data for rotation accuracy.

abilities, which together effectively handle these challeng- # of Images 2 3 4 5 6 7 8


ing scenarios. While we observe that DUSt3R can predict
RD*+MoGe [42] 0.059 0.064 0.071 0.062 0.063 0.061 0.061
highly precise camera rotations, it often struggles with cam- DUSt3R* [37] 0.036 0.037 0.040 0.040 0.037 0.036 0.039
era centers (see the backpack example in Fig. 3). DUSt3R [37] 0.021 0.023 0.024 0.024 0.025 0.025 0.023
Predicted Geometry. To evaluate predicted geometry, we DiffusionSfM* 0.019 0.021 0.022 0.022 0.021 0.021 0.021
compute Chamfer Distance (CD) and show comparisons DiffusionSfM 0.024 0.024 0.025 0.024 0.025 0.026 0.027
against baselines in Tab. 2. We compute CD in two se-
RD*+MoGe [42] 0.071 0.075 0.068 0.067 0.066 0.064 0.064
tups (with and without foreground object masks), and find DUSt3R* [37] 0.038 0.036 0.036 0.036 0.034 0.033 0.034
that DiffusionSfM-CO3D performs best without foreground DUSt3R [37] 0.023 0.022 0.019 0.020 0.019 0.020 0.020
masking. In this setup, the predicted ray endpoints cor- DiffusionSfM* 0.028 0.025 0.024 0.024 0.024 0.023 0.022
responding to the background image pixels tend to have DiffusionSfM 0.028 0.023 0.022 0.022 0.023 0.021 0.020
larger coordinate values than foreground ones, and there-
fore dominate CD. This result indicates that our model pro- Table 2. Chamfer Distance (↓) on CO3D Unseen Categories.
vides more accurate predictions for complex image back- Top: CD computed on all scene points. Bottom: CD computed
grounds. In terms of CD with masking, DUSt3R achieves on foreground points only. Models marked with “*” are trained on
the best result, while our two model variants outperform CO3D only, while those without are trained on multiple datasets.
DUSt3R-CO3D. See Fig. 3 for visualization. We also no- Note that top and bottom values are not directly comparable, as
tice that naively combining RayDiffusion poses with esti- each ground-truth point cloud is individually normalized.
mated depths from MoGe does not work well, as the in-
ferred depths are potentially inconsistent across different
views, thus resulting in degraded performance. era centers, aligning with our findings on CO3D.

4.3. Evaluation on Scene-Level Datasets 4.4. Inference Speed


Beyond the object-centric CO3D dataset [22], we compare Though our method requires iterative diffusion denoising at
DiffusionSfM against DUSt3R on two scene-level datasets: inference, we can speed this up by performing early stop-
Habitat (in-distribution) [25] and RealEstate10k (out-of- ping. Specifically, we can treat the x0 -prediction from early
distribution) [43]. The results are presented in Tab. 3. While timesteps as our output instead of iterating over all denois-
DiffusionSfM achieves comparable camera rotation accu- ing timesteps. Consistent with observations by Zhang et
racy to DUSt3R, it consistently predicts more accurate cam- al. [42], this in fact yields more accurate predictions than

7
Input Images Sample 1 Sample 2
DUSt3R [37] 97.0/100 95.0/97.6 94.3/95.0 94.2/93.1
Hab.

DiffusionSfM 92.7/100 93.9/99.0 94.3/98.6 94.7/98.4


DUSt3R [37] 98.1/100 97.7/68.7 97.6/57.9 97.7/53.3
RE10

DiffusionSfM 97.9/100 97.8/74.9 98.0/67.7 98.0/63.7

Table 3. Camera Rotation and Center Accuracy on Two


Scene-Level Datasets. Top: Habitat (2–5 views). Bottom:
RealEstate10k (2, 4, 6, 8 views). Each grid reports camera ro- Figure 5. Multi-modality of DiffusionSfM. We show two distinct
tation accuracy (left, ↑) and center accuracy (right, ↑). While Dif- samples from DiffusionSfM, starting from the same input images
fusionSfM performs on par with DUSt3R in rotation accuracy, it but with different random noise. Sample 1 explains the input im-
consistently surpasses DUSt3R in center accuracy. ages by putting all flowers on the left side, while Sample 2 places
one flower on each side (note the difference in the green camera’s
viewpoint). DiffusionSfM is able to predict multi-modal geometry
# of Images 2 3 4 5 6 7 8 distributions when the scene layout is ambiguous in the inputs.
DiffusionSfM* 0.019 0.021 0.022 0.022 0.021 0.021 0.021
w/o Mask 0.020 0.022 0.023 0.023 0.022 0.022 0.024 door scenes, such as MegaDepth [11].

Table 4. Ablation Study on GT Mask Conditioning for CO3D 4.6. Multi-modality from Multiple Sampling
Unseen Categories. We assess the effect of replacing GT mask Diffusion models enable the generation of diverse samples
conditioning with depth interpolation in our CO3D variant (Diffu- from challenging input images. For instance, in Fig. 5, the
sionSfM*), by reporting the CD for predicted geometry. Incorpo- vase exhibits symmetric patterns, and we present two dis-
rating mask conditioning to indicate missing data during training
tinct predicted endpoints from DiffusionSfM, each offer-
improves geometry quality.
ing a different interpretation of the flowers in the images.
Compared to regression models such as DUSt3R [8], Diffu-
the final-step diffusion outputs. As a result we only require sionSfM is better suited for handling uncertainty – an inher-
10 denoising diffusion timesteps for inference, taking 1.91 ent aspect of our task – where a sparse set of input images
seconds on a single A5000 GPU with 8 input images. In can correspond to multiple plausible 3D scene geometries.
contrast, DUSt3R takes 8.73 seconds to run the complete
pairwise inference and global alignment procedure. We pro- 5. Discussion
vide additional analysis in the supplementary material.
We present DiffusionSfM and demonstrate that it recovers
4.5. Ablation Study accurate predictions of both cameras and geometry from
multi-view inputs. Although our results are promising, sev-
Homogeneous Coordinates. The use of homogeneous co- eral challenges and open questions remain.
ordinates for ray origins and endpoints is crucial for stable Notably, DiffusionSfM employs a pixel-space diffusion
model training. To assess its impact, we replace the pro- model, in contrast to the latent-space models adopted by
posed homogeneous representation with standard 3D coor- state-of-the-art T2I generative systems. Operating in pixel
dinates in \protect \mathbb {R}^3 . Our experiments show that this variant is space may require greater model capacity, yet our current
difficult to train and fails to converge. Further details on model remains relatively small – potentially explaining the
this experiment are provided in the supplementary material. noisy patterns observed along object boundaries. Learning
GT Mask Conditioning. To evaluate the effectiveness of an expressive latent space for ray origins and endpoints by
the proposed GT mask conditioning, we train a baseline training a VAE could be a promising direction for future
model without using this strategy. For missing values in the work. In addition, the computational requirement in multi-
ground-truth depth maps, we use nearest-neighbor interpo- view transformers scales quadratically with the number of
lation to fill in invalid pixels. This experiment is conducted input images: one would require masked attention to deploy
on the CO3D dataset, with results presented in Tab. 4. The systems like ours for a large set of input images.
findings show that removing GT mask conditioning consis- Despite these challenges, we believe that our work high-
tently degrades predicted geometry across varying numbers lights the potential of a unified approach for multi-view ge-
of input views, even when the loss is still masked. While ometry tasks. We envision that our approach can be built
interpolation can effectively fill missing depth within object upon to train a common system across related geometric
regions, it often introduces substantial noise in the back- tasks, such as SfM (input images with unknown origins
ground (e.g., the sky). This noise negatively impacts dif- and endpoints), registration (some images have known ori-
fusion model training, as the entire ray origin and endpoint gins and endpoints, whereas others don’t), mapping (known
maps are used as input. We anticipate an even larger per- rays but unknown endpoints), and view synthesis (unknown
formance gap on non-object-centric datasets with more out- pixel values for known rays).

8
Acknowledgements: We thank the members of the Physi- [10] Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground-
cal Perception Lab at CMU for their valuable discussions, ing image matching in 3d with mast3r, 2024. ECCV.
and extend special thanks to Yanbo Xu for his insights on [11] Zhengqi Li and Noah Snavely. Megadepth: Learning single-
diffusion models. view depth prediction from internet photos. In CVPR, 2018.
This work used Bridges-2 [4] at Pittsburgh Supercom- [12] Amy Lin, Jason Y Zhang, Deva Ramanan, and Shubham Tul-
puting Center through allocation CIS240166 from the Ad- siani. Relpose++: Recovering 6d poses from sparse-view
vanced Cyberinfrastructure Coordination Ecosystem: Ser- observations. In 3DV, 2024.
[13] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang.
vices & Support (ACCESS) program, which is supported by
Common diffusion noise schedules and sample steps are
National Science Foundation grants #2138259, #2138286,
flawed. In WACV, 2024.
#2138307, #2137603, and #2138296. This work was sup-
[14] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson,
ported by Intelligence Advanced Research Projects Activ- and Marc Pollefeys. Pixel-perfect structure-from-motion
ity (IARPA) via Department of Interior/Interior Business with featuremetric refinement. In ICCV, 2021.
Center (DOI/IBC) contract number 140D0423C0074. The [15] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson,
U.S. Government is authorized to reproduce and distribute and Marc Pollefeys. Pixel-Perfect Structure-from-Motion
reprints for Governmental purposes notwithstanding any with Featuremetric Refinement. In ICCV, 2021.
copyright annotation thereon. Disclaimer: The views and [16] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle-
conclusions contained herein are those of the authors and feys. Lightglue: Local feature matching at light speed. In
should not be interpreted as necessarily representing the of- ICCV, 2023.
ficial policies or endorsements, either expressed or implied, [17] David G Lowe. Distinctive image features from scale-
of IARPA, DOI/IBC, or the U.S. Government. invariant keypoints. In IJCV, 2004.
[18] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,
References Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A
large dataset to train convolutional networks for disparity,
[1] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, optical flow, and scene flow estimation. In CVPR, 2016.
Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, [19] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
real-world dataset for 3d indoor scene understanding using Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
mobile rgb-d data. In NeurIPS D&B, 2021. Dinov2: Learning robust visual features without supervision.
[2] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc In TMLR, 2024.
Van Gool. Surf: Speeded up robust features. In ECCV, 2006. [20] William Peebles and Saining Xie. Scalable diffusion models
[3] Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- with transformers. In ICCV, 2023.
allari, Áron Monszpart, Daniyar Turmukhambetov, and Vic- [21] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
tor Adrian Prisacariu. Scene coordinate reconstruction: Pos- sion transformers for dense prediction. In ICCV, 2021.
ing of image collections via incremental learning of a relo- [22] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
calizer. In ECCV, 2024. Luca Sbordone, Patrick Labatut, and David Novotny. Com-
[4] Shawn T Brown, Paola Buitrago, Edward Hanna, Sergiu mon objects in 3d: Large-scale learning and evaluation of
Sanielevici, Robin Scibek, and Nicholas A Nystrom. real-life 3d category reconstruction. In ICCV, 2021.
Bridges-2: A platform for rapidly-evolving and data inten- [23] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and
sive research. In Practice and experience in advanced re- Marcin Dymczyk. From coarse to fine: Robust hierarchical
search computing, 2021. localization at large scale. In CVPR, 2019.
[5] Ruojin Cai, Joseph Tung, Qianqian Wang, Hadar Averbuch- [24] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,
Elor, Bharath Hariharan, and Noah Snavely. Doppelgangers: and Andrew Rabinovich. Superglue: Learning feature
Learning to disambiguate images of similar structures. In matching with graph neural networks. In CVPR, 2020.
ICCV, 2023. [25] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets,
[6] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia
novich. Superpoint: Self-supervised interest point detection Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A plat-
and description. In CVPR, 2018. form for embodied ai research. In ICCV, 2019.
[7] Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, [26] Johannes L Schonberger and Jan-Michael Frahm. Structure-
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r- from-motion revisited. In CVPR, 2016.
sfm: a fully-integrated solution for unconstrained structure- [27] Samarth Sinha, Jason Y Zhang, Andrea Tagliasacchi, Igor
from-motion. In 3DV, 2025. Gilitschenski, and David B Lindell. Sparsepose: Sparse-
[8] Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten view camera pose regression and refinement. In CVPR,
Wadenbäck, and Michael Felsberg. RoMa: Robust Dense 2023.
Feature Matching. In CVPR, 2024. [28] Cameron Smith, David Charatan, Ayush Tewari, and Vincent
[9] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Sitzmann. Flowmap: High-quality camera poses, intrinsics,
sion probabilistic models. In NeurIPS, 2020. and depth via gradient descent. In 3DV, 2025.

9
[29] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo
tourism: exploring photo collections in 3d. In ACM SIG-
GRAPH, 2006.
[30] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. In ICLR, 2021.
[31] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and
Xiaowei Zhou. Loftr: Detector-free local feature matching
with transformers. In CVPR, 2021.
[32] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In CVPR,
2020.
[33] Chengzhou Tang and Ping Tan. Ba-net: Dense bundle ad-
justment network. In ICLR, 2019.
[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[35] Jianyuan Wang, Christian Rupprecht, and David Novotny.
Posediffusion: Solving pose estimation via diffusion-aided
bundle adjustment. In ICCV, 2023.
[36] Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang,
Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking
accurate monocular geometry estimation for open-domain
images with optimal training supervision. In CVPR, 2025.
[37] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi-
sion made easy. In CVPR, 2024.
[38] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi-
sion made easy, 2024. https://github.com/naver/dust3r.
[39] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
scale dataset for generalized multi-view stereo networks. In
CVPR, 2020.
[40] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner,
and Angela Dai. Scannet++: A high-fidelity dataset of 3d
indoor scenes. In ICCV, 2023.
[41] Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Rel-
pose: Predicting probabilistic relative rotation for single ob-
jects in the wild. In ECCV, 2022.
[42] Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan
Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as
rays: Sparse-view pose estimation via ray diffusion. In ICLR,
2024.
[43] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
and Noah Snavely. Stereo magnification: Learning view syn-
thesis using multiplane images. In ACM SIGGRAPH, 2018.

10
DiffusionSfM: Predicting Structure and Motion
via Ray Origin and Endpoint Diffusion
Supplementary Material
Overview ing a 1D optimal alignment (thus giving this baseline some
privileged information). (2) We align the predicted camera
The supplementary material includes sections as follows: centers from RayDiffusion with GT cameras using an op-
• Section A: Implementation details. timal similarity transform. (3) Finally, we unproject image
• Section B: Additional analysis on integrating RayDiffu-
pixels using the updated camera parameters and the aligned
sion [42] with MoGe [36].
depths. We find that a naive combination of RayDiffusion
• Section C: More qualitative comparisons of predicted ge-
and MoGe yields poor Chamfer Distance, even though Ray-
ometry and camera poses against baseline methods. Diffusion estimates relatively accurate focal length. This is
• Section D: Details and evaluation of the sparse-to-dense because the MoGe depth estimates for different input views
training strategy employed in DiffusionSfM.
are inconsistent with each other. Therefore, to predict con-
• Section E: More analysis of the homogeneous represen-
sistent 3D geometry from multiple images, the model must
tation.
learn to reason over the entire set of views, rather than rely-
• Section F: Converting predicted ray origins and endpoints
ing on mono-depth predictions from individual images. We
into camera poses. also include more visualizations in Fig. 6, where duplicated
structures are observed due to significant pose errors or a
A. Implementation Details minor misalignment between views.
Inference. DiffusionSfM utilizes x0 -parameterization to
predict clean ray origin and endpoint maps as the model out- C. More Qualitative Comparisons
put, employing 100 diffusion denoising timesteps. In Fig. 9,
we evaluate the accuracy of x0 -prediction at each timestep We include more qualitative comparisons with baselines on
with eight input images on CO3D [22] unseen categories. the predicted geometry (Fig. 6) and camera poses (Fig. 7).
Interestingly, we find that DiffusionSfM achieves its most Discussion. We show that DiffusionSfM can handle chal-
accurate clean sample predictions at an early timestep (T = lenging input images where objects present highly sym-
90), rather than at the final denoising step. This observation metric patterns (e.g., the tennis ball example in Fig. 6 and
remains consistent across different numbers of input images the donut example in Fig. 7), while RayDiffusion [42] and
(Zhang et al. [42] also have a similar observation that early DUSt3R [37] fail to predict correct camera poses. Com-
stopping helps improve performance). To capitalize on this pared to RayDiffusion, our approach leverages the predic-
property, we limit inference to 10 denoising steps and use tion of dense scene geometry (i.e., pixel-aligned ray origins
the x0 -prediction at T = 90 as the final output, significantly and endpoints) rather than relying on patch-wise “depth-
reducing inference time. Moreover, we find that the opti- agnostic” rays. We find that predicting dense pixel-aligned
mal timestep varies across datasets: T = 85 yields the best outputs improves performance (see Sec. D). When com-
results on Habitat [25], while T = 95 performs best on pared to DUSt3R, our model benefits from attending to
RealEstate10k [43]. All experiments use a zero-terminal- all input images simultaneously and utilizing a diffusion
SNR noise schedule [13]. framework to effectively manage the high uncertainties in-
Resolving Ambiguities in Ground Truth. We transform herent to this task. Additionally, we observe that DUSt3R
camera poses so the first camera has an identity rotation and often predicts precise camera rotations but struggles with
is positioned at the world origin. For scale, we unproject the camera centers in many cases (e.g., the keyboard example
first image in the input views using ground-truth (GT) depth in Fig. 6). This observation aligns with our quantitative re-
and scale the world coordinates so the “median point” lies sults for camera center evaluation, presented in Tab. 1.
at a unit distance from the origin. Our model is trained to
conform to this scene configuration. D. Sparse-to-Dense Training Details and Eval-
uation
B. RD+MoGe Baseline: More Details
As outlined in Sec. 3.3, we follow a sparse-to-dense strat-
To minimize the scale difference for the predicted camera egy to train our model as we find that training the high-
poses from RayDiffusion [42] and depths from MoGe [36] resolution model (i.e., the dense model) from scratch yields
to form a single consistent output, we follow these proce- suboptimal performance. We visualize the output of the
dures: (1) We match the MoGe depth with the GT depth us- sparse model and dense model in Fig. 8. In the following,

11
Input / GT PCs DiffusionSfM RD + MoGe DUSt3R

Figure 6. More Qualitative Comparisons on Predicted Geometry and Camera Poses. DiffusionSfM shows superior capabilities in
handling challenging samples, e.g., the skateboard and tennis ball. Additionally, while we observe that DUSt3R can predict highly precise
camera rotations, it often struggles with camera centers (see the keyboard example).

we introduce the details of training DiffusionSfM. (i.e., 16 × 16) ray origins and endpoints. Since the spa-
tial resolution of the GT ray origins and endpoints for the
Details. Our model leverages DINOv2-ViTs14 [19] as the sparse model aligns with the DINOv2 feature map, we use
feature backbone and takes 224 × 224 images as input. This a single linear layer to embed the noisy ray origins and end-
results in 16 × 16 image patches, each with a patch size of points (without spatial downsampling), rather than a convo-
14. We first train a sparse model that outputs patch-wise

12
Input Images DiffusionSfM RayDiffusion DUSt3R

Figure 7. More Qualitative Comparisons on Predicted Camera Poses.

Rotation Accuracy (↑, @ 15◦ ) Center Accuracy (↑, @ 0.1)


# of Images 2 3 4 5 6 7 8 2 3 4 5 6 7 8
Sparse Model 92.5 93.1 93.4 93.6 93.6 93.8 93.9 100 95.4 92.6 90.9 89.6 88.8 88.2
Seen

Dense Model (1) 90.3 90.7 90.9 90.8 90.9 91.0 90.9 100 94.9 91.1 89.0 87.1 85.7 84.2
Dense Model (2) 93.4 94.0 94.5 94.8 95.0 95.2 95.1 100 95.9 93.6 92.2 91.2 90.7 90.2
Sparse Model 87.0 89.2 90.2 90.7 91.2 91.7 92.1 100 90.9 86.3 83.1 81.0 79.7 79.2
Unseen

Dense Model (1) 85.9 86.8 87.4 87.8 88.5 88.6 88.8 100 89.1 83.7 79.7 77.7 75.5 74.5
Dense Model (2) 90.4 91.2 92.7 93.0 93.1 93.3 93.5 100 91.1 87.7 85.3 83.7 82.7 82.0

Table 5. Camera Rotation and Center Accuracy on CO3D at Different Training stages. On the left, we report the proportion of relative
camera rotations within 15◦ of the ground truth. On the right, we report the proportion of camera centers within 10% of the scene scale,
relative to the ground truth. To align the predicted camera centers to ground truth, we apply an optimal similarity transform (s, R, t).
Hence the alignment is perfect at N = 2 but worsens with more images.

lutional layer as shown in Eq. 5. We also remove the DPT ize our dense model from the pre-trained sparse model to
[21] decoder in our sparse model. Subsequently, we initial- predict dense (i.e., 256 × 256) ray origins and endpoints.

13
Input Images Sparse Model Dense Model
model is shown in Fig. 10. Notably, the model fails to con-
verge, with the training loss remaining persistently high.
This failure occurs because our diffusion-based approach
assumes input data within a reasonable range, as the Gaus-
sian noise added during training has a fixed standard devi-
ation of 1 (before scaling). Consequently, training scenes
with substantial scale differences across components dis-
Figure 8. Qualitative Comparison of Sparse and Dense Model
Outputs. The sparse model predicts the ray origin and endpoint rupt the model’s learning process. In contrast, employing
for each image patch, limiting its ability to capture the fine-grained homogeneous coordinates enables the normalization of the
details of the scene. input data to a unit norm, which not only stabilizes train-
ing and facilitates convergence but also provides an elegant
representation of unbounded scene geometry.
We copy-paste the DiT [20] weights from the sparse model.
Whereas for the convolutional layer used to embed ray ori- F. Converting Ray Origins and Endpoints to
gins and endpoints, we duplicate the linear-layer weights by Camera Poses
16 × 16 (as the patch size of the conv-layer is 16) and then
divide them by 256 to account for the patch-wise addition. The camera centers for each input image are recovered by
While the DiT in the dense model has learned meaningful averaging the corresponding predicted ray origins. To deter-
representations, the DPT decoder is initialized from scratch. mine camera rotations and intrinsics, we follow the method
To avoid breaking the learned DiT weights in early train- proposed by Zhang et al. [42], which involves solving for
ing iterations, we freeze its weights while only training the the optimal homography that aligns the predicted ray di-
convolutional embedding layer and the DPT decoder for a rections with those of an identity camera. For additional
few iterations. This warm-up model is referred to as Dense details, we refer readers to Zhang et al. [42].
Model (1). After that, we train the whole model together,
including the DINOv2 encoder as well (which was frozen
in the previous stage). During this phase, we apply a lower
learning rate (0.1×) to both the DINOv2 encoder and DiT
compared to the DPT decoder. The fully trained model is re-
ferred to as Dense Model (2). We compare the performance
of DiffusionSfM-CO3D at each stage in Tab. 5.
Training Resources. (1) DiffusionSfM-CO3D: We train
the sparse model using 4 H100 GPUs with a total batch size
of 64 for 400,000 iterations, which takes approximately 2
days. To warm up the dense model, we freeze the DiT
weights and train for 50,000 iterations. We then unfreeze
the full model and continue training for another 250,000
iterations on 4 H100 GPUs with a batch size of 48, requir-
ing an additional 2 days. (2) DiffusionSfM: This variant
is trained with 8 H100 GPUs and a larger batch size. The
sparse model is trained for 1,600,000 iterations with a total
batch size of 288 (the first 1,080,000 iterations are run with
4 GPUs and a batch size of 64), which takes approximately
two weeks. The dense model is trained for 800,000 iter-
ations, including 50,000 warm-up iterations, using a batch
size of 96 and taking around 7 days.

E. The Effect of Homogeneous Representation


To underscore the importance of the proposed homoge-
neous representation for ray origins and endpoints, we train
a variant of DiffusionSfM using these components directly
in R3 (i.e., without using homogeneous coordinates). For
this model, we employ a scale-invariant loss function, as
used in DUSt3R [37]. The training loss curve for this

14
0.95

0.90

0.85

0.80 R@15°
[email protected]
100 90 80 70 60 50 40 30 20 10 0

Figure 9. Performance of x0 -Prediction on CO3D Unseen Categories across Diffusion Denoising Timesteps (N = 8). The X-axis
represents the diffusion denoising timesteps, with T = 100 indicating predictions starting from Gaussian noise and T = 0 corresponding
to the clean sample. The Y-axis shows the accuracy for camera rotation (blue) and camera center (orange). Notably, DiffusionSfM achieves
peak performance at T = 90. As a result, in inference, we perform only 10 diffusion steps, significantly improving inference speed.

0.8

0.6

0.4

0.2

0.0
0 50 100 150 200 250 300 350
Figure 10. Training Loss Curve for DiffusionSfM without Homogeneous Representation. The X-axis represents training iterations (in
thousands, k), and the Y-axis denotes the loss value. Without incorporating a homogeneous representation for ray origins and endpoints,
the model struggles to train effectively due to significant scale differences across various scene components.

15

You might also like