Boosting Monocular Metric Depth Estimation via Bokeh Rendering

1 S-Lab, Nanyang Technological University
2 Beihang University
ICML 2026
BokehDepth teaser

BokehDepth decouples bokeh synthesis from depth prediction and uses lens-aware defocus as a supervision-free geometric cue to improve the accuracy and physical consistency of monocular depth estimation. Left: conventional pipelines predict depth from a single sharp image and render bokeh from the noisy depth map. Right: our two-stage framework, where Stage-1 generates a calibrated bokeh stack from a single image and Stage-2 built on UniDepthV2 fuses defocus cues to produce sharper and more reliable metric depth

Abstract

Bokeh rendering and depth estimation share a fundamental optical connection, yet existing methods fail to fully exploit this reciprocity. Conventional bokeh pipelines rely heavily on noisy depth maps that inevitably introduce visual artifacts. Conversely, existing monocular depth models typically follow two flawed paradigms. Generative diffusion-based frameworks often lack consistent metric scale. Meanwhile, feed-forward metric depth models frequently fail in textureless or distant regions where defocus blur can provide geometric information. We propose BokehDepth, a two-stage framework that treats synthetic defocus as a supervision-free geometric signal. In the first stage, a physically grounded generative model produces calibrated bokeh stacks from a single sharp input without requiring prior depth input. Subsequently, a lightweight defocus-aware aggregation module integrates these stacks into the encoder of a depth estimation framework. This mechanism allows the model to extract consistent geometric features from the defocus dimension while keeping the decoder architecture unchanged. Experiments demonstrate that BokehDepth achieves superior visual bokeh fidelity compared to depth-dependent rendering baselines and consistently enhances the metric accuracy of state-of-the-art monocular depth models.

Overview

BokehDepth overview

From monocular depth and depth-based bokeh to BokehDepth. (a) Standard monocular depth estimation predicts a depth map from a single RGB image. (b) Classical bokeh rendering takes an image and its depth map as input to synthesize bokeh. (c) BokehDepth first generates a calibrated bokeh stack from the input image, and then uses the induced defocus cues to enhance depth estimation.

Method

BokehDepth method

BokehDepth architecture. (a) Stage-1 bokeh generation augments a pretrained I2I model, such as FLUX-Kontext, with a bokeh cross-attention adapter that takes a scalar bokeh strength K and produces a calibrated multi-strength bokeh stack from a single sharp image. (b) Stage-2 bokeh stack fusion inserts Divided Space Focus (DSF) Attention into a ViT encoder and uses FiLM conditioning to inject the bokeh stack along the defocus axis, then feeds the aggregated layerwise features to an unchanged DPT decoder to predict metric depth.

Results

BokehDepth results

Qualitative results of BokehDepth using the Depth Anything V2. From top to bottom: the input image, three representative frames from the Stage-1 bokeh stack, the Stage-2 depth prediction, the error map of BokehDepth, the Depth Anything V2 prediction, the corresponding error map, the ground truth depth, the ΔError map that reports the per-pixel reduction in absolute depth error of BokehDepth over the base model, and the RGB image overlaid with green regions that mark where our method produces notable improvements. BokehDepth lowers depth errors on fine structures, weakly-textured walls and distant background regions, offering more distinct layer separation and steadier metric depth across varied scenes.

BibTeX

@article{zhang2025bokehdepth,
  title={Boosting Monocular Metric Depth Estimation via Bokeh Rendering},
  author={Zhang, Hangwei and Fortes, Armando and Wei, Tianyi and Pan, Xingang},
  journal={arXiv preprint arXiv:2512.12425},
  year={2025}
}