Adversarial Examples are Misaligned in Diffusion Model Manifolds
January 12, 2024
Peter Lorenz, Ricard Durall, Jansi Keuper
In recent years, diffusion models (DMs) have drawn significant attention for
their success in approximating data distributions, yielding state-of-the-art
generative results. Nevertheless, the versatility of these models extends
beyond their generative capabilities to encompass various vision applications,
such as image inpainting, segmentation, adversarial robustness, among others.
This study is dedicated to the investigation of adversarial attacks through the
lens of diffusion models. However, our objective does not involve enhancing the
adversarial robustness of image classifiers. Instead, our focus lies in
utilizing the diffusion model to detect and analyze the anomalies introduced by
these attacks on images. To that end, we systematically examine the alignment
of the distributions of adversarial examples when subjected to the process of
transformation using diffusion models. The efficacy of this approach is
assessed across CIFAR-10 and ImageNet datasets, including varying image sizes
in the latter. The results demonstrate a notable capacity to discriminate
effectively between benign and attacked images, providing compelling evidence
that adversarial instances do not align with the learned manifold of the DMs.
Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo
January 12, 2024
Xunpeng Huang, Difan Zou, Hanze Dong, Yian Ma, Tong Zhang
stat.ML, cs.LG, math.OC, stat.CO
To sample from a general target distribution $p_\propto e^{-f_}$ beyond the
isoperimetric condition, Huang et al. (2023) proposed to perform sampling
through reverse diffusion, giving rise to Diffusion-based Monte Carlo (DMC).
Specifically, DMC follows the reverse SDE of a diffusion process that
transforms the target distribution to the standard Gaussian, utilizing a
non-parametric score estimation. However, the original DMC algorithm
encountered high gradient complexity, resulting in an exponential dependency on
the error tolerance $\epsilon$ of the obtained samples. In this paper, we
demonstrate that the high complexity of DMC originates from its redundant
design of score estimation, and proposed a more efficient algorithm, called
RS-DMC, based on a novel recursive score estimation method. In particular, we
first divide the entire diffusion process into multiple segments and then
formulate the score estimation step (at any time step) as a series of
interconnected mean estimation and sampling subproblems accordingly, which are
correlated in a recursive manner. Importantly, we show that with a proper
design of the segment decomposition, all sampling subproblems will only need to
tackle a strongly log-concave distribution, which can be very efficient to
solve using the Langevin-based samplers with a provably rapid convergence rate.
As a result, we prove that the gradient complexity of RS-DMC only has a
quasi-polynomial dependency on $\epsilon$, which significantly improves
exponential gradient complexity in Huang et al. (2023). Furthermore, under
commonly used dissipative conditions, our algorithm is provably much faster
than the popular Langevin-based algorithms. Our algorithm design and
theoretical framework illuminate a novel direction for addressing sampling
problems, which could be of broader applicability in the community.
Demystifying Variational Diffusion Models
January 11, 2024
Fabio De Sousa Ribeiro, Ben Glocker
Despite the growing popularity of diffusion models, gaining a deep
understanding of the model class remains somewhat elusive for the uninitiated
in non-equilibrium statistical physics. With that in mind, we present what we
believe is a more straightforward introduction to diffusion models using
directed graphical modelling and variational Bayesian principles, which imposes
relatively fewer prerequisites on the average reader. Our exposition
constitutes a comprehensive technical review spanning from foundational
concepts like deep latent variable models to recent advances in continuous-time
diffusion-based modelling, highlighting theoretical connections between model
classes along the way. We provide additional mathematical insights that were
omitted in the seminal works whenever possible to aid in understanding, while
avoiding the introduction of new notation. We envision this article serving as
a useful educational supplement for both researchers and practitioners in the
area, and we welcome feedback and contributions from the community at
https://github.com/biomedia-mira/demystifying-diffusion.
FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation
January 11, 2024
Timur Sattarov, Marco Schreyer, Damian Borth
Realistic synthetic tabular data generation encounters significant challenges
in preserving privacy, especially when dealing with sensitive information in
domains like finance and healthcare. In this paper, we introduce
\textit{Federated Tabular Diffusion} (FedTabDiff) for generating high-fidelity
mixed-type tabular data without centralized access to the original tabular
datasets. Leveraging the strengths of \textit{Denoising Diffusion Probabilistic
Models} (DDPMs), our approach addresses the inherent complexities in tabular
data, such as mixed attribute types and implicit relationships. More
critically, FedTabDiff realizes a decentralized learning scheme that permits
multiple entities to collaboratively train a generative model while respecting
data privacy and locality. We extend DDPMs into the federated setting for
tabular data generation, which includes a synchronous update scheme and
weighted averaging for effective model aggregation. Experimental evaluations on
real-world financial and medical datasets attest to the framework’s capability
to produce synthetic data that maintains high fidelity, utility, privacy, and
coverage.
Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation
January 09, 2024
Xiyi Chen, Marko Mihajlovic, Shaofei Wang, Sergey Prokudin, Siyu Tang
Recent advances in generative diffusion models have enabled the previously
unfeasible capability of generating 3D assets from a single input image or a
text prompt. In this work, we aim to enhance the quality and functionality of
these models for the task of creating controllable, photorealistic human
avatars. We achieve this by integrating a 3D morphable model into the
state-of-the-art multiview-consistent diffusion approach. We demonstrate that
accurate conditioning of a generative pipeline on the articulated 3D model
enhances the baseline model performance on the task of novel view synthesis
from a single image. More importantly, this integration facilitates a seamless
and accurate incorporation of facial expression and body pose control into the
generation process. To the best of our knowledge, our proposed framework is the
first diffusion model to enable the creation of fully 3D-consistent,
animatable, and photorealistic human avatars from a single image of an unseen
subject; extensive quantitative and qualitative evaluations demonstrate the
advantages of our approach over existing state-of-the-art avatar creation
models on both novel view and novel expression synthesis tasks.
EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
January 09, 2024
Jingyuan Yang, Jiawei Feng, Hui Huang
Recent years have witnessed remarkable progress in image generation task,
where users can create visually astonishing images with high-quality. However,
existing text-to-image diffusion models are proficient in generating concrete
concepts (dogs) but encounter challenges with more abstract ones (emotions).
Several efforts have been made to modify image emotions with color and style
adjustments, facing limitations in effectively conveying emotions with fixed
image contents. In this work, we introduce Emotional Image Content Generation
(EICG), a new task to generate semantic-clear and emotion-faithful images given
emotion categories. Specifically, we propose an emotion space and construct a
mapping network to align it with the powerful Contrastive Language-Image
Pre-training (CLIP) space, providing a concrete interpretation of abstract
emotions. Attribute loss and emotion confidence are further proposed to ensure
the semantic diversity and emotion fidelity of the generated images. Our method
outperforms the state-of-the-art text-to-image approaches both quantitatively
and qualitatively, where we derive three custom metrics, i.e., emotion
accuracy, semantic clarity and semantic diversity. In addition to generation,
our method can help emotion understanding and inspire emotional art design.
Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models
January 09, 2024
Xuewen Liu, Zhikai Li, Junrui Xiao, Qingyi Gu
Diffusion models have achieved great success in image generation tasks
through iterative noise estimation. However, the heavy denoising process and
complex neural networks hinder their low-latency applications in real-world
scenarios. Quantization can effectively reduce model complexity, and
post-training quantization (PTQ), which does not require fine-tuning, is highly
promising in accelerating the denoising process. Unfortunately, we find that
due to the highly dynamic distribution of activations in different denoising
steps, existing PTQ methods for diffusion models suffer from distribution
mismatch issues at both calibration sample level and reconstruction output
level, which makes the performance far from satisfactory, especially in low-bit
cases. In this paper, we propose Enhanced Distribution Alignment for
Post-Training Quantization of Diffusion Models (EDA-DM) to address the above
issues. Specifically, at the calibration sample level, we select calibration
samples based on the density and diversity in the latent space, thus
facilitating the alignment of their distribution with the overall samples; and
at the reconstruction output level, we propose Fine-grained Block
Reconstruction, which can align the outputs of the quantized model and the
full-precision model at different network granularity. Extensive experiments
demonstrate that EDA-DM outperforms the existing post-training quantization
frameworks in both unconditional and conditional generation scenarios. At
low-bit precision, the quantized models with our method even outperform the
full-precision models on most datasets.
Stable generative modeling using diffusion maps
January 09, 2024
Georg Gottwald, Fengyi Li, Youssef Marzouk, Sebastian Reich
stat.ML, cs.LG, cs.NA, math.NA, stat.CO
We consider the problem of sampling from an unknown distribution for which
only a sufficiently large number of training samples are available. Such
settings have recently drawn considerable interest in the context of generative
modelling. In this paper, we propose a generative model combining diffusion
maps and Langevin dynamics. Diffusion maps are used to approximate the drift
term from the available training samples, which is then implemented in a
discrete-time Langevin sampler to generate new samples. By setting the kernel
bandwidth to match the time step size used in the unadjusted Langevin
algorithm, our method effectively circumvents any stability issues typically
associated with time-stepping stiff stochastic differential equations. More
precisely, we introduce a novel split-step scheme, ensuring that the generated
samples remain within the convex hull of the training samples. Our framework
can be naturally extended to generate conditional samples. We demonstrate the
performance of our proposed scheme through experiments on synthetic datasets
with increasing dimensions and on a stochastic subgrid-scale parametrization
conditional sampling problem.
D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement
January 08, 2024
Danqi Yan, Qing Gao, Yuepeng Qian, Xinxing Chen, Chenglong Fu, Yuquan Leng
Three-dimensional (3D) human pose estimation using a monocular camera has
gained increasing attention due to its ease of implementation and the abundance
of data available from daily life. However, owing to the inherent depth
ambiguity in images, the accuracy of existing monocular camera-based 3D pose
estimation methods remains unsatisfactory, and the estimated 3D poses usually
include much noise. By observing the histogram of this noise, we find each
dimension of the noise follows a certain distribution, which indicates the
possibility for a neural network to learn the mapping between noisy poses and
ground truth poses. In this work, in order to obtain more accurate 3D poses, a
Diffusion-based 3D Pose Refiner (D3PRefiner) is proposed to refine the output
of any existing 3D pose estimator. We first introduce a conditional
multivariate Gaussian distribution to model the distribution of noisy 3D poses,
using paired 2D poses and noisy 3D poses as conditions to achieve greater
accuracy. Additionally, we leverage the architecture of current diffusion
models to convert the distribution of noisy 3D poses into ground truth 3D
poses. To evaluate the effectiveness of the proposed method, two
state-of-the-art sequence-to-sequence 3D pose estimators are used as basic 3D
pose estimation models, and the proposed method is evaluated on different types
of 2D poses and different lengths of the input sequence. Experimental results
demonstrate the proposed architecture can significantly improve the performance
of current sequence-to-sequence 3D pose estimators, with a reduction of at
least 10.3% in the mean per joint position error (MPJPE) and at least 11.0% in
the Procrustes MPJPE (P-MPJPE).
Reflected Schrödinger Bridge for Constrained Generative Modeling
January 06, 2024
Wei Deng, Yu Chen, Nicole Tianjiao Yang, Hengrong Du, Qi Feng, Ricky T. Q. Chen
Diffusion models have become the go-to method for large-scale generative
models in real-world applications. These applications often involve data
distributions confined within bounded domains, typically requiring ad-hoc
thresholding techniques for boundary enforcement. Reflected diffusion models
(Lou23) aim to enhance generalizability by generating the data distribution
through a backward process governed by reflected Brownian motion. However,
reflected diffusion models may not easily adapt to diverse domains without the
derivation of proper diffeomorphic mappings and do not guarantee optimal
transport properties. To overcome these limitations, we introduce the Reflected
Schrodinger Bridge algorithm: an entropy-regularized optimal transport approach
tailored for generating data within diverse bounded domains. We derive elegant
reflected forward-backward stochastic differential equations with Neumann and
Robin boundary conditions, extend divergence-based likelihood training to
bounded domains, and explore natural connections to entropic optimal transport
for the study of approximate linear convergence - a valuable insight for
practical training. Our algorithm yields robust generative modeling in diverse
domains, and its scalability is demonstrated in real-world constrained
generative modeling through standard image benchmarks.
MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond
January 06, 2024
Yupei Lin, Xiaoyu Xian, Yukai Shi, Liang Lin
Recently, text-to-image diffusion models become a new paradigm in image
processing fields, including content generation, image restoration and
image-to-image translation. Given a target prompt, Denoising Diffusion
Probabilistic Models (DDPM) are able to generate realistic yet eligible images.
With this appealing property, the image translation task has the potential to
be free from target image samples for supervision. By using a target text
prompt for domain adaption, the diffusion model is able to implement zero-shot
image-to-image translation advantageously. However, the sampling and inversion
processes of DDPM are stochastic, and thus the inversion process often fail to
reconstruct the input content. Specifically, the displacement effect will
gradually accumulated during the diffusion and inversion processes, which led
to the reconstructed results deviating from the source domain. To make
reconstruction explicit, we propose a prompt redescription strategy to realize
a mirror effect between the source and reconstructed image in the diffusion
model (MirrorDiffusion). More specifically, a prompt redescription mechanism is
investigated to align the text prompts with latent code at each time step of
the Denoising Diffusion Implicit Models (DDIM) inversion to pursue a
structure-preserving reconstruction. With the revised DDIM inversion,
MirrorDiffusion is able to realize accurate zero-shot image translation by
editing optimized text prompts and latent code. Extensive experiments
demonstrate that MirrorDiffusion achieves superior performance over the
state-of-the-art methods on zero-shot image translation benchmarks by clear
margins and practical model stability.
An Event-Oriented Diffusion-Refinement Method for Sparse Events Completion
January 06, 2024
Bo Zhang, Yuqi Han, Jinli Suo, Qionghai Dai
Event cameras or dynamic vision sensors (DVS) record asynchronous response to
brightness changes instead of conventional intensity frames, and feature
ultra-high sensitivity at low bandwidth. The new mechanism demonstrates great
advantages in challenging scenarios with fast motion and large dynamic range.
However, the recorded events might be highly sparse due to either limited
hardware bandwidth or extreme photon starvation in harsh environments. To
unlock the full potential of event cameras, we propose an inventive event
sequence completion approach conforming to the unique characteristics of event
data in both the processing stage and the output form. Specifically, we treat
event streams as 3D event clouds in the spatiotemporal domain, develop a
diffusion-based generative model to generate dense clouds in a coarse-to-fine
manner, and recover exact timestamps to maintain the temporal resolution of raw
data successfully. To validate the effectiveness of our method comprehensively,
we perform extensive experiments on three widely used public datasets with
different spatial resolutions, and additionally collect a novel event dataset
covering diverse scenarios with highly dynamic motions and under harsh
illumination. Besides generating high-quality dense events, our method can
benefit downstream applications such as object classification and intensity
frame reconstruction.
Fair Sampling in Diffusion Models through Switching Mechanism
January 06, 2024
Yujin Choi, Jinseong Park, Hoki Kim, Jaewook Lee, Saeroom Park
Diffusion models have shown their effectiveness in generation tasks by
well-approximating the underlying probability distribution. However, diffusion
models are known to suffer from an amplified inherent bias from the training
data in terms of fairness. While the sampling process of diffusion models can
be controlled by conditional guidance, previous works have attempted to find
empirical guidance to achieve quantitative fairness. To address this
limitation, we propose a fairness-aware sampling method called
\textit{attribute switching} mechanism for diffusion models. Without additional
training, the proposed sampling can obfuscate sensitive attributes in generated
data without relying on classifiers. We mathematically prove and experimentally
demonstrate the effectiveness of the proposed method on two key aspects: (i)
the generation of fair data and (ii) the preservation of the utility of the
generated data.
SAR Despeckling via Regional Denoising Diffusion Probabilistic Model
January 06, 2024
Xuran Hu, Ziqiang Xu, Zhihan Chen, Zhengpeng Feng, Mingzhe Zhu, LJubisa Stankovic
Speckle noise poses a significant challenge in maintaining the quality of
synthetic aperture radar (SAR) images, so SAR despeckling techniques have drawn
increasing attention. Despite the tremendous advancements of deep learning in
fixed-scale SAR image despeckling, these methods still struggle to deal with
large-scale SAR images. To address this problem, this paper introduces a novel
despeckling approach termed Region Denoising Diffusion Probabilistic Model
(R-DDPM) based on generative models. R-DDPM enables versatile despeckling of
SAR images across various scales, accomplished within a single training
session. Moreover, The artifacts in the fused SAR images can be avoided
effectively with the utilization of region-guided inverse sampling. Experiments
of our proposed R-DDPM on Sentinel-1 data demonstrates superior performance to
existing methods.
January 05, 2024
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao
We propose a novel Latent Diffusion Transformer, namely Latte, for video
generation. Latte first extracts spatio-temporal tokens from input videos and
then adopts a series of Transformer blocks to model video distribution in the
latent space. In order to model a substantial number of tokens extracted from
videos, four efficient variants are introduced from the perspective of
decomposing the spatial and temporal dimensions of input videos. To improve the
quality of generated videos, we determine the best practices of Latte through
rigorous experimental analysis, including video clip patch embedding, model
variants, timestep-class information injection, temporal positional embedding,
and learning strategies. Our comprehensive evaluation demonstrates that Latte
achieves state-of-the-art performance across four standard video generation
datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In
addition, we extend Latte to text-to-video generation (T2V) task, where Latte
achieves comparable results compared to recent T2V models. We strongly believe
that Latte provides valuable insights for future research on incorporating
Transformers into diffusion models for video generation.
The Rise of Diffusion Models in Time-Series Forecasting
January 05, 2024
Caspar Meijer, Lydia Y. Chen
This survey delves into the application of diffusion models in time-series
forecasting. Diffusion models are demonstrating state-of-the-art results in
various fields of generative AI. The paper includes comprehensive background
information on diffusion models, detailing their conditioning methods and
reviewing their use in time-series forecasting. The analysis covers 11 specific
time-series implementations, the intuition and theory behind them, the
effectiveness on different datasets, and a comparison among each other. Key
contributions of this work are the thorough exploration of diffusion models’
applications in time-series forecasting and a chronologically ordered overview
of these models. Additionally, the paper offers an insightful discussion on the
current state-of-the-art in this domain and outlines potential future research
directions. This serves as a valuable resource for researchers in AI and
time-series analysis, offering a clear view of the latest advancements and
future potential of diffusion models.
Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors
January 05, 2024
Top Piriyakulkij, Yingheng Wang, Volodymyr Kuleshov
We propose denoising diffusion variational inference (DDVI), an approximate
inference algorithm for latent variable models which relies on diffusion models
as expressive variational posteriors. Our method augments variational
posteriors with auxiliary latents, which yields an expressive class of models
that perform diffusion in latent space by reversing a user-specified noising
process. We fit these models by optimizing a novel lower bound on the marginal
likelihood inspired by the wake-sleep algorithm. Our method is easy to
implement (it fits a regularized extension of the ELBO), is compatible with
black-box variational inference, and outperforms alternative classes of
approximate posteriors based on normalizing flows or adversarial networks. When
applied to deep latent variable models, our method yields the denoising
diffusion VAE (DD-VAE) algorithm. We use this algorithm on a motivating task in
biology – inferring latent ancestry from human genomes – outperforming strong
baselines on the Thousand Genomes dataset.
Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss
January 05, 2024
Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala, Sayak Paul, Patrick Von Platen
Stable Diffusion XL (SDXL) has become the best open source text-to-image
model (T2I) for its versatility and top-notch image quality. Efficiently
addressing the computational demands of SDXL models is crucial for wider reach
and applicability. In this work, we introduce two scaled-down variants, Segmind
Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter
UNets, respectively, achieved through progressive removal using layer-level
losses focusing on reducing the model size while preserving generative quality.
We release these models weights at https://hf.co/Segmind. Our methodology
involves the elimination of residual networks and transformer blocks from the
U-Net structure of SDXL, resulting in significant reductions in parameters, and
latency. Our compact models effectively emulate the original SDXL by
capitalizing on transferred knowledge, achieving competitive results against
larger multi-billion parameter SDXL. Our work underscores the efficacy of
knowledge distillation coupled with layer-level losses in reducing model size
while preserving the high-quality generative capabilities of SDXL, thus
facilitating more accessible deployment in resource-constrained environments.
Bring Metric Functions into Diffusion Models
January 04, 2024
Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo
We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising
Diffusion Probabilistic Model (DDPM) by effectively incorporating additional
metric functions in training. Metric functions such as the LPIPS loss have been
proven highly effective in consistency models derived from the score matching.
However, for the diffusion counterparts, the methodology and efficacy of adding
extra metric functions remain unclear. One major challenge is the mismatch
between the noise predicted by a DDPM at each step and the desired clean image
that the metric function works well on. To address this problem, we propose
Cas-DM, a network architecture that cascades two network modules to effectively
apply metric functions to the diffusion model training. The first module,
similar to a standard DDPM, learns to predict the added noise and is unaffected
by the metric function. The second cascaded module learns to predict the clean
image, thereby facilitating the metric function computation. Experiment results
show that the proposed diffusion model backbone enables the effective use of
the LPIPS loss, leading to state-of-the-art image quality (FID, sFID, IS) on
various established benchmarks.
Energy based diffusion generator for efficient sampling of Boltzmann distributions
January 04, 2024
Yan Wang, Ling Guo, Hao Wu, Tao Zhou
We introduce a novel sampler called the energy based diffusion generator for
generating samples from arbitrary target distributions. The sampling model
employs a structure similar to a variational autoencoder, utilizing a decoder
to transform latent variables from a simple distribution into random variables
approximating the target distribution, and we design an encoder based on the
diffusion model. Leveraging the powerful modeling capacity of the diffusion
model for complex distributions, we can obtain an accurate variational estimate
of the Kullback-Leibler divergence between the distributions of the generated
samples and the target. Moreover, we propose a decoder based on generalized
Hamiltonian dynamics to further enhance sampling performance. Through empirical
evaluation, we demonstrate the effectiveness of our method across various
complex distribution functions, showcasing its superiority compared to existing
methods.
Improving Diffusion-Based Image Synthesis with Context Prediction
January 04, 2024
Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin Cui
Diffusion models are a new class of generative models, and have dramatically
promoted image generation with unprecedented quality and diversity. Existing
diffusion models mainly try to reconstruct input image from a corrupted one
with a pixel-wise or feature-wise constraint along spatial axes. However, such
point-based reconstruction may fail to make each predicted pixel/feature fully
preserve its neighborhood context, impairing diffusion-based image synthesis.
As a powerful source of automatic supervisory signal, context has been well
studied for learning representations. Inspired by this, we for the first time
propose ConPreDiff to improve diffusion-based image synthesis with context
prediction. We explicitly reinforce each point to predict its neighborhood
context (i.e., multi-stride features/tokens/pixels) with a context decoder at
the end of diffusion denoising blocks in training stage, and remove the decoder
for inference. In this way, each point can better reconstruct itself by
preserving its semantic connections with neighborhood context. This new
paradigm of ConPreDiff can generalize to arbitrary discrete and continuous
diffusion backbones without introducing extra parameters in sampling procedure.
Extensive experiments are conducted on unconditional image generation,
text-to-image generation and image inpainting tasks. Our ConPreDiff
consistently outperforms previous methods and achieves a new SOTA text-to-image
generation results on MS-COCO, with a zero-shot FID score of 6.21.
CoMoSVC: Consistency Model-based Singing Voice Conversion
January 03, 2024
Yiwen Lu, Zhen Ye, Wei Xue, Xu Tan, Qifeng Liu, Yike Guo
eess.AS, cs.AI, cs.LG, cs.SD
The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.
DiffYOLO: Object Detection for Anti-Noise via YOLO and Diffusion Models
January 03, 2024
Yichen Liu, Huajian Zhang, Daqing Gao
Object detection models represented by YOLO series have been widely used and
have achieved great results on the high quality datasets, but not all the
working conditions are ideal. To settle down the problem of locating targets on
low quality datasets, the existing methods either train a new object detection
network, or need a large collection of low-quality datasets to train. However,
we propose a framework in this paper and apply it on the YOLO models called
DiffYOLO. Specifically, we extract feature maps from the denoising diffusion
probabilistic models to enhance the well-trained models, which allows us
fine-tune YOLO on high-quality datasets and test on low-quality datasets. The
results proved this framework can not only prove the performance on noisy
datasets, but also prove the detection results on high-quality test datasets.
We will supplement more experiments later (with various datasets and network
architectures).
DDPM based X-ray Image Synthesizer
January 03, 2024
Praveen Mahaulpatha, Thulana Abeywardane, Tomson George
Access to high-quality datasets in the medical industry limits machine
learning model performance. To address this issue, we propose a Denoising
Diffusion Probabilistic Model (DDPM) combined with a UNet architecture for
X-ray image synthesis. Focused on pneumonia medical condition, our methodology
employs over 3000 pneumonia X-ray images obtained from Kaggle for training.
Results demonstrate the effectiveness of our approach, as the model
successfully generated realistic images with low Mean Squared Error (MSE). The
synthesized images showed distinct differences from non-pneumonia images,
highlighting the model’s ability to capture key features of positive cases.
Beyond pneumonia, the applications of this synthesizer extend to various
medical conditions, provided an ample dataset is available. The capability to
produce high-quality images can potentially enhance machine learning models’
performance, aiding in more accurate and efficient medical diagnoses. This
innovative DDPM-based X-ray photo synthesizer presents a promising avenue for
addressing the scarcity of positive medical image datasets, paving the way for
improved medical image analysis and diagnosis in the healthcare industry.
S2-DMs:Skip-Step Diffusion Models
January 03, 2024
Yixuan Wang, Shuangyin Li
Diffusion models have emerged as powerful generative tools, rivaling GANs in
sample quality and mirroring the likelihood scores of autoregressive models. A
subset of these models, exemplified by DDIMs, exhibit an inherent asymmetry:
they are trained over $T$ steps but only sample from a subset of $T$ during
generation. This selective sampling approach, though optimized for speed,
inadvertently misses out on vital information from the unsampled steps, leading
to potential compromises in sample quality. To address this issue, we present
the S$^{2}$-DMs, which is a new training method by using an innovative
$L_{skip}$, meticulously designed to reintegrate the information omitted during
the selective sampling phase. The benefits of this approach are manifold: it
notably enhances sample quality, is exceptionally simple to implement, requires
minimal code modifications, and is flexible enough to be compatible with
various sampling algorithms. On the CIFAR10 dataset, models trained using our
algorithm showed an improvement of 3.27% to 14.06% over models trained with
traditional methods across various sampling algorithms (DDIMs, PNDMs, DEIS) and
different numbers of sampling steps (10, 20, …, 1000). On the CELEBA dataset,
the improvement ranged from 8.97% to 27.08%. Access to the code and additional
resources is provided in the github.
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
January 02, 2024
Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
cs.SD, cs.AI, cs.CL, eess.AS
Recent advancements in diffusion models and large language models (LLMs) have
significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning
AIGC application designed to generate audio from natural language prompts, is
attracting increasing attention. However, existing TTA studies often struggle
with generation quality and text-audio alignment, especially for complex
textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I)
diffusion models, we introduce Auffusion, a TTA system adapting T2I model
frameworks to TTA task, by effectively leveraging their inherent generative
strengths and precise cross-modal alignment. Our objective and subjective
evaluations demonstrate that Auffusion surpasses previous TTA approaches using
limited data and computational resource. Furthermore, previous studies in T2I
recognizes the significant impact of encoder choice on cross-modal alignment,
like fine-grained details and object bindings, while similar evaluation is
lacking in prior TTA works. Through comprehensive ablation studies and
innovative cross-attention map visualizations, we provide insightful
assessments of text-audio alignment in TTA. Our findings reveal Auffusion’s
superior capability in generating audios that accurately match textual
descriptions, which further demonstrated in several related tasks, such as
audio style transfer, inpainting and other manipulations. Our implementation
and demos are available at https://auffusion.github.io.
DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition
January 01, 2024
Parul Gupta, Tuan Nguyen, Abhinav Dhall, Munawar Hayat, Trung Le, Thanh-Toan Do
The task of Visual Relationship Recognition (VRR) aims to identify
relationships between two interacting objects in an image and is particularly
challenging due to the widely-spread and highly imbalanced distribution of
<subject, relation, object> triplets. To overcome the resultant performance
bias in existing VRR approaches, we introduce DiffAugment – a method which
first augments the tail classes in the linguistic space by making use of
WordNet and then utilizes the generative prowess of Diffusion Models to expand
the visual space for minority classes. We propose a novel hardness-aware
component in diffusion which is based upon the hardness of each <S,R,O> triplet
and demonstrate the effectiveness of hardness-aware diffusion in generating
visual embeddings for the tail classes. We also propose a novel subject and
object based seeding strategy for diffusion sampling which improves the
discriminative capability of the generated visual embeddings. Extensive
experimentation on the GQA-LT dataset shows favorable gains in the
subject/object and relation average per-class accuracy using Diffusion
augmented samples.
DiffMorph: Text-less Image Morphing with Diffusion Models
January 01, 2024
Shounak Chatterjee
Text-conditioned image generation models are a prevalent use of AI image
synthesis, yet intuitively controlling output guided by an artist remains
challenging. Current methods require multiple images and textual prompts for
each object to specify them as concepts to generate a single customized image.
On the other hand, our work, \verb|DiffMorph|, introduces a novel approach
that synthesizes images that mix concepts without the use of textual prompts.
Our work integrates a sketch-to-image module to incorporate user sketches as
input. \verb|DiffMorph| takes an initial image with conditioning artist-drawn
sketches to generate a morphed image.
We employ a pre-trained text-to-image diffusion model and fine-tune it to
reconstruct each image faithfully. We seamlessly merge images and concepts from
sketches into a cohesive composition. The image generation capability of our
work is demonstrated through our results and a comparison of these with
prompt-based image generation.
Diffusion Models, Image Super-Resolution And Everything: A Survey
January 01, 2024
Brian B. Moser, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Sebastian Palacio, Andreas Dengel
cs.CV, cs.AI, cs.LG, cs.MM
Diffusion Models (DMs) represent a significant advancement in image
Super-Resolution (SR), aligning technical image quality more closely with human
preferences and expanding SR applications. DMs address critical limitations of
previous methods, enhancing overall realism and details in SR images. However,
DMs suffer from color-shifting issues, and their high computational costs call
for efficient sampling alternatives, underscoring the challenge of balancing
computational efficiency and image quality. This survey gives an overview of
DMs applied to image SR and offers a detailed analysis that underscores the
unique characteristics and methodologies within this domain, distinct from
broader existing reviews in the field. It presents a unified view of DM
fundamentals and explores research directions, including alternative input
domains, conditioning strategies, guidance, corruption spaces, and zero-shot
methods. This survey provides insights into the evolution of image SR with DMs,
addressing current trends, challenges, and future directions in this rapidly
evolving field.
SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity
December 31, 2023
Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra
Score distillation has emerged as one of the most prevalent approaches for
text-to-3D asset synthesis. Essentially, score distillation updates 3D
parameters by lifting and back-propagating scores averaged over different
views. In this paper, we reveal that the gradient estimation in score
distillation is inherent to high variance. Through the lens of variance
reduction, the effectiveness of SDS and VSD can be interpreted as applications
of various control variates to the Monte Carlo estimator of the distilled
score. Motivated by this rethinking and based on Stein’s identity, we propose a
more general solution to reduce variance for score distillation, termed Stein
Score Distillation (SSD). SSD incorporates control variates constructed by
Stein identity, allowing for arbitrary baseline functions. This enables us to
include flexible guidance priors and network architectures to explicitly
optimize for variance reduction. In our experiments, the overall pipeline,
dubbed SteinDreamer, is implemented by instantiating the control variate with a
monocular depth estimator. The results suggest that SSD can effectively reduce
the distillation variance and consistently improve visual quality for both
object- and scene-level generation. Moreover, we demonstrate that SteinDreamer
achieves faster convergence than existing methods due to more stable gradient
updates.
Taming Mode Collapse in Score Distillation for Text-to-3D Generation
December 31, 2023
Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra
Despite the remarkable performance of score distillation in text-to-3D
generation, such techniques notoriously suffer from view inconsistency issues,
also known as “Janus” artifact, where the generated objects fake each view with
multiple front faces. Although empirically effective methods have approached
this problem via score debiasing or prompt engineering, a more rigorous
perspective to explain and tackle this problem remains elusive. In this paper,
we reveal that the existing score distillation-based text-to-3D generation
frameworks degenerate to maximal likelihood seeking on each view independently
and thus suffer from the mode collapse problem, manifesting as the Janus
artifact in practice. To tame mode collapse, we improve score distillation by
re-establishing in entropy term in the corresponding variational objective,
which is applied to the distribution of rendered images. Maximizing the entropy
encourages diversity among different views in generated 3D assets, thereby
mitigating the Janus problem. Based on this new objective, we derive a new
update rule for 3D score distillation, dubbed Entropic Score Distillation
(ESD). We theoretically reveal that ESD can be simplified and implemented by
just adopting the classifier-free guidance trick upon variational score
distillation. Although embarrassingly straightforward, our extensive
experiments successfully demonstrate that ESD can be an effective treatment for
Janus artifacts in score distillation.
Probing the Limits and Capabilities of Diffusion Models for the Anatomic Editing of Digital Twins
December 30, 2023
Karim Kadry, Shreya Gupta, Farhad R. Nezami, Elazer R. Edelman
Numerical simulations can model the physical processes that govern
cardiovascular device deployment. When such simulations incorporate digital
twins; computational models of patient-specific anatomy, they can expedite and
de-risk the device design process. Nonetheless, the exclusive use of
patient-specific data constrains the anatomic variability which can be
precisely or fully explored. In this study, we investigate the capacity of
Latent Diffusion Models (LDMs) to edit digital twins to create anatomic
variants, which we term digital siblings. Digital twins and their corresponding
siblings can serve as the basis for comparative simulations, enabling the study
of how subtle anatomic variations impact the simulated deployment of
cardiovascular devices, as well as the augmentation of virtual cohorts for
device assessment. However, while diffusion models have been characterized in
their ability to edit natural images, their capacity to anatomically edit
digital twins has yet to be studied. Using a case example centered on 3D
digital twins of cardiac anatomy, we implement various methods for generating
digital siblings and characterize them through morphological and topological
analyses. We specifically edit digital twins to introduce anatomic variation at
different spatial scales and within localized regions, demonstrating the
existence of bias towards common anatomic features. We further show that such
anatomic bias can be leveraged for virtual cohort augmentation through
selective editing, partially alleviating issues related to dataset imbalance
and lack of diversity. Our experimental framework thus delineates the limits
and capabilities of using latent diffusion models in synthesizing anatomic
variation for in silico trials.
Diffusion Model with Perceptual Loss
December 30, 2023
Shanchuan Lin, Xiao Yang
Diffusion models trained with mean squared error loss tend to generate
unrealistic samples. Current state-of-the-art models rely on classifier-free
guidance to improve sample quality, yet its surprising effectiveness is not
fully understood. In this paper, We show that the effectiveness of
classifier-free guidance partly originates from it being a form of implicit
perceptual guidance. As a result, we can directly incorporate perceptual loss
in diffusion training to improve sample quality. Since the score matching
objective used in diffusion training strongly resembles the denoising
autoencoder objective used in unsupervised training of perceptual networks, the
diffusion model itself is a perceptual network and can be used to generate
meaningful perceptual loss. We propose a novel self-perceptual objective that
results in diffusion models capable of generating more realistic samples. For
conditional generation, our method only improves sample quality without
entanglement with the conditional input and therefore does not sacrifice sample
diversity. Our method can also improve sample quality for unconditional
generation, which was not possible with classifier-free guidance before.
iFusion: Inverting Diffusion for Pose-Free Reconstruction from Sparse Views
December 28, 2023
Chin-Hsuan Wu, Yen-Chun Chen, Bolivar Solarte, Lu Yuan, Min Sun
We present iFusion, a novel 3D object reconstruction framework that requires
only two views with unknown camera poses. While single-view reconstruction
yields visually appealing results, it can deviate significantly from the actual
object, especially on unseen sides. Additional views improve reconstruction
fidelity but necessitate known camera poses. However, assuming the availability
of pose may be unrealistic, and existing pose estimators fail in sparse view
scenarios. To address this, we harness a pre-trained novel view synthesis
diffusion model, which embeds implicit knowledge about the geometry and
appearance of diverse objects. Our strategy unfolds in three steps: (1) We
invert the diffusion model for camera pose estimation instead of synthesizing
novel views. (2) The diffusion model is fine-tuned using provided views and
estimated poses, turned into a novel view synthesizer tailored for the target
object. (3) Leveraging registered views and the fine-tuned diffusion model, we
reconstruct the 3D object. Experiments demonstrate strong performance in both
pose estimation and novel view synthesis. Moreover, iFusion seamlessly
integrates with various reconstruction methods and enhances them.
DreamGaussian4D: Generative 4D Gaussian Splatting
December 28, 2023
Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, Ziwei Liu
Remarkable progress has been made in 4D content generation recently. However,
existing methods suffer from long optimization time, lack of motion
controllability, and a low level of detail. In this paper, we introduce
DreamGaussian4D, an efficient 4D generation framework that builds on 4D
Gaussian Splatting representation. Our key insight is that the explicit
modeling of spatial transformations in Gaussian Splatting makes it more
suitable for the 4D generation setting compared with implicit representations.
DreamGaussian4D reduces the optimization time from several hours to just a few
minutes, allows flexible control of the generated 3D motion, and produces
animated meshes that can be efficiently rendered in 3D engines.
PolyDiff: Generating 3D Polygonal Meshes with Diffusion Models
December 18, 2023
Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, Matthias Nießner
We introduce PolyDiff, the first diffusion-based approach capable of directly
generating realistic and diverse 3D polygonal meshes. In contrast to methods
that use alternate 3D shape representations (e.g. implicit representations),
our approach is a discrete denoising diffusion probabilistic model that
operates natively on the polygonal mesh data structure. This enables learning
of both the geometric properties of vertices and the topological
characteristics of faces. Specifically, we treat meshes as quantized triangle
soups, progressively corrupted with categorical noise in the forward diffusion
phase. In the reverse diffusion phase, a transformer-based denoising network is
trained to revert the noising process, restoring the original mesh structure.
At inference, new meshes can be generated by applying this denoising network
iteratively, starting with a completely noisy triangle soup. Consequently, our
model is capable of producing high-quality 3D polygonal meshes, ready for
integration into downstream 3D workflows. Our extensive experimental analysis
shows that PolyDiff achieves a significant advantage (avg. FID and JSD
improvement of 18.2 and 5.8 respectively) over current state-of-the-art
methods.
Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model
December 18, 2023
Decheng Liu, Xijun Wang, Chunlei Peng, Nannan Wang, Ruiming Hu, Xinbo Gao
Adversarial attacks involve adding perturbations to the source image to cause
misclassification by the target model, which demonstrates the potential of
attacking face recognition models. Existing adversarial face image generation
methods still can’t achieve satisfactory performance because of low
transferability and high detectability. In this paper, we propose a unified
framework Adv-Diffusion that can generate imperceptible adversarial identity
perturbations in the latent space but not the raw pixel space, which utilizes
strong inpainting capabilities of the latent diffusion model to generate
realistic adversarial images. Specifically, we propose the identity-sensitive
conditioned diffusion generative model to generate semantic perturbations in
the surroundings. The designed adaptive strength-based adversarial perturbation
algorithm can ensure both attack transferability and stealthiness. Extensive
qualitative and quantitative experiments on the public FFHQ and CelebA-HQ
datasets prove the proposed method achieves superior performance compared with
the state-of-the-art methods without an extra generative model training
process. The source code is available at
https://github.com/kopper-xdu/Adv-Diffusion.
DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models
December 18, 2023
Jiachen Zhou, Peizhuo Lv, Yibing Lan, Guozhu Meng, Kai Chen, Hualong Ma
Dataset sanitization is a widely adopted proactive defense against
poisoning-based backdoor attacks, aimed at filtering out and removing poisoned
samples from training datasets. However, existing methods have shown limited
efficacy in countering the ever-evolving trigger functions, and often leading
to considerable degradation of benign accuracy. In this paper, we propose
DataElixir, a novel sanitization approach tailored to purify poisoned datasets.
We leverage diffusion models to eliminate trigger features and restore benign
features, thereby turning the poisoned samples into benign ones. Specifically,
with multiple iterations of the forward and reverse process, we extract
intermediary images and their predicted labels for each sample in the original
dataset. Then, we identify anomalous samples in terms of the presence of label
transition of the intermediary images, detect the target label by quantifying
distribution discrepancy, select their purified images considering pixel and
feature distance, and determine their ground-truth labels by training a benign
model. Experiments conducted on 9 popular attacks demonstrates that DataElixir
effectively mitigates various complex attacks while exerting minimal impact on
benign accuracy, surpassing the performance of baseline defense methods.
Realistic Human Motion Generation with Cross-Diffusion Models
December 18, 2023
Zeping Ren, Shaoli Huang, Xiu Li
We introduce the Cross Human Motion Diffusion Model (CrossDiff), a novel
approach for generating high-quality human motion based on textual
descriptions. Our method integrates 3D and 2D information using a shared
transformer network within the training of the diffusion model, unifying motion
noise into a single feature space. This enables cross-decoding of features into
both 3D and 2D motion representations, regardless of their original dimension.
The primary advantage of CrossDiff is its cross-diffusion mechanism, which
allows the model to reverse either 2D or 3D noise into clean motion during
training. This capability leverages the complementary information in both
motion representations, capturing intricate human movement details often missed
by models relying solely on 3D information. Consequently, CrossDiff effectively
combines the strengths of both representations to generate more realistic
motion sequences. In our experiments, our model demonstrates competitive
state-of-the-art performance on text-to-motion benchmarks. Moreover, our method
consistently provides enhanced motion generation quality, capturing complex
full-body movement intricacies. Additionally, with a pretrained model,our
approach accommodates using in the wild 2D motion data without 3D motion ground
truth during training to generate 3D motion, highlighting its potential for
broader applications and efficient use of available data resources. Project
page: https://wonderno.github.io/CrossDiff-webpage/.
Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models
December 17, 2023
Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk
Knowledge distillation methods have recently shown to be a promising
direction to speedup the synthesis of large-scale diffusion models by requiring
only a few inference steps. While several powerful distillation methods were
recently proposed, the overall quality of student samples is typically lower
compared to the teacher ones, which hinders their practical usage. In this
work, we investigate the relative quality of samples produced by the teacher
text-to-image diffusion model and its distilled student version. As our main
empirical finding, we discover that a noticeable portion of student samples
exhibit superior fidelity compared to the teacher ones, despite the
``approximate’’ nature of the student. Based on this finding, we propose an
adaptive collaboration between student and teacher diffusion models for
effective text-to-image synthesis. Specifically, the distilled model produces
the initial sample, and then an oracle decides whether it needs further
improvements with a slow teacher model. Extensive experiments demonstrate that
the designed pipeline surpasses state-of-the-art text-to-image alternatives for
various inference budgets in terms of human preference. Furthermore, the
proposed approach can be naturally used in popular applications such as
text-guided image editing and controllable generation.
VecFusion: Vector Font Generation with Diffusion
December 16, 2023
Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Michael Gharbi, Oliver Wang, Alec Jacobson, Evangelos Kalogerakis
We present VecFusion, a new neural architecture that can generate vector
fonts with varying topological structures and precise control point positions.
Our approach is a cascaded diffusion model which consists of a raster diffusion
model followed by a vector diffusion model. The raster model generates
low-resolution, rasterized fonts with auxiliary control point information,
capturing the global style and shape of the font, while the vector model
synthesizes vector fonts conditioned on the low-resolution raster fonts from
the first stage. To synthesize long and complex curves, our vector diffusion
model uses a transformer architecture and a novel vector representation that
enables the modeling of diverse vector geometry and the precise prediction of
control points. Our experiments show that, in contrast to previous generative
models for vector graphics, our new cascaded vector diffusion model generates
higher quality vector fonts, with complex structures and diverse styles.
Continuous Diffusion for Mixed-Type Tabular Data
December 16, 2023
Markus Mueller, Kathrin Gruber, Dennis Fok
Score-based generative models (or diffusion models for short) have proven
successful across many domains in generating text and image data. However, the
consideration of mixed-type tabular data with this model family has fallen
short so far. Existing research mainly combines different diffusion processes
without explicitly accounting for the feature heterogeneity inherent to tabular
data. In this paper, we combine score matching and score interpolation to
ensure a common type of continuous noise distribution that affects both
continuous and categorical features alike. Further, we investigate the impact
of distinct noise schedules per feature or per data type. We allow for
adaptive, learnable noise schedules to ensure optimally allocated model
capacity and balanced generative capability. Results show that our model
consistently outperforms state-of-the-art benchmark models and that accounting
for heterogeneity within the noise schedule design boosts the sample quality.
Lecture Notes in Probabilistic Diffusion Models
December 16, 2023
Inga Strümke, Helge Langseth
Diffusion models are loosely modelled based on non-equilibrium
thermodynamics, where \textit{diffusion} refers to particles flowing from
high-concentration regions towards low-concentration regions. In statistics,
the meaning is quite similar, namely the process of transforming a complex
distribution $p_{\text{complex}}$ on $\mathbb{R}^d$ to a simple distribution
$p_{\text{prior}}$ on the same domain. This constitutes a Markov chain of
diffusion steps of slowly adding random noise to data, followed by a reverse
diffusion process in which the data is reconstructed from the noise. The
diffusion model learns the data manifold to which the original and thus the
reconstructed data samples belong, by training on a large number of data
points. While the diffusion process pushes a data sample off the data manifold,
the reverse process finds a trajectory back to the data manifold. Diffusion
models have – unlike variational autoencoder and flow models – latent
variables with the same dimensionality as the original data, and they are
currently\footnote{At the time of writing, 2023.} outperforming other
approaches – including Generative Adversarial Networks (GANs) – to modelling
the distribution of, e.g., natural images.
Image Restoration Through Generalized Ornstein-Uhlenbeck Bridge
December 16, 2023
Conghan Yue, Zhengwei Peng, Junlong Ma, Shiyan Du, Pengxu Wei, Dongyu Zhang
Diffusion models possess powerful generative capabilities enabling the
mapping of noise to data using reverse stochastic differential equations.
However, in image restoration tasks, the focus is on the mapping relationship
from low-quality images to high-quality images. To address this, we introduced
the Generalized Ornstein-Uhlenbeck Bridge (GOUB) model. By leveraging the
natural mean-reverting property of the generalized OU process and further
adjusting the variance of its steady-state distribution through the Doob’s
h-transform, we achieve diffusion mappings from point to point with minimal
cost. This allows for end-to-end training, enabling the recovery of
high-quality images from low-quality ones. Additionally, we uncovered the
mathematical essence of some bridge models, all of which are special cases of
the GOUB and empirically demonstrated the optimality of our proposed models.
Furthermore, benefiting from our distinctive parameterization mechanism, we
proposed the Mean-ODE model that is better at capturing pixel-level information
and structural perceptions. Experimental results show that both models achieved
state-of-the-art results in various tasks, including inpainting, deraining, and
super-resolution. Code is available at https://github.com/Hammour-steak/GOUB.
PhenDiff: Revealing Invisible Phenotypes with Conditional Diffusion Models
December 13, 2023
Anis Bourou, Thomas Boyer, Kévin Daupin, Véronique Dubreuil, Aurélie De Thonel, Valérie Mezger, Auguste Genovesio
Over the last five years, deep generative models have gradually been adopted
for various tasks in biological research. Notably, image-to-image translation
methods showed to be effective in revealing subtle phenotypic cell variations
otherwise invisible to the human eye. Current methods to achieve this goal
mainly rely on Generative Adversarial Networks (GANs). However, these models
are known to suffer from some shortcomings such as training instability and
mode collapse. Furthermore, the lack of robustness to invert a real image into
the latent of a trained GAN prevents flexible editing of real images. In this
work, we propose PhenDiff, an image-to-image translation method based on
conditional diffusion models to identify subtle phenotypes in microscopy
images. We evaluate this approach on biological datasets against previous work
such as CycleGAN. We show that PhenDiff outperforms this baseline in terms of
quality and diversity of the generated images. We then apply this method to
display invisible phenotypic changes triggered by a rare neurodevelopmental
disorder on microscopy images of organoids. Altogether, we demonstrate that
PhenDiff is able to perform high quality biological image-to-image translation
allowing to spot subtle phenotype variations on a real image.
SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space
December 13, 2023
Yunchen Li, Zhou Yu, Gaoqi He, Yunhang Shen, Ke Li, Xing Sun, Shaohui Lin
Symmetric positive definite~(SPD) matrices have shown important value and
applications in statistics and machine learning, such as FMRI analysis and
traffic prediction. Previous works on SPD matrices mostly focus on
discriminative models, where predictions are made directly on $E(X|y)$, where
$y$ is a vector and $X$ is an SPD matrix. However, these methods are
challenging to handle for large-scale data, as they need to access and process
the whole data. In this paper, inspired by denoising diffusion probabilistic
model~(DDPM), we propose a novel generative model, termed SPD-DDPM, by
introducing Gaussian distribution in the SPD space to estimate $E(X|y)$.
Moreover, our model is able to estimate $p(X)$ unconditionally and flexibly
without giving $y$. On the one hand, the model conditionally learns $p(X|y)$
and utilizes the mean of samples to obtain $E(X|y)$ as a prediction. On the
other hand, the model unconditionally learns the probability distribution of
the data $p(X)$ and generates samples that conform to this distribution.
Furthermore, we propose a new SPD net which is much deeper than the previous
networks and allows for the inclusion of conditional factors. Experiment
results on toy data and real taxi data demonstrate that our models effectively
fit the data distribution both unconditionally and unconditionally and provide
accurate predictions.
Concept-centric Personalization with Large-scale Diffusion Priors
December 13, 2023
Pu Cao, Lu Yang, Feng Zhou, Tianrui Huang, Qing Song
Despite large-scale diffusion models being highly capable of generating
diverse open-world content, they still struggle to match the photorealism and
fidelity of concept-specific generators. In this work, we present the task of
customizing large-scale diffusion priors for specific concepts as
concept-centric personalization. Our goal is to generate high-quality
concept-centric images while maintaining the versatile controllability inherent
to open-world models, enabling applications in diverse tasks such as
concept-centric stylization and image translation. To tackle these challenges,
we identify catastrophic forgetting of guidance prediction from diffusion
priors as the fundamental issue. Consequently, we develop a guidance-decoupled
personalization framework specifically designed to address this task. We
propose Generalized Classifier-free Guidance (GCFG) as the foundational theory
for our framework. This approach extends Classifier-free Guidance (CFG) to
accommodate an arbitrary number of guidances, sourced from a variety of
conditions and models. Employing GCFG enables us to separate conditional
guidance into two distinct components: concept guidance for fidelity and
control guidance for controllability. This division makes it feasible to train
a specialized model for concept guidance, while ensuring both control and
unconditional guidance remain intact. We then present a null-text
Concept-centric Diffusion Model as a concept-specific generator to learn
concept guidance without the need for text annotations. Code will be available
at https://github.com/PRIV-Creation/Concept-centric-Personalization.
-Diffusion: A diffusion-based density estimation framework for computational physics
December 13, 2023
Maxwell X. Cai, Kin Long Kelvin Lee
In physics, density $\rho(\cdot)$ is a fundamentally important scalar
function to model, since it describes a scalar field or a probability density
function that governs a physical process. Modeling $\rho(\cdot)$ typically
scales poorly with parameter space, however, and quickly becomes prohibitively
difficult and computationally expensive. One promising avenue to bypass this is
to leverage the capabilities of denoising diffusion models often used in
high-fidelity image generation to parameterize $\rho(\cdot)$ from existing
scientific data, from which new samples can be trivially sampled from. In this
paper, we propose $\rho$-Diffusion, an implementation of denoising diffusion
probabilistic models for multidimensional density estimation in physics, which
is currently in active development and, from our results, performs well on
physically motivated 2D and 3D density functions. Moreover, we propose a novel
hashing technique that allows $\rho$-Diffusion to be conditioned by arbitrary
amounts of physical parameters of interest.
Clockwork Diffusion: Efficient Generation With Model-Step Distillation
December 13, 2023
Amirhossein Habibian, Amir Ghodrati, Noor Fathima, Guillaume Sautiere, Risheek Garrepalli, Fatih Porikli, Jens Petersen
This work aims to improve the efficiency of text-to-image diffusion models.
While diffusion models use computationally expensive UNet-based denoising
operations in every generation step, we identify that not all operations are
equally relevant for the final output quality. In particular, we observe that
UNet layers operating on high-res feature maps are relatively sensitive to
small perturbations. In contrast, low-res feature maps influence the semantic
layout of the final image and can often be perturbed with no noticeable change
in the output. Based on this observation, we propose Clockwork Diffusion, a
method that periodically reuses computation from preceding denoising steps to
approximate low-res feature maps at one or more subsequent steps. For multiple
baselines, and for both text-to-image generation and image editing, we
demonstrate that Clockwork leads to comparable or improved perceptual scores
with drastically reduced computational complexity. As an example, for Stable
Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and
CLIP change.
Compositional Inversion for Stable Diffusion Models
December 13, 2023
Xulu Zhang, Xiao-Yong Wei, Jinlin Wu, Tianyi Zhang, Zhaoxiang Zhang, Zhen Lei, Qing Li
Inversion methods, such as Textual Inversion, generate personalized images by
incorporating concepts of interest provided by user images. However, existing
methods often suffer from overfitting issues, where the dominant presence of
inverted concepts leads to the absence of other desired concepts. It stems from
the fact that during inversion, the irrelevant semantics in the user images are
also encoded, forcing the inverted concepts to occupy locations far from the
core distribution in the embedding space. To address this issue, we propose a
method that guides the inversion process towards the core distribution for
compositional embeddings. Additionally, we introduce a spatial regularization
approach to balance the attention on the concepts being composed. Our method is
designed as a post-training approach and can be seamlessly integrated with
other inversion methods. Experimental results demonstrate the effectiveness of
our proposed approach in mitigating the overfitting problem and generating more
diverse and balanced compositions of concepts in the synthesized images. The
source code is available at
https://github.com/zhangxulu1996/Compositional-Inversion.
ClusterDDPM: An EM clustering framework with Denoising Diffusion Probabilistic Models
December 13, 2023
Jie Yan, Jing Liu, Zhong-yuan Zhang
Variational autoencoder (VAE) and generative adversarial networks (GAN) have
found widespread applications in clustering and have achieved significant
success. However, the potential of these approaches may be limited due to VAE’s
mediocre generation capability or GAN’s well-known instability during
adversarial training. In contrast, denoising diffusion probabilistic models
(DDPMs) represent a new and promising class of generative models that may
unlock fresh dimensions in clustering. In this study, we introduce an
innovative expectation-maximization (EM) framework for clustering using DDPMs.
In the E-step, we aim to derive a mixture of Gaussian priors for the subsequent
M-step. In the M-step, our focus lies in learning clustering-friendly latent
representations for the data by employing the conditional DDPM and matching the
distribution of latent representations to the mixture of Gaussian priors. We
present a rigorous theoretical analysis of the optimization process in the
M-step, proving that the optimizations are equivalent to maximizing the lower
bound of the Q function within the vanilla EM framework under certain
constraints. Comprehensive experiments validate the advantages of the proposed
framework, showcasing superior performance in clustering, unsupervised
conditional generation and latent representation learning.
Time Series Diffusion Method: A Denoising Diffusion Probabilistic Model for Vibration Signal Generation
December 13, 2023
Haiming Yi, Lei Hou, Yuhong Jin, Nasser A. Saeed
Diffusion models have demonstrated robust data generation capabilities in
various research fields. In this paper, a Time Series Diffusion Method (TSDM)
is proposed for vibration signal generation, leveraging the foundational
principles of diffusion models. The TSDM uses an improved U-net architecture
with attention block to effectively segment and extract features from
one-dimensional time series data. It operates based on forward diffusion and
reverse denoising processes for time-series generation. Experimental validation
is conducted using single-frequency, multi-frequency datasets, and bearing
fault datasets. The results show that TSDM can accurately generate the
single-frequency and multi-frequency features in the time series and retain the
basic frequency features for the diffusion generation results of the bearing
fault series. Finally, TSDM is applied to the small sample fault diagnosis of
three public bearing fault datasets, and the results show that the accuracy of
small sample fault diagnosis of the three datasets is improved by 32.380%,
18.355% and 9.298% at most, respectively
Diffusion Models Enable Zero-Shot Pose Estimation for Lower-Limb Prosthetic Users
December 13, 2023
Tianxun Zhou, Muhammad Nur Shahril Iskandar, Keng-Hwee Chiam
The application of 2D markerless gait analysis has garnered increasing
interest and application within clinical settings. However, its effectiveness
in the realm of lower-limb amputees has remained less than optimal. In
response, this study introduces an innovative zero-shot method employing image
generation diffusion models to achieve markerless pose estimation for
lower-limb prosthetics, presenting a promising solution to gait analysis for
this specific population. Our approach demonstrates an enhancement in detecting
key points on prosthetic limbs over existing methods, and enables clinicians to
gain invaluable insights into the kinematics of lower-limb amputees across the
gait cycle. The outcomes obtained not only serve as a proof-of-concept for the
feasibility of this zero-shot approach but also underscore its potential in
advancing rehabilitation through gait analysis for this unique population.
December 12, 2023
Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li
Diffusion Transformers have recently shown remarkable effectiveness in
generating high-quality 3D point clouds. However, training voxel-based
diffusion models for high-resolution 3D voxels remains prohibitively expensive
due to the cubic complexity of attention operators, which arises from the
additional dimension of voxels. Motivated by the inherent redundancy of 3D
compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer
tailored for efficient 3D point cloud generation, which greatly reduces
training costs. Specifically, we draw inspiration from masked autoencoders to
dynamically operate the denoising process on masked voxelized point clouds. We
also propose a novel voxel-aware masking strategy to adaptively aggregate
background/foreground information from voxelized point clouds. Our method
achieves state-of-the-art performance with an extreme masking ratio of nearly
99%. Moreover, to improve multi-category 3D generation, we introduce
Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a
distinct diffusion path with different experts, relieving gradient conflict.
Experimental results on the ShapeNet dataset demonstrate that our method
achieves state-of-the-art high-fidelity and diverse 3D point cloud generation
performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage
metrics when generating 128-resolution voxel point clouds, using only 6.5% of
the original training cost.
Equivariant Flow Matching with Hybrid Probability Transport
December 12, 2023
Yuxuan Song, Jingjing Gong, Minkai Xu, Ziyao Cao, Yanyan Lan, Stefano Ermon, Hao Zhou, Wei-Ying Ma
The generation of 3D molecules requires simultaneously deciding the
categorical features~(atom types) and continuous features~(atom coordinates).
Deep generative models, especially Diffusion Models (DMs), have demonstrated
effectiveness in generating feature-rich geometries. However, existing DMs
typically suffer from unstable probability dynamics with inefficient sampling
speed. In this paper, we introduce geometric flow matching, which enjoys the
advantages of both equivariant modeling and stabilized probability dynamics.
More specifically, we propose a hybrid probability path where the coordinates
probability path is regularized by an equivariant optimal transport, and the
information between different modalities is aligned. Experimentally, the
proposed method could consistently achieve better performance on multiple
molecule generation benchmarks with 4.75$\times$ speed up of sampling on
average.
Generating High-Resolution Regional Precipitation Using Conditional Diffusion Model
December 12, 2023
Naufal Shidqi, Chaeyoon Jeong, Sungwon Park, Elke Zeller, Arjun Babu Nellikkattil, Karandeep Singh
cs.LG, cs.AI, physics.ao-ph
Climate downscaling is a crucial technique within climate research, serving
to project low-resolution (LR) climate data to higher resolutions (HR).
Previous research has demonstrated the effectiveness of deep learning for
downscaling tasks. However, most deep learning models for climate downscaling
may not perform optimally for high scaling factors (i.e., 4x, 8x) due to their
limited ability to capture the intricate details required for generating HR
climate data. Furthermore, climate data behaves differently from image data,
necessitating a nuanced approach when employing deep generative models. In
response to these challenges, this paper presents a deep generative model for
downscaling climate data, specifically precipitation on a regional scale. We
employ a denoising diffusion probabilistic model (DDPM) conditioned on multiple
LR climate variables. The proposed model is evaluated using precipitation data
from the Community Earth System Model (CESM) v1.2.2 simulation. Our results
demonstrate significant improvements over existing baselines, underscoring the
effectiveness of the conditional diffusion model in downscaling climate data.
LoRA-Enhanced Distillation on Guided Diffusion Models
December 12, 2023
Pareesa Ameneh Golnari
Diffusion models, such as Stable Diffusion (SD), offer the ability to
generate high-resolution images with diverse features, but they come at a
significant computational and memory cost. In classifier-free guided diffusion
models, prolonged inference times are attributed to the necessity of computing
two separate diffusion models at each denoising step. Recent work has shown
promise in improving inference time through distillation techniques, teaching
the model to perform similar denoising steps with reduced computations.
However, the application of distillation introduces additional memory overhead
to these already resource-intensive diffusion models, making it less practical.
To address these challenges, our research explores a novel approach that
combines Low-Rank Adaptation (LoRA) with model distillation to efficiently
compress diffusion models. This approach not only reduces inference time but
also mitigates memory overhead, and notably decreases memory consumption even
before applying distillation. The results are remarkable, featuring a
significant reduction in inference time due to the distillation process and a
substantial 50% reduction in memory consumption. Our examination of the
generated images underscores that the incorporation of LoRA-enhanced
distillation maintains image quality and alignment with the provided prompts.
In summary, while conventional distillation tends to increase memory
consumption, LoRA-enhanced distillation offers optimization without any
trade-offs or compromises in quality.
Photorealistic Video Generation with Diffusion Models
December 11, 2023
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
We present W.A.L.T, a transformer-based approach for photorealistic video
generation via diffusion modeling. Our approach has two key design decisions.
First, we use a causal encoder to jointly compress images and videos within a
unified latent space, enabling training and generation across modalities.
Second, for memory and training efficiency, we use a window attention
architecture tailored for joint spatial and spatiotemporal generative modeling.
Taken together these design decisions enable us to achieve state-of-the-art
performance on established video (UCF-101 and Kinetics-600) and image
(ImageNet) generation benchmarks without using classifier free guidance.
Finally, we also train a cascade of three models for the task of text-to-video
generation consisting of a base latent video diffusion model, and two video
super-resolution diffusion models to generate videos of $512 \times 896$
resolution at $8$ frames per second.
UpFusion: Novel View Diffusion from Unposed Sparse View Observations
December 11, 2023
Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov, Shubham Tulsiani
We propose UpFusion, a system that can perform novel view synthesis and infer
3D representations for an object given a sparse set of reference images without
corresponding pose information. Current sparse-view 3D inference methods
typically rely on camera poses to geometrically aggregate information from
input views, but are not robust in-the-wild when such information is
unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by
learning to implicitly leverage the available images as context in a
conditional generative model for synthesizing novel views. We incorporate two
complementary forms of conditioning into diffusion models for leveraging the
input views: a) via inferring query-view aligned features using a scene-level
transformer, b) via intermediate attentional layers that can directly observe
the input image tokens. We show that this mechanism allows generating
high-fidelity novel views while improving the synthesis quality given
additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google
Scanned Objects datasets and demonstrate the benefits of our method over
pose-reliant sparse-view methods as well as single-view methods that cannot
leverage additional views. Finally, we also show that our learned model can
generalize beyond the training categories and even allow reconstruction from
self-captured images of generic objects in-the-wild.
DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection
December 11, 2023
Haoyang He, Jiangning Zhang, Hongxu Chen, Xuhai Chen, Zhishan Li, Xu Chen, Yabiao Wang, Chengjie Wang, Lei Xie
Reconstruction-based approaches have achieved remarkable outcomes in anomaly
detection. The exceptional image reconstruction capabilities of recently
popular diffusion models have sparked research efforts to utilize them for
enhanced reconstruction of anomalous images. Nonetheless, these methods might
face challenges related to the preservation of image categories and pixel-wise
structural integrity in the more practical multi-class setting. To solve the
above problems, we propose a Difusion-based Anomaly Detection (DiAD) framework
for multi-class anomaly detection, which consists of a pixel-space autoencoder,
a latent-space Semantic-Guided (SG) network with a connection to the stable
diffusion’s denoising network, and a feature-space pre-trained feature
extractor. Firstly, The SG network is proposed for reconstructing anomalous
regions while preserving the original image’s semantic information. Secondly,
we introduce Spatial-aware Feature Fusion (SFF) block to maximize
reconstruction accuracy when dealing with extensively reconstructed areas.
Thirdly, the input and reconstructed images are processed by a pre-trained
feature extractor to generate anomaly maps based on features extracted at
different scales. Experiments on MVTec-AD and VisA datasets demonstrate the
effectiveness of our approach which surpasses the state-of-the-art methods,
e.g., achieving 96.8/52.6 and 97.2/99.0 (AUROC/AP) for localization and
detection respectively on multi-class MVTec-AD dataset. Code will be available
at https://lewandofskee.github.io/projects/diad.
HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models
December 11, 2023
Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, Huaizu Jiang
We address the problem of generating realistic 3D human-object interactions
(HOIs) driven by textual prompts. Instead of a single model, our key insight is
to take a modular design and decompose the complex task into simpler sub-tasks.
We first develop a dual-branch diffusion model (HOI-DM) to generate both human
and object motions conditioning on the input text, and encourage coherent
motions by a cross-attention communication module between the human and object
motion generation branches. We also develop an affordance prediction diffusion
model (APDM) to predict the contacting area between the human and object during
the interactions driven by the textual prompt. The APDM is independent of the
results by the HOI-DM and thus can correct potential errors by the latter.
Moreover, it stochastically generates the contacting points to diversify the
generated motions. Finally, we incorporate the estimated contacting points into
the classifier-guidance to achieve accurate and close contact between humans
and objects. To train and evaluate our approach, we annotate BEHAVE dataset
with text descriptions. Experimental results demonstrate that our approach is
able to produce realistic HOIs with various interactions and different types of
objects.
DiffAIL: Diffusion Adversarial Imitation Learning
December 11, 2023
Bingzheng Wang, Guoqiang Wu, Teng Pang, Yan Zhang, Yilong Yin
Imitation learning aims to solve the problem of defining reward functions in
real-world decision-making tasks. The current popular approach is the
Adversarial Imitation Learning (AIL) framework, which matches expert
state-action occupancy measures to obtain a surrogate reward for forward
reinforcement learning. However, the traditional discriminator is a simple
binary classifier and doesn’t learn an accurate distribution, which may result
in failing to identify expert-level state-action pairs induced by the policy
interacting with the environment. To address this issue, we propose a method
named diffusion adversarial imitation learning (DiffAIL), which introduces the
diffusion model into the AIL framework. Specifically, DiffAIL models the
state-action pairs as unconditional diffusion models and uses diffusion loss as
part of the discriminator’s learning objective, which enables the discriminator
to capture better expert demonstrations and improve generalization.
Experimentally, the results show that our method achieves state-of-the-art
performance and significantly surpasses expert demonstration on two benchmark
tasks, including the standard state-action setting and state-only settings. Our
code can be available at the link https://github.com/ML-Group-SDU/DiffAIL.
The Journey, Not the Destination: How Data Guides Diffusion Models
December 11, 2023
Kristian Georgiev, Joshua Vendrow, Hadi Salman, Sung Min Park, Aleksander Madry
Diffusion models trained on large datasets can synthesize photo-realistic
images of remarkable quality and diversity. However, attributing these images
back to the training data-that is, identifying specific training examples which
caused an image to be generated-remains a challenge. In this paper, we propose
a framework that: (i) provides a formal notion of data attribution in the
context of diffusion models, and (ii) allows us to counterfactually validate
such attributions. Then, we provide a method for computing these attributions
efficiently. Finally, we apply our method to find (and evaluate) such
attributions for denoising diffusion probabilistic models trained on CIFAR-10
and latent diffusion models trained on MS COCO. We provide code at
https://github.com/MadryLab/journey-TRAK .
December 11, 2023
Linjie Fu, Xia Li, Xiuding Cai, Yingkai Wang, Xueyao Wang, Yu Yao, Yali Shen
Radiation therapy serves as an effective and standard method for cancer
treatment. Excellent radiation therapy plans always rely on high-quality dose
distribution maps obtained through repeated trial and error by experienced
experts. However, due to individual differences and complex clinical
situations, even seasoned expert teams may need help to achieve the best
treatment plan every time quickly. Many automatic dose distribution prediction
methods have been proposed recently to accelerate the radiation therapy
planning process and have achieved good results. However, these results suffer
from over-smoothing issues, with the obtained dose distribution maps needing
more high-frequency details, limiting their clinical application. To address
these limitations, we propose a dose prediction diffusion model based on
SwinTransformer and a projector, SP-DiffDose. To capture the direct correlation
between anatomical structure and dose distribution maps, SP-DiffDose uses a
structural encoder to extract features from anatomical images, then employs a
conditional diffusion process to blend noise and anatomical images at multiple
scales and gradually map them to dose distribution maps. To enhance the dose
prediction distribution for organs at risk, SP-DiffDose utilizes
SwinTransformer in the deeper layers of the network to capture features at
different scales in the image. To learn good representations from the fused
features, SP-DiffDose passes the fused features through a designed projector,
improving dose prediction accuracy. Finally, we evaluate SP-DiffDose on an
internal dataset. The results show that SP-DiffDose outperforms existing
methods on multiple evaluation metrics, demonstrating the superiority and
generalizability of our method.
PCRDiffusion: Diffusion Probabilistic Models for Point Cloud Registration
December 11, 2023
Yue Wu, Yongzhe Yuan, Xiaolong Fan, Xiaoshui Huang, Maoguo Gong, Qiguang Miao
We propose a new framework that formulates point cloud registration as a
denoising diffusion process from noisy transformation to object transformation.
During training stage, object transformation diffuses from ground-truth
transformation to random distribution, and the model learns to reverse this
noising process. In sampling stage, the model refines randomly generated
transformation to the output result in a progressive way. We derive the
variational bound in closed form for training and provide implementations of
the model. Our work provides the following crucial findings: (i) In contrast to
most existing methods, our framework, Diffusion Probabilistic Models for Point
Cloud Registration (PCRDiffusion) does not require repeatedly update source
point cloud to refine the predicted transformation. (ii) Point cloud
registration, one of the representative discriminative tasks, can be solved by
a generative way and the unified probabilistic formulation. Finally, we discuss
and provide an outlook on the application of diffusion model in different
scenarios for point cloud registration. Experimental results demonstrate that
our model achieves competitive performance in point cloud registration. In
correspondence-free and correspondence-based scenarios, PCRDifussion can both
achieve exceeding 50\% performance improvements.
CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models
December 11, 2023
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag
Images produced by text-to-image diffusion models might not always faithfully
represent the semantic intent of the provided text prompt, where the model
might overlook or entirely fail to produce certain objects. Existing solutions
often require customly tailored functions for each of these problems, leading
to sub-optimal results, especially for complex prompts. Our work introduces a
novel perspective by tackling this challenge in a contrastive context. Our
approach intuitively promotes the segregation of objects in attention maps
while also maintaining that pairs of related attributes are kept close to each
other. We conduct extensive experiments across a wide variety of scenarios,
each involving unique combinations of objects, attributes, and scenes. These
experiments effectively showcase the versatility, efficiency, and flexibility
of our method in working with both latent and pixel-based diffusion models,
including Stable Diffusion and Imagen. Moreover, we publicly share our source
code to facilitate further research.
A Note on the Convergence of Denoising Diffusion Probabilistic Models
December 10, 2023
Sokhna Diarra Mbacke, Omar Rivasplata
Diffusion models are one of the most important families of deep generative
models. In this note, we derive a quantitative upper bound on the Wasserstein
distance between the data-generating distribution and the distribution learned
by a diffusion model. Unlike previous works in this field, our result does not
make assumptions on the learned score function. Moreover, our bound holds for
arbitrary data-generating distributions on bounded instance spaces, even those
without a density w.r.t. the Lebesgue measure, and the upper bound does not
suffer from exponential dependencies. Our main result builds upon the recent
work of Mbacke et al. (2023) and our proofs are elementary.
Diffusion for Natural Image Matting
December 10, 2023
Yihan Hu, Yiheng Lin, Wei Wang, Yao Zhao, Yunchao Wei, Humphrey Shi
We aim to leverage diffusion to address the challenging image matting task.
However, the presence of high computational overhead and the inconsistency of
noise sampling between the training and inference processes pose significant
obstacles to achieving this goal. In this paper, we present DiffMatte, a
solution designed to effectively overcome these challenges. First, DiffMatte
decouples the decoder from the intricately coupled matting network design,
involving only one lightweight decoder in the iterations of the diffusion
process. With such a strategy, DiffMatte mitigates the growth of computational
overhead as the number of samples increases. Second, we employ a self-aligned
training strategy with uniform time intervals, ensuring a consistent noise
sampling between training and inference across the entire time domain. Our
DiffMatte is designed with flexibility in mind and can seamlessly integrate
into various modern matting architectures. Extensive experimental results
demonstrate that DiffMatte not only reaches the state-of-the-art level on the
Composition-1k test set, surpassing the best methods in the past by 5% and 15%
in the SAD metric and MSE metric respectively, but also show stronger
generalization ability in other benchmarks.
InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
December 10, 2023
Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, Weipeng Hu
Large-scale text-to-image (T2I) diffusion models have showcased incredible
capabilities in generating coherent images based on textual descriptions,
enabling vast applications in content generation. While recent advancements
have introduced control over factors such as object localization, posture, and
image contours, a crucial gap remains in our ability to control the
interactions between objects in the generated content. Well-controlling
interactions in generated images could yield meaningful applications, such as
creating realistic scenes with interacting characters. In this work, we study
the problems of conditioning T2I diffusion models with Human-Object Interaction
(HOI) information, consisting of a triplet label (person, action, object) and
corresponding bounding boxes. We propose a pluggable interaction control model,
called InteractDiffusion that extends existing pre-trained T2I diffusion models
to enable them being better conditioned on interactions. Specifically, we
tokenize the HOI information and learn their relationships via interaction
embeddings. A conditioning self-attention layer is trained to map HOI tokens to
visual tokens, thereby conditioning the visual tokens better in existing T2I
diffusion models. Our model attains the ability to control the interaction and
location on existing T2I diffusion models, which outperforms existing baselines
by a large margin in HOI detection score, as well as fidelity in FID and KID.
Project page: https://jiuntian.github.io/interactdiffusion.
AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model
December 10, 2023
Teng Hu, Jiangning Zhang, Ran Yi, Yuzhen Du, Xu Chen, Liang Liu, Yabiao Wang, Chengjie Wang
Anomaly inspection plays an important role in industrial manufacture.
Existing anomaly inspection methods are limited in their performance due to
insufficient anomaly data. Although anomaly generation methods have been
proposed to augment the anomaly data, they either suffer from poor generation
authenticity or inaccurate alignment between the generated anomalies and masks.
To address the above problems, we propose AnomalyDiffusion, a novel
diffusion-based few-shot anomaly generation model, which utilizes the strong
prior information of latent diffusion model learned from large-scale dataset to
enhance the generation authenticity under few-shot training data. Firstly, we
propose Spatial Anomaly Embedding, which consists of a learnable anomaly
embedding and a spatial embedding encoded from an anomaly mask, disentangling
the anomaly information into anomaly appearance and location information.
Moreover, to improve the alignment between the generated anomalies and the
anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism.
Based on the disparities between the generated anomaly image and normal sample,
it dynamically guides the model to focus more on the areas with less noticeable
generated anomalies, enabling generation of accurately-matched anomalous
image-mask pairs. Extensive experiments demonstrate that our model
significantly outperforms the state-of-the-art methods in generation
authenticity and diversity, and effectively improves the performance of
downstream anomaly inspection tasks. The code and data are available in
https://github.com/sjtuplayer/anomalydiffusion.
Conditional Stochastic Interpolation for Generative Learning
December 09, 2023
Ding Huang, Jian Huang, Ting Li, Guohao Shen
We propose a conditional stochastic interpolation (CSI) approach to learning
conditional distributions. CSI learns probability flow equations or stochastic
differential equations that transport a reference distribution to the target
conditional distribution. This is achieved by first learning the drift function
and the conditional score function based on conditional stochastic
interpolation, which are then used to construct a deterministic process
governed by an ordinary differential equation or a diffusion process for
conditional sampling. In our proposed CSI model, we incorporate an adaptive
diffusion term to address the instability issues arising during the training
process. We provide explicit forms of the conditional score function and the
drift function in terms of conditional expectations under mild conditions,
which naturally lead to an nonparametric regression approach to estimating
these functions. Furthermore, we establish non-asymptotic error bounds for
learning the target conditional distribution via conditional stochastic
interpolation in terms of KL divergence, taking into account the neural network
approximation error. We illustrate the application of CSI on image generation
using a benchmark image dataset.
DPoser: Diffusion Model as Robust 3D Human Pose Prior
December 09, 2023
Junzhe Lu, Jing Lin, Hongkun Dou, Yulun Zhang, Yue Deng, Haoqian Wang
Modeling human pose is a cornerstone in applications from human-robot
interaction to augmented reality, yet crafting a robust human pose prior
remains a challenge due to biomechanical constraints and diverse human
movements. Traditional priors like VAEs and NDFs often fall short in realism
and generalization, especially in extreme conditions such as unseen noisy
poses. To address these issues, we introduce DPoser, a robust and versatile
human pose prior built upon diffusion models. Designed with optimization
frameworks, DPoser seamlessly integrates into various pose-centric
applications, including human mesh recovery, pose completion, and motion
denoising. Specifically, by formulating these tasks as inverse problems, we
employ variational diffusion sampling for efficient solving. Furthermore,
acknowledging the disparity between the articulated poses we focus on and
structured images in previous research, we propose a truncated timestep
scheduling to boost performance on downstream tasks. Our exhaustive experiments
demonstrate DPoser’s superiority over existing state-of-the-art pose priors
across multiple tasks.
Consistency Models for Scalable and Fast Simulation-Based Inference
December 09, 2023
Marvin Schmitt, Valentin Pratz, Ullrich Köthe, Paul-Christian Bürkner, Stefan T Radev
Simulation-based inference (SBI) is constantly in search of more expressive
algorithms for accurately inferring the parameters of complex models from noisy
data. We present consistency models for neural posterior estimation (CMPE), a
new free-form conditional sampler for scalable, fast, and amortized SBI with
generative neural networks. CMPE combines the advantages of normalizing flows
and flow matching methods into a single generative architecture: It essentially
distills a continuous probability flow and enables rapid few-shot inference
with an unconstrained architecture that can be tailored to the structure of the
estimation problem. Our empirical evaluation demonstrates that CMPE not only
outperforms current state-of-the-art algorithms on three hard low-dimensional
problems, but also achieves competitive performance in a high-dimensional
Bayesian denoising experiment and in estimating a computationally demanding
multi-scale model of tumor spheroid growth.
Efficient Quantization Strategies for Latent Diffusion Models
December 09, 2023
Yuewei Yang, Xiaoliang Dai, Jialiang Wang, Peizhao Zhang, Hongbo Zhang
Latent Diffusion Models (LDMs) capture the dynamic evolution of latent
variables over time, blending patterns and multimodality in a generative
system. Despite the proficiency of LDM in various applications, such as
text-to-image generation, facilitated by robust text encoders and a variational
autoencoder, the critical need to deploy large generative models on edge
devices compels a search for more compact yet effective alternatives. Post
Training Quantization (PTQ), a method to compress the operational size of deep
learning models, encounters challenges when applied to LDM due to temporal and
structural complexities. This study proposes a quantization strategy that
efficiently quantize LDMs, leveraging Signal-to-Quantization-Noise Ratio (SQNR)
as a pivotal metric for evaluation. By treating the quantization discrepancy as
relative noise and identifying sensitive part(s) of a model, we propose an
efficient quantization approach encompassing both global and local strategies.
The global quantization process mitigates relative quantization noise by
initiating higher-precision quantization on sensitive blocks, while local
treatments address specific challenges in quantization-sensitive and
time-sensitive modules. The outcomes of our experiments reveal that the
implementation of both global and local treatments yields a highly efficient
and effective Post Training Quantization (PTQ) of LDMs.
Cross Domain Generative Augmentation: Domain Generalization with Latent Diffusion Models
December 08, 2023
Sobhan Hemati, Mahdi Beitollahi, Amir Hossein Estiri, Bassel Al Omari, Xi Chen, Guojun Zhang
Despite the huge effort in developing novel regularizers for Domain
Generalization (DG), adding simple data augmentation to the vanilla ERM which
is a practical implementation of the Vicinal Risk Minimization principle (VRM)
\citep{chapelle2000vicinal} outperforms or stays competitive with many of the
proposed regularizers. The VRM reduces the estimation error in ERM by replacing
the point-wise kernel estimates with a more precise estimation of true data
distribution that reduces the gap between data points \textbf{within each
domain}. However, in the DG setting, the estimation error of true data
distribution by ERM is mainly caused by the distribution shift \textbf{between
domains} which cannot be fully addressed by simple data augmentation techniques
within each domain. Inspired by this limitation of VRM, we propose a novel data
augmentation named Cross Domain Generative Augmentation (CDGA) that replaces
the pointwise kernel estimates in ERM with new density estimates in the
\textbf{vicinity of domain pairs} so that the gap between domains is further
reduced. To this end, CDGA, which is built upon latent diffusion models (LDM),
generates synthetic images to fill the gap between all domains and as a result,
reduces the non-iidness. We show that CDGA outperforms SOTA DG methods under
the Domainbed benchmark. To explain the effectiveness of CDGA, we generate more
than 5 Million synthetic images and perform extensive ablation studies
including data scaling laws, distribution visualization, domain shift
quantification, adversarial robustness, and loss landscape analysis.
Membership Inference Attacks on Diffusion Models via Quantile Regression
December 08, 2023
Shuai Tang, Zhiwei Steven Wu, Sergul Aydore, Michael Kearns, Aaron Roth
Recently, diffusion models have become popular tools for image synthesis
because of their high-quality outputs. However, like other large-scale models,
they may leak private information about their training data. Here, we
demonstrate a privacy vulnerability of diffusion models through a
\emph{membership inference (MI) attack}, which aims to identify whether a
target example belongs to the training set when given the trained diffusion
model. Our proposed MI attack learns quantile regression models that predict (a
quantile of) the distribution of reconstruction loss on examples not used in
training. This allows us to define a granular hypothesis test for determining
the membership of a point in the training set, based on thresholding the
reconstruction loss of that point using a custom threshold tailored to the
example. We also provide a simple bootstrap technique that takes a majority
membership prediction over a bag of weak attackers'' which improves the
accuracy over individual quantile regression models. We show that our attack
outperforms the prior state-of-the-art attack while being substantially less
computationally expensive -- prior attacks required training multipleshadow
models’’ with the same architecture as the model under attack, whereas our
attack requires training only much smaller models.
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models
December 08, 2023
Yiming Zhao, Zhouhui Lian
Text-to-Image (T2I) generation methods based on diffusion model have garnered
significant attention in the last few years. Although these image synthesis
methods produce visually appealing results, they frequently exhibit spelling
errors when rendering text within the generated images. Such errors manifest as
missing, incorrect or extraneous characters, thereby severely constraining the
performance of text image generation based on diffusion models. To address the
aforementioned issue, this paper proposes a novel approach for text image
generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion
[27]). Our approach involves the design and training of a light-weight
character-level text encoder, which replaces the original CLIP encoder and
provides more robust text embeddings as conditional guidance. Then, we
fine-tune the diffusion model using a large-scale dataset, incorporating local
attention control under the supervision of character-level segmentation maps.
Finally, by employing an inference stage refinement process, we achieve a
notably high sequence accuracy when synthesizing text in arbitrarily given
images. Both qualitative and quantitative results demonstrate the superiority
of our method to the state of the art. Furthermore, we showcase several
potential applications of the proposed UDiffText, including text-centric image
synthesis, scene text editing, etc. Code and model will be available at
https://github.com/ZYM-PKU/UDiffText .
MVDD: Multi-View Depth Diffusion Models
December 08, 2023
Zhen Wang, Qiangeng Xu, Feitong Tan, Menglei Chai, Shichen Liu, Rohit Pandey, Sean Fanello, Achuta Kadambi, Yinda Zhang
Denoising diffusion models have demonstrated outstanding results in 2D image
generation, yet it remains a challenge to replicate its success in 3D shape
generation. In this paper, we propose leveraging multi-view depth, which
represents complex 3D shapes in a 2D data format that is easy to denoise. We
pair this representation with a diffusion model, MVDD, that is capable of
generating high-quality dense point clouds with 20K+ points with fine-grained
details. To enforce 3D consistency in multi-view depth, we introduce an
epipolar line segment attention that conditions the denoising step for a view
on its neighboring views. Additionally, a depth fusion module is incorporated
into diffusion steps to further ensure the alignment of depth maps. When
augmented with surface reconstruction, MVDD can also produce high-quality 3D
meshes. Furthermore, MVDD stands out in other tasks such as depth completion,
and can serve as a 3D prior, significantly boosting many downstream tasks, such
as GAN inversion. State-of-the-art results from extensive experiments
demonstrate MVDD’s excellent ability in 3D shape generation, depth completion,
and its potential as a 3D prior for downstream tasks.
HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models
December 08, 2023
Pei Lin, Sihang Xu, Hongdi Yang, Yiran Liu, Xin Chen, Jingya Wang, Jingyi Yu, Lan Xu
Existing hands datasets are largely short-range and the interaction is weak
due to the self-occlusion and self-similarity of hands, which can not yet fit
the need for interacting hands motion generation. To rescue the data scarcity,
we propose HandDiffuse12.5M, a novel dataset that consists of temporal
sequences with strong two-hand interactions. HandDiffuse12.5M has the largest
scale and richest interactions among the existing two-hand datasets. We further
present a strong baseline method HandDiffuse for the controllable motion
generation of interacting hands using various controllers. Specifically, we
apply the diffusion model as the backbone and design two motion representations
for different controllers. To reduce artifacts, we also propose Interaction
Loss which explicitly quantifies the dynamic interaction process. Our
HandDiffuse enables various applications with vivid two-hand interactions,
i.e., motion in-betweening and trajectory control. Experiments show that our
method outperforms the state-of-the-art techniques in motion generation and can
also contribute to data augmentation for other datasets. Our dataset,
corresponding codes, and pre-trained models will be disseminated to the
community for future research towards two-hand interaction modeling.
DiffCMR: Fast Cardiac MRI Reconstruction with Diffusion Probabilistic Models
December 08, 2023
Tianqi Xiang, Wenjun Yue, Yiqun Lin, Jiewen Yang, Zhenkun Wang, Xiaomeng Li
Performing magnetic resonance imaging (MRI) reconstruction from under-sampled
k-space data can accelerate the procedure to acquire MRI scans and reduce
patients’ discomfort. The reconstruction problem is usually formulated as a
denoising task that removes the noise in under-sampled MRI image slices.
Although previous GAN-based methods have achieved good performance in image
denoising, they are difficult to train and require careful tuning of
hyperparameters. In this paper, we propose a novel MRI denoising framework
DiffCMR by leveraging conditional denoising diffusion probabilistic models.
Specifically, DiffCMR perceives conditioning signals from the under-sampled MRI
image slice and generates its corresponding fully-sampled MRI image slice.
During inference, we adopt a multi-round ensembling strategy to stabilize the
performance. We validate DiffCMR with cine reconstruction and T1/T2 mapping
tasks on MICCAI 2023 Cardiac MRI Reconstruction Challenge (CMRxRecon) dataset.
Results show that our method achieves state-of-the-art performance, exceeding
previous methods by a significant margin. Code is available at
https://github.com/xmed-lab/DiffCMR.
MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model
December 08, 2023
Kaiyu Song, Hanjiang Lai
Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where
an imperceptible perturbation is added to the image that can fool the DNNs.
Diffusion-based adversarial purification focuses on using the diffusion model
to generate a clean image against such adversarial attacks. Unfortunately, the
generative process of the diffusion model is also inevitably affected by
adversarial perturbation since the diffusion model is also a deep network where
its input has adversarial perturbation. In this work, we propose
MimicDiffusion, a new diffusion-based adversarial purification technique, that
directly approximates the generative process of the diffusion model with the
clean image as input. Concretely, we analyze the differences between the guided
terms using the clean image and the adversarial sample. After that, we first
implement MimicDiffusion based on Manhattan distance. Then, we propose two
guidance to purify the adversarial perturbation and approximate the clean
diffusion model. Extensive experiments on three image datasets including
CIFAR-10, CIFAR-100, and ImageNet with three classifier backbones including
WideResNet-70-16, WideResNet-28-10, and ResNet50 demonstrate that
MimicDiffusion significantly performs better than the state-of-the-art
baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%,
and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\%
higher, respectively. The code is available in the supplementary material.
Diffence: Fencing Membership Privacy With Diffusion Models
December 07, 2023
Yuefeng Peng, Ali Naseh, Amir Houmansadr
Deep learning models, while achieving remarkable performance across various
tasks, are vulnerable to member inference attacks, wherein adversaries identify
if a specific data point was part of a model’s training set. This
susceptibility raises substantial privacy concerns, especially when models are
trained on sensitive datasets. Current defense methods often struggle to
provide robust protection without hurting model utility, and they often require
retraining the model or using extra data. In this work, we introduce a novel
defense framework against membership attacks by leveraging generative models.
The key intuition of our defense is to remove the differences between member
and non-member inputs which can be used to perform membership attacks, by
re-generating input samples before feeding them to the target model. Therefore,
our defense works \emph{pre-inference}, which is unlike prior defenses that are
either training-time (modify the model) or post-inference time (modify the
model’s output).
A unique feature of our defense is that it works on input samples only,
without modifying the training or inference phase of the target model.
Therefore, it can be cascaded with other defense mechanisms as we demonstrate
through experiments. Through extensive experimentation, we show that our
approach can serve as a robust plug-n-play defense mechanism, enhancing
membership privacy without compromising model utility in both baseline and
defended settings. For example, our method enhanced the effectiveness of recent
state-of-the-art defenses, reducing attack accuracy by an average of 5.7\% to
12.4\% across three datasets, without any impact on the model’s accuracy. By
integrating our method with prior defenses, we achieve new state-of-the-art
performance in the privacy-utility trade-off.
NeuSD: Surface Completion with Multi-View Text-to-Image Diffusion
December 07, 2023
Savva Ignatyev, Daniil Selikhanovych, Oleg Voynov, Yiqun Wang, Peter Wonka, Stamatios Lefkimmiatis, Evgeny Burnaev
We present a novel method for 3D surface reconstruction from multiple images
where only a part of the object of interest is captured. Our approach builds on
two recent developments: surface reconstruction using neural radiance fields
for the reconstruction of the visible parts of the surface, and guidance of
pre-trained 2D diffusion models in the form of Score Distillation Sampling
(SDS) to complete the shape in unobserved regions in a plausible manner. We
introduce three components. First, we suggest employing normal maps as a pure
geometric representation for SDS instead of color renderings which are
entangled with the appearance information. Second, we introduce the freezing of
the SDS noise during training which results in more coherent gradients and
better convergence. Third, we propose Multi-View SDS as a way to condition the
generation of the non-observable part of the surface without fine-tuning or
making changes to the underlying 2D Stable Diffusion model. We evaluate our
approach on the BlendedMVS dataset demonstrating significant qualitative and
quantitative improvements over competing methods.
Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication
December 06, 2023
Ali Naseh, Jaechul Roh, Amir Houmansadr
Diffusion-based models, such as the Stable Diffusion model, have
revolutionized text-to-image synthesis with their ability to produce
high-quality, high-resolution images. These advancements have prompted
significant progress in image generation and editing tasks. However, these
models also raise concerns due to their tendency to memorize and potentially
replicate exact training samples, posing privacy risks and enabling adversarial
attacks. Duplication in training datasets is recognized as a major factor
contributing to memorization, and various forms of memorization have been
studied so far. This paper focuses on two distinct and underexplored types of
duplication that lead to replication during inference in diffusion-based
models, particularly in the Stable Diffusion model. We delve into these
lesser-studied duplication phenomena and their implications through two case
studies, aiming to contribute to the safer and more responsible use of
generative models in various applications.
WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on
December 06, 2023
xujie zhang, Xiu Li, Michael Kampffmeyer, Xin Dong, Zhenyu Xie, Feida Zhu, Haoye Dong, Xiaodan Liang
Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image
onto a target person. While existing methods focus on warping the garment to
fit the body pose, they often overlook the synthesis quality around the
garment-skin boundary and realistic effects like wrinkles and shadows on the
warped garments. These limitations greatly reduce the realism of the generated
results and hinder the practical application of VITON techniques. Leveraging
the notable success of diffusion-based models in cross-modal image synthesis,
some recent diffusion-based methods have ventured to tackle this issue.
However, they tend to either consume a significant amount of training resources
or struggle to achieve realistic try-on effects and retain garment details. For
efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the
warping-based and diffusion-based paradigms via a novel informative and local
garment feature attention mechanism. Specifically, WarpDiffusion incorporates
local texture attention to reduce resource consumption and uses a novel
auto-mask module that effectively retains only the critical areas of the warped
garment while disregarding unrealistic or erroneous portions. Notably,
WarpDiffusion can be integrated as a plug-and-play component into existing
VITON methodologies, elevating their synthesis quality. Extensive experiments
on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the
superiority of WarpDiffusion, surpassing state-of-the-art methods both
qualitatively and quantitatively.
TokenCompose: Grounding Diffusion with Token-level Supervision
December 06, 2023
Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu
We present TokenCompose, a Latent Diffusion Model for text-to-image
generation that achieves enhanced consistency between user-specified text
prompts and model-generated images. Despite its tremendous success, the
standard denoising process in the Latent Diffusion Model takes text prompts as
conditions only, absent explicit constraint for the consistency between the
text prompts and the image contents, leading to unsatisfactory results for
composing multiple object categories. TokenCompose aims to improve
multi-category instance composition by introducing the token-wise consistency
terms between the image content and object segmentation maps in the finetuning
stage. TokenCompose can be applied directly to the existing training pipeline
of text-conditioned diffusion models without extra human labeling information.
By finetuning Stable Diffusion, the model exhibits significant improvements in
multi-category instance composition and enhanced photorealism for its generated
images.
DiffusionSat: A Generative Foundation Model for Satellite Imagery
December 06, 2023
Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lobell, Stefano Ermon
Diffusion models have achieved state-of-the-art results on many modalities
including images, speech, and video. However, existing models are not tailored
to support remote sensing data, which is widely used in important applications
including environmental monitoring and crop-yield prediction. Satellite images
are significantly different from natural images – they can be multi-spectral,
irregularly sampled across time – and existing diffusion models trained on
images from the Web do not support them. Furthermore, remote sensing data is
inherently spatio-temporal, requiring conditional generation tasks not
supported by traditional methods based on captions or images. In this paper, we
present DiffusionSat, to date the largest generative foundation model trained
on a collection of publicly available large, high-resolution remote sensing
datasets. As text-based captions are sparsely available for satellite images,
we incorporate the associated metadata such as geolocation as conditioning
information. Our method produces realistic samples and can be used to solve
multiple generative tasks including temporal generation, superresolution given
multi-spectral inputs and in-painting. Our method outperforms previous
state-of-the-art methods for satellite image generation and is the first
large-scale $\textit{generative}$ foundation model for satellite imagery.
Context Diffusion: In-Context Aware Image Generation
December 06, 2023
Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, Filip Radenovic
We propose Context Diffusion, a diffusion-based framework that enables image
generation models to learn from visual examples presented in context. Recent
work tackles such in-context learning for image generation, where a query image
is provided alongside context examples and text prompts. However, the quality
and fidelity of the generated images deteriorate when the prompt is not
present, demonstrating that these models are unable to truly learn from the
visual context. To address this, we propose a novel framework that separates
the encoding of the visual context and preserving the structure of the query
images. This results in the ability to learn from the visual context and text
prompts, but also from either one of them. Furthermore, we enable our model to
handle few-shot settings, to effectively address diverse in-context learning
scenarios. Our experiments and user study demonstrate that Context Diffusion
excels in both in-domain and out-of-domain tasks, resulting in an overall
enhancement in image quality and fidelity compared to counterpart models.
FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation
December 06, 2023
Olivia Markham, Yuhao Chen, Chi-en Amy Tai, Alexander Wong
Current state-of-the-art image generation models such as Latent Diffusion
Models (LDMs) have demonstrated the capacity to produce visually striking
food-related images. However, these generated images often exhibit an artistic
or surreal quality that diverges from the authenticity of real-world food
representations. This inadequacy renders them impractical for applications
requiring realistic food imagery, such as training models for image-based
dietary assessment. To address these limitations, we introduce FoodFusion, a
Latent Diffusion model engineered specifically for the faithful synthesis of
realistic food images from textual descriptions. The development of the
FoodFusion model involves harnessing an extensive array of open-source food
datasets, resulting in over 300,000 curated image-caption pairs. Additionally,
we propose and employ two distinct data cleaning methodologies to ensure that
the resulting image-text pairs maintain both realism and accuracy. The
FoodFusion model, thus trained, demonstrates a remarkable ability to generate
food images that exhibit a significant improvement in terms of both realism and
diversity over the publicly available image generation models. We openly share
the dataset and fine-tuned models to support advancements in this critical
field of food image synthesis at https://bit.ly/genai4good.
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
December 06, 2023
Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, Jun Zhu
In text-to-speech (TTS) synthesis, diffusion models have achieved promising
generation quality. However, because of the pre-defined data-to-noise diffusion
process, their prior distribution is restricted to a noisy representation,
which provides little information of the generation target. In this work, we
present a novel TTS system, Bridge-TTS, making the first attempt to substitute
the noisy Gaussian prior in established diffusion-based TTS methods with a
clean and deterministic one, which provides strong structural information of
the target. Specifically, we leverage the latent representation obtained from
text input as our prior, and build a fully tractable Schrodinger bridge between
it and the ground-truth mel-spectrogram, leading to a data-to-data process.
Moreover, the tractability and flexibility of our formulation allow us to
empirically study the design spaces such as noise schedules, as well as to
develop stochastic and deterministic samplers. Experimental results on the
LJ-Speech dataset illustrate the effectiveness of our method in terms of both
synthesis quality and sampling efficiency, significantly outperforming our
diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast
TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/
Diffused Task-Agnostic Milestone Planner
December 06, 2023
Mineui Hong, Minjae Kang, Songhwai Oh
Addressing decision-making problems using sequence modeling to predict future
trajectories shows promising results in recent years. In this paper, we take a
step further to leverage the sequence predictive method in wider areas such as
long-term planning, vision-based control, and multi-task decision-making. To
this end, we propose a method to utilize a diffusion-based generative sequence
model to plan a series of milestones in a latent space and to have an agent to
follow the milestones to accomplish a given task. The proposed method can learn
control-relevant, low-dimensional latent representations of milestones, which
makes it possible to efficiently perform long-term planning and vision-based
control. Furthermore, our approach exploits generation flexibility of the
diffusion model, which makes it possible to plan diverse trajectories for
multi-task decision-making. We demonstrate the proposed method across offline
reinforcement learning (RL) benchmarks and an visual manipulation environment.
The results show that our approach outperforms offline RL methods in solving
long-horizon, sparse-reward tasks and multi-task problems, while also achieving
the state-of-the-art performance on the most challenging vision-based
manipulation benchmark.
DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction
December 06, 2023
Yanlong Li, Chamara Madarasingha, Kanchana Thilakarathna
Point cloud streaming is increasingly getting popular, evolving into the norm
for interactive service delivery and the future Metaverse. However, the
substantial volume of data associated with point clouds presents numerous
challenges, particularly in terms of high bandwidth consumption and large
storage capacity. Despite various solutions proposed thus far, with a focus on
point cloud compression, upsampling, and completion, these
reconstruction-related methods continue to fall short in delivering high
fidelity point cloud output. As a solution, in DiffPMAE, we propose an
effective point cloud reconstruction architecture. Inspired by self-supervised
learning concepts, we combine Masked Auto-Encoding and Diffusion Model
mechanism to remotely reconstruct point cloud data. By the nature of this
reconstruction process, DiffPMAE can be extended to many related downstream
tasks including point cloud compression, upsampling and completion. Leveraging
ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the
performance of DiffPMAE exceeding many state-of-the-art methods in-terms of
auto-encoding and downstream tasks considered.
ReconFusion: 3D Reconstruction with Diffusion Priors
December 05, 2023
Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, Aleksander Holynski
3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at
rendering photorealistic novel views of complex scenes. However, recovering a
high-quality NeRF typically requires tens to hundreds of input images,
resulting in a time-consuming capture process. We present ReconFusion to
reconstruct real-world scenes using only a few photos. Our approach leverages a
diffusion prior for novel view synthesis, trained on synthetic and multiview
datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel
camera poses beyond those captured by the set of input images. Our method
synthesizes realistic geometry and texture in underconstrained regions while
preserving the appearance of observed regions. We perform an extensive
evaluation across various real-world datasets, including forward-facing and
360-degree scenes, demonstrating significant performance improvements over
previous few-view NeRF reconstruction approaches.
DiffusionPCR: Diffusion Models for Robust Multi-Step Point Cloud Registration
December 05, 2023
Zhi Chen, Yufan Ren, Tong Zhang, Zheng Dang, Wenbing Tao, Sabine Süsstrunk, Mathieu Salzmann
Point Cloud Registration (PCR) estimates the relative rigid transformation
between two point clouds. We propose formulating PCR as a denoising diffusion
probabilistic process, mapping noisy transformations to the ground truth.
However, using diffusion models for PCR has nontrivial challenges, such as
adapting a generative model to a discriminative task and leveraging the
estimated nonlinear transformation from the previous step. Instead of training
a diffusion model to directly map pure noise to ground truth, we map the
predictions of an off-the-shelf PCR model to ground truth. The predictions of
off-the-shelf models are often imperfect, especially in challenging cases where
the two points clouds have low overlap, and thus could be seen as noisy
versions of the real rigid transformation. In addition, we transform the
rotation matrix into a spherical linear space for interpolation between samples
in the forward process, and convert rigid transformations into auxiliary
information to implicitly exploit last-step estimations in the reverse process.
As a result, conditioned on time step, the denoising model adapts to the
increasing accuracy across steps and refines registrations. Our extensive
experiments showcase the effectiveness of our DiffusionPCR, yielding
state-of-the-art registration recall rates (95.3%/81.6%) on 3DMatch and
3DLoMatch. The code will be made public upon publication.
Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection
December 05, 2023
Cheng-Ju Ho, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai
Semi-supervised object detection is crucial for 3D scene understanding,
efficiently addressing the limitation of acquiring large-scale 3D bounding box
annotations. Existing methods typically employ a teacher-student framework with
pseudo-labeling to leverage unlabeled point clouds. However, producing reliable
pseudo-labels in a diverse 3D space still remains challenging. In this work, we
propose Diffusion-SS3D, a new perspective of enhancing the quality of
pseudo-labels via the diffusion model for semi-supervised 3D object detection.
Specifically, we include noises to produce corrupted 3D object size and class
label distributions, and then utilize the diffusion model as a denoising
process to obtain bounding box outputs. Moreover, we integrate the diffusion
model into the teacher-student framework, so that the denoised bounding boxes
can be used to improve pseudo-label generation, as well as the entire
semi-supervised learning process. We conduct experiments on the ScanNet and SUN
RGB-D benchmark datasets to demonstrate that our approach achieves
state-of-the-art performance against existing methods. We also present
extensive analysis to understand how our diffusion model design affects
performance in semi-supervised learning.
Deterministic Guidance Diffusion Model for Probabilistic Weather Forecasting
December 05, 2023
Donggeun Yoon, Minseok Seo, Doyi Kim, Yeji Choi, Donghyeon Cho
Weather forecasting requires not only accuracy but also the ability to
perform probabilistic prediction. However, deterministic weather forecasting
methods do not support probabilistic predictions, and conversely, probabilistic
models tend to be less accurate. To address these challenges, in this paper, we
introduce the \textbf{\textit{D}}eterministic \textbf{\textit{G}}uidance
\textbf{\textit{D}}iffusion \textbf{\textit{M}}odel (DGDM) for probabilistic
weather forecasting, integrating benefits of both deterministic and
probabilistic approaches. During the forward process, both the deterministic
and probabilistic models are trained end-to-end. In the reverse process,
weather forecasting leverages the predicted result from the deterministic
model, using as an intermediate starting point for the probabilistic model. By
fusing deterministic models with probabilistic models in this manner, DGDM is
capable of providing accurate forecasts while also offering probabilistic
predictions. To evaluate DGDM, we assess it on the global weather forecasting
dataset (WeatherBench) and the common video frame prediction benchmark (Moving
MNIST). We also introduce and evaluate the Pacific Northwest Windstorm
(PNW)-Typhoon weather satellite dataset to verify the effectiveness of DGDM in
high-resolution regional forecasting. As a result of our experiments, DGDM
achieves state-of-the-art results not only in global forecasting but also in
regional forecasting. The code is available at:
\url{https://github.com/DongGeun-Yoon/DGDM}.
BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
December 05, 2023
Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang
Diffusion models have made tremendous progress in text-driven image and video
generation. Now text-to-image foundation models are widely applied to various
downstream image synthesis tasks, such as controllable image generation and
image editing, while downstream video synthesis tasks are less explored for
several reasons. First, it requires huge memory and compute overhead to train a
video generation foundation model. Even with video foundation models,
additional costly training is still required for downstream video synthesis
tasks. Second, although some works extend image diffusion models into videos in
a training-free manner, temporal consistency cannot be well kept. Finally,
these adaption methods are specifically designed for one task and fail to
generalize to different downstream video synthesis tasks. To mitigate these
issues, we propose a training-free general-purpose video synthesis framework,
coined as BIVDiff, via bridging specific image diffusion models and general
text-to-video foundation diffusion models. Specifically, we first use an image
diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video
generation, then perform Mixed Inversion on the generated video, and finally
input the inverted latents into the video diffusion model for temporal
smoothing. Decoupling image and video models enables flexible image model
selection for different purposes, which endows the framework with strong task
generalization and high efficiency. To validate the effectiveness and general
use of BIVDiff, we perform a wide range of video generation tasks, including
controllable video generation video editing, video inpainting and outpainting.
Our project page is available at https://bivdiff.github.io.
Analyzing and Improving the Training Dynamics of Diffusion Models
December 05, 2023
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, Samuli Laine
cs.CV, cs.AI, cs.LG, cs.NE, stat.ML
Diffusion models currently dominate the field of data-driven image synthesis
with their unparalleled scaling to large datasets. In this paper, we identify
and rectify several causes for uneven and ineffective training in the popular
ADM diffusion model architecture, without altering its high-level structure.
Observing uncontrolled magnitude changes and imbalances in both the network
activations and weights over the course of training, we redesign the network
layers to preserve activation, weight, and update magnitudes on expectation. We
find that systematic application of this philosophy eliminates the observed
drifts and imbalances, resulting in considerably better networks at equal
computational complexity. Our modifications improve the previous record FID of
2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic
sampling.
As an independent contribution, we present a method for setting the
exponential moving average (EMA) parameters post-hoc, i.e., after completing
the training run. This allows precise tuning of EMA length without the cost of
performing several training runs, and reveals its surprising interactions with
network architecture, training time, and guidance.
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler
December 05, 2023
Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
Diffusion models are a new class of generative models that have recently been
applied to speech enhancement successfully. Previous works have demonstrated
their superior performance in mismatched conditions compared to state-of-the
art discriminative models. However, this was investigated with a single
database for training and another one for testing, which makes the results
highly dependent on the particular databases. Moreover, recent developments
from the image generation literature remain largely unexplored for speech
enhancement. These include several design aspects of diffusion models, such as
the noise schedule or the reverse sampler. In this work, we systematically
assess the generalization performance of a diffusion-based speech enhancement
model by using multiple speech, noise and binaural room impulse response (BRIR)
databases to simulate mismatched acoustic conditions. We also experiment with a
noise schedule and a sampler that have not been applied to speech enhancement
before. We show that the proposed system substantially benefits from using
multiple databases for training, and achieves superior performance compared to
state-of-the-art discriminative models in both matched and mismatched
conditions. We also show that a Heun-based sampler achieves superior
performance at a smaller computational cost compared to a sampler commonly used
for speech enhancement.
Stable Diffusion Exposed: Gender Bias from Prompt to Image
December 05, 2023
Yankun Wu, Yuta Nakashima, Noa Garcia
Recent studies have highlighted biases in generative models, shedding light
on their predisposition towards gender-based stereotypes and imbalances. This
paper contributes to this growing body of research by introducing an evaluation
protocol designed to automatically analyze the impact of gender indicators on
Stable Diffusion images. Leveraging insights from prior work, we explore how
gender indicators not only affect gender presentation but also the
representation of objects and layouts within the generated images. Our findings
include the existence of differences in the depiction of objects, such as
instruments tailored for specific genders, and shifts in overall layouts. We
also reveal that neutral prompts tend to produce images more aligned with
masculine prompts than their feminine counterparts, providing valuable insights
into the nuanced gender biases inherent in Stable Diffusion.
Diffusion Noise Feature: Accurate and Fast Generated Image Detection
December 05, 2023
Yichi Zhang, Xiaogang Xu
Generative models have reached an advanced stage where they can produce
remarkably realistic images. However, this remarkable generative capability
also introduces the risk of disseminating false or misleading information.
Notably, existing image detectors for generated images encounter challenges
such as low accuracy and limited generalization. This paper seeks to address
this issue by seeking a representation with strong generalization capabilities
to enhance the detection of generated images. Our investigation has revealed
that real and generated images display distinct latent Gaussian representations
when subjected to an inverse diffusion process within a pre-trained diffusion
model. Exploiting this disparity, we can amplify subtle artifacts in generated
images. Building upon this insight, we introduce a novel image representation
known as Diffusion Noise Feature (DNF). DNF is an ensemble representation that
estimates the noise generated during the inverse diffusion process. A simple
classifier, e.g., ResNet, trained on DNF achieves high accuracy, robustness,
and generalization capabilities for detecting generated images, even from
previously unseen classes or models. We conducted experiments using a widely
recognized and standard dataset, achieving state-of-the-art effects of
Detection.
GeNIe: Generative Hard Negative Images Through Diffusion
December 05, 2023
Soroush Abbasi Koohpayegani, Anuj Singh, K L Navaneet, Hadi Jamali-Rad, Hamed Pirsiavash
Data augmentation is crucial in training deep models, preventing them from
overfitting to limited data. Common data augmentation methods are effective,
but recent advancements in generative AI, such as diffusion models for image
generation, enable more sophisticated augmentation techniques that produce data
resembling natural images. We recognize that augmented samples closer to the
ideal decision boundary of a classifier are particularly effective and
efficient in guiding the learning process. We introduce GeNIe which leverages a
diffusion model conditioned on a text prompt to merge contrasting data points
(an image from the source category and a text prompt from the target category)
to generate challenging samples for the target category. Inspired by recent
image editing methods, we limit the number of diffusion iterations and the
amount of noise. This ensures that the generated image retains low-level and
contextual features from the source image, potentially conflicting with the
target category. Our extensive experiments, in few-shot and also long-tail
distribution settings, demonstrate the effectiveness of our novel augmentation
method, especially benefiting categories with a limited number of examples.
Kernel Diffusion: An Alternate Approach to Blind Deconvolution
December 04, 2023
Yash Sanghvi, Yiheng Chi, Stanley H. Chan
Blind deconvolution problems are severely ill-posed because neither the
underlying signal nor the forward operator are not known exactly.
Conventionally, these problems are solved by alternating between estimation of
the image and kernel while keeping the other fixed. In this paper, we show that
this framework is flawed because of its tendency to get trapped in local minima
and, instead, suggest the use of a kernel estimation strategy with a non-blind
solver. This framework is employed by a diffusion method which is trained to
sample the blur kernel from the conditional distribution with guidance from a
pre-trained non-blind solver. The proposed diffusion method leads to
state-of-the-art results on both synthetic and real blur datasets.
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
December 04, 2023
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler
Monocular depth estimation is a fundamental computer vision task. Recovering
3D depth from a single image is geometrically ill-posed and requires scene
understanding, so it is not surprising that the rise of deep learning has led
to a breakthrough. The impressive progress of monocular depth estimators has
mirrored the growth in model capacity, from relatively modest CNNs to large
Transformer architectures. Still, monocular depth estimators tend to struggle
when presented with images with unfamiliar content and layout, since their
knowledge of the visual world is restricted by the data seen during training,
and challenged by zero-shot generalization to new domains. This motivates us to
explore whether the extensive priors captured in recent generative diffusion
models can enable better, more generalizable depth estimation. We introduce
Marigold, a method for affine-invariant monocular depth estimation that is
derived from Stable Diffusion and retains its rich prior knowledge. The
estimator can be fine-tuned in a couple of days on a single GPU using only
synthetic training data. It delivers state-of-the-art performance across a wide
range of datasets, including over 20% performance gains in specific cases.
Project page: https://marigoldmonodepth.github.io.
December 04, 2023
Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat
Diffusion models with their powerful expressivity and high sample quality
have enabled many new applications and use-cases in various domains. For sample
generation, these models rely on a denoising neural network that generates
images by iterative denoising. Yet, the role of denoising network architecture
is not well-studied with most efforts relying on convolutional residual U-Nets.
In this paper, we study the effectiveness of vision transformers in
diffusion-based generative learning. Specifically, we propose a new model,
denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid
hierarchical architecture with a U-shaped encoder and decoder. We introduce a
novel time-dependent self-attention module that allows attention layers to
adapt their behavior at different stages of the denoising process in an
efficient manner. We also introduce latent DiffiT which consists of transformer
model with the proposed self-attention layers, for high-resolution image
generation. Our results show that DiffiT is surprisingly effective in
generating high-fidelity images, and it achieves state-of-the-art (SOTA)
benchmarks on a variety of class-conditional and unconditional synthesis tasks.
In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on
ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT
Stochastic Optimal Control Matching
December 04, 2023
Carles Domingo-Enrich, Jiequn Han, Brandon Amos, Joan Bruna, Ricky T. Q. Chen
math.OC, cs.LG, cs.NA, math.NA, math.PR, stat.ML
Stochastic optimal control, which has the goal of driving the behavior of
noisy systems, is broadly applicable in science, engineering and artificial
intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a
novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal
control that stems from the same philosophy as the conditional score matching
loss for diffusion models. That is, the control is learned via a least squares
problem by trying to fit a matching vector field. The training loss, which is
closely connected to the cross-entropy loss, is optimized with respect to both
the control function and a family of reparameterization matrices which appear
in the matching vector field. The optimization with respect to the
reparameterization matrices aims at minimizing the variance of the matching
vector field. Experimentally, our algorithm achieves lower error than all the
existing IDO techniques for stochastic optimal control for three out of four
control problems, in some cases by an order of magnitude. The key idea
underlying SOCM is the path-wise reparameterization trick, a novel technique
that is of independent interest, e.g., for generative modeling. Code at
https://github.com/facebookresearch/SOC-matching
Conditional Variational Diffusion Models
December 04, 2023
Gabriel della Maggiora, Luis Alberto Croquevielle, Nikita Desphande, Harry Horsley, Thomas Heinis, Artur Yakimovich
cs.CV, cs.AI, cs.LG, stat.ML, I.2.6
Inverse problems aim to determine parameters from observations, a crucial
task in engineering and science. Lately, generative models, especially
diffusion models, have gained popularity in this area for their ability to
produce realistic solutions and their good mathematical properties. Despite
their success, an important drawback of diffusion models is their sensitivity
to the choice of variance schedule, which controls the dynamics of the
diffusion process. Fine-tuning this schedule for specific applications is
crucial but time-costly and does not guarantee an optimal result. We propose a
novel approach for learning the schedule as part of the training process. Our
method supports probabilistic conditioning on data, provides high-quality
solutions, and is flexible, proving able to adapt to different applications
with minimum overhead. This approach is tested in two unrelated inverse
problems: super-resolution microscopy and quantitative phase imaging, yielding
comparable or superior results to previous methods and fine-tuned diffusion
models. We conclude that fine-tuning the schedule by experimentation should be
avoided because it can be learned during training in a stable way that yields
better results.
Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation
December 04, 2023
Joshua Niemeijer, Manuel Schwonberg, Jan-Aike Termöhlen, Nico M. Schmidt, Tim Fingscheidt
When models, e.g., for semantic segmentation, are applied to images that are
vastly different from training data, the performance will drop significantly.
Domain adaptation methods try to overcome this issue, but need samples from the
target domain. However, this might not always be feasible for various reasons
and therefore domain generalization methods are useful as they do not require
any target data. We present a new diffusion-based domain extension (DIDEX)
method and employ a diffusion model to generate a pseudo-target domain with
diverse text prompts. In contrast to existing methods, this allows to control
the style and content of the generated images and to introduce a high
diversity. In a second step, we train a generalizing model by adapting towards
this pseudo-target domain. We outperform previous approaches by a large margin
across various datasets and architectures without using any real data. For the
generalization from GTA5, we improve state-of-the-art mIoU performance by 3.8%
absolute on average and for SYNTHIA by 11.8% absolute, marking a big step for
the generalization performance on these benchmarks. Code is available at
https://github.com/JNiemeijer/DIDEX
Fully Spiking Denoising Diffusion Implicit Models
December 04, 2023
Ryo Watanabe, Yusuke Mukuta, Tatsuya Harada
Spiking neural networks (SNNs) have garnered considerable attention owing to
their ability to run on neuromorphic devices with super-high speeds and
remarkable energy efficiencies. SNNs can be used in conventional neural
network-based time- and energy-consuming applications. However, research on
generative models within SNNs remains limited, despite their advantages. In
particular, diffusion models are a powerful class of generative models, whose
image generation quality surpass that of the other generative models, such as
GANs. However, diffusion models are characterized by high computational costs
and long inference times owing to their iterative denoising feature. Therefore,
we propose a novel approach fully spiking denoising diffusion implicit model
(FSDDIM) to construct a diffusion model within SNNs and leverage the high speed
and low energy consumption features of SNNs via synaptic current learning
(SCL). SCL fills the gap in that diffusion models use a neural network to
estimate real-valued parameters of a predefined probabilistic distribution,
whereas SNNs output binary spike trains. The SCL enables us to complete the
entire generative process of diffusion models exclusively using SNNs. We
demonstrate that the proposed method outperforms the state-of-the-art fully
spiking generative model.
ResEnsemble-DDPM: Residual Denoising Diffusion Probabilistic Models for Ensemble Learning
December 04, 2023
Shi Zhenning, Dong Changsheng, Xie Xueshuo, Pan Bin, He Along, Li Tao
Nowadays, denoising diffusion probabilistic models have been adapted for many
image segmentation tasks. However, existing end-to-end models have already
demonstrated remarkable capabilities. Rather than using denoising diffusion
probabilistic models alone, integrating the abilities of both denoising
diffusion probabilistic models and existing end-to-end models can better
improve the performance of image segmentation. Based on this, we implicitly
introduce residual term into the diffusion process and propose
ResEnsemble-DDPM, which seamlessly integrates the diffusion model and the
end-to-end model through ensemble learning. The output distributions of these
two models are strictly symmetric with respect to the ground truth
distribution, allowing us to integrate the two models by reducing the residual
term. Experimental results demonstrate that our ResEnsemble-DDPM can further
improve the capabilities of existing models. Furthermore, its ensemble learning
strategy can be generalized to other downstream tasks in image generation and
get strong competitiveness.
Diffusion Posterior Sampling for Nonlinear CT Reconstruction
December 03, 2023
Shudong Li, Matthew Tivnan, Yuan Shen, J. Webster Stayman
physics.med-ph, cs.CV, eess.IV, physics.comp-ph, J.3; I.4.4; I.4.5
Diffusion models have been demonstrated as powerful deep learning tools for
image generation in CT reconstruction and restoration. Recently, diffusion
posterior sampling, where a score-based diffusion prior is combined with a
likelihood model, has been used to produce high quality CT images given
low-quality measurements. This technique is attractive since it permits a
one-time, unsupervised training of a CT prior; which can then be incorporated
with an arbitrary data model. However, current methods only rely on a linear
model of x-ray CT physics to reconstruct or restore images. While it is common
to linearize the transmission tomography reconstruction problem, this is an
approximation to the true and inherently nonlinear forward model. We propose a
new method that solves the inverse problem of nonlinear CT image reconstruction
via diffusion posterior sampling. We implement a traditional unconditional
diffusion model by training a prior score function estimator, and apply Bayes
rule to combine this prior with a measurement likelihood score function derived
from the nonlinear physical model to arrive at a posterior score function that
can be used to sample the reverse-time diffusion process. This plug-and-play
method allows incorporation of a diffusion-based prior with generalized
nonlinear CT image reconstruction into multiple CT system designs with
different forward models, without the need for any additional training. We
develop the algorithm that performs this reconstruction, including an
ordered-subsets variant for accelerated processing and demonstrate the
technique in both fully sampled low dose data and sparse-view geometries using
a single unsupervised training of the prior.