0% found this document useful (0 votes)
18 views18 pages

PiGW A Plug-In Generative Watermarking Framework

Watermarking research

Uploaded by

emmettsteveson2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

PiGW A Plug-In Generative Watermarking Framework

Watermarking research

Uploaded by

emmettsteveson2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PiGW: A Plug-in Generative Watermarking Framework

Rui Ma1 , Mengxi Guo2 , Yuming Li1 , Hengyuan Zhang1 , Cong Ma3 , Yuan Li1 , Xiaodong Xie1 , and Shanghang Zhang1
1
Peking University
2
Bytedance Inc.
3
SenseAuto Research
arXiv:2403.12053v1 [[Link]] 4 Jan 2024

Abstract Post-hoc Watermarking

Integrating watermarks into generative images is critical


for protecting intellectual property and enhancing artificial Image

intelligence security. This paper proposes Plug-in Genera- Image Attacked Image Watermark
tive Watermarking (PiGW) as a general framework for inte- Watermark

grating watermarks into generative images. PiGW embeds Post-processing Transfer and Edit
watermark information into the initial noise with an adap-
Generative Watermarking
tive frequency spectrum mask using a learnable watermark
embedding network. Furthermore, we introduce a warmup Plug-in
Audio
strategy by gradually increasing timesteps to optimize train-
ing stability. Extensive experiments demonstrate that PiGW Gaussian noise

enables embedding the watermark into the generated image Image


Generator Detect
with negligible quality loss while achieving true invisibil- Watermark
Watermark
ity and high resistance to noise attacks. Moreover, PiGW 3D

Watermarked Contents Transfer and Edit


can serve as a plugin for various commonly used gener-
ative structures and multimodal generative content types.
Figure 1. Difference between post-hoc watermarking and genera-
Finally, we demonstrate that PiGW can also be utilized for
tive watermarking. Traditional watermarks involve embedding the
detecting generated images, which contributes to promot- watermark into an existing image through post-hoc modification.
ing secure AI development. The project code will be made However, generative watermarks incorporate the watermark along
available on GitHub. with Gaussian noise into the generator, resulting in the watermark
becoming an integral part of the generated content.

1. Introduction
noise attacks. Given the growing use of text-to-image gen-
The latest developments in generative models have made erative models for creating images, generative watermark-
it easier for people to create increasingly indistinguishable ing is proposed [51]. As shown in Figure 1, generative
images from their natural counterparts. Particularly, some watermarks are different from traditional post-hocs, as they
text-to-image generative models, such as DALL·E 2 [39] can be embedded during image generation, resulting in wa-
and Stable Diffusion [36, 40], have become prevalent cre- termark information becoming an intrinsic part of the image
ative tools among both professionals and non-professionals. and achieving true invisibility. Recent advances in genera-
With the exponential growth of Artificial Intelligence Gen- tive watermarking have shown progress in two directions.
erated Content (AIGC) expected in the future, there arises The first employs fine-tuning of the generators to enable
an urgent need for robust copyright protection tools. output watermarked images directly [9, 10, 32, 35], while
Watermarking is commonly used for protecting the the second involves a watermarking embedding module as
copyright of images. Most watermarking works [21, 29, a plugin incorporated into existing generative models [51].
31, 54] aim to make minimal post-hoc modifications to pre- Although the fine-tuning method is effective, it can incur
existing images. However, such methods are often difficult significant costs, making the plugin approach more user-
to balance between invisibility and robustness under strong friendly and practical. Tree-ring watermarks [51] is a sem-
inal plugin generative watermarking method. It imprints a rithms [1, 13] were commonly used. However, with the
manually designed watermark onto the initial noise vector development of deep learning, the first CNN-based water-
during the image generation. However, the handcrafted de- marking work, HiDDeN [54], was introduced. HiDDeN
sign of the watermark limits its robustness against specific utilizes an end-to-end approach to train the encoder and
types of noise, such as cropping and Gaussian noise. Fur- decoder of the watermark simultaneously. Subsequently,
thermore, this method is not easily transferable because it is many deep learning watermarking methods have been pro-
tailored to the diffusion model. posed, including two-stage algorithms [29], compression-
In this paper, we present Plug-in Generative Watermarks resistant algorithms [21], and invertible watermarking algo-
(PiGW) as a framework for watermarking generative im- rithms [8, 31]. These post-hoc watermarking methods can
ages. Specifically, PiGW uses a private-key encoder to im- resist most noise attacks, but they are still ineffective against
print watermarks onto an initial noise vector, making the strong noise attacks and inevitably require modifications to
watermark an integral part of the generated image instead of the image.
post-hoc modification. For generative models that require
multiple operations, we propose two strategies: an adaptive
frequency spectrum mask to prioritize retaining more wa- 2.2. Generative watermarking
termark information in less vulnerable areas, and a warm-up
With the advancement of generative models, many photo-
training strategy that stabilizes the training of the watermark
realistic generative models such as VAE [25], GAN [11],
module by gradually increasing the timesteps of the gener-
Diffusion models [7, 15], DALL-E [38] and Imagen [43]
ative model. As an easy-used plugin, PiGW can seamlessly
are widely used. The increasing prevalence of generated im-
integrate watermarks into generated images with true invis-
ages poses new challenges for copyright protection. There
ibility and high resistance to noise attacks. Moreover, as far
are two primary watermarking approaches prevail: fine-
as we know, PiGW is the first general watermarking frame-
tuned generator and plug-in methods. Several watermark-
work that can be applied to the most commonly used gen-
ing methods using fine-tuned generators have been pro-
erative structures, including diffusion model [36, 39, 40],
posed including the supervised GAN approach [9], Cycle-
GAN [11], and VAE [25], as well as multimodal generative
GAN technique [26], the watermark diffusion process [35],
content types such as audio [28] and 3D models [22]. Fi-
pre-set prompt approach [30], co-development of a water-
nally, as PiGW can embed information into images while
mark generator and detector [32], and integration of unde-
preserving the generated quality, it can be applied in the
tectable watermarks at the latent decoder level of Latent
generated image detection task, promoting the secure de-
Diffusion Models [10]. The drawbacks of these methods
velopment of artificial intelligence.
are evident as fine-tuning generators is time-intensive and
Our contributions can be summarized as follows:
not easily transferable. RoSteALSS[2] and Tree-ring water-
(1) We propose Plug-in Generative Watermarks as a unified marks [51] are two existing works in plugin generative wa-
watermarking framework for image generation. It learns termarking. RoSteALS employs pre-trained autoencoders
to embed watermarks into images during the generation for steganography by embedding secret information in im-
process and achieve true invisibility and strong robustness age latent codes, which can negatively impact image qual-
against various types of noise. ity. On the other hand, tree-ring watermarks create artifi-
(2) We extend our framework to accommodate more gen- cially designed watermarks on noise vectors during image
eral generative scenarios, demonstrating its scalability. generation for true invisibility but face limitations in re-
It is compatible with the most commonly used image- silience against particular types of noise. Moreover, it has
generative models and extensible to other multimodal low transferability due to manual watermark design and de-
generation tasks. pendency on specific models.
(3) Our framework can be applied to detecting generated im-
ages, which can be beneficial in promoting the secure de-
velopment of generative model techniques. 3. Method
2. Related work As shown in Figure 2, the PiGW framework mainly consists
of 4 modules. The Embedding module is used to embed the
2.1. Post-hoc watermarking
watermark into the initial noise xT . The Generation mod-
Watermarking in digital content, particularly images, is ule is responsible for generating the corresponding image.
important for copyright protection. Traditional methods The Attack module is used to simulate potential tamper-
including algorithms based on singular value decompo- ing attacks on the watermarked image. The Authentication
sition [33, 48, 49], moment-based watermarking algo- module is used to determine whether the public key (image)
rithms [18, 19] and transform domain watermarking algo- matches the private key.
Figure 2. Framework diagram of PiGM: The embedding module combines the masked watermark with initial noise, z, in the frequency
domain to yield zwm . The generation module either denoises zwm or decodes it to obtain the watermarked image, Iwm . The attack module

simulates edits and tampering on the image, resulting in Iwm . The authentication module derives the final outcome by assessing the pairing
of the public and private keys.

3.1. Embedded Module where A(N oise) and ϕ(N oise) represent the amplitude
b×l and phase adjustment functions, respectively.
A uniformly distributed random key ∈ R is generated as
The processed noise FN oise is subsequently fused with
the initial step in the watermarking process. b and l repre-
the latent representation of key zkey through concatenation
sents the dimensions of the key vector. The generated key is
and inverse Fourier transformation (IFFT):
input into an encoding network designed to encode the key
into a more suitable form for watermark embedding. The zwm = IFFT2(FN oise ⊕ Mes )
encoding function E can be formulated as:
yielding the watermark embedding vector zwm , which in-
Ekey = σ(We · key + be ) corporates both the encoded key and the adjusted noise.

where Ekey is the encoded key, We represents the weight


3.2. Generation Module
matrix, be is the bias vector, and σ denotes the activation The watermark generation module leverages zwm as the
function. The encoded key Ekey is then passed through an stochastic input to a generative model G, processed with text
adaptive frequency mask Madp , which aims to selectively prompt to synthesize the watermarked image Iwm :
alter the frequency components for an optimal embedding.
Iwm = G(zwm , C, θG )
zkey = S · Madp ⊙ Ekey where θG denotes the set of parameters of the generative
where ⊙ represents element-wise multiplication, and Mkey model G.
is the frequency-masked encoded key. S ∈ Rn×m is a 3.3. Attack Module
strength factor.
Simultaneously, a Gaussian noise N oise ∈ Rn×m is The Attack Module evaluates the robustness of the water-
generated and transformed into the frequency domain us- mark through a series of simulated perturbations. An at-
ing Fast Fourier Transform (FFT), yielding FN oise . The tack transformation function T , parameterized by θT , ap-
transformed noise FN oise is then subjected to amplitude and plies various manipulations to Iwm :
phase modifications, formulated as: ′
Iwm = T (Iwm , θT )
FN oise = A(FFT2(N oise))) · eiϕ(FFT2(N oise)) ′
resulting in the perturbed image Iwm .
(a) 3.5. Warmup Stratege
(b)
We utilized the output of the diffusion model’s UNet to dif-
Low Frequency ferentiate between watermarked and non-watermarked im-
ages. This warm-up strategy was employed to initialize
the watermark embedding encoder, mitigating convergence
challenges in the image space caused by introducing VAE
High
into the diffusion model.
pred = UNet(Embed(xT , zwm ), timestep = 1)
(c) Watermark

Low Frequency High Frequency 3.6. Training Objectives


Figure 3. Initialization diagram of watermark frequency-adaptive Watermark Robustness. Our approach formulates the
masks. (a) Presents the mask form of two-dimensional noise, Xt problem of watermark detection as a binary classification
(such as Stable Diffusion), vividly depicting the low-frequency task, utilizing the Binary Cross Entropy (BCE) loss func-
(brighter areas) and high-frequency (darker areas) components, tion to calculate the loss between the predicted values and
where we embed watermark into the low-frequency region of Xt. the ground truth labels for input images.
(b) Numerical visualization of Figure (a). (c) Represents the mask
N
form of one-dimensional noise Xt (such as StyleGAN-v3). The 1 X (n) (n) (n) (n)
initial noise masks all employ a linear gradient masking approach. Lbce (θ) = − [y log(ŷθ )+(1−yθ ) log(1−ŷθ )]
N n=1 θ
(n)
where ŷθ denotes the predicted value of the model for the
3.4. Authentication Module
nth input image, which is constrained to the range of 0 to 1,
′ (n)
The attacked image, Iwm , first undergoes a Fast Fourier and yθ represents the ground truth label, taking a binary
Transform (FFT), which separates the image into amplitude value of either 0 or 1.
and phase components:
′ ′ Image Fidelity. As stable diffusion assumes that the ini-
RepIwm
′ = A[FFT2(Iwm )] · eiϕ[FFT2(Iwm )] tial noise xT adheres to a Gaussian distribution, we need
where A and ϕ represent the amplitude and phase adjust- to impose a regularization term on the distribution of the
ment functions, respectively. Then the RepIwm ′ will be con- watermarked initial latent to approximate it to a Gaus-
catenated with Iwm′
and as the input of Private Key Decoder sian distribution. We choose the Jensen-Shannon (JS)

to get the latent zwm : divergence [11] as the distribution metric and minimize
the difference between the watermarked noise distribution
′ ′ ′ fθ # (pkey , p(xT )) and the initial noise distribution p(xT ):
zwm = Dpri key ([zwm , Iwm ])

Simultaneously, the private key key is input into the en- Ldistr (θ) = JS(fθ # (pkey , p(xT )), p(xT ))
coding network and passed through an adaptive frequency
mask Madp similar to the process in the embedded model Total Loss. Finally, the overall training objectives can be
which can be formulated as: defined as:
Mkey = Madp ⊙ Ekey Ltotal = λ1 Lbce (θ) + λ2 Ldistr (θ)

where ⊙ represents element-wise multiplication, and Mkey where λ1 , λ2 are coefficients for balancing different loss
is the frequency-masked encoded key. terms. In the specific experiment, the value of λ1 is typi-

Finally, the latent zwm ′
from Iwm and Mkey is fed into cally set to 1, while λ2 is set to 1e-5.
the fusion layer F composed of some convolutional lay-
ers which computes the intermediate representation and is
4. Experiments
eventually fed to the discriminator model D to authentic The image size was uniformly set to 512x512, and the wa-
the presence of the private key. The fusion layer and the termark was fixed at 30 bits (comprising 0 or 1). For the
discriminator model can be mathematically represented as text-to-audio and text-to-3D experiments (see Table 4), de-
follows: fault parameters from Diffusers1 were utilized. In the gener-

Ŷ = D(F (zwm , Mkey )) ative content detection experiments, a fixed watermark was
Here, F symbolizes the fusion layers, and D represents dis- used for training (randomly selected but fixed after selec-
criminator model the model, Ŷ is conclusion about the pres- tion).
ence of the private key within the image. 1 [Link]
"A white cat with a red collar is eating
"A man riding a motorcycle through the rain." "a giraffe is standing next to a zebra outside"
from a green bowl."

Figure 4. Visualization of images: without watermark and with watermark.

Table 1. Method comparison results. T@1%F represents TPR@1%FPR. We evaluate watermark accuracy in both clean and adversarial
settings. Adversarial here refers to average performance over a battery of image manipulations.

Model Method AUC/T@1%F (Clean) AUC/T@1%F (Adversarial) FID ↓ CLIP Score ↑


SD-v2 DwtDct [5] 0.974 / 0.624 0.574 / 0.092 25.10(-0.19) 0.362(-0.001)
CLIP Score=0.363 DwtDctSvd [5] 1.000 / 1.000 0.702 / 0.262 25.01(-0.28) 0.359(-0.004)
RivaGan [53] 0.999 / 0.999 0.854 / 0.448 24.51 (-0.78) 0.361(-0.002)
FID=25.29 Tree-Ring [51] 1.000 / 1.000 0.975 / 0.694 25.93(+0.64) 0.364(+0.001)
Our 1.000 / 1.000 0.999 / 0.988 24.76(-0.53) 0.365(+0.002)

4.1. Metric and Dataset evaluation.


To valuate the efficacy of watermarked images generated by AUC and TPR@1%FPR Generating 1000 water-
our encoder model and our decoder model’s ability to detect marked and 1000 clear images for each experimental
the watermark, we used the following metrics and dataset: method to calculate the Area Under the Curve (AUC)
FID Utilizing the Frechet Inception Distance (FID) met- and True Positive Rate at a 1% False Positive Rate
ric [14] to assess the similarity between the distributions of (TPR@1%FPR).
generated watermarked images and real images. For FID Dataset The MS-COCO-2017 [27] training dataset is
calculation, using 5000 generated watermarked images. used as the image dataset for computing the FID in the ex-
CLIP Score Employing the CLIP score [37] to assess the periments. Randomly selecting 10,000 captions from the
alignment between the text and watermarked images, mea- MS-COCO-2017 training dataset serves as input prompts
sured by OpenCLIP-ViT/G [4]. Using 1000 images during for the text-to-Image model when training the watermark-
Table 2. Results from different image generation models. T@1%F represents TPR@1%FPR. We evaluate watermark accuracy in both clean
and adversarial settings. Adversarial here refers to average performance over a battery of image manipulations. PiGW-base represents the
watermark model during training, PiGW* denotes direct migration of PiGW-base for testing, and PiGW** indicates finetuning the model
on top of PiGW-base. Other models labeled as PiGW-{·} are adapted based on the generation model accordingly.

Clear Adversarial
Text to Image Model Watermark Model
AUC↑ TPR@1%FPR↑ AUC↑ TPR@1%FPR↑
SD V-1-4 [40] PiGW-base 1.000 1.000 1.000 1.000
SD V-1-5 [40] PiGW* 1.000 1.000 0.999 0.998
SD V-2-1 [40] PiGW* 1.000 1.000 0.999 0.987
SD Lora [17] [repo] PiGW* 1.000 1.000 0.999 0.994
SD Dreambooth [41] [repo] PiGW* 1.000 1.000 0.996 0.991
SD super-resolution [repo] PiGW** 1.000 1.000 0.983 0.977
VAE [25][repo] PiGW-vae 1.000 1.000 0.999 0.984
SD-XL [36] [repo] PiGW-xl 1.000 1.000 0.962 0.914
StyleGAN3 [24][repo] PiGW-gan 1.000 1.000 0.971 0.959
DALLE-2-torch [39] [repo] PiGW-dal 0.977 0.942 0.910 0.836

ing model. brightness factor uniformly sampled from a range of 0 to 6,


Gaussian blur with a 8×8 filter size, and the most common
4.2. Model Version Gaussian noise with σ=0.1. For the image attack module,
To illustrate the generality of our watermarking framework, we configured two experiment settings: the clean setting,
we conducted experiments on various generators capable of which involves no attacks on clear images, and the adversar-
harnessing latent features. ial setting combining the six attack methods to train a wa-
Stable diffusion We validate the watermarking effects termark encoder-decoder model capable of simultaneously
produced by our approach across a series of stable diffusion resisting various types of noise attacks. Please refer to the
models, like SD V-1-4, SD V-1-5, SD V-2-1[40]. supplementary materials for other experimental details.
Extension of stable diffusion We conducted experi-
ments on the LoRA [17] plugin and the DreamBooth [41] 4.4. Watermark Robustness
fine-tuning based stable diffusion model. Simultaneously, We presented an overall comparison of watermark robust-
we conducted corresponding experiments on the recently ness in Table 1, robustness results across different gener-
popular SDXL [32] model and image super-resolution 2 ative models in Table 2, and specific experiments demon-
model. strating resistance to various types of noise in Table 3.
Previous generative models We selected representative The experiment results indicate the strong robustness of our
methods from previously classical theories of image gener- method.
ation like VAE [25], styleGAN3 [24] and DALLE-2[39] 3 Noise Intensity In Figure 5, we conducted comparative
to validate the generality of our method. experiments with the latest state-of-the-art method, Tree
Audio and 3D We aspire to go beyond the boundaries of Ring. Both our method and the comparative approach ex-
image generation and aim to achieve cross-modal universal- hibit strong capabilities against Gaussian blur. However, the
ity as our ultimate goal. We successfully implementing our performance of Tree Ring, a manually designed algorithm,
approach in the text-to-audio using the audioldm-v2 [28] significantly diminishes with increased Gaussian noise in-
and text-to-3D using the Shap-e [22]. tensity and higher levels of crop&scaling or color adjust-
4.3. Noise Attack Setting ments. In contrast, our proposed learning-based method
demonstrates better adaptability to varying noise intensities.
To evaluate the robustness of our watermark, we employ 6 Additionally, our method’s TPR@1%FPR metric notably
common data augmentations as attacks against the water- outperforms the comparative approach. This also indicates
mark to test its robustness under extreme image processing the significant advantage of our method when noise param-
scenarios. These attacks consists of 25% JPEG compres- eters fluctuate, aligning well with the variability of attack
sion, 75% random cropping and scaling, color jitter using a noise in practical applications of watermarking methods.
2 https : / / huggingface . co / stabilityai / stable - Compression Additionally, in Fig.6, we conducted tests
diffusion-x4-upscaler using multiple compression algorithms and found that even
3 [Link] at extremely low compression rates (e.g., Q=5 in JPEG),
Table 3. AUC under each Attack for different watermarking methods, our method achieves the best value in almost. Cr.&Sc. refers to
random cropping and rescaling.

Method Clean JPEG Cr.& Sc. Gauss. Blurring Gauss. Noise Color Jitter Average
DwtDct [5] 0.974 0.492 0.640 0.503 0.293 0.519 0.574
DwtDctSvd [5] 1.000 0.753 0.511 0.979 0.706 0.517 0.702
RivaGan [53] 0.999 0.981 0.999 0.974 0.888 0.963 0.854
Tree-Ring [51] 1.000 0.999 0.961 0.999 0.944 0.983 0.975
Our 1.000 1.000 0.980 1.000 1.000 1.000 0.996

Figure 5. Results of watermark detectability of our model under four different image perturbation techniques, namely Gaussian blurring,
color jitter, cropping and scaling, and Gaussian noise. Across all perturbation types, our model, trained using a mini-batch strategy [20],
consistently outperforms the existing ’Tree-Ring’ algorithm in both AUC and TPR@1%FPR.

Figure 6. Robustness experiments of watermark against com-


pression algorithms. The parameters used for each compression
method are as follows: 5 for JPEG, 5 for JPEG2000, 5 for WebP,
and 0.5 for MBT2018.
Figure 7. The set of histograms presented here compare the statis-
tical distribution of samples with and without watermarking. The
blue histograms at the top represent the data distribution of a sin-
the watermark exhibits remarkable robustness. Unlike post-
gle sample, while the black histograms at the bottom represent the
processing methods that modify pixel values to embed wa- aggregate data distribution of 128 samples. The histograms on
termarks, our generative approach correlates highly with the left side depict the distribution without watermarking, whereas
image content. those on the right side include watermarking.
4.5. Image Quality
Evaluating Factor We provided the changes in FID values tial noise, xT , minimally affects the image quality. This
of generative models after watermark embedding in Table also suggests that the generative model is less sensitive to
1. By comparing the FID and CLIP score values of gener- changes in the low-frequency region of the initial noise, xT ,
ated images before and after watermark embedding, we ob- indicating a closer relationship between image diversity and
served that embedding a watermark of appropriate strength quality with the high-frequency information of xT .
in the low-frequency region of the generative model’s ini- Noise Distribution Additionally, in Figure 7, we illus-
(a) TPR@1%FPR / Only train on timestep=1 (b) AUC / Only train on timestep=1
Table 4. Audio and 3D content watermarking. We validated our
method on text-to-audio and text-to-3D models based on diffusion
models, demonstrating significant potential in both domains.

Model Task AUC TPR@1%FPR


AudioLDM [repo] text-to-audio 0.980 0.859
Shap-e [repo] text-to-3D 0.924 0.882

Table 5. Generative image detection. We present ACC/AP(%) in (c) TPR@1%FPR / GeWm (timestep progressive growth) (d) AUC / GeWm (timestep progressive growth)
the table.

Test Diffusion Generators


Method
LDM SD-v1 SD-v2
Patchfor [3] 97.3/100 63.2/71.2 97.2/100 0.98253
0.98253 0.99976
F3Net [3] 92.5/97.8 86.1/95.3 83.6/91/7
DIRE [50] 100/100 99.7/100 100/100
Our 100/100 100/100 100/100 Figure 8. Depicting the relationship between model performance
and the diffusion model’s Timestep. The first row represents re-
sults when the model is trained only at timestep=1. The second
Table 6. Adaptability of Watermark Frameworks and Sampling
row illustrates the final model performance after employing an in-
Schemes.
creasing timestep strategy.

Diffusion Sampling Schemes


Method
DDIM LMSD PNDM Diffusion model Sampler In Table 6, we compared our
Tree-Ring [51] ✓ × × method with Tree-Ring in terms of their support for differ-
Our ✓ ✓ ✓ ent sampling techniques. Tree-Ring’s watermark decoder
heavily relies on DDIM inversion, necessitating specific de-
coder adaptations for different sampling strategies, which
trated the distribution changes of the initial noise, xT , be- also limits its applicability.
fore and after watermark embedding. The experimental re- Diffusion model Timestep As depicted in Figure 8,
sults demonstrate that after embedding the watermark, xT training the model solely at timestep=1 leads to watermark
still adheres to a standard Gaussian distribution, hence min- information loss as the timestep increases. However, em-
imally affecting the quality of the generated images. Fur- ploying the increasing timestep strategy allows the model
thermore, in Figure 4, we visualized the images before and to gradually learn the watermark embedding and extraction
after watermark embedding, with more visualizations avail- methods across multiple denoising UNet modules, start-
able in the supplementary materials. ing from easier to more complex, as shown in the sec-
ond row of Figure 8. Once the watermark embedding
4.6. Ablation Study model converges on a denoising network with 50 timesteps,
Multimodal As depicted in Table 4, owing to the plugin- it exhibits backward compatibility, maintaining robustness
based nature of our method—embedding watermarks solely across other timesteps.
in the initial noise, xT —we’ve successfully extended the
watermark framework to text-to-audio and text-to-3D tasks. 5. Conclusion
Experimental results indicate significant potential for our
We introduced a plugin-based watermarking framework
method in watermark applications across other modalities.
that seamlessly integrates into existing generative models.
Generative Content Detection As shown in Table 5, we
By injecting watermarks into the initial noise of generative
attempted our watermarking framework for generative con-
models, the watermark becomes part of the image genera-
tent detection. In the experiments, we embedded a fixed wa-
tion process, achieving true invisibility and high robustness.
termark to serve as an identifier for generative content. Dur-
Experiments conducted across various modalities and ap-
ing generative image detection, we could easily distinguish
plication scenarios demonstrate the method’s flexibility and
the authenticity of images by assessing the match between
potential.
public and private keys. Unlike other methods compared in
the experiments, we embedded a distinctive marker in the
generated images, yet this marker minimally impacted the
quality of the images.
References Multimedia Tools and Applications, 77:27181–27214, 2018.
2
[1] Reem A Alotaibi and Lamiaa A Elrefaei. Text-image wa- [14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
termarking based on integer wavelet transform (iwt) and dis- Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
crete cosine transform (dct). Applied Computing and Infor- two time-scale update rule converge to a local nash equilib-
matics, 15(2):191–202, 2019. 2 rium. Advances in neural information processing systems,
[2] Tu Bui, Shruti Agarwal, Ning Yu, and John Collomosse. 30, 2017. 5
Rosteals: Robust steganography using autoencoder latent [15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
space. In Proceedings of the IEEE/CVF Conference on Com- fusion probabilistic models. Advances in neural information
puter Vision and Pattern Recognition, pages 933–942, 2023. processing systems, 33:6840–6851, 2020. 2, 13
2 [16] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
[3] Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
What makes fake images detectable? understanding proper- models for high fidelity image generation. The Journal of
ties that generalize. In Computer Vision–ECCV 2020: 16th Machine Learning Research, 23(1):2249–2281, 2022. 13
European Conference, Glasgow, UK, August 23–28, 2020, [17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Proceedings, Part XXVI 16, pages 103–120. Springer, 2020. Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
8 Lora: Low-rank adaptation of large language models. arXiv
[4] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell preprint arXiv:2106.09685, 2021. 6
Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- [18] Hai-tao Hu, Ya-dong Zhang, Chao Shao, and Quan Ju. Or-
mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- thogonal moments based on exponent functions: Exponent-
ing laws for contrastive language-image learning. In Pro- fourier moments. Pattern Recognition, 47(8):2596–2606,
ceedings of the IEEE/CVF Conference on Computer Vision 2014. 2
and Pattern Recognition, pages 2818–2829, 2023. 5 [19] Ming-Kuei Hu. Visual pattern recognition by moment invari-
[5] Ingemar Cox, Matthew Miller, Jeffrey Bloom, Jessica ants. IRE transactions on information theory, 8(2):179–187,
Fridrich, and Ton Kalker. Digital watermarking and 1962. 2
steganography. Morgan kaufmann, 2007. 5, 7 [20] Zhaoyang Jia, Han Fang, and Weiming Zhang. Mbrs: En-
[6] Antonia Creswell, Tom White, Vincent Dumoulin, Kai hancing robustness of dnn-based watermarking by mini-
Arulkumaran, Biswa Sengupta, and Anil A Bharath. Gen- batch of real and simulated jpeg compression. In Proceed-
erative adversarial networks: An overview. IEEE signal pro- ings of the 29th ACM International Conference on Multime-
cessing magazine, 35(1):53–65, 2018. 13 dia, pages 41–49, 2021. 7
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion models [21] Zhaoyang Jia, Han Fang, and Weiming Zhang. Mbrs: En-
beat gans on image synthesis. Advances in neural informa- hancing robustness of dnn-based watermarking by mini-
tion processing systems, 34:8780–8794, 2021. 2, 13 batch of real and simulated jpeg compression. In Proceed-
[8] Han Fang, Yupeng Qiu, Kejiang Chen, Jiyi Zhang, Weim- ings of the 29th ACM international conference on multime-
ing Zhang, and Ee-Chien Chang. Flow-based robust water- dia, pages 41–49, 2021. 1, 2
marking with invertible noise layer for black-box distortions. [22] Heewoo Jun and Alex Nichol. Shap-e: Generat-
In Proceedings of the AAAI conference on artificial intelli- ing conditional 3d implicit functions. arXiv preprint
gence, pages 5054–5061, 2023. 2 arXiv:2305.02463, 2023. 2, 6, 14
[9] Jianwei Fei, Zhihua Xia, Benedetta Tondi, and Mauro Barni. [23] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park,
Supervised gan watermarking for intellectual property pro- Eli Shechtman, Sylvain Paris, and Taesung Park. Scal-
tection. In 2022 IEEE International Workshop on Informa- ing up gans for text-to-image synthesis. In Proceedings of
tion Forensics and Security (WIFS), pages 1–6. IEEE, 2022. the IEEE/CVF Conference on Computer Vision and Pattern
1, 2 Recognition, pages 10124–10134, 2023. 13
[10] Pierre Fernandez, Guillaume Couairon, Hervé Jégou, [24] Tero Karras, Samuli Laine, and Timo Aila. A style-based
Matthijs Douze, and Teddy Furon. The stable signature: generator architecture for generative adversarial networks.
Rooting watermarks in latent diffusion models. arXiv In Proceedings of the IEEE/CVF conference on computer vi-
preprint arXiv:2303.15435, 2023. 1, 2 sion and pattern recognition, pages 4401–4410, 2019. 6
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [25] Diederik P Kingma and Max Welling. Auto-encoding varia-
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and tional bayes. arXiv preprint arXiv:1312.6114, 2013. 2, 6
Yoshua Bengio. Generative adversarial nets. Advances in [26] Dongdong Lin, Benedetta Tondi, Bin Li, and Mauro Barni.
neural information processing systems, 27, 2014. 2, 4 Cycleganwm: A cyclegan watermarking method for owner-
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing ship verification. arXiv preprint arXiv:2211.13737, 2022. 2
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Yoshua Bengio. Generative adversarial networks. Commu- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
nications of the ACM, 63(11):139–144, 2020. 13 Zitnick. Microsoft coco: Common objects in context. In
[13] Mohamed Hamidi, Mohamed El Haziti, Hocine Cherifi, and Computer Vision–ECCV 2014: 13th European Conference,
Mohammed El Hassouni. Hybrid blind robust image wa- Zurich, Switzerland, September 6-12, 2014, Proceedings,
termarking technique based on dft-dct and arnold transform. Part V 13, pages 740–755. Springer, 2014. 5, 12
[28] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, recognition, pages 10684–10695, 2022. 1, 2, 6
Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audi- [41] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
oldm: Text-to-audio generation with latent diffusion models. Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
arXiv preprint arXiv:2301.12503, 2023. 2, 6, 13, 18 tuning text-to-image diffusion models for subject-driven
[29] Yang Liu, Mengxi Guo, Jian Zhang, Yuesheng Zhu, and generation. In Proceedings of the IEEE/CVF Conference
Xiaodong Xie. A novel two-stage separable deep learning on Computer Vision and Pattern Recognition, pages 22500–
framework for practical blind watermarking. In Proceedings 22510, 2023. 6
of the 27th ACM International conference on multimedia, [42] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
pages 1509–1517, 2019. 1, 2 Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
[30] Yugeng Liu, Zheng Li, Michael Backes, Yun Shen, and Norouzi. Palette: Image-to-image diffusion models. In
Yang Zhang. Watermarking diffusion model. arXiv preprint ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
arXiv:2305.12502, 2023. 2 10, 2022. 13
[31] Rui Ma, Mengxi Guo, Yi Hou, Fan Yang, Yuan Li, Huizhu [43] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Jia, and Xiaodong Xie. Towards blind watermarking: Com- Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
bining invertible and non-invertible mechanisms. In Pro- Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
ceedings of the 30th ACM International Conference on Mul- et al. Photorealistic text-to-image diffusion models with deep
timedia, pages 1532–1542, 2022. 1, 2 language understanding. Advances in Neural Information
[32] Yihan Ma, Zhengyu Zhao, Xinlei He, Zheng Li, Michael Processing Systems, 35:36479–36494, 2022. 2
Backes, and Yang Zhang. Generative watermarking against [44] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal-
unauthorized subject-driven image synthesis. arXiv preprint imans, David J Fleet, and Mohammad Norouzi. Image
arXiv:2306.07754, 2023. 1, 2, 6 super-resolution via iterative refinement. IEEE Transactions
[33] Rajesh Mehta, Navin Rajpal, and Virendra P Vishwakarma. on Pattern Analysis and Machine Intelligence, 45(4):4713–
Lwt-qr decomposition based robust and efficient image wa- 4726, 2022. 13
termarking scheme using lagrangian svr. Multimedia Tools [45] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Applications, 75:4129–4150, 2016. 2 and Surya Ganguli. Deep unsupervised learning using
[34] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav nonequilibrium thermodynamics. In International confer-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and ence on machine learning, pages 2256–2265. PMLR, 2015.
Mark Chen. Glide: Towards photorealistic image generation 13
and editing with text-guided diffusion models. arXiv preprint [46] Yang Song and Stefano Ermon. Generative modeling by esti-
arXiv:2112.10741, 2021. 13 mating gradients of the data distribution. Advances in neural
[35] Sen Peng, Yufei Chen, Cong Wang, and Xiaohua Jia. information processing systems, 32, 2019. 13
Protecting the intellectual property of diffusion models [47] Yang Song and Stefano Ermon. Improved techniques for
by the watermark diffusion process. arXiv preprint training score-based generative models. Advances in neural
arXiv:2306.03436, 2023. 1, 2 information processing systems, 33:12438–12448, 2020. 13
[36] Dustin Podell, Zion English, Kyle Lacey, Andreas [48] Abdallah Soualmi, Adel Alti, and Lamri Laouamer. Schur
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and and dct decomposition based medical images watermarking.
Robin Rombach. Sdxl: Improving latent diffusion mod- In 2018 Sixth International Conference on Enterprise Sys-
els for high-resolution image synthesis. arXiv preprint tems (ES), pages 204–210. IEEE, 2018. 2
arXiv:2307.01952, 2023. 1, 2, 6 [49] Qingtang Su, Yugang Niu, Hailin Zou, Yongsheng Zhao, and
[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Tao Yao. A blind double color image watermarking algo-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, rithm based on qr decomposition. Multimedia tools and ap-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning plications, 72:987–1009, 2014. 2
transferable visual models from natural language supervi- [50] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun
sion. In International conference on machine learning, pages Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire
8748–8763. PMLR, 2021. 5 for diffusion-generated image detection. arXiv preprint
[38] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, arXiv:2303.09295, 2023. 8
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. [51] Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom
Zero-shot text-to-image generation. In International Confer- Goldstein. Tree-ring watermarks: Fingerprints for diffu-
ence on Machine Learning, pages 8821–8831. PMLR, 2021. sion images that are invisible and robust. arXiv preprint
2 arXiv:2305.20030, 2023. 1, 2, 5, 7, 8
[39] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, [52] Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan
and Mark Chen. Hierarchical text-conditional image gen- Saharia, Alexandros G Dimakis, and Peyman Milanfar. De-
eration with clip latents. arXiv preprint arXiv:2204.06125, blurring via stochastic refinement. In Proceedings of the
2022. 1, 2, 6, 13 IEEE/CVF Conference on Computer Vision and Pattern
[40] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Recognition, pages 16293–16303, 2022. 13
Patrick Esser, and Björn Ommer. High-resolution image [53] Kevin Alex Zhang, Lei Xu, Alfredo Cuesta-Infante, and
synthesis with latent diffusion models. In Proceedings of Kalyan Veeramachaneni. Robust invisible video watermark-
the IEEE/CVF conference on computer vision and pattern ing with attention. arXiv preprint arXiv:1909.01285, 2019.
5, 7
[54] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei.
Hidden: Hiding data with deep networks. In Proceedings of
the European conference on computer vision (ECCV), pages
657–672, 2018. 1, 2
Appendix

Contents with an initial learning rate set to 1e−4 , gradually decaying


based on the number of epochs iterated, reaching 1e−6 .
1. Experimental Settings 12
2. Algorithm Framework
2. Algorithm Framework 12

3. Fundamental Methods 13 Algorithm 1 Algorithm Framework of PiGW.


3.1. Generative models . . . . . . . . . . . . . . 13 Input:
3.1.1 Diffusion models. . . . . . . . . . . 13 xT ∼ N (0, I)
3.1.2 Generative adversarial networks. . . 13 C: the conditional prompts,
3.1.3 Text-to-audio . . . . . . . . . . . . 13 key: the watermark’s generation key consists of a pre-
3.1.4 Text-to-3D . . . . . . . . . . . . . 14 defined length sequence of randomly generated binary
3.2. FFT and IFFT. . . . . . . . . . . . . . . . . 14 digits.
Output:
4. Comparison and Analysis of Experiments 14 Lable: classification labels indicating whether input
4.1. Frequency Masks Comparisons. . . . . . . . 14 images contain watermarks or not.
4.2. Robustness Comparison between Spatial and Stage 1: Warmup
Frequency Domains. . . . . . . . . . . . . 14 1: Pkey = Madp ⊗ Proj(key)
2: FN oise = Concat (A (F (xT )) , ϕ (F (xT )))
5. Visualization of Watermarking Images 14 3: zwm = F −1 (AP2C (sPkey + FN oise ))
5.1. Conditional Generation Models and Uncon- Here, we represent the transition from xt to zwm as
ditional Generation Models . . . . . . . . . 14 zwm = Encoderwm (xT ; θ1 ).
5.2. Comparison of Audio Spectrograms with 4: pred = Unet (Concat (xT , zwm ) , C, timestep = 1)
Watermarks . . . . . . . . . . . . . . . . . 14 5: Optim : arg min||pred, label|| + ∥zwm , xT ∥
5.3. Comparison of Watermarked 3D Images . . . 14 θ1
Stage 2: Training
6. Ablation Study of Loss Functions 14 1: for timestep in = 1, . . . , 50 do
2: zwm = Encoderwm (xT ; θ1 )
7. Complexity and Time Overhead 14 3: (Image, Imagewm ) = Generator (xT , zwm , C)
4: pred = Decoderwm (Image, Imagewm , Pkey ; θ2 )
8. Supplementary materials folder 14 5: Optim : arg min||pred, label|| + ∥zwm , xT ∥
θ1,2
8.1. Sample Folder . . . . . . . . . . . . . . . . 14
6: end for
8.2. PiGW Demo Video . . . . . . . . . . . . . . 14

1. Experimental Settings Algorithm Framework 1 delineates the comprehensive


training procedure, divided into two phases: warmup and
The experiment utilized the COCO2017 dataset [27]. Dur- training. Here, we primarily introduce the processing steps
ing the training phase, 10,000 captions from the training conducted on stable diffusion. In the warmup stage, we opt
set were employed as input conditions for the generative for the U-Net’s output to train the encoder, which integrates
model. Subsequently, during the testing phase, 1,000 cap- watermark information into xT . The warm-up phase aims
tions from the test set were used as corresponding condition to mitigate potential convergence challenges when training
inputs. The training process of the experiment was facili- both the encoder and decoder simultaneously, especially as
tated using a single 80GB A100. A warm-up phase com- the introduction of U-Net modules for denoising may sig-
prised a single epoch of training. During the training phase nificantly impact the robustness of the watermark, consid-
of algorithm framework, without noise, 20 epochs were ering that the watermark itself is random information. The
conducted; in the presence of noise, the number of train- training stage is dedicated to training the decoder and refin-
ing epochs was appropriately increased based on the noise ing the encoder, leveraging the VAE output to distinguish
intensity. Adam optimizer was employed during training, between watermarked and unmarked images.
Here, Proj(key) projects the watermark’s generation N (xT ; 0, I) and Gaussian transitions
key to the high-dimensional watermark embedding used
to imprint watermark information onto the initial Gaussian pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)), t = T, . . . , 1.
noise in the Fourier space. Madp is an adaptive mask which
can selectively choose the appropriate frequency compo- Learnable parameters θ are trained to guarantee that the
nent to achieve the optimal watermarking effect. AP2C generated reverse process is close to the forward process.
signifies the conversion process from phase-amplitude rep- DDIM sampling. Deterministic DDIM sampling [47]
resentation to the form of a complex number. s is a strength can be used to accelerate the reverse diffusion process,
factor controlling the intensity of watermark information mapping from a latent vector xT ∼ N (0, 1) to an image
added to xT . x0 ∈ q(x) in a small number of denoising steps.

3. Fundamental Methods r s r !
αt−1 1 1
xt−1 = xt + −1− − 1 ·ϵθ (xt , t, C) .
Here we give a brief introduction to diffusion models, αt αt−1 αt
DDIM sampling [7, 15, 47] and GAN. Diffusion models (5)
[45, 46] are a battery of generative models that produce where C = ψ(P ) is the conditional embeddings, such as
samples from a learned data distribution through an itera- class labels, text, or low-resolution images [7, 16, 34, 39,
tive denoising process. 42, 44, 52].
For each denoising step, a learned noise-predictor esti-
3.1. Generative models mates the noise ϵθ (xt , t, C) added to x0 according to step t
to obtain xt .
3.1.1 Diffusion models. According to 4, we can derive the estimation of x0 as:
A diffusion Model consists of a forward diffusion pro- √
t xt − 1 − ᾱt ϵθ (xt , t, C)
cess, where a datum (typically an image) is progressively x̂0 = √ . (6)
ᾱt
noised, and a reverse diffusion process, in which noise
is transformed back into a sample from the target distri- Then, we add the estimated noise x̂t0 to find xt−1 :
bution. A forward diffusion process consists of T steps

xt−1 = ᾱt−1 x̂t0 + 1 − ᾱt−1 ϵθ (xt , t, C).
p
of the noise process to create a fixed forward Markov (7)
chain xt , ..., xT [15]. Given the data distribution x0 ∼
q(x), the Markov transition q(xt |xt−1 ) can be defined as The recursively denoising process from xT to x0 can be
denoted as x0 = Dθ (xT ).
a Gaussian distribution √ which is related to x0 , specifically:
q(xt |xt−1 ) = N (xt ; 1 − βt xt , βt I), for t ∈ {0, 1, ..., T −
1} where βt ∈ (0, 1)is the scheduled variance at the 3.1.2 Generative adversarial networks.
step t. The closed-form for this transition can be derived
by the Bayes’ rules and Markov property. Specifically, GANs [6, 12, 23] comprise two key components: a gener-
one can express the conditional probabilities q(xt |x0 ) and ator and a discriminator. The generator is responsible for
q(xt−1 |xt , x0 ) as: synthesizing images by utilizing random noise inputs. Dur-
ing training, the generator adjusts its noise inputs based on
√ input text conditions to produce visually consistent images
q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I), t = 1, . . . , T,
that correspond to the provided semantic context. In this ad-
(1) versarial process, the discriminator competes with the gen-
erator to distinguish between generated and real images,
q(xt−1 |xt , x0 ) = N (xt−1 ; µ̃t (xt , x0 ), β̃t I), t = 1, . . . , T,
thus guiding the generator’s improvement in image genera-
(2)
tion.
t
Y 1 − ᾱt−1
w.r.t. αt = 1 − βt , ᾱt = αs , β̃t = βt
s=1
1 − ᾱt 3.1.3 Text-to-audio
(3) Audioldm [28] is proposed for text-to-audio generation by
√ √
ᾱt βt αt (1 − ᾱt−1 ) integrating contrastive language-audio pretraining and la-
µ̃t (xt , x0 ) = x0 + xt .
1 − ᾱt 1 − ᾱt tent diffusion models. AudioLDM has three main advan-
(4) tages: high audio quality; high efficiency; enabling text-
guided audio manipulations like style transfer and super-
For the reverse diffusion process, DDPMs generate the resolution without fine-tuning. Additionally, it can conduct
Markov chain x1 , ..., xT with a prior distribution p(xT ) = zero-shot style transfer and audio inpainting.
3.1.4 Text-to-3D 5. Visualization of Watermarking Images
Jun and Nichol [22] introduce Shap-e, a text-to-3D method 5.1. Conditional Generation Models and Uncondi-
which is a conditional generative model designed for intri- tional Generation Models
cate 3D implicit representations. This two-stage approach The experimental results after applying the PiGW method
initially employs a transformer-based encoder to generate are depicted in Figure 10, showcasing the outcomes of both
INR parameters for 3D assets. Following this, a diffusion conditional and unconditional generative models.
model is trained using the encoder’s outputs. Shap-e has
the capability to directly generate the parameters of im- 5.2. Comparison of Audio Spectrograms with Wa-
plicit functions, thus enabling the rendering of both textured termarks
meshes and neural radiance fields. Upon being trained on The spectrogram comparison of the audio after applying the
a large dataset of paired 3D and text data, Shap-e demon- PiGW method in the audioldm model is illustrated in Figure
strates the ability to quickly generate complex and varied 12.
3D assets in just seconds.
5.3. Comparison of Watermarked 3D Images
3.2. FFT and IFFT. The transformation of generated samples in the text-to-3D
Diffusion models convert an array of Gaussian noise into model shap-e after applying the PiGW method is depicted
a real image. Our method attempts to imprint watermarks in Figure 13.
onto the initial Gaussian noise in the Fourier space to obtain
combined signals. The combined signals which have went 6. Ablation Study of Loss Functions
throught the IFFT process would be fed into the generator The ablation study results for watermark robustness loss and
G to generate the watermarked images. To streamline the distribution loss of Gaussian noise are presented in Table
depiction, we’ll employ symbols F and F −1 to represent 9. It can be observed that Lbce significantly enhances the
these two procedures within the Algorithm Framework 1. robustness of the watermark. Meanwhile, Ldistr is crucial
for the quality of the generated images.
4. Comparison and Analysis of Experiments
7. Complexity and Time Overhead
4.1. Frequency Masks Comparisons.
The evaluation results of the complexity and overhead of the
Table 7 presents the robustness effects of different water- PiGW method are presented in Table 10. The experimental
mark mask pattern. Experimental results demonstrate that results demonstrate that our method has minimal impact on
under identical settings, the use of an Adaptive Mask leads the Parameters, FLOPs, and Time Overhead of the origi-
to enhanced watermark robustness, offering more flexibility nal generative model, which is highly favorable for existing
for optimization. generative model and the text-to-image ecosystem.
As illustrated in Figure 9, four distinct watermark mask
forms are depicted. It’s evident that the linear mask and the 8. Supplementary materials folder
learning-based mask exhibit closer similarities. Hence, in 8.1. Sample Folder
our experiments, we employed the linear mask as the initial
state for the Adaptive Mask. Partial sample of experimental results for text-to-3d and
text-to-audio provided.
4.2. Robustness Comparison between Spatial and 8.2. PiGW Demo Video
Frequency Domains.
Provided a demo video to assist in introducing this work.
The experimental results in Table 8 provide a comparison
between watermark embedding in the spatial domain and
in the frequency domain. The experiments indicate that
directly embedding the watermark into z in its spatial do-
main significantly affects the generated image quality (as
indicated by the increase in FID). By embedding the water-
mark in the frequency domain of z, combined with a adap-
tive mask, we have managed to largely preserve the high-
frequency components of z, which are crucial for the qual-
ity of the generated images.
Table 7. Relevant Results of Watermark Mask Patterns and Watermark Robustness.

Watermark Mask Pattern AUC ↑ TPR1@%FPR ↑ Experimental Setup


Gaussian Mask 0.949 0.930 <W/o attack,
Ladder Mask 0.946 0.901 Iterate for 10 epochs,
Linear Mask 0.942 0.904 Stable Diffusion v2-1,
Adaptive Mask (Learning based) 1.000 1.000 Timestep=25 >

Linear Mask Gaussian Mask Ladder Mask Adaptive Mask

Figure 9. The Visualization Results of Watermark Mask Patterns.

Table 8. Comparison of embedding watermarks in spatial and frequency domains. Experiments demonstrate that directly adding watermark
(zwm ) in the spatial domain of z (Initial Gaussian noise) significantly affects the image quality of generative models.

Watermark Embedding Method AUC ↑ FID ↓ Experimental Setup


Spatial Domain (W/o Mask) 1.000 73.21 <W/o attack, Iterate for 10 epochs,
Frequency Domain (With Adaptive Mask) 1.000 31.57 Stable Diffusion v2-1, Timestep=25,
FID: 1000 pictures from COCO2017-val >
Table 9. Ablation study on the loss functions. Separate ablation experiments were conducted on Lbce , which constrains watermark
robustness, and Ldistr , which constrains the quality of generated images. This experiment was conducted without a noise attack.

Metrics Lbce Ldistr AUC ↑ FID ↓


× × 0.4391 53.0
✓ × 1.000 76.2
Loss
× ✓ 0.4822 24.63
✓ ✓ 1.000 24.76

Table 10. Testing the complexity and time overhead of the PiGW watermark plugin. For Encoder-w/o-wm, a randomly generated Gaussian
vector is directly utilized as the initial noise input into the Unet. Conversely, for the Encoder-with-wm, the watermark message is first
projected into an embedding, which is then amalgamated with the initial noise. Subsequently, denoising is executed with the assistance of
the DDIM scheduler. Here, the inference time is based on the stable diffusion v2.1 with 50 inference steps. The results are tested a hundred
times, using a batch size of 1, and averaged on a single A100 80G GPU

Module FLOPs(M) Parameters(K) Inference Time(s)


Encoder-w/o-wm (z) 0 0 3.099
Encoder-with-wm (z → zwm ) 1.33 8.9 3.110 (+ 0.355%)
Stable Diffusion v2-1
W/o
Watermarking

With
Watermarking

W/o
Watermarking

With
Watermarking

Stylegan3

W/o
Watermarking

With
Watermarking

W/o
Watermarking

With
Watermarking

Figure 10. Visualization of Watermarked Images. The above images depict the experimental outcomes using stabel diffusion v2-1,
presenting images with and without watermarks (noted, utilizing the same prompt). This experiment relies on a conditional generation
model. The below images showcase results from stylegan3, which operates on a non-conditional generation model.
W/o
Watermarking

With
Watermarking

Figure 11. Visualization of Watermarked Images.


Figure 12. The audio spectrogram comparison using audioldm [28] is depicted in the figures. The top figure represents the spectrogram of
the audio without any watermark, while the bottom figure illustrates the spectrogram of the audio after the addition of the watermark.

W/o
watermarking

With
watermarking

Time = t Time = t +

Figure 13. Comparision of watermarked 3D images. This experiment was based on the open-source model shap-e [repo].

You might also like