PiGW A Plug-In Generative Watermarking Framework
PiGW A Plug-In Generative Watermarking Framework
Rui Ma1 , Mengxi Guo2 , Yuming Li1 , Hengyuan Zhang1 , Cong Ma3 , Yuan Li1 , Xiaodong Xie1 , and Shanghang Zhang1
1
Peking University
2
Bytedance Inc.
3
SenseAuto Research
arXiv:2403.12053v1 [[Link]] 4 Jan 2024
intelligence security. This paper proposes Plug-in Genera- Image Attacked Image Watermark
tive Watermarking (PiGW) as a general framework for inte- Watermark
grating watermarks into generative images. PiGW embeds Post-processing Transfer and Edit
watermark information into the initial noise with an adap-
Generative Watermarking
tive frequency spectrum mask using a learnable watermark
embedding network. Furthermore, we introduce a warmup Plug-in
Audio
strategy by gradually increasing timesteps to optimize train-
ing stability. Extensive experiments demonstrate that PiGW Gaussian noise
1. Introduction
noise attacks. Given the growing use of text-to-image gen-
The latest developments in generative models have made erative models for creating images, generative watermark-
it easier for people to create increasingly indistinguishable ing is proposed [51]. As shown in Figure 1, generative
images from their natural counterparts. Particularly, some watermarks are different from traditional post-hocs, as they
text-to-image generative models, such as DALL·E 2 [39] can be embedded during image generation, resulting in wa-
and Stable Diffusion [36, 40], have become prevalent cre- termark information becoming an intrinsic part of the image
ative tools among both professionals and non-professionals. and achieving true invisibility. Recent advances in genera-
With the exponential growth of Artificial Intelligence Gen- tive watermarking have shown progress in two directions.
erated Content (AIGC) expected in the future, there arises The first employs fine-tuning of the generators to enable
an urgent need for robust copyright protection tools. output watermarked images directly [9, 10, 32, 35], while
Watermarking is commonly used for protecting the the second involves a watermarking embedding module as
copyright of images. Most watermarking works [21, 29, a plugin incorporated into existing generative models [51].
31, 54] aim to make minimal post-hoc modifications to pre- Although the fine-tuning method is effective, it can incur
existing images. However, such methods are often difficult significant costs, making the plugin approach more user-
to balance between invisibility and robustness under strong friendly and practical. Tree-ring watermarks [51] is a sem-
inal plugin generative watermarking method. It imprints a rithms [1, 13] were commonly used. However, with the
manually designed watermark onto the initial noise vector development of deep learning, the first CNN-based water-
during the image generation. However, the handcrafted de- marking work, HiDDeN [54], was introduced. HiDDeN
sign of the watermark limits its robustness against specific utilizes an end-to-end approach to train the encoder and
types of noise, such as cropping and Gaussian noise. Fur- decoder of the watermark simultaneously. Subsequently,
thermore, this method is not easily transferable because it is many deep learning watermarking methods have been pro-
tailored to the diffusion model. posed, including two-stage algorithms [29], compression-
In this paper, we present Plug-in Generative Watermarks resistant algorithms [21], and invertible watermarking algo-
(PiGW) as a framework for watermarking generative im- rithms [8, 31]. These post-hoc watermarking methods can
ages. Specifically, PiGW uses a private-key encoder to im- resist most noise attacks, but they are still ineffective against
print watermarks onto an initial noise vector, making the strong noise attacks and inevitably require modifications to
watermark an integral part of the generated image instead of the image.
post-hoc modification. For generative models that require
multiple operations, we propose two strategies: an adaptive
frequency spectrum mask to prioritize retaining more wa- 2.2. Generative watermarking
termark information in less vulnerable areas, and a warm-up
With the advancement of generative models, many photo-
training strategy that stabilizes the training of the watermark
realistic generative models such as VAE [25], GAN [11],
module by gradually increasing the timesteps of the gener-
Diffusion models [7, 15], DALL-E [38] and Imagen [43]
ative model. As an easy-used plugin, PiGW can seamlessly
are widely used. The increasing prevalence of generated im-
integrate watermarks into generated images with true invis-
ages poses new challenges for copyright protection. There
ibility and high resistance to noise attacks. Moreover, as far
are two primary watermarking approaches prevail: fine-
as we know, PiGW is the first general watermarking frame-
tuned generator and plug-in methods. Several watermark-
work that can be applied to the most commonly used gen-
ing methods using fine-tuned generators have been pro-
erative structures, including diffusion model [36, 39, 40],
posed including the supervised GAN approach [9], Cycle-
GAN [11], and VAE [25], as well as multimodal generative
GAN technique [26], the watermark diffusion process [35],
content types such as audio [28] and 3D models [22]. Fi-
pre-set prompt approach [30], co-development of a water-
nally, as PiGW can embed information into images while
mark generator and detector [32], and integration of unde-
preserving the generated quality, it can be applied in the
tectable watermarks at the latent decoder level of Latent
generated image detection task, promoting the secure de-
Diffusion Models [10]. The drawbacks of these methods
velopment of artificial intelligence.
are evident as fine-tuning generators is time-intensive and
Our contributions can be summarized as follows:
not easily transferable. RoSteALSS[2] and Tree-ring water-
(1) We propose Plug-in Generative Watermarks as a unified marks [51] are two existing works in plugin generative wa-
watermarking framework for image generation. It learns termarking. RoSteALS employs pre-trained autoencoders
to embed watermarks into images during the generation for steganography by embedding secret information in im-
process and achieve true invisibility and strong robustness age latent codes, which can negatively impact image qual-
against various types of noise. ity. On the other hand, tree-ring watermarks create artifi-
(2) We extend our framework to accommodate more gen- cially designed watermarks on noise vectors during image
eral generative scenarios, demonstrating its scalability. generation for true invisibility but face limitations in re-
It is compatible with the most commonly used image- silience against particular types of noise. Moreover, it has
generative models and extensible to other multimodal low transferability due to manual watermark design and de-
generation tasks. pendency on specific models.
(3) Our framework can be applied to detecting generated im-
ages, which can be beneficial in promoting the secure de-
velopment of generative model techniques. 3. Method
2. Related work As shown in Figure 2, the PiGW framework mainly consists
of 4 modules. The Embedding module is used to embed the
2.1. Post-hoc watermarking
watermark into the initial noise xT . The Generation mod-
Watermarking in digital content, particularly images, is ule is responsible for generating the corresponding image.
important for copyright protection. Traditional methods The Attack module is used to simulate potential tamper-
including algorithms based on singular value decompo- ing attacks on the watermarked image. The Authentication
sition [33, 48, 49], moment-based watermarking algo- module is used to determine whether the public key (image)
rithms [18, 19] and transform domain watermarking algo- matches the private key.
Figure 2. Framework diagram of PiGM: The embedding module combines the masked watermark with initial noise, z, in the frequency
domain to yield zwm . The generation module either denoises zwm or decodes it to obtain the watermarked image, Iwm . The attack module
′
simulates edits and tampering on the image, resulting in Iwm . The authentication module derives the final outcome by assessing the pairing
of the public and private keys.
3.1. Embedded Module where A(N oise) and ϕ(N oise) represent the amplitude
b×l and phase adjustment functions, respectively.
A uniformly distributed random key ∈ R is generated as
The processed noise FN oise is subsequently fused with
the initial step in the watermarking process. b and l repre-
the latent representation of key zkey through concatenation
sents the dimensions of the key vector. The generated key is
and inverse Fourier transformation (IFFT):
input into an encoding network designed to encode the key
into a more suitable form for watermark embedding. The zwm = IFFT2(FN oise ⊕ Mes )
encoding function E can be formulated as:
yielding the watermark embedding vector zwm , which in-
Ekey = σ(We · key + be ) corporates both the encoded key and the adjusted noise.
Simultaneously, the private key key is input into the en- Ldistr (θ) = JS(fθ # (pkey , p(xT )), p(xT ))
coding network and passed through an adaptive frequency
mask Madp similar to the process in the embedded model Total Loss. Finally, the overall training objectives can be
which can be formulated as: defined as:
Mkey = Madp ⊙ Ekey Ltotal = λ1 Lbce (θ) + λ2 Ldistr (θ)
where ⊙ represents element-wise multiplication, and Mkey where λ1 , λ2 are coefficients for balancing different loss
is the frequency-masked encoded key. terms. In the specific experiment, the value of λ1 is typi-
′
Finally, the latent zwm ′
from Iwm and Mkey is fed into cally set to 1, while λ2 is set to 1e-5.
the fusion layer F composed of some convolutional lay-
ers which computes the intermediate representation and is
4. Experiments
eventually fed to the discriminator model D to authentic The image size was uniformly set to 512x512, and the wa-
the presence of the private key. The fusion layer and the termark was fixed at 30 bits (comprising 0 or 1). For the
discriminator model can be mathematically represented as text-to-audio and text-to-3D experiments (see Table 4), de-
follows: fault parameters from Diffusers1 were utilized. In the gener-
′
Ŷ = D(F (zwm , Mkey )) ative content detection experiments, a fixed watermark was
Here, F symbolizes the fusion layers, and D represents dis- used for training (randomly selected but fixed after selec-
criminator model the model, Ŷ is conclusion about the pres- tion).
ence of the private key within the image. 1 [Link]
"A white cat with a red collar is eating
"A man riding a motorcycle through the rain." "a giraffe is standing next to a zebra outside"
from a green bowl."
Table 1. Method comparison results. T@1%F represents TPR@1%FPR. We evaluate watermark accuracy in both clean and adversarial
settings. Adversarial here refers to average performance over a battery of image manipulations.
Clear Adversarial
Text to Image Model Watermark Model
AUC↑ TPR@1%FPR↑ AUC↑ TPR@1%FPR↑
SD V-1-4 [40] PiGW-base 1.000 1.000 1.000 1.000
SD V-1-5 [40] PiGW* 1.000 1.000 0.999 0.998
SD V-2-1 [40] PiGW* 1.000 1.000 0.999 0.987
SD Lora [17] [repo] PiGW* 1.000 1.000 0.999 0.994
SD Dreambooth [41] [repo] PiGW* 1.000 1.000 0.996 0.991
SD super-resolution [repo] PiGW** 1.000 1.000 0.983 0.977
VAE [25][repo] PiGW-vae 1.000 1.000 0.999 0.984
SD-XL [36] [repo] PiGW-xl 1.000 1.000 0.962 0.914
StyleGAN3 [24][repo] PiGW-gan 1.000 1.000 0.971 0.959
DALLE-2-torch [39] [repo] PiGW-dal 0.977 0.942 0.910 0.836
Method Clean JPEG Cr.& Sc. Gauss. Blurring Gauss. Noise Color Jitter Average
DwtDct [5] 0.974 0.492 0.640 0.503 0.293 0.519 0.574
DwtDctSvd [5] 1.000 0.753 0.511 0.979 0.706 0.517 0.702
RivaGan [53] 0.999 0.981 0.999 0.974 0.888 0.963 0.854
Tree-Ring [51] 1.000 0.999 0.961 0.999 0.944 0.983 0.975
Our 1.000 1.000 0.980 1.000 1.000 1.000 0.996
Figure 5. Results of watermark detectability of our model under four different image perturbation techniques, namely Gaussian blurring,
color jitter, cropping and scaling, and Gaussian noise. Across all perturbation types, our model, trained using a mini-batch strategy [20],
consistently outperforms the existing ’Tree-Ring’ algorithm in both AUC and TPR@1%FPR.
Table 5. Generative image detection. We present ACC/AP(%) in (c) TPR@1%FPR / GeWm (timestep progressive growth) (d) AUC / GeWm (timestep progressive growth)
the table.
3. Fundamental Methods r s r !
αt−1 1 1
xt−1 = xt + −1− − 1 ·ϵθ (xt , t, C) .
Here we give a brief introduction to diffusion models, αt αt−1 αt
DDIM sampling [7, 15, 47] and GAN. Diffusion models (5)
[45, 46] are a battery of generative models that produce where C = ψ(P ) is the conditional embeddings, such as
samples from a learned data distribution through an itera- class labels, text, or low-resolution images [7, 16, 34, 39,
tive denoising process. 42, 44, 52].
For each denoising step, a learned noise-predictor esti-
3.1. Generative models mates the noise ϵθ (xt , t, C) added to x0 according to step t
to obtain xt .
3.1.1 Diffusion models. According to 4, we can derive the estimation of x0 as:
A diffusion Model consists of a forward diffusion pro- √
t xt − 1 − ᾱt ϵθ (xt , t, C)
cess, where a datum (typically an image) is progressively x̂0 = √ . (6)
ᾱt
noised, and a reverse diffusion process, in which noise
is transformed back into a sample from the target distri- Then, we add the estimated noise x̂t0 to find xt−1 :
bution. A forward diffusion process consists of T steps
√
xt−1 = ᾱt−1 x̂t0 + 1 − ᾱt−1 ϵθ (xt , t, C).
p
of the noise process to create a fixed forward Markov (7)
chain xt , ..., xT [15]. Given the data distribution x0 ∼
q(x), the Markov transition q(xt |xt−1 ) can be defined as The recursively denoising process from xT to x0 can be
denoted as x0 = Dθ (xT ).
a Gaussian distribution √ which is related to x0 , specifically:
q(xt |xt−1 ) = N (xt ; 1 − βt xt , βt I), for t ∈ {0, 1, ..., T −
1} where βt ∈ (0, 1)is the scheduled variance at the 3.1.2 Generative adversarial networks.
step t. The closed-form for this transition can be derived
by the Bayes’ rules and Markov property. Specifically, GANs [6, 12, 23] comprise two key components: a gener-
one can express the conditional probabilities q(xt |x0 ) and ator and a discriminator. The generator is responsible for
q(xt−1 |xt , x0 ) as: synthesizing images by utilizing random noise inputs. Dur-
ing training, the generator adjusts its noise inputs based on
√ input text conditions to produce visually consistent images
q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I), t = 1, . . . , T,
that correspond to the provided semantic context. In this ad-
(1) versarial process, the discriminator competes with the gen-
erator to distinguish between generated and real images,
q(xt−1 |xt , x0 ) = N (xt−1 ; µ̃t (xt , x0 ), β̃t I), t = 1, . . . , T,
thus guiding the generator’s improvement in image genera-
(2)
tion.
t
Y 1 − ᾱt−1
w.r.t. αt = 1 − βt , ᾱt = αs , β̃t = βt
s=1
1 − ᾱt 3.1.3 Text-to-audio
(3) Audioldm [28] is proposed for text-to-audio generation by
√ √
ᾱt βt αt (1 − ᾱt−1 ) integrating contrastive language-audio pretraining and la-
µ̃t (xt , x0 ) = x0 + xt .
1 − ᾱt 1 − ᾱt tent diffusion models. AudioLDM has three main advan-
(4) tages: high audio quality; high efficiency; enabling text-
guided audio manipulations like style transfer and super-
For the reverse diffusion process, DDPMs generate the resolution without fine-tuning. Additionally, it can conduct
Markov chain x1 , ..., xT with a prior distribution p(xT ) = zero-shot style transfer and audio inpainting.
3.1.4 Text-to-3D 5. Visualization of Watermarking Images
Jun and Nichol [22] introduce Shap-e, a text-to-3D method 5.1. Conditional Generation Models and Uncondi-
which is a conditional generative model designed for intri- tional Generation Models
cate 3D implicit representations. This two-stage approach The experimental results after applying the PiGW method
initially employs a transformer-based encoder to generate are depicted in Figure 10, showcasing the outcomes of both
INR parameters for 3D assets. Following this, a diffusion conditional and unconditional generative models.
model is trained using the encoder’s outputs. Shap-e has
the capability to directly generate the parameters of im- 5.2. Comparison of Audio Spectrograms with Wa-
plicit functions, thus enabling the rendering of both textured termarks
meshes and neural radiance fields. Upon being trained on The spectrogram comparison of the audio after applying the
a large dataset of paired 3D and text data, Shap-e demon- PiGW method in the audioldm model is illustrated in Figure
strates the ability to quickly generate complex and varied 12.
3D assets in just seconds.
5.3. Comparison of Watermarked 3D Images
3.2. FFT and IFFT. The transformation of generated samples in the text-to-3D
Diffusion models convert an array of Gaussian noise into model shap-e after applying the PiGW method is depicted
a real image. Our method attempts to imprint watermarks in Figure 13.
onto the initial Gaussian noise in the Fourier space to obtain
combined signals. The combined signals which have went 6. Ablation Study of Loss Functions
throught the IFFT process would be fed into the generator The ablation study results for watermark robustness loss and
G to generate the watermarked images. To streamline the distribution loss of Gaussian noise are presented in Table
depiction, we’ll employ symbols F and F −1 to represent 9. It can be observed that Lbce significantly enhances the
these two procedures within the Algorithm Framework 1. robustness of the watermark. Meanwhile, Ldistr is crucial
for the quality of the generated images.
4. Comparison and Analysis of Experiments
7. Complexity and Time Overhead
4.1. Frequency Masks Comparisons.
The evaluation results of the complexity and overhead of the
Table 7 presents the robustness effects of different water- PiGW method are presented in Table 10. The experimental
mark mask pattern. Experimental results demonstrate that results demonstrate that our method has minimal impact on
under identical settings, the use of an Adaptive Mask leads the Parameters, FLOPs, and Time Overhead of the origi-
to enhanced watermark robustness, offering more flexibility nal generative model, which is highly favorable for existing
for optimization. generative model and the text-to-image ecosystem.
As illustrated in Figure 9, four distinct watermark mask
forms are depicted. It’s evident that the linear mask and the 8. Supplementary materials folder
learning-based mask exhibit closer similarities. Hence, in 8.1. Sample Folder
our experiments, we employed the linear mask as the initial
state for the Adaptive Mask. Partial sample of experimental results for text-to-3d and
text-to-audio provided.
4.2. Robustness Comparison between Spatial and 8.2. PiGW Demo Video
Frequency Domains.
Provided a demo video to assist in introducing this work.
The experimental results in Table 8 provide a comparison
between watermark embedding in the spatial domain and
in the frequency domain. The experiments indicate that
directly embedding the watermark into z in its spatial do-
main significantly affects the generated image quality (as
indicated by the increase in FID). By embedding the water-
mark in the frequency domain of z, combined with a adap-
tive mask, we have managed to largely preserve the high-
frequency components of z, which are crucial for the qual-
ity of the generated images.
Table 7. Relevant Results of Watermark Mask Patterns and Watermark Robustness.
Table 8. Comparison of embedding watermarks in spatial and frequency domains. Experiments demonstrate that directly adding watermark
(zwm ) in the spatial domain of z (Initial Gaussian noise) significantly affects the image quality of generative models.
Table 10. Testing the complexity and time overhead of the PiGW watermark plugin. For Encoder-w/o-wm, a randomly generated Gaussian
vector is directly utilized as the initial noise input into the Unet. Conversely, for the Encoder-with-wm, the watermark message is first
projected into an embedding, which is then amalgamated with the initial noise. Subsequently, denoising is executed with the assistance of
the DDIM scheduler. Here, the inference time is based on the stable diffusion v2.1 with 50 inference steps. The results are tested a hundred
times, using a batch size of 1, and averaged on a single A100 80G GPU
With
Watermarking
W/o
Watermarking
With
Watermarking
Stylegan3
W/o
Watermarking
With
Watermarking
W/o
Watermarking
With
Watermarking
Figure 10. Visualization of Watermarked Images. The above images depict the experimental outcomes using stabel diffusion v2-1,
presenting images with and without watermarks (noted, utilizing the same prompt). This experiment relies on a conditional generation
model. The below images showcase results from stylegan3, which operates on a non-conditional generation model.
W/o
Watermarking
With
Watermarking
W/o
watermarking
With
watermarking
Time = t Time = t +
Figure 13. Comparision of watermarked 3D images. This experiment was based on the open-source model shap-e [repo].