0% found this document useful (0 votes)
33 views7 pages

ACV - Notes - Final

The document provides an overview of various generative models including AutoEncoders, Variational AutoEncoders, Generative Adversarial Networks (GANs), Conditional GANs, and Diffusion Models. It discusses their architectures, objectives, and limitations, emphasizing the differences in how they generate and reconstruct data. Key concepts such as latent space representation, loss functions, and training challenges are also explored.

Uploaded by

k247605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views7 pages

ACV - Notes - Final

The document provides an overview of various generative models including AutoEncoders, Variational AutoEncoders, Generative Adversarial Networks (GANs), Conditional GANs, and Diffusion Models. It discusses their architectures, objectives, and limitations, emphasizing the differences in how they generate and reconstruct data. Key concepts such as latent space representation, loss functions, and training challenges are also explored.

Uploaded by

k247605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1 AutoEncoder it’s typically implemented as negative Mean-Squared

Error (MSE) for continuous data, or negative Cross-


1.1 Overview Entropy for binary data.
AutoEncoder (AE) are a class of latent-variable The regularization term uses KL Divergence
models. The term latent refers to the model’s ability (DKL ) as measure of similarity between the learned
to compactly represent the input in an encoded latent distribution, qϕ (z|x), and the prior distribution, p(z).
space, referred to as z. As a result, latent variables are The objective of the regularization term is to reduce
often called true explanatory factors of the input the dissimilarity between qϕ (z|x) and p(z), as a result
distribution. AutoEncoders have an encoder module, of which the elements of vector z, will have some aspects
that learns to encode the input into a compact rep- of un-correlation or disentanglement. To control the
resentation, z, and a decoder module, that learns to level of uncorrelated, we can introduce a β parameter,
decode the compact representation back into the input. that acts as a weight for the regularization term,
This allows us to reduce feature dimensionality, increasing which results in higher levels of inde-
but doesn’t allow for novelty since we end up recon- pendence between the latent variables, but can
structing the same image. be at the cost of overfitting , where the model just
2 Variational AutoEncoder learns p(z), instead of something close to (qϕ (z|x), but
not exactly p(z).
2.1 Overview During gradient computation, since sampling is a
Variational AutoEncoder (VAE) allow for more stochastic process and as a result is not differentiable,
variability in the latent-space, and as a result have a to resolve this, the stochasticity is isolated by using
limited threshold for novelty. VAEs introduce a a reparameterization trick. The reparameteriza-
mean (µ) and standard-deviation (σ) vector which tion trick introduces a random sample, ϵ from a fixed
the model samples from to construct the encoded rep- standard normal/gaussian distribution, and z is then
resentation, z. The µ and σ are learned during the computed as: z = µ + σ ⊙ ϵ, where ⊙ represents
training process by having them model a prior distri- element-wise multiplication. Since ϵ has is drawn from
bution (typically a multi-variate standard gaussian a standard gaussian, it has µ = 0 and σ = 1, then the
distribution - N (0, I), where the µ is 0 and σ is an iden- element-wise scaling will have the same features, just
tity matrix). scaled.
2.2 Goal
The goal of VAEs is to learn a latent space which is (1) 3 GAN
continuous, points that are close in latent space also 3.1 Overview
result in similar content after decoding, (2) complete- Generative Adversarial Networks (GANs) are
ness, sampling from latent space always gives meaning- also a class of latent-variable models, but instead of
ful content after decoding. recreating the input using just the information bottle-
2.3 Explanation neck layer as the latent representation, GANs use two-
A multi-variate (d-dimensional) standard gaussian dis- fold objectives: (1) the generator learns to generate
tribution is inherently independent because it’s proba- examples that are as similar to the input as possible,
bility density function can be factorized into a product (2) the discriminator learns to tell the fake exam-
of d distribution, and the factorization into a product ples apart from the real ones. GANs don’t explicitly
of marginal PDFs indicate the independency of the dis- model density (”distribution”), instead they just
tribution. sample to generate new instances. GANs sam-
The I matrix has 1s along its diagonal, and 0s on ple from a simple distribution, such as some noise (e.g.
the off-diagonals, where the offset, j, of 1 at each row gaussian noise - z ∼ N (0, 1)), z, and then learn a trans-
represents zj ’s σ. formation that maps z to the target data distribution,
Generator
VAEs use Evidence Lower Bound (ELBO) and z − −−−−−−→ p(x). We refer to the samples generated
try to maximize it. The lower bound represents the this way as fake samples. Since GANs map points
maximization of a quantity (a proxy objective) that is in the noise space, zd , to points in the learned target
less than or equal to log pθ (x), since we can’t directly data space, pd (x), as a result we are able to interpo-
maximize pθ (x) itself because the actual marginal like- late and traverse in the noise space which directly re-
lihood is calculated using an integral, which is difficult sults in traversal in the target data space. To refine the
to compute in high-dimensional spaces, has no closed- generation of fake samples, GANs introduce an adver-
form solution, no direct gradient signal etc.. ELBO sarial behavior through discriminators. The discrim-
is represented as: L(θ, ϕ; x) = Eqϕ (z|x) [logpθ (x|z)] − inator tries to identify real data from fakes created by
DKL (qϕ (z|x) || p(z)), where qϕ (z|x) is the distribution the generator, as a result training the generator to pro-
learned by the encoder, using parameters ϕ, pθ (x|z) is duce more realistic-looking fake examples. The global-
the distribution learned by the decoder to reconstruct optimum here is that the generator, G, reproduces the
the original input, x, from the latent-representation, z. input data distribution.
Since, log pθ (x) ≥ L(θ, ϕ; x), we can treat L(θ, ϕ; x) as 3.2 Explanation
the ELBO. The first term in ELBO is referred to as the The loss function for GANs is:
reconstruction loss, while the second term is called arg min max Ez,x [log D(G(z)) + log (1 − D(x))],
regularization term. G D
The reconstruction loss measures how well the de- where D(x) is the probability that the generated
coder, pθ (x|z), can reconstruct the original input, x, example is a fake, G(z) is the example generated
from a sample, z, drawn from the latent-space mod- from noise z. The D(G(z)) is the probability that the
eled by the encoder’s distribution, qϕ (z|x). Maximiz- generated example is categorized as fake, 1 − D(x)
ing this term encourages a good generative model that is the probability that a real example is categorized
can produce samples similar to the input. In practice, as real. We want to maximize the discriminator’s
objective and minimize the generator’s objective,

1
which is to minimize the probability that a sample gradually increasing the resolution, as a result the train-
generated by G is classified as fake. After the training ing process becomes more stable, because the networks
process, G can be used to generate new data instances. learn the overall structure of the image first at low reso-
GANs do not have an information bottleneck layer to lution, before being challenged with fine details, reduc-
produce a latent representation, called z in AEs and ing the chance of the generator getting stuck on small
VAEs, but it is possible to infer the representation local optima and encourages it to learn a broader dis-
by using the activations of the layer before the final tribution.
layer in the discriminator as a feature representation,
since the discriminator would have learned a useful
4 Conditional GAN
representation to be able to discriminate the data. 4.1 Overview
3.3 Limitations Conditional Adversarial Networks (C-GANs) al-
GANs, similar to VAEs, have several issues that can low more control over the class of the generated sam-
hinder its results: (1) Mode/Mean Collapse is when ple. C-GANs introduce a one-hot encoded class vector,
the generator, G, instead of exploring and learning to y, which is appended to the noise, z, in the case of the
produce samples across all modes (clusters/categories), generator and input, x, in the case of the discrimina-
gets stuck generating samples from only a limited sub- tor. The GAN is then trained to jointly learn the trans-
set of these modes, which leads to a lack of diversity and formations to the distribution and classify the classes.
repetitive outputs, (2) Out-Of-Distribution Gener- The loss function is modified to incorporate this change:
ation is when the generator is tasked with produc- argmin G
maxEz,x [logD(G(z|y))+log(1−D(x|y))], where
D
ing an instance not present in its training distribution, D(G(z|y)) is the probability of a generated sample be-
such as generating an image of a squirrel, when it was ing fake, given that it belongs to class y, and D(x|y)
only trained on shapes, (3) Unstable Training is why is the probability of an input sample being fake, given
GANs are notoriously difficult to train, since there is no that it belongs to class y. The input layers of both G
definite convergence criteria and both, G and D, keep and D are enlarged to accept the concatenated input.
trying to optimize their objectives, as a result there is In the inference phase, a class label, yi can be chosen,
no clear way to know when to stop. and the generator will generate a point from that class.
Mode Collapse: G’s job is to fool D. If G discovers
a specific type of sample that the current D struggles 5 Diffusion Models
to identify as fake, then G has a strong incentive to 5.1 Overview
produce more and more of that type of samples. G es- Diffusion Models use a probabilistic distribution
sentially finds a shortcut to lowering error. Now, since (typically gaussian) to sample and add noise to an in-
G is producing a narrow set of samples, D’s job also gets put image iteratively, at each time step, then learn a
easier, and it learns to quickly discriminate these spe- function to reverse the noise back to the original im-
cific fake samples, but if D becomes too good, it might age. The process of adding noise at each time step is
stop providing useful feedback to G. This leads to a cy- referred to as the forward diffusion process, and the
cle of oscillation, where G learns to generate a specific process of recreating the image from noise by reversing
mode → − D learns to classify it as fake →
− G realizes it the noise at each time step, starting from the last time
is no longer working and switches to another mode → − step, is referred to as reverse diffusion process. The
D identifies it as fake → − then G switches to another idea here is to have a fixed method (no parameters) to
mode, and so on.... This can also result in vanishing add noise to the image in an iterative manner (because
gradients as D becomes near sub-optimal, and starts we need the partially noisy image at each time step for
providing less and less good feedback (>>> gradients) our reverse diffusion process to learn/compare from),
to G. To prevent mode collapse, multiple strate- then to model a function that maps a noisy image at
gies can be employed: (1) Change the loss func- a time step, t, to a less-noisy version of that image at
tion, GAN uses Jensen-Shannon (JS) Divergence, time step, t − 1.
which can become saturated if there is very little over- 5.2 Iterative Diffusion
lap between pd ata and pg , as a results the gradients
The reason for the iterative forward/reverse process
with respect to the generator’s parameters become too
stems mainly from the idea of diffusion. Diffusion
small (vanish), instead using functions that can penal-
is the physical process of noise addition that hap-
ize or provide meaningful gradients, like Wasserstein
pens gradually over time, but it also has other, bet-
GAN, can avoid this issue, (2) Mini-batch Discrim-
ter, reasons. During the reverse process, the model
inator, another problem with the discriminator is that
learns to invert the noising process as it reverses, go-
it processes individual samples, not how diverse a batch
ing t →
− t − 1, making the learning problem more
of generated samples is, to solve this, a layer can be
tractable (better computational feasibility, simpler ob-
added to the discriminator that computes the similar-
jective function, less prone to vanishing/exploding gra-
ity of samples with a mini-batch, and if the generated
dients, decomposition into smaller sub-problems), as
samples in the batch are too similar, the discriminator
learning to remove a small amount of noise at each
can penalize them, (3) Feature Matching, instead of
time step is a much simpler problem than removing all
making the generator directly fool the discriminator,
the noise in one go.
the generator’s objective is modified to match statis-
tics of the features extracted by an intermediate layer 5.3 Forward Diffusion Process
of the discriminator for real and fake samples, which The forward process of Denoising Diffusion Prob-
encourages the generator to produce samples with sim- abilistic Models (DDPM) is a fixed, predefined,
ilar statistical properties to the real data, not just sin- Markov chain that gradually adds Gaussian noise to the
gle samples that fool the discriminator, (4) Progres- data. It transforms x0 into a pure noise distribution,
sive Growing of GANs, starts training with very xT over T discrete time steps.
low-resolution images and progressively adds layers to The joint probability of the entire noisy sequence
both generator and discriminator as raining progresses, (1 : T ), given the initial x0 , is a product of

2
all individual conditional probabilities: q(x1:T |x0 ) = to diminish (µ →
− 0) and variance term to dominate
QT (σ →
− 1), causing xT to eventually approximate a pure
t=1 q(xt |xt−1 ).
Each step, q(xt |xt−1 ) is defined Gaussian noise distribution. The direct sampling also
√ as a Gaussian dis- makes the process of generating a sample xt simpler,
tribution: q(xt |xt−1 ) = N (xt ; 1 − βt xt , βt I), where
0 < βt < 1 is a variance schedule at time step, t. The making the training process more efficient.
values of βt √follow a linear schedule, from β1 = 10−4 to 5.4 Reverse Diffusion Process
βT = 0.02, 1 − βt is the scaling factor applied to the The reverse process models a Markov chain, starting
previous data, xt−1 , as βt is small, this factor is close from xT ∼ N (0, I) (pure noise) and gradually denois-
to 1, so most of the previous signal is preserved. ing to x0 . The goal is to learn the conditional prob-
It is possible to directly sample xt from x0 with- abilities, pθ (xt−1 |xt ) for t = T, T − 1 , T − 2, ..., 1:
QT
out the iterative process,
√ by using
√ the reparameteri- pθ (x0:T −1 |xT ) = t=1 pθ (xt−1 |xt ), and pθ (xt−1 |xt ) =
zation trick: xt = 1 − βt xt−1 + βt ϵt−1 , where ϵt−1 N xt−1 ; µθ (xt , t), Σθ (xt , t), where Σθ is replaced by the
is sampled
√ from a standard gaussian noise at√time step pre-defined variance schedule, βt , and not learned.
t − 1, βt is the standard deviation (σ = σ 2 ). We We need to predict the noise in the reverse
then define: αt = 1 − βt and ᾱ =
Qt process, to do so, we can compute −log(pθ (x0 )),
i=1 αi , and
substitute this back into the equation to get: xt = which is the negative log-likelihood, but it is in-
√ √ tractable, so instead we replace it with a varia-
αt xt−1 + 1 − αt ϵt−1 , then we can simply substitute
xt−1 following the same equation, but tional lower-bound: −log(pθ (x0 )) ≤ −log(pθ (x0 )) +
√ with √ time step
DKL (q(x1:T |x0 ) || pθ (x1:T |x0 )). To be able to analyti-
t − 1, and will eventually get: xt = ᾱt x0 + 1 − ᾱt ϵ.
As t increases, ᾱt will decrease, causing the mean term cally compute the lower bound, we:

q(x1:T |x0
DKL (q(x1:T |x0 ) || pθ (x1:T |x0 )) = log( ) (Rewrite as log ratio - 1)
pθ (x1:T |x0 )
pθ (x0 |x1:T )pθ (x1:T )
= (Bayesian on bottom)
pθ (x0 )
pθ (x0 , x1:T )
= (Summarize upper)
pθ (x0 )
q(x1:T |x0
= log( p (x ) ) (Substitute back into 1)
θ 0:T
pθ (x0 )
q(x1:T |x0
= log( ) + log(pθ (x0 ))
pθ (x0:T )
1:T |x0
Now, log(pθ (x0 )) cancels out −log(pθ (x0 )) leaving only DKL (q(x1:T |x0 ) = log( q(x
pθ (x0:T ) ) from the original LV LB .

QT
q(x1:T |x0 q(x |x )
log( ) = log( Qt=1 T t t−1 )
pθ (x0:T ) p(xT t = 1 pθ (xt−1 |xt )
q(xt |xt−1 )
= −log(p(xT )) + ΣTt=1 log( ) (Log rules)
pθ (xt−1 |xt )
q(xt |xt−1 ) q(x1 |x0 )
= −log(p(xT )) + ΣTt=2 log( ) + log( ) (Sep. t = 1)
pθ (xt−1 |xt ) pθ (x0 |x1 )
q(xt−1 |xt , x0 ) q(xt |x0 ) q(x1 |x0 )
= −log(p(xT )) + ΣTt=2 log( q(xt−1 |x0 )) + log( )
pθ (xt−1 |xt ) pθ (x0 |x1 )
(Bayes + x0 cond.)

The conditioning on x0 is crucial, as it allows the predictions to the known forward diffusion, guiding the
model to know what to denoise its input to eventually. model to correct any discrepancies.
Also the true posterior, q(xt−1 |xt ), but it is intractable, Eventually, we will end up with Lt = ||ϵ − ϵθ (xt , t)||2 ,
since it requires integrating over all possible x0 values, ϵ is the noise distribution, ϵθ is the learned noise dis-
which is computationally infeasible, however, when con- tribution. The final loss function, √ Lsimple will be

ditioned on x0 , q(xt−1 |xt , x0 ) is tractable. Lsimple = Et,x0√,ϵ [||ϵ−ϵ θ ( ᾱ x
t 0 + 1 − ᾱ 2
t ϵ, t)|| ], where
The final ELBO is: Eq [log p(xT )] + √
xt = ᾱt x0 + 1 − ᾱt ϵ
θ (xt−1 |xt )
ΣTt=1 Eq [log( pq(x t |xt−1 )
), where Eq [log p(xT )] vali-
dates the model’s calibration to the true noise at the
last diffusion step, ensuring that the reverse process
θ (xt−1 |xt )
starts accurately, and ΣTt=1 Eq [log( pq(x t |xt−1 )
) acts
as a regularizer by comparing the model’s backward

3
5.4.1 KL Divergence Between Two Gaussians ous outputs in an autoregressive manner) and adds po-
sitional embeddings to them, and runs self-attention
over them, the result is then used with the encoder
outputs through cross-attention, the result of which
is passed on to an FFN to output softmax probabili-
ties. Sinusoidal functions are used for positional em-
beddings, because they don’t have any learnable pa-
rameters, they can adjust to any sequence length, and
sin(a + b) = sin(a)cos(b) + cos(a)sin(b) provides a
mathematical intuition for their usage and calculation.
5.4.2 Algorithms They are also easily differentiable, and continuous, al-
though that property only applies in case of learnable
embeddings. Their values lie between (−1, 1). They can
be calculated as: P E(pos,2i) = sin(pos/100002i/dmodel )
and P E(pos,2i+1) = cos(pos/100002i/dmodel )
6.3 Explanation
We learn WK , WV and WQ as our key, value and query
matrices, using them to compute attention weights:
T
Attention(Q, K, V ) = sof tmax( Q·K

dk
) · V . The dot
product of Q and K tells us which key attends the best
*For t <= 1, we can use ϵ = 0 for more efficiency to our query, we then scale it and softmax it to get a
probability vector, which we scale using V and add to
6 Transformer our original embeddings, as a result this lets us know
6.1 Overview which tokens to attend to more. To avoid attending to
Transformer was introduced as a way to build a scalable subsequent tokens, we can mask the tokens by setting
and parallelizable model architecture that could be di- them as −∞ initially, which after softmax is 0.
rectly used for many downstream tasks. RNNs, LSTMs
and other sequence models were inherently sequential, 7 BERT
and as a result had no notion of parallelism, so training 7.1 Motivation
them was time consuming and inefficient. Transformers • Two-existing strategies for applying pre-trained
move away from the sequential aspect, and instead rely language representations, feature-based and fine-
on attention and an encoder/decoder architecture tuning, but techniques based on the former use
with positional embeddings, for the order/sequence task-specific architecture that tincludes pre-trained
interpretation, to realize a highly scalable and easily representations as additional features (ELMo),
parallelizable architecture. and techniques based on the latter use an autore-
6.2 Architecture gressive approach, with minimal task-specific pa-
rameters.
• Unidirectional (Left-To-Right - every token at-
tends only to previous token, like in original trans-
former or Right-To-Left) language models restrict
the choice of architectures that can be used for pre-
training.
• Such representations are sub-optimal for sentence-
level tasks, and could be very harmful when ap-
plying fine-tuning based approaches to token-level
tasks such as question answering, where it is crucial
to incorporate context from both directions.
• Goal is to learn a bidirectional representation, re-
duce task-specific architectures by allowing fine-
tuning on pre-trained representations.
7.2 Analysis
• The LTR model performs worse than MLM model
on all tasks
• Demonstrates convincingly that scaling to extreme
model sizes also leads to large improvements on
In the encoder module, the input embeddings and po- very small scale tasks, provided that the model has
sitional embeddings added and passed to the attention been sufficiently pre-trained.
heads, each attention head uses self-attention across • Hypothesize that when the model is fine-tuned di-
the embeddings, learning a different aspect of the lan- rectly on downstream tasks, and uses only a very
guage, as a result leading to a broader representation small number of randomly initialized additional pa-
being learned. The result is then layer-normalized for rameters, the task-specific models can benefit from
training efficiency and added with the input embed- the larger, more expresive, pre-trained representa-
dings via skip connection. It is then finally passed on tions, even when the downstream task data is very
to the Feed Forward Network (FFN), and the same small.
layer-normalization and skip-connection steps are re- • Feature-based approach also has advantages:
peated as passed to the decoder module. The decoder – Not all tasks can be easily represented by a
module takes the output embeddings (from the previ- transformer encoder architecture, and there-

4
fore require a task-specific model architecture evaluated on to improve its language modeling ca-
to be added. pability, and the more structured attentional mem-
– There are major computational benefits to ory of the transformer assists in transfer compared
pre-compute an expensive representation of to LSTMs. Heuristic solutions suggest that gener-
the training data once, and then run many ative pretraining supports the learning of wide va-
experiments with cheaper models on top of riety of task relevant functionality, and that LSTM
this representation. exhibits higher variance in zero-shot performance,
7.3 Overview suggesting that the inductive bias of the Trans-
Bidirection Encoder Representations from former architecture assists in transfer.
Transformers (BERT) differs from ELMo (Pe- 8.3 Overview
ters et al.) and GPT (Radford et al.) by using an Generative Pre-trained Transformer (GPT) relies on
Autoencoder approach, rather than a Autoregressive the Left-To-Right autoregressive approach to lan-
approach. BERT uses Masked Language Modeling guage modeling. It first pre-trains a language model
and Next Sequence Prediction objectives during on a diverse corpus of unlabeled text, then follows
the pre-training phase to learn better language repre- it with a discriminative fine-tuning on each spe-
sentations that transfer to a multitude of downstream cific task. First, given a corpus of tokens, U =
tasks with minimal task-specific additions to the archi- {ui |ui−k , ..., ui−1 ; ϕ}, GPT uses a standard language
tecture. MLM selects 15% of the tokens from an input modeling objective to maximize the following likeli-
sequence and (1) masks 80% of the time by replacing hood: L1 (U) = Σi logP {ui |ui−k , ..., ui−1 ; ϕ}, where k
them with a [M ASK] token, so that the model learns is the size of context window and P is the condi-
to predict this token by understanding the surrounding tional probability modeled using the network’s param-
context, (2) replaces 10% of the time with a random eters, ϕ, which are trained using Stochastic Gra-
word, so that the mismatch between pre-training dient Descent. A multi-layer Transformer decoder
(using [M ASK]) and fine-tuning (not using [M ASK]) is used for the language model. Similar to trans-
can be mitigated, and (3) keeps the same token 10% former, h0 = U We + Wp , where We and Wp are
of the time. NSP is used to model the relationship embedding and position matrices, respectively. hl =
between two sentences, which is not directly captured transf ormer block(hl−1 ) ∀i ∈ [1, n] and finally P (u) =
by language modeling. When choosing sentences A sof tmax(hn WeT ). After the unsupervised pre-training
and B, 50% of the time, B is a random sentence phase, a labeled dataset C, is passed through the pre-
(labeled N otN ext) and 50% of the time it is the actual trained model to obtain the final transformer block’s
sentence (labeled IsN ext). Context/Class vector, C, activation hm l , where m is the m-th token in the se-
is used for NSP. BERT’s input representation is able quence. The results are then fed into an added linear
to model a single sentence and pair of sentences for output layer, with parameters Wy to predict y. GPT
tasks like Sentiment Analysis or Sentence Entailment. uses a traversal-style input representation, which con-
The first token of every sequence is always a special verts structured inputs into an ordered sequence. The
classification token [CLS]. The final hidden state of input sequence begins with a < s > (start) token and
this token, C ∈ RH , is used as the aggregate sequence ends with an < e > (end) token. Each sequence is
representation for classification tasks and NSP. [SEP ] separated by a delimiter, $, similar to BERT’s input
is used as a separator/delimiter token between the pair representation style.
of sentences. For output, token representations are fed
into an output layer for token-level tasks, and C is fed 9 CLIP
into an output layer for classification/sentence-level 9.1 Motivation
tasks. Typically, sentence-pair tasks, such as machine State of The Art (SOTA) computer vision systems are
translation, use Cross-Attention, but since BERT’s trained to predict a fixed set of predetermined object
input representation arranges the pairs in a single categories. This restricts their generality and usabil-
sequences, the Self-Attention mechanism from the ity. Learning directly from raw text about images can
transformer-based approach also functions similar to be leveraged. The simple task of predicting which im-
Bidirectional Cross-Attention. age goes with which caption is an efficient way to learn
8 GPT SOTA image representation. Another motivation is to
study zero-shot transfer as a way of measuring task-
8.1 Motivation learning capabilities of machine-learning systems.
• It is unclear what type of optimization objectives 9.2 Overview
are most effective at learning text representations Results suggest that aggregate supervision accessible
useful for transfer. to modern pre-training methods within web-scale col-
• There is no consensus on the most effective way to lections of text surpasses that of high-quality crowd-
transfer these learned representations to the targetlabeled NLP datasets. Scalable pre-training methods
task. which learn directly from web-text results act as the
• Goal is to learn a universal representation that same breakthrough of pre-training on large corpora,
transfers with little adaptation to a wide range of as used for BERT & GPT. The authors create a new
tasks. dataset of 400 million (image, text) pairs and demon-
8.2 Analysis strate that a simplified version of ConVIRT trained
• Transferring embeddings improves performance, from scratch is an efficient method of learning from
with each layer increasing the score each time, indi- natural language supervision. At the core is the idea of
cating that each layer in the pre-trained model con- learning perception from supervision contained in natu-
tains useful functionality for solving target tasks. ral language. We couldn’t research this approach previ-
• Authors hypothesize that the underlying genera- ously since we did not have the tools to learn deep con-
tive model learns to perform many of the tasks it is textual representations. Training efficiency was used as

5
the metric to optimize during pre-training. quired for zero-shot capabilities of CLIP to reach overall
9.3 Pre-Training SOTA performance, (3) CLIP struggles with more ab-
During pre-training, CLIP prepares a batch (32,768 stract and systematic tasks, such as counting number
pairs in paper) of N image-text pairs. N positive pairs of objectives, (4) CLIP’s performance is random on out
and N (N − 1) negative pairs are constructed. The task of distribution tasks, such as classifying MNIST dig-
is to predict which caption goes with which image in its, which was not part of the pre-training corpus, and
the batch. as a result CLIP performs worse than a simple logis-
CLIP employs 2 separate encoders, (1) an image en- tic regression classifier. CLIP circumvents this problem
coder (ResNet or ViT) and (2) a text encoder (Trans- of brittle generalization by making a naive assumption
former). Both encoders map their respective inputs into that its pre-training corpora is so large and varied, that
a fixed-dimensional embedding space. This embeddings all data will effectively be in-distribution. (5) Due to
are normalized to lie on a unit hypersphere. the unfiltered nature of the corpora, there are many
Since previous research showed that contrastive ob- biases that CLIP learns.
jects learn better representations than their equivalent 10 Vision Transformer
predictive objectives, and generative models of image
can learn high-quality image representations, but they
require over an order of magnitude more compute than
contrastive models, the authors used ConViRT as the
basis for CLIP.
To understand the contrastive loss, let Ii is the em-
bedding of the i-th image and Tj is the embedding of
the j-th text from a batch. The similarity between
an image and text embedding is calculated using co-
sine similarity (sim(x, y)) and is scaled by a learnable
temperature parameter, T . Now, the probability of a
sim(Ii ,Tj )
exp( )
correct pair is given by: p(Tj |Ii ) = T
N
Σk=1 exp(
sim(Ii ,Tk )
T ) 10.1 Overview
exp(
sim(Ii ,Tj )
) Vision Transformers apply the idea of transformers
for an image Ii and p(Ii |Tj ) = T
sim(Ik ,Ti ) for directly to images by treating an image as a sequence
ΣN
k=1 exp( T )
text Tj . The overall contrastive loss (symmetric cross- of patches and feeding them directly to the transformer
entropy loss based on InfoNCE) is the sum of negative architecture. Transformers inherently lack inductive bi-
log-likelihoods of the correct image-text pairs from both ases, such as translation equivariance and locality, and
perspectives: L = −ΣN i=1 [logp(Ti |Ii )+logp(Ii |Ti )]. The
as a result do not generalize well when trained on in-
objective is to minimize this loss, effectively pulling em- sufficient amounts of data, but ViT shows that training
beddings of positive pairs closer, and pushing negative on a sufficiently large-scale corpus trumps the inductive
pairs apart. bias. The first lyer of the ViT linearly projects flat-
9.4 Text Encoder as a Linear Classifier tened patches into a lower-dimensional space, then a
Instead of learning weights, Wk for each class, CLIP learned position embedding is added (the model learns
generates the weights from text descriptions of the to encode distance within the image in the similarity
classes. First, the classes are described via prompt of position embeddings; closer patches have more sim-
templates, ”a photo of a class”, then each description ilar position embeddings, for larger grids of patches, it
is encoded via the text encoder and their embeddings, is akin to a sinusoidal structure). Self-attention allows
tk are used as class descriptions. The image embed- ViT to integrate information across the entire image,
ding, i, for a particular image, is obtained via the image even in the lowest layers. The average attention dis-
encoder. Then a similarity score using cosine similar- tance is used to evaluate how much information self-
ity is calculated for each class, tk , against the image attention propagates, and is akin to receptive field size
embeddings, i. Finally, a softmax over the similarity in CNNs. Some attention heads attend to most of the
score is calculated to get a probability distribution over imag ealready in the lowest layers, showing information
is indeed globally integrated. Other heads have small
the classes: P (classk |image) = Σexp(sim(i,t k )/T
j exp sim(i,tj )/T
. So attention distances in low layers, similar to early con-
instead of training a linear layer over this, we simply volutional layers. The attention distance increases with
reuse the similarity scores, with text embeddings act- network depth.
ing as weights for each class. 10.2 Methodology
9.5 Strengths 2D images, of 224x224 size are split into H/P × W/P
Learning from natural language supervision has several where H is the height, W is the width and P is the
potential strengths over other training methods: (1) it dimension of patches, so if P = 16, 224/16 = 14, result-
is much easier to scale natural language supervision, ing in 14x14 or 768 dimensions when flattened, which is
compared to crowd-sourced labeling for image classi- conveniently the same as the transformers embedding
fication datasets, (2) methods which work on natural dimensions. A linear projection with learnable parame-
language can learn passively from the supervision con- ters is used to project the embeddings into the correct,
tained in the vast amount of text on the internet, (3) the D, dimensions of the transformers if needed. A [CLS]
learned representation is connected to language, which token is prepended at the start, whose output serves
enables flexible zero-shot transfer. as the image representation, y, A classification head, is
9.6 Limitations attached to the [CLS] token’s output, a MLP at pre-
(1) On datasets with training splits, the performance of training and a single linear layer at fine-tuning. The
CLIP is on average competitive with simple supervised MLP layers are local and translationally equivariant,
models, (2) around a 1000x increase in compute is re- the self-attention layers are global.

6
1D learnable positional embeddings are used, but
analysis shows that these 1D positional embeddings
learn 2D-aware context as well.2D interpolation of the
pre-trained position embeddings is done for sequence
lengths larger than the original embeddings size to han-
dle arbitrary sequence lengths.
The 2D structure, like in CNNs, is only used when
cutting the image into patches, and when adjusting the
position embeddings during fine-tuning for differing res-
olution sizes.

You might also like