FaceNet for Face Recognition
• FaceNet is a Facial Recognition System
developed by Florian Schroff, Dmitry
Kalenichenko and James Philbina, a group of
researchers affiliated to Google.
• The system was f irst presented in the IEEE
Conference on Computer Vision and Pattern
Recognition held in 2015.
• FaceNet does not recommend a
completely new set of algorithms or complex
mathematical calculations to perform the face
recognition tasks.
• The concept is rather simple.
FaceNet for Face Recognition
• A l l t h e i m a g e s o f f a c e s a r e f ir s t
represented in a Euclidean space.
• An d th en we calcu late th e s imilarity
between faces by calculating the respective
distances.
• C on s i der th i s , i f we h av e an i mage,
Image1 of Mr. X, then all the images or faces of
Mr. X will be closer to Image1 rather than
FaceNet for Face Recognition
FaceNet for Face Recognition
• Imagine you have three images: one of the
same person (Mr. X) and two of different
people. Triplet loss makes sure that the image
of Mr. X is closer to another image of him than
to any image of a different person.
• This helps FaceNet
learn to distinguish
between faces
accurately by pulling
similar faces closer
together in a 'face space'.
FaceNet for Face Recognition
• It is a pre-trained Deep Convolutional
Neural Network Model which performs facial
recognition using only 128 bytes per face.
• FaceNet is a Deep Learning Model that
learns a mapping from face images to a high-
dimensional vector space where distances
correspond to a measure of face similarity.
• This model is used to encode face images
into a vector of numerical values.
• The FaceNet model aims to generate
highly discriminative embeddings for faces,
enabling accurate face identif ic ation and
verification.
FaceNet Model Architecture
• FaceNet consists of a Batch Input layer and a
Deep CNN (DCNN) followed by L2 Normalization,
which provides face embedding. Finally, the Triplet
loss is calculated during training.
FaceNet Model Architecture - DCNN
• The network starts with a batch input layer of the
images.
• And then it is followed by a deep CNN
architecture.
DCNN:
• The network utilizes an architecture like ZFNet or
Inception network.
• Fac e Ne t i m p l e m e nts 1 x1 c onv ol uti ons to
decrease the number of parameters.
• FaceNet takes an image of the person's face as
input and will extract high-quality features from the
face and outputs 128 element vector. In machine
learning, this vector is called embedding (Feature
Vector).
• The output of these Deep Learning models is an
FaceNet for Face Recognition
• FaceNet is a start-of-art face recognition,
verification and clustering neural network.
• It is 22-layers deep neural network that
directly train s its ou tpu t to be a 128-
dimensional embedding.
• The loss function used at the last layer is
called triplet loss.
FaceNet Model Architecture - Triplet Loss
• T he c re at e d e m b e d d i ng s are t he n f e d to
calculate the loss.
• There is one vital concept implemented in
FaceNet – Triplet Loss function.
• Anchor (A): This is the image of Mr. X that we're
comparing against others.
• Positive (P): All other images of Mr. X, which
should be closer to the Anchor.
• Negative (N): An image of a different person,
which should be farther away from the Anchor.
• Hence, as per triplet loss, we want the distance
between the embeddings of the anchor image and
positive image to be less as compared to the distance
between embeddings of the anchor image and
negative image.
FaceNet Model Architecture - Triplet Loss
FaceNet Model Architecture - Triplet Loss
• Th e l os s fu n c ti on ai ms to make th e
squared distance between two image
embeddings of similar images small, whereas
the squared distance between different images
is large.
• In other words, the squared distance
between the respective embeddings will decide
the similarity between the faces.
FaceNet Model Architecture - Triplet Loss
where
Anchor image is xia.
Positive image is xip.
Negative image is xin, so basically xi is an
image.
⍺ is a margin that is enforced between positive
and negative pairs. It is the threshold we set,
and it signifies the difference between the
respective
FaceNet Model Architecture - Triplet Loss
The feature representations are obtained by
passing the images through the neural network
and extracting their respective feature vectors
f(xia) and f(xip).
This part calculates the Euclidean distance (L2
norm) between the feature representations of
the anchor image xia and the positive image xip.
FaceNet Model Architecture - Triplet Loss
argmax: This function returns the argument (in
this case, xi p ) that maximizes the value of its
input expression.
So, f in ds the positive
image
xi p that maximizes the Euclidean distance
between its feature representation and that of
the anchor image.
This equation helps us f in d a positive image
that is as dissimilar as possible to the anchor
image. By selecting such positive images, we
encourage the network to learn to differentiate
FaceNet Model Architecture - Triplet Loss
• T is the set of all the possible triplets in the
training set and has cardinality N. i.e T contains all the
possible combinations of three elements from the
training set, and there are N such combinations.
• Mathematically, triplet loss can be represented as
in Equation 2. It is the loss which we wish to minimize.
• In the preceding equations, the embedding of an
image is represented by f(x) such that x ∈ℝ.
• It embeds an image x into a d-dimensional
Euclidean space. f(xi) is the embedding of an image
which is in the form of a vector of size 128.
FaceNet Model Architecture - Triplet Loss
• Mathematically, triplet loss can be represented as:
Triplet Loss=max(0,d(A,P)−d(A,N)+α)
Here:
• d(A,P) is the distance between the Anchor and the
Positive images (usually measured using Euclidean
distance or cosine similarity).
• d(A,N) is the distance between the Anchor and the
Negative images.
• α is a margin, a small constant value that prevents
the loss from being too small (usually set to a small
positive value).
• If the difference between A and P is smaller than the
difference between A and N plus α, then the loss is 0.
• Otherwise, we adjust the network to make A and P
closer and A and N farther apart.
FaceNet Model Architecture - Triplet Loss
• We want to f ind a positive image xip for a given
anchor image xia that maximizes their similarity. This
means we're looking for the most similar image to the
anchor.
• Similarly, for the same anchor image xia, we want
to f in d a negative image xi n that minimizes their
similarity. In other words, we want the negative image
to be as different as possible from the anchor.
FaceNet for Face Recognition
• During training, it is ensured that positive and
neg ativ e are c hosen as p er the m axim um and
minimum functions given earlier on the mini-batch.
SGD (Stochastic Gradient Descent) with Adagrad is
used for training.
• The two networks which have been used are
shown in the following (ZF-Net and Inception).
• There are 140 million parameters in ZF-Net and
7.5 million parameters for Inception.
• The model performed very well with 95.12%
accuracy with standard error of 0.39 using the first 100
frames.
FaceNet for Face Recognition
• The usage of embeddings is the prime
d i f fe r e n c e b e tw e e n F a c e N e t a n d o th e r
Methodologies
• FaceNet is a novel solution as it directly
learns an embedding into the Euclidean space
for face verification.
• The model is robust enough to be not
affected by the pose, lighting, occlusion, or age
of the faces