Thesis
Thesis
[LIS]
Co-encadrée par :
Mme. GUIS Vincente Ingénieur de recherche, LIS
JURY :
Mme. GODIN Christelle Directeur de recherche CEA, Examinateur
M. CHERUBINI Andrea Professeur, LIRMM, Rapporteur
M. JOLY Philippe Professeur, IRIT, Rapporteur
ii
Abstract
Deep Learning applied to computer vision has been shown to be able to extract many kinds of seman-
tic information. From classification to localization, or pixel-level semantic segmentation, those new
algorithms improved on the state-of-the-art of many tasks and many domains. The company I have
been working with provides video streaming platforms for many customers. One of them wants to
compete with other actors who have been investing in deep learning in order to improve their user ex-
perience. We aim at extracting semantic information that was not accessible before in order to make
better personalized suggestions, emphasize on high quality content and propose new content browsing
and exploration features. As such, in this work, we explore tasks such as face identification, activity
recognition and recommender systems with an emphasis on latency and the ability to deploy at scale.
Our contributions were made by developing three datasets from our industrial content. The first one is
a study on data augmentation and pretrained models to train a classifier from an activity dataset for our
data domain. Our second contribution is a survey on learning classifiers in presence of label noise. The
next contributions revolve around face recognition. We propose a new loss function, the Threshold-
Softmax, aiming to learn from negative samples, that is, faces whose identity is just known not to be
one of the other classes. We revert back from metric learning to standard classifiers and explore four
loss functions for exploiting further negative learning, using a dataset of faces labeled with their iden-
tity, of people famous in our customer’s domain. We also contribute a face swapping model based on
the Vector-Quantized Variational Auto-Encoder (VQVAE), along with a new algorithm to improve the
vector quantization algorithm. Finally, we use the browsing history of premium users in order to learn
a recommender system based on metadata, aiming to mitigate the cold start problem for both users and
items.
iii
Résumé
Le Deep Learning appliqué à la vision par ordinateur s’est révélé capable d’extraire de nombreux types
d’informations sémantiques. De la classification à la localisation, ou à la segmentation sémantique au
niveau du pixel, ces nouveaux algorithmes ont amélioré l’état de l’art de nombreuses tâches et de nom-
breux domaines. L’entreprise dans laquelle je travaille fournit des plates-formes de streaming vidéo
à de nombreux clients. L’un d’entre eux souhaite concurrencer d’autres acteurs qui ont investi dans
l’apprentissage profond afin d’améliorer leur expérience utilisateur. Notre objectif est d’extraire des in-
formations sémantiques qui n’étaient pas accessibles auparavant afin de faire de meilleures suggestions
personnalisées, de mettre l’accent sur le contenu de haute qualité et de proposer de nouvelles fonction-
nalités de navigation et d’exploration du contenu. Ainsi, dans ce travail, nous explorons des tâches
telles que l’identification de visage, la reconnaissance d’activité et les systèmes de recommandation en
mettant l’accent sur la latence et la capacité de déploiement à grande échelle. Nos contributions ont
été réalisées en développant trois jeux de données à partir de notre contenu industriel. La première est
une étude sur l’augmentation des données et les modèles pré-entraı̂nés pour entraı̂ner un classificateur
à partir d’un ensemble de données d’activité pour notre domaine de données. Notre deuxième contribu-
tion est une étude sur l’apprentissage de classifieurs en présence de bruit d’étiquettes. Les contributions
suivantes portent sur la reconnaissance des visages. Nous proposons une nouvelle fonction de perte,
le Threshold-Softmax, visant à apprendre à partir d’échantillons négatifs, c’est-à-dire des visages dont
l’identité n’est pas celle d’une des autres classes. Nous revenons de l’apprentissage métrique aux clas-
sificateurs standards et explorons quatre fonctions de perte pour exploiter davantage l’apprentissage
négatif, en utilisant un jeu de données de visages étiquetés avec leur identité, de personnes célèbres
dans le domaine de notre client. Nous proposons également un modèle d’échange de visages basé sur
le Vector-Quantized Variational Auto-Encoder (VQVAE), ainsi qu’un nouvel algorithme pour améliorer
l’algorithme de quantification vectorielle. Enfin, nous utilisons l’historique de navigation des utilisa-
teurs premium afin d’apprendre un système de recommandation basé sur les métadonnées, visant à
atténuer le problème du démarrage à froid pour les utilisateurs et les vidéos.
iv
Remerciements
Table of Contents
Abstract ii
Remerciements iv
1 Introduction 1
2 Industrial Context 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Problems and motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Machine Learning 10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 What is Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 ⇒Practical Example: A classifier for HActions . . . . . . . . . . . . . . . . . . . . . 21
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Face Recognition 34
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Standard systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Metric learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 ⇒Contribution: Face Recognition with Threshold-Softmax . . . . . . . . . . . . . . . 38
vi
7 Recommender System 84
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Recommender systems are hard to build . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3 Types of recommender systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Baseline algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.5 ⇒Project: Hexaglobe’s RecSys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9 Conclusion 116
List of Tables
4.1 Approaches according to annotations in the dataset. Notes: TIMIT is a speech to text
dataset, ”NLP” is a set of natural language processing datasets (Twitter, IMDB and
Stanford Sentiment Treebank), ”face rec” denotes classical face recognition datasets
(LFW, CALFW, AgeDB, CFP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Accuracy on face verification for LFW and FGLFW image pairs for various loss func-
tions. Best rejection angular threshold selected for each method. . . . . . . . . . . . . 40
5.2 Various metrics for each model, rejection threshold selected at maximal total accuracy.
Best and second best results are highlighted . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Various metrics for each model, rejection threshold selected at maximal F1. Best and
second best results are highlighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 F1 AUC for the models evaluated. Value for ArcFace is normalized for comparability
(0.26 to 0.34) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
List of Figures
5.1 Figure 17 from Wang and Deng [165]. The comparison of different training proto-
col and evaluation tasks in FR. In terms of training protocol, FR can be classified into
subject-dependent or subject-independent settings according to whether testing iden-
tities appear in training set. In terms of testing tasks, FR can be classified into face
verification, close-set face identitification, open-set face identification. . . . . . . . . . 36
ix
6.1 G is a generator that turns a person identitifier yi and a latent variable zi into an image. 55
6.2 Conditional probability graph of an autoregressive model. Each pixel depends on the
previous ones, iteratively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 A PixelCNN sampling a pixel value for the current pixel from its surrounding context.
White pixels are still undetermined grey pixels have already been sampled. Shown in
red is the softmax output describing the probability distribution of the current pixel
values conditioned on the context window. Image from Kolesnikov and Lampert [83] . 56
x
6.4 Conditional probability graph of a Latent Variable Model (LVM). The whole image x
is sampled at once from a lower dimensional encoding z. . . . . . . . . . . . . . . . . 57
6.5 Training a latent variable model for colorization. There are multiple possible coloriza-
tions for a single greyscale input. A latent extractor h extracts the information solving
the ambiguity between those multiple answers ; an information bottleneck prevents the
latent extractor from encoding all of the target and short-circuiting the task. The col-
orizer f resolves ambiguous cases using the latent. . . . . . . . . . . . . . . . . . . . 60
6.6 Figure from [155] developping the quantization process. . . . . . . . . . . . . . . . . 61
6.7 top: Training a Vector-Quantized Variational AutoEncoder (VAE) (VQ-VAE) stage 1:
a quantized encoder and decoder are trained in an autoencoding fashion. bottom left:
Training a VQ-VAE stage 2: the encoder is frozen and an autoregressive prior is learnt
on the extracted latents. bottom right: Sampling from a VQ-VAE: We generate a latent
variable from the prior model and decode it to a full picture . . . . . . . . . . . . . . . 61
6.8 Training a standard GAN. top left: G is kept frozen, we teach D to classify a fake
sample as a fake image with a Binary Cross-Entropy (BCE) loss BCE(D(xf , 0)). top
right: D is taught to classify a real sample with BCE(D(xr , 1)). bottom: we train G
to produce images that are classified as true by D with BCE(D(xf , 1)), D is kept frozen. 63
6.9 Interpreting D as a trainable loss giving low values to real samples and high values to
fake samples. G learns to minimize the loss D represents. Gradients of fake samples
represented as white arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.10 An example of GAN training collapse. The generated samples suddenly ceases
converging towards realistic samples, and the GAN never escapes this degenerate state.
Image source: https://www.mathworks.com/help/deeplearning/ug/
monitor-gan-training-progress-and-identify-common-failure-modes.
html . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.11 Effect of regularizers. Top: D is trained without a regularizer. The loss landscape
might be noisy and hard to optimize against. There are strong peaks and valley because
of the unregulated Lipschitzness. Bottom: D is trained with R1 or Wasserstein GAN
with Gradient Penalty (WGAN-GP) regularizers, smoothing the surface around real
data points or just controlling D’s Lipschitzness. The gradients are more predictive
of the correct optimization direction, the loss is easier to optimize against, the peaks
and valley are smoother than the unregulated version. Note: these surfaces are just for
illustrative purposes and are not visualizations of actual loss surfaces. . . . . . . . . . 67
6.12 A cGAN. The discriminator and generator are both conditioned on y. . . . . . . . . . . 68
6.13 Examples of image translation from the original pix2pix paper [72]. x is a real image,
y a label, and G(y, z) a fake sample produced by the generator. . . . . . . . . . . . . . 69
6.14 In BigGAN, G generates samples from features and E generates features from samples.
Both pairs are discriminated, forcing G and E to reciprocate each other. Figure fom [38]. 70
6.15 In the InfoGAN, the generator is fed with a random noise z and random categorical
and continuous random codes c. The discriminator pushes the generator towards real
samples. Q tries to guess c and G cooperates, ideally leading to G utilizing c in an
interpretable way so that Q can identify them back in the generated samples. Figure
from [97]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.16 The CycleGAN architecture. Image from https://towardsdatascience.
com/image-to-image-translation-using-cyclegan-model-d58cfff04755 71
xi
7.1 Long tail. A few popular items (the head) get a significant higher number of view than
the majority (the tail). Exploiting only head items results in neglecting most items. The
Y axis is clamped at 2k views but the highest video count is 8k. . . . . . . . . . . . . . 87
7.2 In collaborative filtering, we aim to guess the ratings one user would give to an
item given the rating similar users gave. Would she like Shrek because she liked
The Dark Knight like user 1, or dislike Shrek becaue she liked Memento like user
2? Picture from https://developers.google.com/machine-learning/
recommendation/collaborative/basics . . . . . . . . . . . . . . . . . . 88
xii
7.3 This imaginary app store has 3 apps: a science app, a robot game, and a dentist appoint-
ment finder. Those apps and John’s interests are annotated by a set of tags shown above
the table. Based on John’s past interests, the first item, the science app, seems to be a
good recommendation: John’s and this app’s feature vector share the greater similarity. 89
7.4 The ratings matrix is decomposed as the inner product of user latent factors and movies
latent factors, discovered during learning. They can be inspected to find semanti-
cally meaningful features. Image source: https://developers.google.com/
machine-learning/recommendation/collaborative/basics . . . . . 90
7.5 Overview of the model. We sample a user, randomly sample a video from the watch
history, and cut the history at its watch timestamp t. The user is embedded by a user
network while videos are encoded with a video network. The dot product of their
embedding is computed and fed to a softmax + negative log likelihood loss, trained to
predict the next video watched. The x denotes the dot product / matrix multiply operation. 92
7.6 Illustrating word2vec training. A linear model trains word embeddings either by pre-
dicting the center word of a context window, or the context words of a context window
from the center words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.7 We train and evaluate our recommender system in a contrastive way. Batches of 256
pairs of histories and their next viewed video are loaded; the encoders learn to embed
them so that the dot product of the real pair is greater than the ones of the other possible
pairs formed in the batch. In other words, the encoders are learned so that ui ·vi > ui ·vj
with i 6= j. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.8 Experiments on video encoder for a fixed user encoder. We aim to understand how
features contribute to the classification information and build a model from this infor-
mation. Test Top1 accuracy is indicated. . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.9 Experiments on user encoder for a fixed video encoder. We aim to understand how
features contribute to the classification information and build a model from this infor-
mation. Test Top1 accuracy is indicated. Unless indicated otherwise, the history lenght
H is set to 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1 Visualization of torchelie.hyper hyperparameter search. The user can select hy-
per parameters to sample (and how to sample them), and target metrics. Once ran, the
results appear in this visualization. In this case, we highlighted via the interface the
three runs with the best resulting accuracy. . . . . . . . . . . . . . . . . . . . . . . . . 104
8.2 The ClassificationInspector allows to see live the performance of the classi-
fier. It reports the samples that are provide the best, worst, and most confused answers
from the classifier. The bar below the images is green when the prediction is correct,
red otherwise; the width reflects the confidence score of the prediction. This allows
eyeballing the datasets, strengths and weaknesses of the model, and build intuition. . . 105
8.3 Live confusion matrix provided automatically when the number of classes is not too big
to make it unreadable (less than 25 classes). . . . . . . . . . . . . . . . . . . . . . . . 106
8.4 Gradient of the loss wrt the input on the current batch. The per-pixel norm of the
gradient weighs each pixel’s intensity. This helps figuring out what the model looks at
in the picture in order to make its predictions . . . . . . . . . . . . . . . . . . . . . . 107
B.1 Additional plots for the ArcFace model (Section 5.5.5) . . . . . . . . . . . . . . . . . 135
B.2 Additional plots for the CE model (Section 5.5.5) . . . . . . . . . . . . . . . . . . . . 136
B.3 Additional plots for the DCE model (Section ??) . . . . . . . . . . . . . . . . . . . . 136
B.4 Additional plots for the ZLog model (Section 5.5.5) . . . . . . . . . . . . . . . . . . . 137
xiii
C.1 Incremental improvement process from ResNet to ConvNext. Figure extracted from
Liu et al. [102]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
1
1 Introduction
This thesis summarizes four years of work and research in collaboration with Hexaglobe for my PhD.
This took place in the Université de Toulon, Laboratoire Informatique et Système. The section I am in
focuses on solving problems with automated statistical approaches, commonly called Machine Learn-
ing. Machine Learning uses algorithms that are able to learn patterns from data in order to make
predictions on new data. In the last decade, neural networks, a class of those algorithms, got a lot
of traction. Researchers managed to stack many layers of neural networks, an approach now dubbed
Deep Learning. Computer Vision, treats images in order to understand or process them in various
way. Tremendous progress was achieved in Computer Vision thanks to new Deep Learning techniques,
which will be the main focus of this work.
Hexaglobe is a company providing video distribution platforms to many customers. The work
plan was to dedicate the research and innovation efforts to a single customer willing to invest in order
to lead its market. This customer has huge quantities of data, possibility to label datasets, and can
provide hardware, making it a convenient deep learning research environment. As such, there is a
strong emphasis on applied research as the problems treated are motivated by industrial challenges.
The modern computer vision developments, started in 2015, were seen as a opportunity to modernize
the underlying software of the video platform, enriching user experience through semantic analysis of
the content.
The customer and Hexaglobe were motivated by the numerous press articles emphatic about com-
puter vision astonishingly fast progresses in the deep learning era. They wanted to investigate how
useful could deep learning be for a video streaming platform. Could new computer vision algorithms
extract semantic information from pixels, useful for enhancing the user experience? Can deep learn-
ing outperform the recommender system currently in place, based on manual heuristics and popularity
scores, using semantic information instead? Can we recognize and annotate persons famous in our
domain, at scale (both in number of samples to label and identities to recognize)? Can deep learning be
used to create and exploit metadata for video content recommendation?
A two steps work plan was made: first, extract face recognition metadata from videos, then build a
recommendation engine using them and other available features and metadata. Chapter 2 will proceed
to give more context about Hexaglobe, and the customer’s datasets. Then, Chapter 3 will outline how
modern computer vision algorithms work: define the main components and examplify with an image
classification project. Chapter 4 acknowledges that our face recognition training dataset has noisy
label issues and investigates state of the art methods for detecting and mitigating this issue. Chapter 5
deals with face recognition in itself. We will see how one can leverage modern generative model with
the intent to reinforce face recognition models in chapter 6. Chapter 7 lays out how we are building
our customer’s recommender system. Finally, Chapter 8 will present the code framework developed
supporting both research and industrial work.
This document also highlights contributions:
2
1. a survey on label noise in the context of deep image classification (Chapter 4), that was con-
tributed in SANCHEZ et al. [138];
2. a novel loss function for metric learning applied to face recognition, Threshold-Softmax (Section
5.4);
3. a system for using and detecting distractors in the context of open-set face recognition (Section
5.5);
4. improvements over the original Vector Quantized information bottleneck (Section 6.11), one be-
ing contributed in Łańcucki et al. [90] and another described in this chapter;
6. a study of various design options for designing our customer’s recommender system (Section
7.5);
7. a novel framework for deep learning work (Chapter 8), publicly available at https://
github.com/Vermeille/Torchelie.
3
2 Industrial Context
2.1 Introduction
In order to understand the work that has been done during this PhD, it is preferable to first contextualize
it. Hexaglobe will be presented first, along with their problems and motivations. Then, Section 2.3 will
introduce the datasets developed internally in order to approach the problems that we aim to solve.
Hexaglobe provides all types of companies in the modern media landscape with technologies and
professional services covering the entire process from video ingest to delivery.
Customers are numerous and diverse, from TV channels to radios and Video On Demand (VOD)
producers. Hexaglobe takes care of the whole video life cycle: uploading, storage, metadata extraction
and management, encoding, referencing, searching, and serving.
2.3 Data
Throughout the thesis, three datasets have been created and are continuing to be developed as an ongo-
ing process. Those datasets need to be refined, completed, and changed in order to fit the ever growing
industrial needs. They are used for training and evaluating models before deployment.
1. HActions is an image dataset for activity recognition. It has been tailored to 12 popular activities
in our domain;
Due to the confidential nature of the datasets, the nature of the data and example samples cannot be
revealed.
2.3.1 HActions
This first dataset has been manually labeled on my own. It contains pictures of natural images of people
going on about what are supposed to be the 12 most popular activities on the website’s videos, labeled
A to M, plus an extra class that represents any other activity and another one for title screens / text
screens showing no humans. The distribution of data samples is shown in Figure 2.1 and a few samples
in Figure 2.2.
In many activity recognition tasks, the environment can be extremely informative. For instance,
karting, canyoning, sailing, driving, climbing, all take place in different environments and there are
activity clues scattered all over the picture. However, in our situation, those activities are decided
by body position rather than environment. In some sense, sleeping, singing, dining, watching TV or
playing games all take place in a domestic environment and a classifier would have very little clues in
the environment to sort them out. Special care would be needed to make sure the classifier does not
overfit on spurious background elements for instance.
Since training and inference on full videos would require a lot of compute, we instead assume
that still pictures taken from the videos contain enough information for the classification task. 300
evenly spaced frames are extracted throughout videos that lasts more than one minute. That way the
computation budget remains fixed and controlled. For comparison sake, if we were to use all the video
frames with a temporality aware model, 300 frames would represent a video of 10s only. Even if a
video aware model performed much better, the associated costs would be too high, and reducing the
frame rate in training and inference would need another pass of video transcoding, which already is
expensive.
Instead of randomly splitting the pictures among train / validation / test sets, we randomly split
the videos in those sets. Not doing this would create validation and test sets containing examples
very similar to the ones in the train sets as nearby frames look similar. We aim to have a model that
generalizes to videos outside the training set, not frames of a closed set of videos, thus the validation/test
sets must contain frames of unseen videos.
2.3.2 HFaces
Automatic face recognition would bring very valuable metadata to our content. In our business, identi-
fying celebrities in our domain is one of the core feature that the website’s visitors would find valuable.
The dataset contains 8938 identities, but is growing every day as we wish to recognize more people.
The distribution of the number of pictures per identities is shown in Figure 2.3 and a few samples are
5
J B
8.7% 9.3%
I
1688 B J 186 C
6.2% 1415 7.3% 11.5% 0.7%
H C I 230 D
1216
0.9% 1.9% 5.7% 173 8.7%
G D H E
1549 2120 114
8.0% 10.9% 0.8% 2.6%
F 691 E G 142 F
3.6% 1.0% 7.1% 2.1%
Figure 2.2: A few samples from HActions, with class label ”none”
6
1000
100
10
Figure 2.3: Number of examples per identities in the photos subset of HFaces (sorted). Most identities have
between 10 and 100 samples. Each bar represents a different class, sorted from least to most populated.
Figure 2.4: Samples from HFaces. Top row: extracted from pictures. Bottom row: extracted from videos.
shown in Figure 2.4. The dataset has been built with balance in mind. After collecting few very famous
identities with web scraping, the labelers have been tasked to manually collect 100 pictures per identity
when possible.
Our face dataset is divided into two data sources: faces coming from promotional pictures and
videos. Faces coming from pictures are easier to collect but are prone to domain mismatch with videos
as they are cleaner: the faces are usually not occluded, there is no motion blur, lighting is good and
people tend to smile. In videos, none of this might hold true. We collected many pictures from photos
and a smaller bunch from videos in order to evaluate and eventually mitigate the domain mismatch
between both.
The dataset has been complemented with MS1Mv2 [54] to provide the so-called ”distractors”, de-
velopped in section 5.5.
2.3.3 HHistory
Finally, there is a dataset of premium users’ browsing history. This dataset is here to provide historical
data for building a recommender system, analyze trends, study semantic proximity of items, etc. This
could be useful in various way : videos frequently watched together can serve as a contrastive / metric
learning dataset to learn about visual features that makes them similar, can learn about tag proximity
as well, etc. This dataset can be of many uses and the only metric really lacking is the lenght of watch
time per video in order to assess the interest of the user.
7
1000
100
10
Figure 2.5: Number of views per videos (sorted). Each video is a thin vertical bar.
Premium content is only a subset of the whole content. The dataset has 135,984 users (content
uploader), 3,673 channels and 139,711 videos. The distribution of views is in Figure 2.5. It contains
various user information, many being optional:
1. user ID
2. username
3. country / region
4. gender
7. favorited videos
9. channels subscriptions
1. video ID
2. main category
4. people appearing in the video (from the face recognition model or the previous system)
8. upload timestamp
Finally, each channel has popularity scores per category and region.
2.4 Machines
To work, Hexaglobe provides me three machines, hosted by our Cloud computing provider.
1. gpu1 has 2 NVIDIA GTX 1080 Ti and is used mainly for production and inference.
2. gpu2 has 4 NVIDIA RTX 2080 and is used for my main experiments. It allows me to iterate
quickly during development by using large batch sizes.
3. gpu3 has 2 NVIDIA RTX 2080. This machine is used for runs that I don’t mind waiting for or
side experiments.
The university also provided several clusters that I could use when working on public datasets.
2.5 Conclusion
In this chapter, we presented the company who supported the thesis and their industrial needs: they
provide video platforms to customers. In order to improve their products, they would like to explore
what Deep Learning has to offer for extracting metadata. Those metadata would be used to enrich
searchability, sorting, user experience and recommendation. It then appears coherent to first focus on
9
extracting activity and face recognition to collect the most important features, then build a recommender
system using them.
We are developing three datasets, one for each task: HActivity for activity recognition, HFaces for
face recognition, and HHistory for the recommender system.
In the next chapters, we will be exploiting those datasets in order to develop and improve models.
10
3 Machine Learning
3.1 Introduction
As an introductory material, this chapter introduces what Machine Learning is and how it works. This
chapter will illustrate the notions through the lens of activity recognition in images. However, the
foundations laid out here are not limited to this use case and are indeed more general. The explanations
will alternate between the specific use cases, to ease the intuitive understanding, and the more abstract
concepts in order to make the generality of the techniques clear.
At the end of this chapter we will illustrate the foundations laid here with a practical example. This
will allow us to introduce the basics of model training, evaluation, data augmentation, and fine-tuning
a pretrained model.
3.3.2 Neurons
Figure 3.1: A (very simplified) biological neuron and an artificial neuron (Perceptron)
Before understanding neural networks, let’s focus on a single artificial neuron. Artificial neurons
are loosely inspired by biological neurons. They represent a neuron taking some input electrical signals,
transmitting those by weaker or stronger dendrites (connections), summed up in the soma, firing through
the axon if the total amount is above a certain threshold. Cf Figure 3.1
In computer science, this is approximated by a linear combination w - a weighted sum - of the input
x, followed by a threshold b and an activation function σ. An artificial neuron is then σ(wT x + b),
with σ being any non-linear function. Training a neuron is finding the weights w, b that works best for
solving a given problem. This is sometimes still called a Perceptron, in reference to the original paper
by ”Rosenblatt [134].
While today we often chose the Rectified Linear Unit (ReLU) σ(x) = max(0, x) for its good
numerical and computational properties [116], researchers initially used a sigmoid, tanh or step func-
tion. Developing on activation functions is well beyond the scope of this manuscript, but it is still an
active area of research with new propositions, despite none of them outperforming the performance /
computational cost tradeoff of ReLUs.
Ground truth
preprocessing Error
Input (normalization, neural Loss score /
Prediction
data cropping, network function loss
resizing, etc) value
Optimizer Gradients
correct output y. We compute ŷ = f (x) the output of the network. Then, a differentiable predefined
function L(ŷ, y) computes the distance between the generated output and the expected correct output.
By differentiating this distance wrt to the parameters θ, we get the direction in which to move θ to
reduce the error. θ is moved a bit (a step of size α) in this direction and the iterative process continues,
and f learns to provide outputs closer and closer to the target output. This process is called Gradient
Descent. In Deep Learning, however, Stochastic Gradient Descent (SGD) is used. The stochasticity
comes from the fact that we compute the gradient on a random subset of the dataset (a batch or mini-
batch), each iteration. This helps regularizing the model (thus providing greater generalization), and
descending on the full dataset is often not practically doable anyway since they are usually too big to
fit in memory along with the gradient information. Each iteration thus fundamentally computes
However, as we will later see, this update equation can be made more sophisticated in order to
reduce the number of steps needed to converge and / or to improve generalization. These equations are
called optimizers.
This process is summarized in Figure 3.2.
Figure 3.3: Example of data augmentation. The original image is transformed to artificially generate new training
examples. In this case, AutoAugment is used ; it combines multiple transformations and its settings depend on
the dataset and the task. Figure from torchvision’s documentation. Each row shows the set of augmentations used
for those datasets on a sample image.
Tremendous progress has been made thanks to augmentation strategies [178, 29, 180, 36]. Examples
from AutoAugment from Cubuk et al. [28] are shown in Figure 3.3.
For instance [36] randomly erases rectangle parts of the image in order to encourage robustness, by
forcing models to rely on several and less salient features.
In this work, we will mainly use TrivialAugment [115]. It is a recent and very simple augmen-
tation algorithm from which we can learn that despite a lot of complicated methods to automatically
find augmentation policies (AutoAugment [28], RandAugment [29], ...), the simplest method performs
comparably or better.
TrivialAugment defines an augmentation as ”a function mapping an image x and a discrete strength
parameter m to an augmented image”. It uses a collection of predefined classic image transformations
(color, contrast, rotation, ...). It randomly selects one of those operations per sample, and randomly
samples its strength parameter m. ”The strength parameter is not used by all augmentations, but most
use it to define how strongly to distort the image.”
3.3.6 Architecture
What those layers of neurons compute and how they are arranged defines an architecture or topology.
When two or more Perceptron layers are stacked, it is called a Multi-Layer Perceptron (MLP).
However, Perceptrons do not perform well on natural image data. They are too general and some-
15
Figure 3.4: AlexNet architecture, built from convolutions (CONV), pooling operations (POOL), and linear layers
(FC). Figure from Krizhevsky et al. [88]
how too powerful for computer vision: a Perceptron treats each and every input as a separate variable,
but pixel values are not independant variables. First, they are spatially correlated as the world is com-
positional, and exhibits many invariances in translation, scale, and orientation.
Perceptrons have to learn to perform the same operations everywhere on the picture, and, in practice
they don’t. Replacing Perceptrons with convolutions with learnable weights gave birth to Convolutional
Neural Networks (CNNs).
A convolutional layer is a Perceptron in disguise. It is just applied repeatedly and similarly on
smalls spatial patches of the input. The size of that input patch is called kernel size. We compute many
different convolutions on the same input - the width of the convolutional layer. Perceptrons compute
multiple different linear combinations of the input in the same way -, each resulting spatial map is called
a feature map or channel. Finally, the convolution kernel can jump some inputs, we call this strided
convolutions, in order to downsample the input resolution.
Convolutions instead of Perceptrons build into the model what are called inductive biases: they
perform local operations - addressing compositionality - and performs them the same way everywhere
on the picture - addressing translational invariance. CNNs got fame in 2012 when they beat their
competitors in the ImageNet classification challenge by a large margin and created the deep learning
hype we know today [88].
The convnet that made deep learning attractive by winning ImageNet is AlexNet [88] (see Figure
3.4). It contains 5 conv layers, 2 max pooling layers, and 3 linear layers (also called Fully Connected
layers). Max Pooling are meant to reduce the spatial size in order to reduce both memory and compu-
tation consumption, and increase the working region of each convolution. As the image is processed
through the convnet, it produces layers of activations, that get abstracted in neural representations. Deep
representations have a low spatial resolution but rich semantics [92, 174].
The design of convnets raises many questions: how many channels? what kernel shape? what
stride? Should we pool? It is hard to answer those, so, when designing GoogLeNet / Inception [146]
(Cf Figure 3.5), the Google team stacked many layers of complex building blocks. Each of those blocks
is composed of parallel paths taking different design decisions. At train time, the net can learn to use
each of those paths to its best.
However, those questions might not be that important after all. The VGG net [141] (Cf Figure
3.6) uses only 3x3 convolutions, and doubles the number of channels after each pooling operation. Its
simplicity encouraged researchers to invest time into simple designs with better fundamental principles
rather that complex models.
He et al. [59] observed that VGG nets could not be made very deep (no more than 20 layers).
They hypothesized that the gradients quickly lose their supervision quality going back through the
layers, failing to update the first layers. From this, they chose to design residual networks, where convs
16
compute additive residual transformations. The gradients can then flow unchanged along the identity
path and keep its informative quality. The Residual Network (ResNet, cf Figure 3.7) can go at least
up to 1000 layers and still learn, despite reaching diminishing returns after 150 layers. The 50 layers
variant is the most widely used because of its compute cost / performance tradeoff. ResNets were such
an improvement that most of the following architectures integrated residual connections.
Note: Perceptrons are making their great comeback in image processing [153], mostly through
Transformers (Perceptrons with an attention mechanism) [158, 40]. However, it does not invalidate
what was said previously: while their performance scales better wrt big data quantity than State Of The
Art (SotA) CNNs, they perform worse than CNNs on smaller training sets (ImageNet-1k being con-
sidered ”small” in this context). Indeed, despite limiting convnets at scale, the convolutional inductive
biases embody some knowledge about natural images. Transformers based architectures need some
more data to discover those invariances and close the gap, but are able to surpass CNNs with even more
data. There are works trying to suggest inductive biases to vision transformers that the network can
un-learn if needed, in order to make them perform on par or better than convnets in lower data regimes
[32, 31] and always benefit from their scalability.
Figure 3.6: VGG network. Grey: 3x3 convolution layers; red: pooling layers; blue: linear (or 1x1 convs) layers;
green: softmax. Each activation is described as Heigth × Width × Channels.
SGD is slow and prone to underfitting as it has no mechanism to escape local minima.
For weights θ, gradient of the loss ∇J, and a positive hyperparameter learning rate α:
(3.2) θ := θ − α∇J
v :=v + β∇J
(3.3)
θ :=θ − αv
Another notable variant is Nesterov momentum. Here, the gradient evaluation is performed after
the momentum step of the parameters, contrarily to SGDM.
Adam
Another family of optimizers, the adaptive optimizers, made popular by Adaptive Moment Estimation
(Adam) is said to be less sensitive to hyperparameters and especially to the learning rate. It works
by scaling the learning rate by the rolling variance of the weight in recent history. It introduces new
hyperparameters, β1 and β2 , respectively defaulted to 0.9 and 0.999. It also considers t, the number of
the current iteration, and , a constant for numerical stability.
m :=β1 m + (1 − β1 )∇J
v :=β2 v + (1 − β2 )∇J
(3.4) m̂ :=m/(1 − β1t )
v̂ :=v/(1 − β2t )
√
θ :=θ − αm̂/ v̂ +
18
Figure 3.7: In ResNets, an identity path is added every two convolutions, so that the gradient can flow up to the
first layers untouched. Figure from [59]
Despite its advantages in convergence speed and hyperparameters tuning, it has not fully took prece-
dence over SGD as it converges to results only close to SGDM performance.
The literature explored many variants, including AdamW, RAdam, AdaMax, AMSGrad, AdaBelief,
etc.
Cross-Entropy
Cross-entropy measures the difference between two discrete probability distributions pa and pb for a
random variable A with realizations a.
X
(3.5) H(pa , pb ) = − pa (x) log pb (x)
a
19
We can use it as a loss function. This implies that ŷ = f (x), our neural network, must be interpreted
as the conditional probability distribution pf (ŷ|x) and the target outputs as a probability distribution
p(y|x). Minimizing the cross-entropy (that is, using it as a loss function) reaches its optimum when the
two distributions are identical.
In order to train a classifier, we consider the special case with the target distribution p(y|x) defined
as a categorical distribution over the possible classes. This distribution assigns a probability mass of 1
for the correct class, 0 otherwise. Minimizing the cross-entropy in this situation is strictly equivalent to
maximizing the predicted probability of the target class, or minimizing the negative log likelihood of
the target class. The later formulation is preferred.
A neural network does not produce calibrated probability distributions on its own, but unconstrained
scalars. They are often called logits, as unnormalized log parameters of the categorical distribution,
interpreted as the output of the logit function. Those logits zi can be be normalized into the categorical
distribution parameters o with the softmax function.
exp(zi τ )
(3.6) oi = P
j exp(zj τ )
Where τ is an optional temperature parameter that controls the sharpness / entropy of the distribu-
tion.
While not being totally accurate, some people call the cross-entropy loss the softmax loss.
Mean-Squared Error
When trying to predict continuous values, one’s default loss function choice is the L2 distance, or
Mean-Square Error (MSE). It penalizes the prediction more as the difference to the target grows (Figure
3.8). When there is ambiguity, optimizing the MSE will result into predicting the mean value over the
possible targets.
Figure 3.8: Gradient field of the L2 loss function over a R2 plane. The black line shows the minimization
trajectory from the black dot. We observe that this loss penalizes each variable proportionally to its value.
20
L1 Loss
L1 might be used as well. L2 makes sure that predictions are not diverging too much from the target
value; L1 rather encourages the number of exact predictions by penalizing equally any amount of
divergence (Figure 3.9). In there is ambiguity, optimizing the L1 distance will result into predicting the
median over the possible target values.
Figure 3.9: Gradient field of the L1 loss function over a R2 plane. The black line shows the minimization
trajectory from the black dot. We observe that this loss penalizes equally each variable, encouraging sparsity.
Note: Numerical Considerations When working with probabilities, computers might run into issues.
Probabilities pi are often multiplied together and can become small, to the point where there could be
representation issues with standard IEEE754 32 bits floats. For this reason, when possible, we instead
manipulate log probabilities, taking advantage of this property:
Y X
(3.9) pi = exp log pi
i i
MNIST
MNIST [91] is dataset of 28x28 greyscale images representing handwritten digits labeled with the
ground truth digit. It contains 60k training pictures and 10k testing pictures. Classifying those digits is
such a simple task that a SVM reaches 0.8% error. In deep learning, MNIST is used as a sanity check
for debugging or to introduce novel ideas, for instance with artificially introduced label noise.
CIFAR-10/100
CIFAR-10 [87] contains 50k training 32x32 color images of 10 classes (airplane, automobile, bird, cat,
deer, dog, frog, horse, ship, truck). There are 10k test images. This dataset is often used for developing
21
new strategies. Optimizers, architectures, augmentations etc are often tested and calibrated on CIFAR-
10 before before being battle-tested on ImageNet.
CIFAR-100 is similar but extended to 100 classes, 600 images each. It is not as commonly used as
CIFAR-10.
ImageNet-1K / ILSVRC12
ImageNet (sometimes called ILSVRC for Image Large-Scale Visual Recognition Challenge) [136, 33]
is a large scale database of natural images crawled from the web, divided into 1k classes. It contains
about 1M training pictures ( 1k per class) and 50k testing images.
Since 2012, ImageNet has been the gold standard dataset for evaluation and comparison of clas-
sifiers. It is also commonly used to extract knowledge from natural images in order to build fea-
ture extractors or backbones for assembling models together [85]. An expanded version of ImageNet,
ImageNet-21k has been released. Models able to ingest lots of data are usually pretrained on it, before
being tested on ImageNet-1k.
dataset such as ImageNet, the model would learn generic patterns, filters, shapes or objects that would
be useful for other datasets or other tasks as well.
The parameters and architecture of the net is then kept untouched (or ”frozen”) but the last layer(s)
that are trained on the new task. The frozen layers are used as a fixed feature extractor. This way, we
only learn a small model from semantically rich features, instead of a bigger model from raw pixels.
This allows to reuse natural images knowledge extracted from ImageNet. It helps preventing overfitting
on meaningless spurious patterns in the case of small training sets, and to reuse knowledge for a different
task. In addition to the performance advantage, this is much faster than training the whole thing as the
features can be pre-computed only once.
After these last layer(s) have been retrained, it sometimes helps to fine tune all the parameters of
the model, by iterating a little more on the dataset, this time training the whole network with a very
small learning rate. This allows the model to extract a few more useful or specific features that were
not learned during the pretraining, while trying not to overfit . When possible, it helps to know how the
base network has been trained as regularizers can improve ImageNet accuracy but reduce transferability
[85, 86].
3.4.3 Experiments
In order to demonstrate what we laid down so far we run a quick set of typical experiments with a
ResNet-18. All experiments are conducted with SGDM, weight decay (regularizing the square of the
weights) is set to 1e-3, momentum is set to 0.9, batch size to 128, split across 4 GPUs. The learning
rate is searched in {0.3, 0.1, 0.01, 0.001, 0.0001}. We train for 40 epochs. The learning rate (lr) decays
linearly from its initial value to 0 at the end of the training as [95] shows this to be a sensible choice in
common scenarios. The initial data augmentation includes random horizontal flipping and the pictures
are resized to 128x227 which is unusual but keeps the 16:9 aspect ratio of the frames. We use a standard
cross-entropy loss. A run takes about 1h.
We wish to verify and exemplify the benefits of a pretrained model and data augmentation, known
to be among the top strategies to improve a natural images classifier.
The results, summarized in Table 3.1, show the power of pretraining and data augmentation, allow-
ing performance boost with fast experiments.
Config-A trains from scratch. The best learning rate found is 0.1. It reaches an accuracy of 37.82%.
Config-B verifies that weights pretrained on ImageNet actually boost the accuracy. All the Batch-
Norm layers are kept in inference mode during training and the running statistics are not updated. The
learning rate for all the layers but the last is set to 0 during the first 4 epochs and divided by 100 for
23
the following iterations. The best learning rate found is 0.01. This significantly boosts the accuracy by
15%, reaching 52.44%.
Config-C Config B is found to perform significantly better than A. Config-C adds random horizontal
flipping of input images. Adding this transformation brings the random initialization (C1) to 43.87%,
and Config-B to 55.67% (C2).
Config-D adds inception-style random cropping to Config-C. It crops a random area from 80% to
100% of the image and resizes it to the input size.
3.5 Conclusion
We demonstrated how the building blocks explained in this chapter (Figure 3.2) can be assembled in
order to train and utilize a learning algorithm. Pre-trained models for most used classification models
are readily available and the data augmentation strategies are based on simple image manipulations. We
show how one can leverage those for quick starting a project with little compute and data.
We trained a standard modern deep image classifier, leveraging battle tested techniques. Yet, the
accuracy we obtained are usable for our use case, but somewhat below what can be expected from
ImageNet capable models. We are facing problems that arise with ”real world” usages: low data avail-
ability, labeling cost, ambiguity in samples and class semantics, etc. For instance, while ImageNet is
a dataset of natural images, there are some peculiarities that might not be considered ”real world”, no-
tably the data collection protocol that biased the data (keywords from search engines), or the fact that
the class is an object centered and occupying the majority of the image surface.
In the next chapters, we apply the same standard classifier to the problem of face recognition,
but quickly observe the same phenomenon: some data dependent problems are starting to arise, and
24
need mitigation: there are some specific types of label errors that need to be fixed, the face extraction
pipeline sometimes extracts non-face images, and the data bears its own set of hard features (very
narrow demographics, extreme facial expressions, varying makeup, etc). The next chapter investigates
dealing with label noise so as to mitigate the most damaging aspect of our training set: erroneous
training samples.
25
4.1 Introduction
Deep Learning systems have shown tremendous accuracy in image classification, at the cost of big,
manually labeled, image datasets. Collecting such amounts of data can lead to labelling errors in the
training set. Indexing multimedia content for retrieval, classification or recommendation can involve
tagging or classification based on multiple criteria. In our case, we train face recognition systems
for actors identification with a closed set of identities while being exposed to a significant number of
distractors (actors unknown to our database). Face classifiers are known to be sensitive to label noise.
We review recent works on how to manage noisy annotations when training deep learning classifiers,
independently from our interest in face recognition.
Our client wishes to extract as much metadata as possible from their content. For the video content
we host, identifying the actors is very valuable. Indeed, those videos have recurring actors that are worth
identifying, among many unknown people that should be ignored. Users get to search for the content
from the same actors, or similar actors. In order to extract these data, we build a face recognition system.
We collect a dataset of face recognition with our celebrities. First we selected the 50 most popular
celebrities on the platform and scrape pictures automatically from the internet, quickly verifying them
manually resulting in 1k+ pictures per identity. In a second phase, we collect data for all the lesser
known actors we wish to recognize. As those second-phase actors are less known, it is harder to find
data for them, making automatic web scraping unreliable, and we shift to a strategy that is both more
reliable and less time consuming. Human annotators are tasked to manually download between 10 and
100 pictures per identity, by descending order of popularity.
Bootstrapping and managing a large-scale dataset for face recognition requires either a lot of manual
collection and labelling or scraping data from internet. Either way, the data is complex and the process
is prone to error, and, when analyzing the data, we observe some recurrent error types:
• some people might be lookalikes and end up mixed up by human annotators or web resources (eg
Vin Diesel and Dwayne Johnson);
• some might share a similar name and get scraped together, either by an automatic process or a
human annotator (like two women named Alexa);
• some others might appear frequently together and collecting one would probably get pictures of
the others as well (like Eric Judor and Ramzy Bédia). It might also be hard to collect pictures
where each person of interest is alone, and we might also end up collecting people sharing the
shot.
For this reason, it became important to detect and mitigate label noise. I wrote a survey in order to
26
learn the state of the art in detecting and handling label noise and its impact on image classification in
general, keeping face recognition in mind.
This section is a contributed review paper published in ICME2020 [138].
sample x having a true label y of class c, and a complex corruption model for ŷ, depending on both y
and x, P (ŷ = c|y = c0 , x).
X
(4.1) P (ŷ = c|x) = P (ŷ = c|y = c0 , x)P (y = c0 |x)
c0 ∈C
2) Noise At Random (NAR) assumes that label noise is independent from the sample content and
occurs randomly for a given label. Label noise can be modeled by a confusion matrix C ∈ R|C|×|C|
that maps each true label to labels observation probabilities. It implies that some classes y = c0 may
be more likely to be corrupted to ŷ = c. It also allows for the distribution of resulting noisy labels not
to be uniform, for instance in naturally ambiguous classes. In other words, some pairs of labels may be
more likely to be switched than others.
3) The least general model, called Noise Completely at Random (NCAR), assumes that each
erroneous label is equally likely and that the probability of an error is the same among all classes.
For an error probability E, it corresponds to a confusion matrix with P (E = 0) on the diagonal and
P (E = 1)/(|C| − 1) elsewhere. The probability of observing a label ŷ of class c among the set of all
classes C is
purpose dataset could have a hundred classes that can easily be discriminated if they are all different
enough. When the dataset is simple, true label correction can be provided without prohibitive costs.
When it is not, a reviewer can sometimes provide a boolean verification saying that the label is correct
or not, which might be easier than recovering the true labels.
A dataset can then provide (1) no annotations, (2) corrected labels or (3) verified labels for a subset
of its labels.
undecidable, or out of domain samples. MNIST [91] can be employed under the same protocols, with
a reduced size of classes of handwritten digits, each composed of 1000 images.
Clothing1M [171] contains 14 classes of clothes for 1 million images. The images, fetched from
the web, contain approximately 40% of erroneous labels. The training set contains 50k images with
25k manually corrected labels, the validation set has 14k images and the test set contains 10k samples.
This scenario fits our low annotation complexity situation where labels can be corrected without too
much difficulty, but the size of the dataset makes a full verification prohibitive.
Food101-N [93] has 101 classes of food pictures for 310k images fetched from the internet. About
80% of the labels are correct and 55k labels have a human provided verification tag in the training set.
This dataset rather describes the high annotation complexity scenario where the labels are too numerous
and semantically close for an untrained human annotator to correct them. However, verifying a subset
of them is feasible.
Finally, WebVision [96] was scraped from Google and Flickr in a big dataset mimicking ILSVRC-
2012 [33] (1k classes, 1.2M training samples), but twice as big. It contains the same categories, and
images were downloaded from text search. Web metadata such as caption, tags and description were
kept but the training set is left completely uncurated. A cleaned test set of 50k images is provided.
WebVision-v2 extends to 5k classes and 16M training images.
When working on image data, all the papers used classical modern architectures such ResNet [59],
Inception [146] or VGG [141].
4.5 Approaches
4.5.1 Prediction re-weighting
Given a softmax classifier f (xi ) for a sample xi , prediction re-weighting mostly implies estimating the
confusion matrix C in order to learn CT f (xi ) in a supervised fashion with the noisy labels. Doing so
will propagate the labels’ confusion in the supervising signal to integrate the uncertainty about label
errors. The main difference between the approaches lies in the way C is estimated.
In Noisy Label Neural Networks (NLNN) [10], noisy labels are assumed to come from a real
distribution observed through a noisy channel. The algorithm performs an iterative Expectation Max-
imization algorithm. In the Expectation step, correct labels yi are guessed through CT f (xi ) while in
the Maximization step, C is estimated from the confusion matrix between guessed labels ỹi and dataset
labels ŷi . Finally, f (xi ) is trained on guessed labels ỹi . The process is repeated until convergence.
Taking a more direct approach, (Xiao et al, 2015) [171] directly approaches C by manually correct-
ing the labels of a subset of the training set. Then, a secondary neural network g(xi ) is defined, giving
to each sample a probability P (z1,i , z2,i , z3,i |xi ) of being either (z1 ) noise free, that is ŷi = yi , (z2 )
victim of completely random noise (NCAR), ie P (ŷi |yi ) = (U − I)yi such that the matrix U is uniform
and all rows of U − I sums to 1, or (z3 ) confusing label noise (NAR), P (ŷi |yi ) = CT ŷi . Finally, f (xi )
is trained on the noisy labels so as to minimize LCE (z1i f (xi ) + z2i (U − I)f (xi ) + z3i CT f (xi ), ŷi ) with
LCE the cross entropy loss function.
(Hendrycks et al, 2018) [62] first train a model on the dataset with noisy labels. This model is then
tested on a corrected subset and its predictions errors are used to build the confusion matrix C. Finally
f (xi ) is trained on the corrected subset and CT f (xi ) is trained on the noisy subset.
close to 0, the example has almost no impact on training. αi values larger than 1 emphasize examples.
If αi is exactly 0, then it is analogous to removing the sample from the dataset.
Co-mining [167] investigates face recognition where correcting labels is unapproachable for a large
number of identities, and most likely a situation of open-set noise. Two neural nets f1 and f2 are given
the same batch. For each net, the losses l1i = L(f1 (xi ), ŷi ) and l2i = L(f2 (xi ), ŷi ) are computed
for each sample and sorted. The samples with the highest loss for both nets are considered noisy and
are ignored. The samples s1,i and s2,i that have been kept by f1 and f2 are considered clean and
informative: both nets agreed. Finally, the samples kept by only one net are considered valuable to the
other. Backpropagation is then applied, with clean faces weighted to have more impact, valuable faces
swapped in order to learn f1 with s2,i and f2 with s1,i , and low quality samples are discarded.
CurriculumNet [53] trains a model on the whole dataset. The deep features of each sample are
extracted, and from the Euclidean distance between features vectors, a matrix is built. Densities are
estimated, 3 clusters per class are found with k-means, and ordered from the most to least populated.
Those three clusters are used for training a classifier with a curriculum, starting from the first with
weight 1, then the second and third, both weighted 0.5.
Iterative learning [168] chooses to operate iteratively rather than in two phases like Curriculum-
Net. The deep representations are analyzed throughout the training with a probabilistic variant of Local
Outlier Factor [17] for estimating the densities. Local outliers are deemed noisy. The unclean samples
importance is reduced according to their probability of being noisy. A contrastive loss working on pairs
of images is added to the cross entropy. It minimizes the euclidean distance between the representation
of samples considered correct and of the same class, and maximizes the Euclidean distance between
clean samples of different class or clean and unclean samples. The whole process is repeated until
model convergence.
We can also employ meta-learning by framing the choice of the αi as values that will yield a model
better at classifying unseen examples after a gradient step. (Ren et al, 2018) [131] performs a meta
gradient step on L = αi LCE (f (xi ), ŷi ) then evaluate the new model on a clean set. The clean loss
is backpropagated back through L, for which the gradient η gives the contribution of each sample to
the performance of the model on the clean set after the meta step. By setting αi = max(0, ηi ), the
samples that impacted the model negatively are discarded, and the positive samples get an importance
proportional to the improvement they bring.
CleanNet [93] learns what it means for a sample to come from a given class distribution, utilizing
a correct / incorrect tag provided by human annotators. A pretrained model extracts deep features of
the whole dataset. Then, they run a per-class K-Means, and find the images with features closest to the
centroids as a set vc of reference images for that class c. A deep model g(vc ) encodes the set into a single
prototype. A third deep model h(xi ) encodes the query image xi in a prototype. We learn to maximize
wci = cos(g(vc ), h(xi )) if xi has a correct class c, and to minimize it otherwise. This relevance score is
used to weigh the importance of that sample when training a classifier with max(0, wŷi )LCE (f (xi ), ŷi ).
Instead of getting a consistent wrong information from an erroneous label, NLNL [80] (not to be
confused with NLNN) samples a label ỹi 6= ŷi and uses negative learning, a negative cross-entropy
version that minimizes the probability of ỹi for xi . As the number of classes grows, the more likely the
sampled label ỹi is indeed different of yi and noise is mitigated, despite being less informative. Then
only samples with a label confidence above 1/|C| are kept and used negatively in a second phase called
Selective Negative Learning (SelNL). Finally, examples with confidence over a high threshold (0.5 in
the paper) are used for positive fine-tuning with a classical cross entropy and their label ŷi .
31
4.5.3 Unlabeling
Iterative Noise Filtering [117]: A model is trained on the noisy dataset. An exponential moving
average estimate of this model is then used to analyze the dataset. Samples classified correctly are
considered clean, while the label is removed for those classified incorrectly. The model is further trained
with both a supervised and unsupervised objective for labeled and unlabeled samples. The samples
with labels are used with a cross entropy loss. For each unlabeled sample, we maximize maxc f (xi )c
in order to reinforce the model’s prediction, while maximizing the entropy of the predictions over the
whole batch to avoid degenerate solutions. After each epoch, the dataset’s labels are evaluated again
according to the average model.
4.6 Discussion
Those approaches cover a wide variety of use cases, depending on the dataset: whether it has verified or
corrected labels or not, and the estimated proportion of noisy labels. They all have different robustness
properties: some might perform well in low noise ratio but deteriorate quickly while others might have
a slightly lower optimal accuracy but not deteriorate as much with high noise ratio.
Re-weighting predictions performs better on flipped labels rather than uniform noise as shown in
the experiments on CIFAR-10 in Hendrycks et al, 2018 [62]. As noise becomes close to a uniform
noise, the entropy of the confusion matrix C increases, labels provide more diffused information, and
prediction re-weighting is less informative. CIFAR-10 being limited to 10 classes, NLNN [10] is shown
to scale with a greater number of classes on TIMIT.
Noisy samples re-weighting scales well: CurriculumNet [53] scales in number of samples and
classes as the experiments on WebVision shows, Co-Mining [167] is able to scale to face recognition
datasets and open-set noise at the expense of training two models, CleanNet generalizes its noisy sam-
ples detection by manually verifying a few classes.
However, NLNL [80] may not scale as the number of classes grows: despite having negative labels
that are less likely to be wrong, they also become less informative.
We can expect unlabeling techniques to grow as the semi-supervised and unsupervised methods
gets better, since any of those can be used once a sample had its label removed. One could envision
32
utilizing algorithms such as MixMatch [14] or Unsupervised Data Augmentation [172] on unlabeled
samples.
Similarly, the label fixing strategies could benefit from unsupervised representation learning to learn
prototypes that makes it easier to discriminate hard samples and incorrect samples. Deep self-learning
[55] is shown to scale on Clothing1M and Food-101N. It would be expected that those approaches
become less accurate as the number of classes grows or as the classes get more ambiguous. Some prior
knowledge or assumptions about the classes could be used explicitly by the model. Iterative Noise
Filtering [117] in its entropy loss assumes that all the classes are balanced in the dataset and in each
batch.
4.7 Conclusion
We explored the situation where a deep classifier has to be learnt on data with label noise, that is, con-
taining erroneous target labels. We explored the literature and showed that the approaches can be sorted
into four main categories: reweighting predictions using a noise model, reweighting the importance of
training samples based on their assessed probability of having a wrong label, unlabel the suspicious
samples and use them with unsupervised training, or fixing the suspicious labels with new guesses.
Training a deep classifier using a noisy labeled dataset is not a single problem but a family of prob-
lems, instantiated by the data itself, noise properties, and provided manual annotations if any. As types
of problems and solutions will reveal themselves to the academic and industrial deep learning prac-
titioners, deciding on a single metric, a more thorough and standardized set of tests might be needed.
This way, it will be easier to answer questions about the use of domain knowledge, generality, tradeoffs,
strengths and weaknesses, of noisy labels training techniques depending on the use-case.
In the face recognition system that we are building, label noise has varying causes: persons with
similar names; confusion with lookalikes; related persons that appear together; erroneous faces detected
on signs or posters in the picture; errors from the face detector that are not faces; and random noise. All
those situations represent label noise with different characteristics and properties that must be handled
with those algorithms. We believe those issues are more general than this scenario and find an echo in
the broader multimedia tagging and indexing domain.
From this study we mainly keep that samples with label errors produce higher losses than correctly
labeled ones. The next chapter will explore how to leverage this in order to manually curate our training
set. Furthermore, as we will show, a face recognition dataset contains pictures of many identities, but
can also include pictures that do not belong to any of the known identities. This situation is reminis-
cent of the open-set label error (when the correct label does not belong to any of the known label) or
unlabeled samples.
33
Similarity to prototypes
Reweight predictions
CIFAR-10 / MNIST
(NNAR, Closed-set)
(NNAR, Open-set)
(NNAR, Open-set)
(NAR, Closed-set)
(Corrected labels)
Corrected labels
Unlabel samples
(Synthetic noise)
(Verified labels)
Verified labels
No correction
Clothing1M
(Raw labels)
Food-101N
WebVision
Fix labels
Model
NLNL [80] X negative X X X
labels
Iterative Noise Fil- without with en- X X X
tering [117] entropy tropy loss
loss
Ren et al, 2018 X X X X
[131]
Iterative learning X X X X
[168]
NLNN [10] X X X X & TIMIT
Hendrycks et al, X X X & NLP
2018 [62]
Deep Self- X X X X X
Learning [55]
CleanNet [93] X X X X X X
Xiao et al, 2015 X X X X
[171]
CurriculumNet X X X X
[53]
Co-Mining [167] X X X face rec
Table 4.1: Approaches according to annotations in the dataset. Notes: TIMIT is a speech to text dataset, ”NLP”
is a set of natural language processing datasets (Twitter, IMDB and Stanford Sentiment Treebank), ”face rec”
denotes classical face recognition datasets (LFW, CALFW, AgeDB, CFP)
34
5 Face Recognition
5.1 Introduction
As previously mentioned, a face recognition system embedded on the customer’s platform would extract
metadata that could be interesting for our recommendation engine and valuable for the user experience.
We will first highlight how our use case is different from the general case, then present how face
recognition is usually tackled, and finally explore our system currently in production. We will empha-
size our contributions: a new metric-learning loss function, the threshold-softmax loss, and a study of
several methods for exploiting unlabeled faces during training.
In our industrial setting, we wish to identify some people of interest (”VIPs”) among many unknown
people in videos with unconstrained facial pose, illumination, expression and occlusion. The set of
identities of interest is known ahead of time but a lot of the input pictures, if not most, are unknown
people that must be rejected by the system. This particular setting makes this an instance of a subject-
dependent open-set protocol, which we observe to be an understudied case, not even considered in
Wang and Deng [165] (Figure 5.1).
We believe this setting is particularly common under some industrial settings in which we are in-
terested in some people, like celebrities, for which we can acquire datasets if needed, among many
unknown test-time distractors. One such example would be reidentifying famous YouTubers in fan
compilations, clips or video reuses. Another would be to reidentify famous actors in movies among
extras.
We call distractors faces whose identity is unknown to the system. It is expected that the system
rejects those faces and is aware that they are unknown. The presence or absence of distractors is what
differentiates open-set and close-set settings. Depending on the data and task, the ratio of distractors
might be largely greater or lesser than that of VIPs. For us, the distractors dominate.
Hexaglobe’s Face Recognition system has some specificities compared to the algorithms laid in the
literature:
1. Contrarily to most, this work is focused on identifying a known set of identities among distractors.
Most works instead deal with verifying if a query face is the same person as a key face, for
instance, verifying if the person at the customs is the same as the one on the passport being
presented.
2. Most works on classifiers focus on having a correct best guess, but this system does not have to
make a prediction for every input. The model rejecting some inputs because of uncertainty is a
better outcome than a failed prediction. Knowing that we don’t know is crucial here.
3. Our work is deployed at scale where it analyzes hundreds of user uploaded videos per day. Those
videos present real world faces in the wild, with occlusions, variable lighting conditions, image
35
resolution, quality and scale, not properly aligned, sometimes with extreme facial poses, making
this data more challenging than most used datasets currently used in publications [54].
1. As a public video platform using the system for tagging videos with identities for search improve-
ments, we are only interested in tagging the most popular people. These people can be decided
ahead of time and our problem becomes a semi-open set face recognition problem: the set of
identities to recognize is known ahead of time but there are unknown distractors to reject.
2. All errors are not equal. It is more harmful to add a wrong tag than to miss one. In our case,
precision is more important than recall. This makes it possible, even needed, to let the model
express its lack of confidence and not act on uncertain predictions.
3. As we are interested in overall video tagging and not per frame tagging, per frame errors are not
harmful if they can be smoothed out.
Besides, considering the volume of existing videos (approx 8M) and the number of new uploads
per day, processing a video should not take more than 5 minutes. In order to guarantee this processing
time, we extract 300 frames evenly spaced throughout the video.
Finally, since our system is meant to have a quickly growing set of persons of interest, we favor fast
iterations, both in model training time and time needed to add a subject to the set of known persons.
Figure 5.1: Figure 17 from Wang and Deng [165]. The comparison of different training protocol and evaluation
tasks in FR. In terms of training protocol, FR can be classified into subject-dependent or subject-independent
settings according to whether testing identities appear in training set. In terms of testing tasks, FR can be classified
into face verification, close-set face identitification, open-set face identification.
Figure 5.2: Training of classification-based metric learning algorithm. The algorithm is trained to classify faces
in different identities, with a margin in softmax and representation constraints.
discriminative features that generalize well (Megaface [79], VGGFace2 [21], MS1M [54], ImdbFace
[162], IJB-A,B,C [82, 169, 106]).
Those representations are then extracted from reference images and compared independently with
the input image’s representation (Figure 5.3).
When tasked to identify, the input image (the “probe”) is encoded in a feature vector and compared
to all the encoded reference picture vectors (the “gallery”), aiming for minimum distance on the correct
identity. This strategy has some shortcomings outlined in [82]: it is unclear how to aggregate various
feature vectors from several images of the same identity, and current systems are not precise enough
to reject all the images in the gallery for unknown probe identity when the gallery has many pictures,
triggering false detections. For testing, LFW [71] is a common test set as well.
Figure 5.3: Test time usage of feature vectors. Two images’ representations are compared under a distance metric.
A distance under a predefined threshold indicates the same identity.
the same person and dissimilar vector for different people. The similarity measure is often euclidean or
angular.
where Ai and Aj are two different pictures of a person A, and Bk is a picture of a different person
B.
While this works, it requires a lot of time to converge.
5.3.2 Softmax
Posing the problem as a classification task allows to leverage the efficiency of the cross-entropy loss
and the discriminative capability of neural networks. In order to frame f as a classifier, we need to
define
Here, W is a learnable k × d matrix where k is the number of identities to classify and d is the
dimension of the embedding row vector produced by f . The ith row of W can be interpreted as the
prototypical embedding for class i.
At inference time, W is discarded and a distance function is used to compare the embeddings. It is
hoped that f was trained on enough identities in order to make a rich, semantically meaningful vector
38
space. During training, as the embeddings were discriminated with a linear classifier, a parameter-free
distance function should be able to discriminate embeddings from people not seen during training.
5.3.3 L2-softmax
The softmax has some notable drawbacks: it does not enforce positive pairs to remain together and
negative pairs further; it is biased towards the training distribution unbalance; and uncertain samples
produce decisions with low confidence that are poorly penalized.
Ranjan et al. [129] (Figure 5.4a) proposes to fix all of those by enforcing a L2 constraint on both
the output vector of f and each row of W . Instead of maximizing the softmax output with maximum
inner product between the correct row of W and f (x), this aims to maximize (minimize) the cosine
similarity between the correct (incorrect) row of W and f (x). If we accept the notation hni for a matrix
n with each row vector normalized to unit length, the L2-softmax is:
where the scalar s can be either interpreted as a radius or the temperature of the softmax. The higher
the s, the more the softmax will resemble a hard argmax. As cos is bounded in [-1; 1], it is necessary to
control the sharpness of the softmax with a temperature in order to control the gradient magnitude. θ is
the vector of the angles between f (x) and each row of W .
Note that in the equation above, cos is applied independently to each row of W and f (x), also
making θ a vector of size k representing the angles between f (x) and each row of W .
5.3.4 ArcFace
ArcFace [34] (Figure 5.4b) builds on this by enforcing a margin between classes. The L2-softmax (Eq.
5.3) emits a valid high probability as soon as f (x) has its smallest angle with the prototype of the
correct class in W . They thought this was not enough and wanted to add another guarantee, that small
perturbations to f (x) due to input distribution shifts or hard samples would not be predicted as another
class. As such, they add an hyperparameter margin m in the equation to repel the embedding by m. By
adding m radians to the angle of the correct class the angular distance with other classes is forced to be
greater than the margin:
Figure 5.4: Comparison of the angular softmax, ArcFace and the proposed Threshold-Softmax. In ArcFace, the
margin (in blue) is fixed but the width of the arcs of each class can be arbitrarily wide (or narrow), since there is
no constraint on them. In threshold-softmax, there are no enforced margins but the decision boundaries have a
fixed width. An artificial class is predicted outside of those trusted cones. Figure derived from [163].
I achieved this by concatenating an artificial entry cos(m) to θ. We say this entry has class Ω. If
all the angles in θ are greater than m, then this artificial entry is maxed out by the softmax function,
leading to a high loss until the correct angle is finally smaller than m. m being a hyperparameter. The
loss function can be expressed as:
Figure 5.4 highlights the difference with standard angular softmax, ArcFace, and Threshold-
Softmax.
5.4.3 Evaluation
In the subject independent scenario, the dataset LFW [71] test set is commonly used to assess the quality
of face representation. It consists of 6k image pairs, half being faces of the same identity, half being of
different ones, formed with 13233 source images from 5749 people. The model has to decide whether
a pair of pictures belong or not to the same identity.
State-of-the-art accuracy on this set surpassed 99% which drove the creation of newer and harder
sets.
40
Figure 5.5: Threshold-softmax with negative samples: crosses are negative samples. We do not know their
identities, we just know they do not belong to any of the known identities. Threshold-Softmax naturally uses
those samples by placing their identities outside of the known classes decision boundaries, ie, predicting class Ω.
Table 5.1: Accuracy on face verification for LFW and FGLFW image pairs for various loss functions. Best
rejection angular threshold selected for each method.
FGLFW [35] reuses pictures from LFW but selected harder pairs. DeepFace [148] has an accuracy
of 92.87% on LFW but 78.78% on FGLFW.
The Megaface challenge [79] extends this beyond pairs. They propose an input ”probe” picture and
a gallery of ”candidate” pictures. The model has to identify which picture in the gallery is of the same
person as the probe.
IJB-A,B,C propose a collection of challenges including verification and identification in both pic-
tures and videos.
100
98
96
94
acc
92
90
88
0.4 0.5 0.6 0.7 0.8 0.9
threshold value
Figure 5.6: Accuracy on LFW and FGLFW according to the hyper parameter threshold value m for Threshold-
Softmax.
It is interesting to note that ArcFace’s margin is not mutually exclusive with the cones of trust of
Threshold-Softmax and the two could be combined. This is left as future work.
5.4.5 Conclusion
We proposed the Threshold-Softmax loss function that is able to use negative samples that are cheaper
to collect. The Threshold-Softmax proposes to learn face embeddings fitting a cone with an absolute
maximum angle, rather than imposing an angular margins between classes. Negative samples are forced
in the negative space: outside of the regions allocated for the positive classes. We experimented this
loss on MS1Mv2 and compared it to the state of the art ArcFace. The Threshold-Softmax is com-
petitive but not always superior to ArcFace, but presents the ability to learn from unlabeled negative
samples (unknown people not belonging to any positive class), halving the error rate in our tests on
LFW and FGLFW. Those cones are not mutually exclusive with ArcFace’s margins and future works
could include adding margins to the Threshold-Softmax.
1. a comparison of ArcFace and cross-entropy classifiers in the context of closed-set face recogni-
tion (Sections 5.5.1, 5.5.2, 5.5.5)
42
Figure 5.7: Rank 1 identification performance for contestants on Megaface’s Facescrub challenge under various
quantities of distractors. Most models have their performance degrading quickly even with 100 distractors only.
Figure from [79].
2. a search for a strategy to manually deal with label noise, curate and expand a face recognition
dataset for closed-set scenarios (Sections 5.5.3, 5.5.4)
3. a set of losses and their evaluations allowing to leverage unlabeled negative faces and make face
recognition system more robust to distractors (Section 5.5.5)
Calibration
Models trained with a cross entropy loss should exhibit a useful property: calibration. A model is said to
be calibrated when the predicted probability aligns with the actual correctness rate of the prediction. For
instance, labels predicted with probability 0.3 indeed have a 30% chance of being correct: the predicted
probability is equal to the true probability. Calibration is extremely important in some systems: we
could ignore predictions with confidence lower than 0.95 if would not tolerate more than 5% of errors,
or a medical system could perform automatic labeling on high confidence predictions and require a
human expert label on lower scores.
Unfortunately, modern neural networks are poorly calibrated [52]. They have a high capacity and
end up overfitting on the training loss, predicting only high confidences. Guo et al. [52] proposes to fix
this by learning a temperature parameter before the softmax on a separate set. In our tests, this approach
was both extremely simple and efficient, for a negligible computation cost and a low code complexity.
Figure 5.12 shows calibration plots.
Computing a calibration plot is easy: one can bin predicted probabilities on the validation set, and
evaluate the actual correctness ratio in each bin. i.e.: For samples predicted with confidence 0.9-0.95,
the predicted label should be correct 90-95% of the time.
Rejection
Hendrycks and Gimpel [61] claim that Out of Distribution (OoD) samples have a lower softmax prob-
ability than in-distribution samples. This is something we have not observed to be true or sufficient
to reliably reject distractors. Moreover, uncalibrated models make thresholding rejection confidence
score hard. The predicted probabilities lose their meaning and become detached from the semantics
they represent, just reflecting the overfitting level of the model. The best threshold value might then
change between training runs and its value is not interpretable.
Figure 5.8: Each class (abbreviated to the first letter of the person’s name) is placed on this grid depending on
its precision and recall scores. Top right is best, bottom left is worst. We aim to find strategies that help moving
each point right and up. (Produced by the DCE model, Section 5.5.5)
2. Some VIPs seem to cause too much confusion and trigger too many false positives: rework the
data for those identities (looking for errors, low quality or confusing samples) in order to identify
harmful samples
3. Some VIPs are not detected in tests (false negatives) : add samples. Those identities have insuf-
ficient data to generalize correctly.
Which one of those actions to take is crucial in order to control the growth of the model and its
performance.
Principles
Diagnosing the current state of the dataset in order to choose what to do next happens to be easy thanks
to simple metrics. By computing precision / recall by class we get a view of the model performance.
45
Figure 5.9: A, B, and C are three fictitious identities for illustration purposes. We compute some metrics by
selecting various meaningful subsets from the confusion matrix: distractor accuracy (blue rectangle), identifica-
tion accuracy (orange rectangle), kept accuracy (violet rectangle), and total accuracy (green rectangle). For each
subset, the metric is computed as the sum of the green cells it contains divided by the sum of all cells it contains.
In order to easily visualize which class need work, we plot each class on a grid, seen in Figure 5.8. We
empirically hypothesize that classes with low precision suffer from noisy training data and/or labels,
while classes with low recall indicate a lack of training data.
These assumptions bring insight into the model’s state, are easy to implement, interpret and use but
bear the drawback that some additional effort must be taken in order to build a meaningful test set for
each identity. As we observe strong outliers on the precision/recall plots while building the dataset, we
hypothesize that the performance across identities is not correlated (or not enough), and therefore we
believe we cannot use a random subset of identities to estimate the overall performance.
In our situation, we accept trading recall for precision as false positives damage the user experience
with erroneous suggestions or search results. With false negatives, no labeling is performed, and the
user browsing is not impacted.
Results
In order to investigate these hypotheses, we select a class which performs well in both precision and
recall and run three experiments (with the DCE model, Section 5.5.5):
This data point seem to suggest that recall indeed correlates with the amount of data for a class.
Unfortunately, precision does not negatively correlate with label noise. The classifier seems to overfit
the noise rather quickly and learn a multimodal class. The question of detecting noisy classes remains
unanswered by this approach.
5.5.5 Experiments
We devise a set of experiments in order to evaluate the progress done on the task of face recognition
on HFaces. We emphasize the industrial context: this tool is to be used in order to label videos on the
46
customer’s platform. As such, it is better not to annotate a video than to predict a wrong label. Having
the product owner or production engineers being able to set a false acceptance rate in production would
be a very interesting feature as we could decorrelate the model training from production settings.
We train our model on a subset of the 105 most popular VIPs among 8k identities in our dataset,
the remaining ones are set as distractors. The model, a standard resnet18, pretrained on ImageNet, is
trained for 40k iterations with a batch size of 1024, Stochastic Gradient Descent, a linearly decaying
learning rate starting at 0.1 and a weight decay of 5e-4. The test set contains 119k pictures (resized to
96x96), with 24% of them being distractors.
For each experiment, we measure:
Distractor acccuracy accuracy on the distractor set: among all distractors how many have been pre-
dicted as such, ie, the true rejection rate (See Figure 5.9, blue rectangle).
Identification accuracy the recognition accuracy among non-distractors (See Figure 5.9, orange rect-
angle).
Total accuracy accuracy on all test samples (See Figure 5.9, green rectangle).
Kept accuracy accuracy of kept (non rejected) predictions. We consider that we reject predictions
classified as distractors (See Figure 5.9, violet rectangle). This allows us to evaluate how many
false identification will be performed on the platform, and choose our precision / recall tradeoff.
True Positive Rate (TPR)@95, TPR@99 When techniques give scores with predictions and the
model performs reasonably well, wrong predictions are given a lower score and correct predic-
tions a higher score. This allows us to trade precision for recall, as we can find a threshold value
which rejects predictions with low scores until a chosen true positive ratio. Thus, we also com-
pute the identification and total accuracy for 95% and 99% true positives, giving us an estimate
of the recall for both test sets at those true positive rate goals.
F1 Area Under Curve (AUC) We also compute the area under the F1 curve as a function of the thresh-
old, as it captures the sensitivity of the model to the thresholding value. A high value (close to 1)
indicates that the model has a maximum F1 value irregarding of the threshold, whereas a value
of 0.5 indicates that there is a high threshold sensitivity, a bad mean F1 score, or both, which are
all unsatisfactory.
All these metrics are shown for the tested models in Figure 5.11 and Figure 5.12.
The total accuracy highly depends on the ratio of distractors in the test set, which is, in this situation,
arbitrary. While it remains an interesting metric, the mean F1 more closely captures our expectations: if
all the classes are given equal importance, what is the best precision / recall we can reach? Generalizing
this question to all possible thresholds (all possible VIP / distractor decision boundaries) gives us the F1
AUC which will be our metric of choice given that the model has a satisfying calibration curve. If this
condition is not met, the precision / recall tradeoff (and thus F1 score) can’t be set based on production
needs. A model able to control this tradeoff by thresholding the predicted scores, is said to be amenable
to thresholding.
We compare several models: ArcFace, and cross-entropy classifiers. We train a simple classifier
(CE) and also explore several techniques for exploiting distractors from the training set, and reject-
ing them at inference: an extra class for distractors (DCE), maximizing entropy on distractors (ME),
47
Table 5.2: Various metrics for each model, rejection threshold selected at maximal total accuracy. Best and
second best results are highlighted
Table 5.3: Various metrics for each model, rejection threshold selected at maximal F1. Best and second best
results are highlighted
minimizing logits on distractors (ZLog). We report, in Table 5.2, the various metrics at the maximum
accuracy and at maximum mean-F1 in Table 5.3. The F1 AUC is reported in Table 5.4. As we shall see,
all models tested here exhibits satisfying calibration. Therefore, we present their performance when se-
lecting a confidence threshold giving 99% and 95% of true positives in Figure 5.10 as an additional
informative signal.
ArcFace
We first verify our claim that metric learning is not robust in our setting. We reuse the pretrained Arc-
Face model published by https://github.com/foamliu/InsightFace-v2. The important
things to note are:
• this model has been trained for a face-independent protocol, that is, it has not been trained on the
people it was meant to recognize;
• this model has several data biases: our distribution includes more females than males, very few
above the age of 50. The training set is broader than this narrower distribution of ours. The model
Model F1 AUC
ArcFace 0.34*
Cross-Entropy (CE) 0.64
CE with Distractors 0.59
CE + Zero-Logits 0.68
CE + Maximum-Entropy 0.72
Table 5.4: F1 AUC for the models evaluated. Value for ArcFace is normalized for comparability (0.26 to 0.34)
48
TPR@95
85
80
75
70
65
accuracy
60
55
50
45
40
30 40 50 60 70
identification accuracy
TPR@99
70
60
accuracy
50
40
20 30 40 50 60
identification accuracy
Figure 5.10: Identification accuracy and total accuracy for a true acceptance rate of 95% (top) and 99% (bot-
tom). CE: Cross-Entropy, DCE: Cross-Entropy+Distractors, ME: Cross-Entropy+MaximumEntropy, zlog=Zero-
Logits.
49
Figure 5.11: Various metrics for the a) ArcFace b) CE c) DCE d) ZLog e) ME model as predictions are set as
distractors under various threshold values. The black bar traverses all plots at the best total accuracy, the dotted
bar is located at maximum F1. As we sweep over the threshold values and reject more samples as distractors, we
look at the variations on the metrics.
50
Figure 5.12: Calibration plots for the a) CE b) DCE c) ZLog d) ME models. See Section 5.5.2 for more details
about calibration.
• Metric learning makes it hard to know when we don’t know, thus rejecting distractors is not easy;
• ArcFace protocol is about computing the cosine similarity between the representation of the input
image and a reference image. For identification, we randomly choose one reference image per
class. The choice of that reference image could have been selected with respect to a validation
set for minor gains. We made sure that the different metrics do not dramatically change between
each run.
• We set a distractor threshold. If all reference images have a cosine similarity lower than this
threshold, we reject this input as a distractor. We compute our metrics sweeping over threshold
values.
Figure 5.11a shows different metrics and plots on this experiment. We see that the rejection thresh-
old maximizing total accuracy is about 0.35. the first plot shows that all distractors are concentrated
between a cosine similarity of 0.25 and 0.45, which unfortunately overlaps with VIPs, between 0.25
and 0.6: there is no clear boundary to separate both and rejecting distractors implies rejecting correctly
identified VIPs too. The kept accuracy show that detections above 0.46 are all correct, and there is no
point setting a higher threshold value.
The plots show a best accuracy value of 52.03% for a distractor accuracy of 78%, an identification
accuracy of 44% and F1 score of 0.40.
51
Cross-Entropy (CE)
We now compare the advanced ArcFace loss with a standard softmax+cross entropy loss, trained on the
persons of interest only. This model has no built-in way of signaling distractors. We are going to assume
that the distractors produce predictions with lower confidence, and threshold them on low confidence
scores. H(·, ·) is the cross entropy function. In equations, we note x+ and y+ the non-distractors
training examples (positive examples, persons of interest) and their associated label, and x− the set
of distractors in the training set. A learned softmax classifier is noted f (·), and its logit predictions
log f (·). The loss function is defined by:
the low logits they produce. We propose penalizing their squared logits, further encouraging them to
zero, and classify only the persons of interest with a cross-entropy loss. The loss function is defined by:
We find that this brings some improvements in identification accuracy and total accuracy, bringing
the F1 score at maximum accuracy to 0.66 and the best F1 to 0.75. The F1 and kept plots of Figure
5.11d show that this model is amenable to thresholding, and the F1 AUC of 0.68 performs preferably
to the CE model (0.64). This model gets either the best or second best scores at both max accuracy and
max F1, and performs better at fixed true positive rate.
However, contrarily to Vaze et al. [159], we don’t observe logits to be more informative than soft-
max probabilities. We tried thresholding distractors with logits in each model, every time bringing
equivalent or lower results. For this reason, we do not report those results, and threshold only on
softmax confidence scores.
We evaluate this model at maximum accuracy, and get a best identification accuracy of 78.93%, total
accuracy of 81.43%, and mean F1 of 0.72 shown in Table 5.2. Thresholded for maximal F1, this model
also obtains the best F1 score and is still highly competitive with ZLog in Table 5.3. It gets the best
F1 AUC (Table 5.4) by a significant margin, since its F1 curve it strictly above ZLog’s. Figure 5.11e
shows that the model is amenable to thresholding from the increasing F1 plot and the kept accuracy /
distractor accuracy curves strictly better than ZLog’s. AT 95% and 99% TPR (Figure 5.10) this model
dominates all others. Finally, its calibration plot in Figure 5.12e displays a close to perfect calibration
for confidence values above 0.6, which is the range of values that will be of interest for a high precision
deployment scenario.
Discussion
Overall, the ME approach results in the best way to leverage distractors in order to build our industrial
system, as can be seen on the TPR plots (Figure 5.10) and our F1 AUC metric. It results in the best total
and identification accuracy and F1, while being amenable to thresholding. The ZLog technique scores
second but is inferior in every way.
The DCE method attracts our attention as well, scoring the highest distractor accuracy and and a
nice semantic interpretation. However, its identification and total accuracy, as well as its F1 are the
53
lowest among the cross-entropy classifier, showing that it might just favor detecting distractors rather
than recognizing VIPs. We conclude that this strategy just over emphasizes classifying as distractor,
decreasing identification recall too strongly. Its distractor/kept curves dominating ME’s as well as the
TPR plots further indicates that this model overemphasizes precision over recall. All considered, The
ME is a better compromise as we will have a slightly higher error ratio (that can be negotiated with
thresholding) but a much better recall, hence predicting many more correct labels.
We highlight how observing that OoD samples naturally score lower softmax probabilities lead us
to the ME loss that encourages this behavior, giving us the best results we obtained in these experiments.
Indeed we can observe on Figures 5.11b and 5.11e that the metrics follow the same trends with similar
shapes, but ME strictly improves on them.
5.6 Conclusion
We showed that off-the-shelf face recognition classifiers, trained with ArcFace, were not satisfying for
our industrial scenario. We first explored remaining in the Metric Learning realm and proposed the
Threshold-Softmax loss function that is able to use negative samples that are cheaper to collect. The
Threshold-Softmax proposes to learn face embeddings fitting a cone with an absolute maximum angle,
rather than imposing an angular margins between classes. Negative samples are forced in the negative
space: outside of the regions allocated for the positive classes. We experimented this loss on MS1Mv2
and compared it to the state-of-the-art ArcFace. The Threshold-Softmax is competitive but not always
superior to ArcFace, but presents the ability to learn from unlabeled negative samples (unknown people
not belonging to any positive class), halving the error rate in our tests on LFW and FGLFW. However,
Threshold-Softmax remains a technique for subject-independent face verification and indentification.
We moved away from metric learning and went back to cross-entropy classifiers as our system
only has to recognize identities known ahead of time, in a subject-dependent, open-set fashion. After
showing the inefficacy of ArcFace in our situation, we explored various ways of making it robust to
distractors, unknown people that the system must learn to discriminate. We explored various techniques
to reject distractors at inference time and use distractors at training time, and found that maximizing
the entropy for distractors to be the best performing strategy we tried among regularizing the logits
or adding a Distractor class. We further showed that all models were quite satisfyingly calibrated and
amenable to thresholding, making production engineers able to decide on the precision/recall that is
best for the product. The ME model is used in production today at Hexaglobe.
We search for a principled way to refine and extend the training set, leveraging precision/recall
plots. While our hypothesis that recall correlates with the amount of data for that class looks promising,
we still don’t know how to manipulate precision and our hypothesis that noise decreases precision has
been disproved.
The final system is currently used in production, labeling 15k videos a day, and the extracted labels
are used as planned to enrich the user experience.
54
6.1 Introduction
Deep Learning has proven to be successful at generating natural images. Antoniou et al. [6] see in this
ability an opportunity to improve datasets by generating more data and shows performance improve-
ments in classifiers when using generated data as a supplement to the training data.
Using generative models in the context of face recognition is appealing. Many pictures per identities
are needed in order to teach a classifier that it should be invariant to lighting, pose, makeup, haircuts,
etc. However, as we grow the number of identities that the system has to recognize, there is a risk that
the classifier does not learn the invariants for identities with less variations in training data. In other
words, we fear that the classifier learns useful features only for the identities with many diverse pictures
and overfit the case with little training data.
Generating data gives us the opportunity to create the diversity of pose, lightning etc for the identi-
ties with the least diverse identities.
In this chapter we lay out a review of the different techniques of generative models we explored
before settling on one. We will explore GANs and VAEs, and more specifically the VQ-VAE for which
we will present our contribution: an expiration process for the codebook in order to improve its training
dynamics and performance. We then introduce our chosen system for data augmentation in the context
of face recognition.
about yi , such that a powerful generative model G could hold G(yi , pi ) = xi (see figure 6.1).
Those faces can be considered samples of an underlying ”face photo” manifold with dimensions
describing semantic variations such a lighting, pose or identity. We would like to learn a generator
G(yi , z), z ∼ pz (z). G learns to interpret z as a pi and decode it as a pose / illumination / etc vector that
doesn’t include any identity information. Ideally, z is a probability distribution that is easy to sample
from (eg standard normal). We could then reenact any identity y by sampling z vectors at will.
[yi=Barack Obama;
zi ~ pZ(z)]
G
Figure 6.1: G is a generator that turns a person identitifier yi and a latent variable zi into an image.
H Y
Y W
(6.1) p(X = x) = p(Xi,j = xi,j )
i=1 j=1
With such a simple model, p(Xi,j ) is left as an arbitrarily sophisticated or simple distribution of our
choice, such as a categorical distribution over discretized pixels values (with parameters θi,j )
or Gaussian distributions over values (with mean parameters µi,j and standard deviation parameters
σi,j )
Once the individual pixel probabilities parameters (ie θi,j or µi,j , σi,j ) have been estimated from
data, one could sample a value for each pixel and get an image.
However, this modeling is trivial and would produce pictures that looks nothing like real images
because it considers each pixel as independent and does not take into account patterns and spatial
correlations.
56
|Ψ|
Y
(6.4) p(X) = p(XΨi |xΨi−1 , .., xΨi−k )
i=0
This is the approach coined by PixelCNN [83], PixelRNN [156], PixelCNN++ [137] or PixelSNAIL
[24]. At inference time, we sample pixels one by one, each requiring a model forward pass. This
exhibits the major drawback of auto-regressive models for image synthesis: they require H ×W forward
passes, making it extremely slow and computationally intensive. Moreover, as the images grow bigger,
not only more forward passes are needed, bigger models are needed as well in order to grow their
receptive fields and context windows accordingly. A single pass of PixelCNN is shown in figure 6.3.
Figure 6.2: Conditional probability graph of an autoregressive model. Each pixel depends on the previous ones,
iteratively.
Figure 6.3: A PixelCNN sampling a pixel value for the current pixel from its surrounding context. White pixels
are still undetermined grey pixels have already been sampled. Shown in red is the softmax output describing the
probability distribution of the current pixel values conditioned on the context window. Image from Kolesnikov
and Lampert [83]
Modern auto-regressive models such as the VQ-VAE [155] or Vector-Quantized GAN (VQ-GAN)
[42] try working around this complexity by only sampling small pictures or small representations,
57
leaving the actual high quality rendering to another method, such as convolutional upsamplers, convo-
lutional decoders, or GANs. The VQ-VAE will be described in section 6.5.4.
z x
Figure 6.4: Conditional probability graph of a Latent Variable Model (LVM). The whole image x is sampled at
once from a lower dimensional encoding z.
Instead of having a long chain of random variable dependency (ie the previous components), we can
assume that there is a lower dimensional explicative random variable z ∼ pz (z), and that a powerful
function could decode from it all the components at once (compare figure 6.2 and 6.4). For images,
this embodies the idea that the pixels of an image can be reduced to a much denser amount of semantic
information such as ”a child sitting on a bench and eating ice cream in a park” or a low dimensional
feature vector. We can thus model the data probability density as the probability of a data point x
decoded by all possible codes z:
Z Z H Y
Y W
(6.5) p(x) = p(x|z)pz (z)dz = pz (z) p(xi,j |z)dz
i=1 j=1
Z
(6.6) − log p(x) = − log p(x|z)pz (z)dz
In our case, this code z is considered unknown and has to be discovered by the training procedure
as well. We often chose z to be a continuous feature vector and pz (z) to be a standard Gaussian as it
is easy to sample from. p(X|Z), called ”decoder”, generates the data components from the code. It
usually is a neural network suited for the data type.
The integral inside Eq 6.6 can be rewritten as an expectation:
The expectation outside the log is unfortunate: the log of the expected value would need many
samples in order to be accurate, and all those probability multiplications would be numerically unstable.
Thankfully, Jensen’s inequality gives us a useful lower bound: f (E[x]) ≤ E[f (x)] for any convex
function f . So, the log can be moved inside the expectation at the cost of a lower bound.
Learning this model with Maximum Likelihood Estimate for large datasets is impractical as we
would first need to sample a z, find a sample in the dataset that is best explained by the decoder for that
z, and perform the MLE step.
58
The VAE [81] proposes to solve this with more neural networks. The simple fix is to train an
”encoder” q that learns which z = q(x) explains x the best. We can then take a training sample, encode
it to a z, decode it back, and optimize for reconstruction and penalize the log probability of z according
to the prior pz (z) as well.
This, however, would not make a good generative model as there is no incentive that the encoder
covers the whole volume of pz (z). Instead of encoding z = q(z|x) as a deterministic mapping, q(z|x)
can be turned into probability density parameters from which we can sample z. We can thus read
z ∼ q(z|x), and instead of penalizing the log probability of pz (z), we penalize the Kullback-Liebler di-
vergence DKL (q(z|x)||pz (z)). The KL divergence measures the dissimilarity between two probability
distributions. With q being pushed to resemble the prior, we hope to enforce full utilization of the prior
probability space.
Formally, if we use this surrogate distribution q(z|x) to ease smart sampling from pz (z), we are
pz (z)
doing importance sampling, and get Ez∼pz (z) [p(x|z)] = Ez∼q(z|x) [ q(z|x) p(x|z)]. Taking the log and
applying Jensen’s inequality, we obtain
(6.10) − log p(x) ≤ ELBO(x) = −Ez∼q(z|x) log p(x|z) − DKL (q(z|x)||pz (z))
We usually interpret this loss as two terms that must be minimized: a reconstruction term and a prior
divergence term. We aim to learn an encoder that produces representations whose distribution is similar
to an isotropic gaussian, and a decoder that is able to decode any sample from the N (0, I) prior into a
realistic data sample. While this helps this intuitive explanation is the source of some misconceptions
that are beyond the scope of this document.
6.5.2 Limits
It is to be noted that the shortcomings of the VAE are well known:
59
1. The KL term and sampling operation prevents the decoder from having an accurate latent variable
to decode. Thus, the produced samples are notoriously blurry.
2. The reconstruction term and divergence term balance in counter-intuitive ways. The ultimate
VAE goal is not to learn a meaningful latent vector but to assign the correct probability density
to the data distribution. When possible, the encoder ignores the input sample, produces exactly
the prior distribution (turning the KL term to zero), and decodes samples at random. This is to be
expected, especially when powerful decoders are used.
Benefits as regularization
Alemi et al. [4] inspect how models with an information bottleneck generalize. It happens that those
models are less prone to overfitting and adversarial attacks, and generalize better overall.
6.5.4 VQ-VAE
The VQ-VAE considers that a discrete latent variable could be used in place of a continuous one.
Architecture
The VQ-VAE [155] takes the VAE from another perspective. They propose to train an auto-encoder
then, in a second stage, fit an auto-regressive model on the latent representation as a prior to sample
from. In order to both ease the job of the prior network and control the amount of information that can
be transmitted, the latent is encoded as discrete tokens.
60
y: target
h: latent
extractor
bottleneck
z: latent variable
f: colorizer L2
x: input o: output
Figure 6.5: Training a latent variable model for colorization. There are multiple possible colorizations for a single
greyscale input. A latent extractor h extracts the information solving the ambiguity between those multiple an-
swers ; an information bottleneck prevents the latent extractor from encoding all of the target and short-circuiting
the task. The colorizer f resolves ambiguous cases using the latent.
For image data, the prior network usually is a PixelCNN or a variation of it. The approach is
summed up in figure 6.7 and Figure 6.6. A CNN auto-encoder with a VQ bottleneck is trained, then
a prior model able to model discrete sequences (PixelCNN, LSTM, Transformer, etc) fits the latent
distribution generated by the bottleneck.
From a VAE perspective, the encoder is (deterministic) categorical one-hot distribution and by
defining our prior as a uniform categorical distribution, we obtain a KL divergence constant and equal
to log K, K being the size of the codebook (better explained in the next section). This dispenses us
from computing this term at all and it can be removed from the loss.
quant-
ization z L2
f: encoder g: decoder
quant-
ization z z
p: auto- p: auto-
f: encoder regressive prior regressive prior g: decoder
Figure 6.7: top: Training a VQ-VAE stage 1: a quantized encoder and decoder are trained in an autoencoding
fashion. bottom left: Training a VQ-VAE stage 2: the encoder is frozen and an autoregressive prior is learnt
on the extracted latents. bottom right: Sampling from a VQ-VAE: We generate a latent variable from the prior
model and decode it to a full picture
1. A straight-though estimator is used. It is assumed that the gradients from the upper layers, com-
puted from the quantized codes, are good approximations for the gradients of the pre-quantized,
continuous values.
62
2. In order to keep this approximation relevant and learn the codebook, we move (in a L2 sense)
each prototype towards the center of mass of the continuous vectors that were assigned to it. The
prototypes follow the input values.
3. Finally, we reinforce the approximation and strengthen training dynamics. We add a ”commit-
ment” term encouraging the pre-quantized values to get closer (in a L2 sense) and aggregate
around their assigned quantized prototype. The strength of this parameter is controllable through
a parameter β. This value is defaulted to 0.25 and rarely touched.
The final VQ-VAE loss is
Benefits
As the latent variables usually have a lower dimensionality than the data points, it is faster to train and
sample latents from an autoregressive model, than to train and sample from an autoregressive model on
the data points directly.
The generated samples are also of much greater quality that ones of a standard VAE. First, The
prior distribution is much more complex, hence much more expressive. Secondly, the component-by-
component, conditional, sampling, instead of sampling all the latent at once like a standard VAE allows
for a much more precise latent.
As an information bottleneck
When designing a quantization layer in a neural network, we can choose how many codebooks and
quantized values per codebook we want. This allows to set a very accurate and hard limit on the
maximum amount of information that can be transmitted. For instance, with 8 codebooks with 32
codepoints, we can transmit exactly 8 log 32 = 8 × 5 = 40 bits of information.
0 1
z ~ p(z) G D p(real | x) BCE D p(real | x) BCE
xf: fake
xr: real
images images
1
z ~ p(z) G D p(real | x) BCE
xf: fake
images
Figure 6.8: Training a standard GAN. top left: G is kept frozen, we teach D to classify a fake sample as a fake
image with a BCE loss BCE(D(xf , 0)). top right: D is taught to classify a real sample with BCE(D(xr , 1)).
bottom: we train G to produce images that are classified as true by D with BCE(D(xf , 1)), D is kept frozen.
z ~ p(z) G
Figure 6.9: Interpreting D as a trainable loss giving low values to real samples and high values to fake samples.
G learns to minimize the loss D represents. Gradients of fake samples represented as white arrows.
Alternatively, they can be viewed as a simple but rich idea: training a neural network as loss func-
tion, modeling the manifold of the data distribution. This loss-neural-network learns to give high logits
to samples coming from the real data distribution and low logits to samples produced by a generator
network. The generator networks is trained to maximize the discriminator’s output, and convergence
is reached when it perfectly mimics the data distribution, making a flat logits surface. As we shall see,
correctly shaping this energy surface is of crucial importance and can make GANs simple to work with
or very difficult to train. Figure 6.9 shows a generator learning to reduce the loss modeled by D.
Formal Definition
An unconditional Generator learns a mapping G(z) from a distribution z ∼ pz (z) that is easy to sample
from to the data distribution x ∼ Pdata (x) [49]. We call the distribution produced by G pfake . We aim
for pfake = pdata and often chose pz (z) to be a standard Gaussian distribution.
It is often stated that G and D play a min-max game on the value function V. V has initially been
defined like a binary cross entropy loss on D. However, instead of ascending the gradient on G, which
would be really small if D makes confident choices, G is learned with gradient descent with reversed
targets (referred to as Non-Saturating GAN (NSGAN)).
(6.14) min max V (G, D) = Ex∼pdata (x) [log D(x)] + Ex∼pz (z) [log(1 − D(G(z)))]
G D
64
Goodfellow et al. [49] proved in the seminal paper that the optimal G for an optimal D mimics the
data distribution perfectly and that the system minimizes the Jensen-Shannon (JS) divergence between
pfake and pdata .
1 1
JS(pfake , pdata ) = DKL (pfake ||Q) + DKL (pdata ||Q)
(6.15) 2 2
1
Q = (pdata + pfake )
2
The global optimum of the JSD is given by the Nash equilibrium reached when pfake = pdata in
the case of generator and discriminators of unlimited capacity and unlimited training data.
This game can converge to various points:
• D is overpowered by G, G tries to satisfy D but cannot, and the samples are of poor quality;
• G and D are both able to generate and learn the data distribution, the optimization process does
not diverge, and G produces a distribution close to pdata .
Alternatively and more classically, one can view D as a classifier modeling p(real|x), and training
G is maximizing p(real|G(z)), using D as a differentiable loss. Several alternatives were proposed,
such as a regression or a hinge loss for D instead of a BCE loss. Viewed as an energy based model, all
those alternatives are similar as they train D to model a loss surface that G optimizes against.
6.6.2 Failures
GANs were said to have numerous problems
• Sensibility to architecture: G and D had to be symmetrical for one not to overpower the other,
and they had to be carefully tuned
• Training collapse: one of the two networks can collapse and end the convergence, producing
unrealistic samples (Figure 6.10).
• Rotational dynamics: mode collapse can be rotational as well, meaning that G moves from mode
to mode as training goes.
Most of those difficulties are now mitigated thanks to gradient penalties introduced by WGAN-GP
[51] and later improved into various regularizers such as R1 [108] or R0 [151]. They all bear the same
idea: control the Lipschitzness of D, aka its smoothness, to prevent strong gradients and give G an easy
and stable descent into the loss surface.
Figure 6.10: An example of GAN training collapse. The generated samples suddenly
ceases converging towards realistic samples, and the GAN never escapes this degener-
ate state. Image source: https://www.mathworks.com/help/deeplearning/ug/
monitor-gan-training-progress-and-identify-common-failure-modes.html
two distributions pa and pb instead, noted W (pa , pb ). Also called ”Earth-Mover Distance”, it represent
the optimal cost of transporting the probability mass to transform one distribution into another.
This requires complex transportation algorithms to solve in low dimensionality and becomes in-
tractable in high dimensions. Instead, Arjovsky et al. [7] devise a variational approach using the
Kantorovich-Rubinstein duality [160]:
That is, for a function f that has a maximum Liptschitzness of 1 and gives the highest (lowest)
possible scores to the samples from pa (pb ), the Wasserstein distance between two distributions is the
difference of the average score for each distribution.
Lipschitzness
The Lipschitzness of a function f is the maximum L2-norm of its gradient. We say that f is K-Liptschitz
if its Lipschitzness is equal to or less than K.
samples, which is a trivial task for today’s neural networks. However, the way to enforce the Lipschitz
constraint is not trivial.
WGAN [7] proposes as a first rough solution to clip the weights of D to small absolute values.
Thus, the value function optimized is:
(6.18) min max V (G, D) = Ex∼pdata (x) [D(x)] − Ex∼pz (z) [D(G(z))]
G D
WGAN-GP
WGAN-GP [51] approximates the Liptschitzness by mesuring the L2 gradient norm of D on a linear
path from real samples to fake samples.
From this, they devise the 1-GP regularizer: R1-GP (D) = (Lip(D) − 1)2 . It encourages D to be
1-Lipschitz.
R1 regularizer
Mescheder et al. [108] exhibits that R1-GP brings rotational dynamics that slows down or totally hin-
ders convergence. The system would oscillate around the convergence point as the gradients do not
effectively point towards it but spiral around it. They present the R1 regularizer that flattens the sur-
face around real data points, effectively turning them into attractive points. Figure 6.11 illustrates a
regularized vs an unregularized loss landscape.
This strategy was successful enough to be used in and make the glory of StyleGAN [76]. Though,
this regularizer does not enforce anything about D’s Liptschitzness and diverges from the Wasserstein
GAN framework.
67
z ~ p(z) G
D
z ~ p(z) G
Figure 6.11: Effect of regularizers. Top: D is trained without a regularizer. The loss landscape might be noisy and
hard to optimize against. There are strong peaks and valley because of the unregulated Lipschitzness. Bottom:
D is trained with R1 or WGAN-GP regularizers, smoothing the surface around real data points or just controlling
D’s Lipschitzness. The gradients are more predictive of the correct optimization direction, the loss is easier to
optimize against, the peaks and valley are smoother than the unregulated version. Note: these surfaces are just
for illustrative purposes and are not visualizations of actual loss surfaces.
Beyond Wasserstein
Despite paving the way towards reliable GAN convergence, the WGAN is not the pinnacle of training
algorithms. As shown in Lucic et al. [104], no loss for D can ensure proper convergence. Qin et al.
[125] shows that any loss would work with a good Lipschitz regularization as those loss functions
would be constrained in a linear regime anyway. This explains why Mescheder et al. [108], despite not
being rooted in WGAN, shows better theoretical and empirical convergence than WGAN-GP’s 1-GP
regularizer. Thanh-Tung et al. [151] pushes this idea further for greater generalization by flattening the
path from real samples to fake samples. Those works continue investigating further regularizations with
success.
Image Synthesis
Advances specific to image synthesis were mostly brought in the form of architectural refinements in
G, starting from the Deep Convolutional GAN [126], residual GANs introduced with SNGAN [111],
progressively grown GAN [75], or with multiscale noise inputs and adaptive scaling [122, 76, 78] as
seen in fast style transfer [47].
68
Figure 6.12: A cGAN. The discriminator and generator are both conditioned on y.
Pix2Pix
Isola et al. [72] conditioned images based on images and was highly successful at supervised image
translation, that is, transforming pictures from one domain to another. Back then, gradient penalties
were unknown and Lipschitzness was not a concern, thus Pix2pix and its evolution Pix2PixHD [166]
had to bake in several stabilization techniques and convergence helpers.
First, Pix2Pix restricts the discriminator’s receptive field so that it sees patches of the image only and
produces a real/fake signal per patch. They name this approach PatchGAN and Patch Discriminator.
69
Figure 6.13: Examples of image translation from the original pix2pix paper [72]. x is a real image, y a label, and
G(y, z) a fake sample produced by the generator.
1. this forces the discriminator to learn more about textures and patches rather than discriminating
on global coherence;
2. it simulates a bigger training set since each patch is seen independently, fighting against discrim-
inator overfitting.
Then, they add a L1 pixel loss to the adversarial loss, in order to tackle low frequency supervisions
and large image regions bigger than a patch. This also helps guiding the generator, stabilizes training,
and helps converging to better results than this unregularized discriminator alone would. The generator
is based on a U-Net architecture with skip connections.
Figure 6.13 shows examples from the original paper.
Pix2PixHD also shows the ongoing research on generator architectures and uses a ResNet based
generator instead. Besides, they change the NSGAN loss to a Least-Square GAN (LSGAN) loss, which
replaces the BCE targets with MSE targets. They found this loss to be more stable. They also replace
the L1 pixel loss with a feature matching loss, matching deep features of the discriminators under a
L1 constraint. They use 3 patch discriminators working on images resized to different sizes, and some
more subtle differences.
BiGAN
Unconditional GANs show that they have a semantic interpretation of their latent variable z. BiGAN
[38] proposes to jointly learn a generator G : z 7→ x and an encoder E : x 7→ z using a conditional
discriminator D that learns to discriminate D(z, G(z)) against D(E(x), x) (Figure 6.14). They prove
that D can be fooled only if G = E −1 . They then use E as a feature extractor.
70
Figure 6.14: In BigGAN, G generates samples from features and E generates features from samples. Both pairs
are discriminated, forcing G and E to reciprocate each other. Figure fom [38].
Figure 6.15: In the InfoGAN, the generator is fed with a random noise z and random categorical and continuous
random codes c. The discriminator pushes the generator towards real samples. Q tries to guess c and G coop-
erates, ideally leading to G utilizing c in an interpretable way so that Q can identify them back in the generated
samples. Figure from [97].
6.8.1 InfoGAN
One such example is the InfoGAN [23] aiming to learn a controllable generator with disentangled
input features. They learn a generator G(z, c, f ) with z ∼ pz (z) the latent variable distribution,
71
6.8.2 CycleGAN
Zhu et al. [184] wanted to take Pix2Pix one step further and perform image translation in case pairs
are not available. We have real data points from real distribution xa, real ∼ A and real data points
from real distribution xb, real ∼ B. We wish to learn a generator GA7→B : xa,real 7→ xb, fake . They
propose to learn two generators, GA7→B and GB7→A , which respectively perform distribution matching
against their own discriminator, respectively DB and DA . This alone would ensure that both generators
would produce realistic samples of their target distribution. However, we also want the output to bear
some similarity with the input. This additional constraint is added as a cycle loss aiming for xa ==
GB7→A (GA7→B (xa )) and xb == GA7→B (GB7→A (xb )) and is is modeled with a L2 pixelwise similarity
constraint. See Figure 6.16.
While being a breakthrough, CycleGAN exhibits two major flaws. First, the L2 pixel-wise simi-
larity constraint prevents the GAN from performing geometric heavy changes. Second, the cycle loss
prevents any transformation that would loose information. For example, CycleGAN can’t be used cor-
rectly for sunglasses removal as removing the sunglasses in a convincing way would make it impossible
72
to recreate the exact same glasses to complete the cycle. In those situations, CycleGAN adds artifacts
in order to be able to complete the cycle.
6.10 Evaluation
Evaluating GANs is difficult and must account for two key elements: image quality and distribution
matching. In unconditional GAN, the Kernel Inception Distance (KID) [15], Fréchet Inception Dis-
tance (FID) [64], and slightly obsolete Inception Score (IS) [152] are used to evaluate both elements at
once. Those can also be evaluated separately with metrics such as the Precision / Recall developed by
Kynkäänniemi et al. [89]. Unfortunately it is quite unclear as of today how to evaluate a conditional
GAN, especially in the unpaired setting.
73
6.10.1 FID
The Fréchet Inception Distance gained a lot of traction to evaluate image GANs. It works by fitting
a multivariate normal distribution on the output vectors of an Inception-V3, encoding the real and
generated images, and computing the Fréchet distance [41] between both. For two gaussians X and Y ,
the Fréchet distance is expressed as
p
F (X, Y ) = kµX − µY k2 + tr ΣX + ΣY − 2 ΣX ΣY
where µ and Σ are the mean and co-variance of the subscripted Gaussian. This captures both the realism
of the generated images and the coverage of the modes and variance of the real distribution.
While the FID is widely used, it has some drawbacks. It is biased and as such limited for small
datasets, is not easily interpretable, and is meant for evaluation of unconditional GANs.
6.10.2 KID
The Kernel Inception Distance measures the asymmetry between two distributions of samples. Con-
trarily to the FID, it does not assume a parametric form and has a different mathematical expression
3
that makes it unbiased. It uses the polynomial kernel k(x, y) = d1 xT y + 1 , where d is the number
of dimensions of x and y, which are the Inception representations of the images. Not having to com-
pute this metric on 50k samples as is traditionally done for the FID makes it more suitable for smaller
datasets. It has been recently used in pair with the FID for comparing methods.
We now propose , in this section, a simpler and lightweight algorithm that even allows choosing the
entropy of the quantized vectors usage.
74
Figure 6.17: Precision/Recall estimation: white dots are 2D representations of generated samples and black dots
are real samples representations. Top figure shows the fake manifold estimation (in blue), the ratio of black dots
inside the manifold shows the recall, ie, the ratio of the real dataset covered by the generator. Below, we show the
manifold of the real dataset. The ratio of white dots inside the blue zone is the precision, ie, the ratio of generated
samples that look like real samples. The spheres are drawn from each point in the manifold to estimate to its k-th
nearest neighbor. For those visualization we set k = 2
6.11.1 Principles
It became quickly clear that the quantized vectors in the codebook are updated only via weight decay or
receive gradients only if input vectors are quantized to them. If initialized improperly, a significant part
of the codebook might not be ever used and just lost. Some codes might be lost as well during training
if the updates of the codebook and input vectors get out of synchronization.
75
(3)
(2)
(1)
Figure 6.18: Vector Quantization illustration. Black points are codebooks prototypes. They divide the space into
Voronoi cells. White points are input vectors, quantized to the prototype of the Voronoi cell they fall in. (1) shows
the commitment loss as white arrows, bringing the input vectors closer to the prototype they have been assigned
to. The prototype in cell (2) is not used in this iteration, its unused age is incremented. When, like in cell (3), that
prototype has not been used for too long (more iterations than limit), it is resampled to a random input vector
and its age is reset to 0.
To overcome this issue and ensure a full usage of the codebook, and thus of the available bandwidth,
Łańcucki et al. [90] chose to resample every so often the codebook based on k-means centroids of the
previous input vectors. Others have proposed continuous relaxation [144] or soft assignments [135].
Instead, we propose a simpler algorithm. Prototypes that have not been used for more than limit
iterations are said to expire and are resampled to a random vector of the current batch. This allows to
directly set a lower bound on the entropy of the assignments: a lower expiration limit will push towards
uniform assignments while a higher limit would allow for stronger unbalance and preferences. Figure
6.18 illustrates this resampling operation, Listing 6.1 is a pseudo-python simple implementation.
Alternatively, we call age the number of iterations spent since last resampling.
Our approach is implemented in Torchélie (Chapter 8) and supports distributed training.
6.11.2 Experiments
We run several experiments in order to demonstrate that our VQ with expiration has better training
dynamics than the vanilla version of van den Oord et al. [155]. That is, we expect our experiments to
converge faster, show greater performance, and / or exhibit better utilization of the codebook.
Following the ideas in van den Oord et al. [155], we auto-encode 128x128 Imagenette [68].
Settings.
The encoder and decoder are fully convolutional. Each encoder layer LN contains a 3x3 convolution
with N output channels, a batchnorm, and ReLU. MaxPool is noted M . The encoder full architec-
ture is L64-M-L128-M-L256-L256-M-L512-L512-M. The decoder is L512-U-L512-U-L256-U-L128-
U-L64-L3 with U the bilinear upsampling operation; the last layer does not uses batchnorm and replace
ReLU with a sigmoid activation. Between the encoder and decoder, the activations are quantized.
76
1 class VQ(nn.Module):
2 """
3 Quantization layer from *Neural Discrete Representation Learning*
4 Args:
5 latent_dim (int): number of features along which to quantize
6 num_tokens (int): number of tokens in the codebook
7 limit (int): maximum number of iterations before unused codepoints
8 get resampled.
9 """
10 def __init__(self, latent_dim: int, num_tokens: int, limit: int):
11 self.codebook = Array(num_tokens, latent_dim).gaussian_init()
12 self.age = Array(num_tokens).fill(limit)
13 self.limit = limit
14
15 def forward(self, x: torch.Tensor):
16 if self.training:
17 for i in range(len(self.age)):
18 if self.age[i] => self.limit:
19 self.codebook[i] = random.choice(x)
20 self.age[i] = 0
21
22 quantized, used_indices = quantize(x, self.codebook)
23
24 if self.training:
25 for i in range(len(self.age)):
26 self.age[i] += 1
27
28 for i in used_indices:
29 self.age[i] = 0
30
31 return quantized
Listing 6.1: VQ with expiration (pseudo python)
77
250
200
150
frequency
100
50
0
0 20 40 60 80 100
age
Figure 6.19: Histogram of age (time since last use) of each VQ layer codepoint after 20k training iteration.
left: Without the expiration process, the optimization is harder and the net fails to use the codebook to optimize
the loss. A lot of the codes remained unused for at least 2k iterations, presumably dead. right: Expiring and
resampling code allows for exhaustive use of the codebooks and controllable entropy. Even if the maximum age
is set to 250 iterations, the codebook has a much lower age on average.
This encodes input images into a spatial map of 8x8 codes. The number of available codes varies
through experiments.
Metrics.
We evaluate the proposed algorithm under various codebook sizes, in both training and testing. Per-
plexity is used to estimate the codebook usage, as well as the age of the different codes. Perplexity is
defined as P P (p) = 2H(p) (H(p) is the entropy function) where a perplexity of k indicates an entropy
similar to the one of a k-way discrete uniform distribution. Finally, the influence on the test loss is
considered as well; the test set contains 512 pictures.
24
20
22
20
10
8
Loss
18
PPL
6
16 4
14
2
12
2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256
# codes # codes
Figure 6.20: Experiment comparing test loss (left) and codebook usage Perplexity (PPL) (right) with a ReLU
layer before quantization. Expiration VQ achieves lower loss and the perplexity scales correctly.
Figure 6.21: Experiment comparing test loss (left) and codebook usage PPL (right) for a codebook of 32 code
points. Horizontal axis: training iterations, blue: VQ with expiration, orange: VQ without expiration.
79
For small codebooks. A deeper experiment for 32 code points shows in Figure 6.21 that, with a
relatively small codebook, more training without expiration manages to gradually recover full usage
of the codebook, although much more slowly than its expiration counterpart. In fact, training twice as
long does not suffice to reach the the same loss.
For big codebooks. Taking this experiment to more code points yields a different result: most of the
codes remain unused, and, contrarily to the previous results, are not ”recovered” thanks to more training
iterations. In this situation, the expiration strategy becomes necessary in order to control the effective
bottleneck size and avoid wasting unused parameters.
Follow up. We hypothesize that those results are amplified for dimensions greater than 8 because of
the curse of dimensionality, but this is still to be verified experimentally.
6.11.3 Conclusion
We proposed a simple and lightweight algorithm that allows setting a lower bound on the entropy of
the codebook usage in VQ-VAEs. Codes that have not been used for more training iterations than a set
threshold are resampled, preventing dead codes that receive no updates. Experimental evidence suggest
that this strategy yields improvements over the baselines that grows with the size of the codebook. Our
results show no notable inferiority scenario, and can be used as a default safely.
6.12.2 Methods
Architecture
The system we propose is a latent variable model in the form of an encoder-bottleneck that extracts
latent variables from a picture, and sends them to the decoder along with an identity embedding. We
aim to reconstruct the input image under a L2 pixel-wise loss and a VGG loss (also called Perceptual
loss, using a VGG16). The VGG loss is the L2 difference of the deep features of a pretrained VGG
network, extracted after every ReLU.
We choose simple convolutional encoders and decoders following a VGG style for both the encoder
and decoder. The encoder (a VGG11 with BN up to the linear layers) uses quantification bottleneck as
we have seen it produces crispier pictures. The decoder is the same as the one used for experiments in
Section 6.11.
Figure 6.22 shows our proposed approach.
We emphasize the importance of image quality since the produced samples are to be used as training
data.
80
y: target
h: latent
extractor
bottleneck
z: latent variable
Daniel g: generator L2
Radcliffe +
VGG
x: input o: output
Figure 6.22: Our proposed controllable face generator. A target face is encoded to a latent, decoded back with the
identity label to the original picture. An information bottleneck in the encoder discourages the latent variable to
contain any information about the person’s identity, thus not leaking any identity specific geometry and encoding
parameters not recoverable from the identity alone: lighting, pose, makeup, etc. At inference time, one can use
any latent from any sample or sample a latent from a prior distribution to reenact anyone’s face.
Training
We train the model on aligned faces from Hexaglobe with RAdamW. The algorithm is implemented
with Torchélie.
At inference time, we reuse latent variables from the training batch but randomly shuffle identity
vectors within the batch. Not only is this simple but also aligns with our goal of performing face swap.
1. While not ideal, we score the quality of the face swap features by the ability of a face classifier
to recover the new sampled identity.
2. We compute the KID [15] (Section 6.10.2) as our image quality metric. This is only an indicative
number as the KID is a distribution matching metric. While the precision outlined in Section
6.10.3 was a better fit for image quality, it was not yet implemented in Torchélie. We argue
that since we exchange identities in a batch instead of randomly sampling them uniformly, the
swapped distribution does not diverge much from the ground truth distribution, making KID a
usable image quality metric in this situation.
of the latent variations was too complex to fill the whole Gaussian volume, resulting into many
”holes” in the prior spaces. Sampling latents from the prior lead to results only slightly looking
like faces, far worse than reusing extracted latents.
• We found the L2 pixel loss not sufficient and too harsh to generate meaningful images. This loss
considers that all pixels are equal in the image, which is not true. Some pixels of the face, like
face contours, bear more semantics than the others, like background pixels. A VGG loss captures
this pixel importance and emphasize the importance of those pixel structures, while relaxing the
need to reconstruct the target picture in a pixel-perfect fashion. The VGG loss compares image
semantics rather than pixel intensities.
1. learn an identity encoder network that predicts an identity embedding from a face picture;
2. extract the pose latents from a specific picture then optimize the identity embedding under a VGG
or pixel reconstruction loss;
Figure 6.23: Four batches of not curated samples. Rows 1, 3, 5, 7 are reconstructed samples. Identity is randomly
swapped in rows 2, 4, 6, 8 but latent vector is kept untouched.
6.13 Conclusion
In this chapter we explored various ideas about how to augment a face recognition dataset with invari-
ants. We hypothesize that enriching our dataset with a face swap tool would enforce a face recognition
model to really exploit facial geometry features, and would reduce the risk of overfitting on back-
grounds, or makeup, that would be transferred by the swapping. With this goal in mind, we presented
various generative models: autoregressive models, VAEs with a focus on the VQ-VAE, and GANs. We
presented various problems and their proposed solutions in the literature, such as Spectral Normaliza-
tion or R1 regularizer.
We then presented conditional and controlled modelling allowing control over the generated sam-
ples, which is necessary in our situation as we want to generate pictures of a specific person from
another picture of someone else. We presented various algorithms: cGANs, Pix2Pix, Pix2PixHD, Bi-
GAN, CVAE, InfoGAN, CycleGAN and CUT.
We contributed an improved VQVAE. Codes that have not been used for more training iterations
than a set threshold are resampled, preventing dead codes that receive no updates. Experimental evi-
dence suggest that this strategy yields improvements over the baselines that grows with the size of the
83
codebook. Our algorithm shows no notable inferiority scenario, and can be used as a default safely.
From this VQVAE we proposed a face swap model. Our generator shows a 92% face swap success
rate on our tests. The images exhibit good quality. They show good facial transfer while keeping
other features untouched, as expected. The impact of this augmentation strategy for training our face
recognition model is still to be evaluated and left as future work. Inverting this model, getting inspiration
from GAN inversion, is another step to take in order to explore this model and its usability.
Now that we have extracted metadata from the videos with our activity recognition and face recog-
nition models, we will build our recommender system.
84
7 Recommender System
7.1 Introduction
Now that we learnt how to extract features from videos, those will be fed to a recommender system.
In this chapter, we will explore various notable ways of building recommender systems, then dive into
how we built ours. This section features various experiments showcasing the importance of various
features and augmentation strategies.
When browsing on YouTube for instance, the landing page proposes some content. This content
is tailored and suggested according to the visitor. Suggesting videos at random or just the most recent
videos would do for a terrible user experience as most of them would not be relevant to the user’s
interests.
In order to extract the maximum value from the available videos, relevant videos should be favored
and presented individually to each different user, and give each one of them a tailored YouTube ex-
perience. Selecting those videos that are a good fit for a given user is the role of the recommender
system.
7.1.1 Lexicon
More formally, a recommender system setting involves several entities.
The items are the objects of interest of the application domain. For YouTube, these are videos, for
Amazon these are products and for Spotify these are songs.
The platform is the application. It can be Spotify, Facebook, YouTube, etc. The platform hosts items
to recommend. They have their own set of goals, usually maximizing revenue.
Content creators are the ones creating new items on the aforementioned platforms. Musicians for
Spotify, user profiles for Facebook, shops for Amazon. Some platforms, such as Last.fm (music rec-
ommendation) or MovieLens (movie recommendation) have no content creator. Instead, they have a
catalog of items provided by the platform. The content creators have their own incentive for using that
platform: they make revenue based on views, they sell items, they get exposure, etc. If the platform
does not care enough about them, they might stop creating content and the platform loses its value.
Users are the ones exploring the items and interacting with them. They are YouTube’s viewers,
Spotify’s listeners, Amazon’s shoppers, etc. Users find value in the platform for various reasons: it
hosts items they value, it allows discovering new items they like, etc. If the platform does not have their
interest at heart, they might leave, decreasing revenue to both the platform and content creators.
85
The recommender system is a key element of the platform that has to solve a tripartite equation:
maximizing its own business goals while maximizing the incentive for content creators to remain on
this platform and to create more valuable content, and ensuring that users will get in contact with it. In
other words, the recommender system has to find a mapping from content to users that is optimal for
the three actors.
7.1.2 At Hexaglobe
Hexaglobe provides platforms for their clients. One major way a user interacts with content is by
searching. In order to provide a good search engine, the platform has to extract features from behavioral
patterns or from the content itself (description text, computer vision, etc); or to ask content creators for
metadata about the item. Face recognition, for the client I am working for, provides metadata that is
valuable to index items, and this is complemented by other tags.
However, in some cases, we want to be able to suggest the videos the user wants even before they
search for it. This is typical on landing pages where we want to suggest videos the user will enjoy. This
is known as a recommender system as it recommends content to a given user, sometimes based on its
profile info and/or browsing history.
I have been tasked with giving a shot at writing a recommender system for my client.
Recommendation bubble
Recommender systems might have the tendency to not balance enough exploitation of safe or known
items and exploration of other items. This locks users in a fairly limited subset of items, and limits their
ability to discover content, or even giving the false impression that there is no other content.
You just bought a shelf, and now Amazon wants you to buy every other shelf. You listen to rock mu-
sic, you’ve never ever been recommended a single rap song, etc. Those are recommendation bubbles.
You are getting recommended only items that are similar to things you already know and like, without
any exploration or novelty. Sometimes, this is expected and a good thing: if you hate some type of mu-
sic, it’s okay not being exposed to it. Sometimes, you are missing out: you’re not being recommended
this great movie just because your past watch session inclines you more towards movies of lesser pop-
ularity matching a bit more the features you like. Finally, it can also be plain detrimental: being a flat
earther whose recommendations include more flat earth fake news and not a single scientifically valid
and informative video. Having political point of views that keeps you from being exposed to opinions
opposed to yours, but gets you recommendations of more and more extreme content [118, 121].
Recommendation bubbles must be at least considered when building a recommender system, and
whether to fight against them or not by adding some randomness or exploration bias to the system.
Cold start
Finally, what should be done with a new user or item? A new user might not have enough implicit or
explicit feedback or information for the system to gauge their interest. A new item has not been seen
yet and it’s hard to know who it can be appealing to or if it is ever going to be popular. This is known
as the cold start problem.
Fairness
Related to the cold start problem is fairness. Fairness might not matter for Netflix but might be of primal
importance when the content is proposed by users, such as Amazon Marketplace or YouTube. Taking
fairness into account is making sure that the system is not overly biased towards old and/or popular
items, so that newcomers or smaller artists can still get recommended. Failing to account for fairness
is making the system useless for the long tail items and narrowing the recommendations to top popular
items. Users might benefit from more diversity and content creators get some benefit getting involved
into the platform. [2] inspects popularity bias, [22] explores long tail items in music recommendation.
The long tail items are the ones - usually the majority - that get a much lower popularity than
the top popular items. For some platforms, the value lies in exploiting the long tail items through
personalization. It ensures that a valuable majority of items does not remain dormant. Figure 7.1 shows
the popularity distribution of the client’s items. Failing to recommend long tail items would make it a
bad platform for content creators and just a waste of hard drive space.
87
1500
1000
500
Figure 7.1: Long tail. A few popular items (the head) get a significant higher number of view than the majority
(the tail). Exploiting only head items results in neglecting most items. The Y axis is clamped at 2k views but the
highest video count is 8k.
Collaborative filtering
We can leverage information about a collection of user preferences. It is assumed that if a user ui shares
opinion about some content with user uj , then he should also share a similar opinion about another item
that ui has not rated yet.
This method is often presented as ”user who watched this video also watched”, ”users like you also
liked”. Figure 7.2 displays a user rating matrix. The unknown value is predicted from the ratings of
other users.
88
Figure 7.2: In collaborative filtering, we aim to guess the ratings one user would give to an item given the
rating similar users gave. Would she like Shrek because she liked The Dark Knight like user 1, or dis-
like Shrek becaue she liked Memento like user 2? Picture from https://developers.google.com/
machine-learning/recommendation/collaborative/basics
Content-based filtering
In content-based filtering, a representation of items is extracted from some available features and meta-
data. An interest feature vector is extracted from a user’s watch history and ratings, and the database
of content is queried with it. A baseline algorithm uses Term Frequency–Inverse Document Fre-
quency (TF-IDF) [128] for featurization and a dot-product for determining content relevance. Figure
7.3 shows a user’s interests featurized in the same feature space as the items. Features could include
keywords left in comments, main geographical region of users, and metadata might be categories and
tags provided by the uploader. We recommend an item based on the user’s similarity with candidate
content. Contrarily to collaborative filtering, note that no other user is considered in making a decision.
Platform presents content-based filtering such as ”items matching your interests”, or ”similar to your
recent history”.
healthcare
education
science
Health
casual
game
…
Figure 7.3: This imaginary app store has 3 apps: a science app, a robot game, and a dentist appointment finder.
Those apps and John’s interests are annotated by a set of tags shown above the table. Based on John’s past
interests, the first item, the science app, seems to be a good recommendation: John’s and this app’s feature vector
share the greater similarity.
Here, we will use the cosine distance as the distance d(·, ·). In case of a regression task, the labels
of the k most similar items is averaged.
• The Collaborative filtering uses the rating row ri,· (column r·,j ) to represent an user (item).
• The Content-Based Filtering represents users and items by a set of features extracted from items
metadata or user profile.
Figure 7.4: The ratings matrix is decomposed as the inner product of user latent factors and movies la-
tent factors, discovered during learning. They can be inspected to find semantically meaningful features.
Image source: https://developers.google.com/machine-learning/recommendation/
collaborative/basics
where λ is the hyperparameter controlling the regularization strength. The unknown values can then
be predicted r̂i,j = Ai · Bj .
The greater k, the more latent factors can be learnt about users and items, the smaller the training
error is but the overfitting risk increases. There are various ways to increase this model complexity and
account for various biases or features.
Those embeddings can be analyzed to find semantically meaningful latent dimensions, like movie
genre or target age group (Figure 2 of [84]). Illustrative examples are shown in figure 7.4.
Pure MF methods need a full retrain at each new user, content or new interaction and suffer from a
cold start problem : each user and content is treated independently.
minimizing
where β and λ are the hyperparameters regularizing, respectively, the L1 aiming for a sparse W and
the L2 norm preventing overfitting. The diagonal of W is forced to zero so that items can’t recommend
themselves and fall into that trivial solution. Given that both r and W are very sparse, computing
r̂ = rW can be done very fast with sparse aware math libraries.
7.5.2 Contraints
The system has some constraints to operate under in order to be useful in our production environment:
Latency
The system must return a recommendation response in under 30 ms.
92
Figure 7.5: Overview of the model. We sample a user, randomly sample a video from the watch history, and cut
the history at its watch timestamp t. The user is embedded by a user network while videos are encoded with a
video network. The dot product of their embedding is computed and fed to a softmax + negative log likelihood
loss, trained to predict the next video watched. The x denotes the dot product / matrix multiply operation.
Fairness
Even if it is secondary for the customer’s business, dominated by big content creators, they want to
encourage indie content creators. If it is not detrimental to recommendations, fairness is a desirable
property to have.
93
User network
The user network is a 2-layers MLP with 1024 hidden units and 64 output units. All categorical vari-
ables have an associated learnt embedding using pytorch’s nn.Embedding layer which effectively
transforms a discrete vector into a dense, learnable, continuous vector in latent space. The embedding
of the n-th value of a categorical variable is W T one hot(n), where W is a learnable matrix (basically
extracting the n-th row of w). When a categorical variable can take multiple values at the same time
(like video tags), the embeddings of all tags are averaged, thanks to pytorchs’s nn.EmbeddingBag.
It encodes various features:
• Profile features: country and gender. Categorical variables like those use learnt embedding
before going into the MLP.
94
• History features: for the number H of previously seen, liked, favorited and disliked videos: their
tags, their category, their main actors, their uploader’s ID. The embeddings for all videos are
concatenated. Sequences shorter than H are padded with zero embeddings. All the videos share
the same embedding layers; there is no reason the tags from the last seen video must be encoded
differently than the second to last. However, there is a reason to keep each video encoding
separate, so that the network can estimate if the user’s history is diversified or not, what was the
previous seen videos so that it knows it can willingly decide to recommend it again or not, etc.
• Request features: for now, the category the user is browsing the website for. It was crucial
making sure that the system would not recommend videos from another category as it displeases
users.
The dimension of the nn.Embedding vectors is equal to min(512, round(1.6n0.56 )) with n being
the number of possible values for this variable (following FastAI library’s empirical rule of thumb). We
set tag embeddings to have 64 dimensions.
Video network
The video network is similar to the user network. The input features are:
• Popularity features: the popularity scores, summed over all categories.
• Metadata features: the category, the video tags, the main actors, the uploader’s ID.
• File features: the video’s encoding quality.
• Optional: visual features: convolutional features extracted for some images from the video.
The embedding layers for tags, actors, categories, and uploader ID are shared among both networks.
Figure 7.6: Illustrating word2vec training. A linear model trains word embeddings either by predicting the center
word of a context window, or the context words of a context window from the center words.
Tags of a video, contrarily to words in a sentence, however, have no order. Deciding on a ”center
word” makes no sense, nor does ”context window”. A user’s watch history is a natural ordering. We thus
considers all tags on a video to be center tags, and use the tags of previous and next videos as context.
The target word is then a randomly picked word in the center video; tags from center and neighbouring
videos are considered context as well. This already biases the embeddings towards recommendation.
We use a context window of 5, that is, we use 2 videos before and after the center video as context.
We use the same strategy for words in the titles, also sampling context words, when applicable,
from their multilingual translations. Stop words (pronouns, determiners, etc) that I am able to identify
(French, English, Spanish) are removed.
Inspecting the embeddings with a technique such as Uniform Manifold Approximation and Projec-
tion (UMAP) [107] shows great insight about our tags. For instance, anime tags are effectively clustered
together, such as various groups of tags for various genres, scene types, or famous actors. From this
alone, computing the Word2Vec embeddings has value. One could envision running a k-means algo-
rithm on those embeddings to create invisible meta-tags that help even more with the indexing, merging
different tags but with the same meaning.
Those embeddings are used, frozen, in the recommendation model.
• RandLang: Only one language is selected at random when there are multilingual data for the
title
• DropPop: the popularity info is randomly dropped. This encourages fairness as well as the
system learns not to recommend only videos with high popularity scores.
96
Figure 7.7: We train and evaluate our recommender system in a contrastive way. Batches of 256 pairs of histories
and their next viewed video are loaded; the encoders learn to embed them so that the dot product of the real pair
is greater than the ones of the other possible pairs formed in the batch. In other words, the encoders are learned
so that ui · vi > ui · vj with i 6= j.
• DropUploader: Sometimes drops the uploader ID. The system learns not to rely only on popular
uploaders.
Figure 7.8: Experiments on video encoder for a fixed user encoder. We aim to understand how features contribute
to the classification information and build a model from this information. Test Top1 accuracy is indicated.
Reusing the other examples in the batch as negative targets prevents manually inserting user-videos
interaction features such as: has the user already seen this video? How older than the request is that
video?. This however is not a bad thing: for latency reasons, it is better to precompute the embeddings
of the videos, which is not possibly if they depend on the user and/or the request.
For testing, we use an identical setting as for training. The only difference is that the testing set
consists in an isolated set of users that have not been seen in training. However, the set of videos is
shared between training and testing as it would be extremely hard to find a user split that also splits the
seen videos in two disjoint sets.
1. A featureless experiment as a sanity check, making sure the accuracy is 1/256=0.39%, which
gives 0.39%, passing the test.
2. Only the category of the next requested item and of the candidate videos, giving 1.18%. This is
not surprising as there are only 3 main categories, which are notably unbalanced. It is important
to recommend items within the correct category. This is a typical case where false positives are to
be avoided. Despite not being very helpful in a Top1 test accuracy sense, this feature is of crucial
importance.
98
3. Only the video encoding quality of the videos, giving 0.8%. The video quality does not seem to
explain much of the user behavior.
4. Only the popularity scores, giving 5.86%. This is quite surprising and would indicate that the
dataset might not be enormously biased towards popular items. This hypothesis could not be
verified, as the scores were computed some time ago and the code is not available anymore.
Thus, the popularity scores may have been processed in an unhelpful way.
5. Only the uploader’s ID, giving 20.25%. It seems that this gives a lot of valuable information about
the content, suitable for recommendation. A projection and visualization of the learnt uploader
ID embeddings would help understand the semantic space they are organized into, and help get
insight into the information extracted from the uploader ID.
7. Only 5 tags per video, giving 12.27%. There are substantial gains adding more tags.
8. Only 10 tags per video, giving 15.09%. Adding more tags still helps.
9. Only 15 tags per video, giving 19.37%. Adding even more tags still helps.
10. Only 20 tags per video, giving 20.80%. This yields diminishing returns, 15 looks good for tests.
11. Only 10 tags per videos, tags are pretrained with Word2Vec, giving 17.92%. This is a notable
improvement from the random initialization. Does it help unfreezing the embeddings?
12. Only 10 tags per videos, tags are pretrained and trainable, giving 19.22%. Unfreezing tags em-
beddings help. Since we are challenging the performance with 15 tags, try with 15 pretrained and
trainable tags.
13. Only 15 tags per videos. Tags are pretrained and trainable, giving 22.74%. The gains are still
important.
14. Uploader ID and 10 tags per videos. Tags are pretrained and trainable, giving 24.63%. The
uploader ID and tags do not bear totally redundant information.
15. Uploader ID and 15 tags per videos. Tags are pretrained and trainable, giving 24.98%. The
uploader ID seem to bear enough information so that more tags yield diminishing returns.
16. Uploader ID, 15 pretrained and trainable tags, category, quality, gives 25.45%. This value serves
as an upper bound when all info is available.
We now have a pretty clear view of the importance of candidate video features. We settle to 15
pretrained and trainable tags, not using popularity scores (favoring unbiased information), with uploader
ID, main category and quality.
User encoder
Now, we run feature experiments on the user encoder, summed up in Figure 7.9.
1. We remove all features from users, giving 0.3905%. Sanity check passes, without features, we
get 1/256 chances of being right.
99
Figure 7.9: Experiments on user encoder for a fixed video encoder. We aim to understand how features contribute
to the classification information and build a model from this information. Test Top1 accuracy is indicated. Unless
indicated otherwise, the history lenght H is set to 2
100
2. Only profile geolocation features, gives 1.84%. This is surprising as location correlates with
language which can be inferred, to some extent, from tags. We know that language does not
constitutes a major feature of our videos, but the accuracy gain still looks quite low.
3. Only requested category gives 1.20%. This matches the previous results with main category only
in candidate videos.
4. Only user profile’s gender feature gives 0.6%. Unsurprisingly, gender is a low indicator of user
preference.
5. Only Uploader ID from history gives 20.33%.
6. Only 15 video tags from history gives 21.44%. This jumps accuracy close to its maximum ob-
served value.
7. 15 video tags from history, video categories, request category, gives 22.77%.
8. 15 Tags, category from history, uploader IDs from history, and request category gives 24.32%.
9. Enabling all user features, growing from 2 to 5 elements in history increases from 25.45% to
25.59%. This is not enough for the increased computation.
10. Using all user features, only 15 tags in candidate video encoder but only viewed videos in history,
gives 22.54%. Only dislikes yields 3.51%. Only likes 4.58%. Views and like, 22.40%. Views
and dislikes, 22.37%. All of them, 22.74%. We see that, contrarily to intuition, likes and dislikes
bears little information compared to views. Moreover, this information is not only redundant but
maybe slightly misleading, suggesting not using likes and dislikes in the model.
11. Using only views, categories, uploader ID, we reach 24.04%.
12. Using all data but gender (ie tags, categories, geolocation, uploader, request, tags), with only past
views from history, we reach 24.75%. We consider that adding geolocation, even if it does not
bring tremendous improvement, is very cheap and assume that it biases favorably the suggestions
for a new user with localization.
13. 25.05% with views, likes and dislikes, confirming the improvement is marginal.
We settle with a user encoder that uses all data but gender (ie tags, categories, geolocation, uploader,
request, tags), with only past views from history, reaching 24.75%.
Transforms
We now evaluate our transforms.
1. We activate DropTags with probability 0.25, effectively randomly dropping a quarter of the tags
during training. It gives 24.78%, making no difference. Raising to 0.5 probability yields 23.89%,
impairing the training. DropTags is better kept off.
2. We activate DropUploader with probability 0.25 gives 24.63%. Probability 0.5 gives 24.37%.
We designed those transforms inspired by CutOut [36] (Section 3.3.5). The results suggest that ran-
domly removing some input features during training does not make the model more robust as expected,
but impairs learning instead.
As it is sometimes the case with data augmentation, we experiment with less regularization and
longer training schedule. We double the training time and reduce the weight decay from 0.1 to 0.01.
101
1. Doubling the training iterations brings our baseline to 25.11%. Adding DropTags with probability
0.25 yields 25.35%.
2. Reducing the weight decay brings our baseline to 24.91%. Adding DropTags with probability
0.25 yields 24.05%.
It seems that DropTags starts to be moderately useful with more training iterations. Time and efforts
would be better spent on feature engineering as suggested by all the previous experiments.
Further experiments
We finally want to evaluate the contribution of face recognition to the recommender system. Adding
the detected identities as features goes to 26.84%, demonstrating a noticeable effect in recommending
content. The identities embeddings could be pretrained with word2vec, in a similar fashion to tags
embeddings, maybe yielding even greater improvements.
7.5.7 Discussion
Starting from Covington et al. [26], we designed our own recommender system, guided with experimen-
tal results. We found out that tags are the most important features to be used, followed by uploader ID
as each uploader has its own type of content. We expected our model to overfit and devised some data
augmentation techniques; the experiments showed that overfitting was not a problem but DropTags still
managed to be helpful. We also showed that pretraining embedding with Word2Vec noticeably helps
(approximately +4% going from random to pretrained tag embeddings). The next steps to be taken
for maximum rewards are probably pretraining identity and uploader embeddings as well. This could
be done by predicting the associated tags distribution or their co-occurences in users’ watch history.
The practical use of this model now has to be tested in real world condition, and A/B testing [176] is
needed in order to compare it to the current recommender system to verify its superiority in actually
understanding users’ preferences.
102
8 ⇒Project: Torchelie
8.1 Introduction
Back in Montréal, working for JSALT2019 in order to publish [25], I wanted to gather all the deep
learning related code I wrote so far for my thesis. The initial motivation to realize this project was
clear: while deep learning code is often short to write, it is also extremely easy to get wrong, and even
harder to diagnose and debug. In standard software engineering, most mistakes end up crashing the
application or raising exceptions, making them obvious to uncover. In deep learning, they damage the
results, often in very subtle ways.
Besides, many code bases would share common patterns: training and testing loops, alternating
training for G and D in GANs, measuring accuracy, layers or blocks, etc. In most public code, the
training code is often made of raw for-loop instrumented with many ifs, and as the need to monitor new
quantities grow, the code gets harder to maintain and error prone. As we try new incremental ideas, the
ifs switches grow out of hand, readability suffers, and more bugs arise.
Torchélie is a software engineering take to tackle this problem and provide tools to build both
experiments and production-ready code that is fast to write and easy to maintain, from battle tested
building blocks. It is based on PyTorch that it extends.
Torchélie is a twofold contribution:
1. first, as a library and toolbox for pytorch, providing many utilities. We aim to mimic PyTorch’s
style as close as possible, hoping to make it a seamless experience;
2. second, as a set of design principles that can be followed even outside of Torchélie in order to
ease iterative development.
8.2 Overview
The first contribution Torchélie brings is a Python package based on PyTorch that contains multiple
tools extending Pytorch horizontally:
torchelie.datasets implements new datasets, new dataset transforms, and dataset utils. It
contains utilities like FastImageFolder which caches pytorch’s ImageFolder file list, making
big datasets loading much faster; PairedDatasets sampling pairs from two datasets at a time,
which is useful for tasks like image translation or augments like Mixup; MixupDataset which sam-
ples from a dataset and uses Mixup to augment the sampled image and its classification target; Subset
which allows using only a random (but reproducible) subset of dataset, quantified either as a fraction
or a number of samples. It also provides datasets loaders and downloaders such as MS1M which is able
103
to load MS1M despite its file format encoded for MXnet, Pix2PixDataset loading from NVidia’s
servers the datasets used for Pix2Pix [72], Imagenette, and Imagewoof.
torchelie.loss contains many losses and regularizers, noticeably for GANs (Hinge loss for Big-
GAN [19], R1 regularizer from [108], R0 regularizer from [150], WGAN loss from [51]), face recogni-
tion (AdaCos [182], ArcFace [34]) and style transfer. It implements the BitemperedLoss from
Amid et al. [5] that is said to be more robust to label noise, DeepDreamLoss from Mordvintsev et al.
[113], Style Transfer loss from Gatys et al. [46], FocalLoss from Lin et al. [98]. It contains a nor-
malized VGG network for computing perceptual losses [39] with equal layer importance as suggested
in [46]. We extend PyTorch’s cross-entropy with continuous cross entropy allowing for non
one-hot target distributions or smoothed cross entropy for cross-entropy with label smoothing
[147].
torchelie.utils contains many utility functions, some for weights initialization, accessing lay-
ers / weights by name, computing things like Gram matrices, linear interpolation, or distributed training
helpers.
torchelie.nn is one of the biggest part of torchélie and contains layers and blocks from various
papers and architectures. Listing them all would be tedious and not very informative. We should
say that it contains the VQ as a layer transparently handling backpropagation, layers for StyleGAN2
[78], PixelCNN(++) [83, 137], blocks for ResNets and ResNet-likes [59, 173], SE blocks [70], and
some more utilities. An original addition is the ModuleGraph allowing to describe a model as a
computational graph.
Figure 8.1: Visualization of torchelie.hyper hyperparameter search. The user can select hyper parameters
to sample (and how to sample them), and target metrics. Once ran, the results appear in this visualization. In this
case, we highlighted via the interface the three runs with the best resulting accuracy.
pretrained on other datasets than ImageNet, by embedding, in a near future, models pretrained on face
recognition and face detection tasks. We will talk more about model design in Section 8.3.1.
Torchélie also contains utilities that are not within PyTorch’s scope:
torchelie.nn.utils is part of the different design philosophy and contains utilities aimed to
edit models, in order to insert, remove, or replace layers, activations, etc.
torchelie.data learning provides tools to infer data via backpropagation, used for instance
in style transfer or feature visualization. It provides learnable images, in pixel space or fourier space,
with uncorrelated color space or not.
torchelie.recipes a model is useless without training and inference code. Recipes are
Torchélie’s ways of building programs utilizing models. We provide ready to use Recipes such as CUT
[123], DeepDream [113], Feature visualization for CNNs [120], vanilla GAN training [49], image ma-
nipulations algorithms from Ulyanov et al. [154], neural style transfer [46], Pix2Pix [72], StyleGAN2
[78], and standard cross-entropy classification. Most of those recipe provide a commandline interface
to complement their Python API.
105
Figure 8.2: The ClassificationInspector allows to see live the performance of the classifier. It reports
the samples that are provide the best, worst, and most confused answers from the classifier. The bar below
the images is green when the prediction is correct, red otherwise; the width reflects the confidence score of the
prediction. This allows eyeballing the datasets, strengths and weaknesses of the model, and build intuition.
1. Independent As much as possible, features must be independent and one must be able to use
exactly the feature they need without difficulty. For instance, one must be able to use a layer or a
model in their project and use them as much as possible as Pytorch components in their Pytorch
project. One might be able to train a vanilla Pytorch model with a training loop from Torchélie
without any adaptation.
106
Figure 8.3: Live confusion matrix provided automatically when the number of classes is not too big to make it
unreadable (less than 25 classes).
2. Compositionality Features are built by composing small and independent building blocks. All
blocks must do one thing and do it well. Greater components must be designed by combining
smaller components and keep the logic simple. The user should be able to easily write their own
components and easily compose them to the set of feature that is looked for. For instance, it must
be easy to write a new type of layer and use it in the provided models with minimal effort; to
write a new metric to integrate in the training loop; to write a new training logic code; etc. This
is, for instance, why our implementation of the VQVAE embodies its gradients estimator in its
backward function: using this layer is now a totally local decision, there is no need to alter the
code somewhere else, like in the loss function, to make it usable.
3. Build standard and modify This delta approach is the new take Torchélie proposes on devel-
opping libraries. Trying to build models, training recipes, layers, with options for customization
inside the object constructors leads to two kind of major issues. First, despite the best efforts,
since it is very hard to know what the developer will want to control, it is unlikely that our param-
eters would address every implementation details. Second, the code either lacks customizability
or becomes unmaintainable as one adds switches and parameters. Moreover, each layer option
must be ported as well to the model’s options, and the parameters and switches grow totally out
of hand. In training code, this happens the same way as one might want to change a loss func-
tion, add one, etc. Instead, we should provide simple ways to build standard components, and
give editing features. Instead of building customized, it is hypothesized that thinking in terms of
deltas from the standard blocks scales better: the construction code remain simple and straight-
forward as it builds just one standard thing, and the edition functions remain external and self
107
Figure 8.4: Gradient of the loss wrt the input on the current batch. The per-pixel norm of the gradient weighs each
pixel’s intensity. This helps figuring out what the model looks at in the picture in order to make its predictions
contained.
Training code should follow the same guidelines. For instance, complex losses such as the
Pix2pixHD loss should be simple but composable so that a user can easily add, remove, change
or replace one term.
A deep learning practitioner mainly works in incremental changes, as we evolve from a standard
algorithm and evaluate their performance; let’s keep these deltas expressed as deltas in the code
as well.
8.3.1 Models
Torchélie provide many architectures, that are yet to be pretrained. It provides many resnet variants and
GANs architectures such as Pix2pix or StyleGAN2.
It becomes straightforward to write code in Torchelie style. The model or block is implemented in
its most basic, canonical form. Derivative models are implemented in terms of transforms from those
basic blocks.
Let’s illustrate this first with a simple example, without using Torchélie. One wants to experiment
with SE-ResNets [70]. In simple terms, the SE-ResNet is a ResNet which adds a Squeeze-and-Excite
block in every residual layers.
A typical way to implement this, shown in Listing 8.1, would be invasive changes to the ResNet
definition. Implementing more and more variants would bloat the code and make it harder and harder
to maintain as the code grows with switches.
Instead, we repeat that the SE-ResNet is a ResNet which adds a Squeeze-and-Excite block in every
residual layers. This is exactly the way this should be expressed in code. The resulting code, in Listing
108
1 class ResNet(nn.Module):
2 def __init__(self, arch, use_se=False):
3 # create input stem
4
5 for n_blocks in arch:
6 for i in range(n_blocks):
7 self.add_module(ResBlock(..., use_se=use_se))
8 # create end of trunk
9
10 class ResBlock(nn.Module):
11 def __init__(self, ..., use_se=False):
12 # create branch and skip convs if any
13
14 if use_se:
15 self.branch.add_module(SE(...))
16
17 se_resnet = ResNet(..., use_se=True)
Listing 8.1: Typical SE-ResNet implementation. Note how use se has to be passed down. Details are left out
for clarity and brievity.
1 resnet = ResNet(...)
2
3 for m in resnet.modules():
4 if isinstance(m, ResBlock):
5 m.branch.append(SE(...))
Listing 8.2: Delta implementation of SE-ResNet. There are no needs for intrusive changes.
8.2, is shorter, more readable, local and non invasive. There are no risks inserting bugs into vanilla
resnets that were perfectly functioning before our intervention.
This is not cherry picking, and examples are plenty:
• a Wide ResNet [179] is ”a ResNet with wider kernels”. We can take a resnet, iterate on its residual
blocks and replace the layers with wider ones.
• ResNet-v2 [60] has ”a ResNet with three 3x3 convs in the input stem”, ”a ResNet where stride
2 convolutions are the 2nd one in the branch rather than the first one”, and ”a ResNet that uses
average pooling rather than strides in the identity path”.
• PoolFormer [177] is a Vision Transformer [40] ”using average pooling instead of self attention”
[158].
• ReZero [8] ”adds a learnable scalar multiplier, initialized to 0, at the end of every residual
branch”.
• etc...
Those are deltas that are easily expressed in code, and more explicitly exhibit the intent. Those
deltas are fundamental for research and iterative development. Having those deltas expressed as ex-
ternal transforms rather than intrusive instrumentation is very important. To some extent, it allows
modifying an architecture provided by a library without needing to alter its code. It ensures that the
109
new delta the user is about to try does not introduce bugs into previous models and does not inter-
fere with any other part of the code. It scales trivially as incremental changes can be layered in the
exact same way, or explored in parallel, without introducing uncontrollably growing complexity with
switches.
The role of Torchélie is to ease delta programming. This is first done by providing carefully de-
signed interfaces:
2. Layers bear a semantic name, so that they can be easily and robustly indexed.
3. Layers that form a semantic units are grouped, so that models are not a flat sequence of funda-
mental operations, but a (fully editable) hierarchy exhibiting design choices; for instance, Conv-
BN-ReLU are grouped together in a torchelie.nn.ConvBlock.
insert after(base, key, new, new name) that inserts new with name new name into
model base after the module indexed by key
make leaky(m) that changes ReLUs in m into LeakyReLUs and accordingly adapts the initialization
the layer before it (if applicable)
edit model(model, f) that recursively transforms every module m of model to f(m), etc.
8.3.2 Algorithms
Those design guidelines apply to algorithm implementation as well. This is not as straightforward as
for models since we are already used to models being described as data structures. For code, we needs
carefully designed code architecture.
Let’s study examples of how that would work out on concrete cases.
• StyleGAN2 [78] (besides architectural changes) ”adds a Perceptual Path Length regularizer and
a R1 regularizer to a standard NSGAN”.
• StyleGAN2-ADA [77] ”inserts differentiable data augmentation before feeding real and fake
images to the discriminator”.
• ResNet-RS [12] propose to improve resnets by ”architecture improvements” (dealt with model
delta programming) and ”better regularization methods: adding model exponential averaging
[124] (Polyak averaging), label smoothing [147], RandAugment [29] and a lower weight decay”.
• Similarly, ResNet-SB proposes to ”replace cross-entropy with BCE, add mixup [180] and cutmix
[178], use the LAMB optimizer [175] with bigger batch size, and a few other hyper parameter
differences”
110
• Pix2Pix [72] ”adds a L1 per pixel loss to the adversarial loss of a standard cGAN [110]”.
• Pix2PixHD [166] ”removes the L1 pixel loss but adds a feature matching loss and changes the
NSGAN loss to a LSGAN loss”.
• etc
Again, all those paper are defined in deltas from another paper or standard procedure. And even if
they propose significant changes (like ResNet-SB), they came to those lists of changes by incremental
trials, building deltas one on top of another, often made explicit through ablation studies. ConvNeXt
[102] is a very demonstrative example showing how research advances by stacking deltas (cf. Figure
C.1).
The reasoning is the same as for handling model variants: as one adds options and wants to exper-
iment, the code grows out of hands, and all those intrusive changes risk introducing regression bugs.
Listing 8.3 shows how timm [170], a famous library providing vision models and their training code,
grew an enormous list of parameters by not following a delta approach. Developping new ideas out-
side of what is permitted by those is not possible without intrusive intervention inside timm’s code.
However, humility is needed as timm sprung many works and supports a much much larger number of
projects than Torchélie does.
Torchélie proposes tooling in order to describe code as data structures and ease delta programming
of algorithms. Our currently proposed tool is the Algorithm class. It is an improved sequence of
named functions. Those functions are called in order when executing the Algorithm. The names allow
to easily manipulate the sequence in order to replace or remove existing functions, or insert new ones.
That way, algorithms can be described in atomic steps that can be incrementally manipulated to grow
more sophisticated or fork variants.
Let’s illustrate how a standard conditional GAN algorithm can be manipulated to create new deriva-
tive algorithms, namely WGAN-GP, Pix2Pix and Pix2PixHD with small deltas.
111
1 class ConcatConditionalGANLoss:
2 def __init__(self, G: nn.Module, D: nn.Module) -> None:
3 self.G = G
4 self.D = D
5 self.gan_loss = tch.loss.gan.standard
6
7 G_alg = Algorithm()
8
9 @G_alg.add_step(’fake’)
10 def G_fake_pass(env, real, target):
11 # Generate some fakes
12
13 @G_alg.add_step(’adversarial’)
14 def G_adv_pass(env, real, target):
15 # compute the discriminator loss
16
17 @G_alg.add_step(’backward’)
18 def backward(env, real, target):
19 # backward the discriminator loss
20
21 self.G_alg = G_alg
22
23 ... # D’s pass to be implemented here
Listing 8.4: The generator pass for a ConcatConditionalGAN. The implementation details are removed in order
to focus on the design principles. The discriminator’s pass is in Listing 8.5
A conditional GAN concatenates the source and target image to the discriminator. The Generator
(G) pass is made in two steps: generate some fakes and compute the loss; then backpropagate to the
generator. Listing 8.4 illustrates our delta programming approach using Algorithm for the generator.
We can see that a conditional GAN loss is defined as a pass for G and the Discriminator (D). G’s
include one ’fake’ step, generating fake samples, ’adversarial’ discriminating those fakes,
and ’backward’ running the back propagation. Same goes for D.
Listing 8.5 illustrates the steps involved in training the discriminator: ’gen fakes’ gener-
ates fake samples from G, ’fake adversarial’ computes the discriminator’s output on them,
’fake backward’ backpropages the fake loss, ’real adversarial’ computes the discrimi-
nator’s output on real samples and ’real backward’ backpropagates the loss for them. These
sequences of operations are now manipulable as data from G alg and D alg members.
A first variant, a conditional WGAN, could use the WGAN-GP regularizer [51]. This could be
achieved by adding a gradient penalty computation before performing the backpropagation on D, as
shown in Listing 8.6. We first make sure that the GAN’s Algorithm is using a standard BCE loss,
we define our gradient penalty, and we insert a new step in the algorithm ’D gp’, right before the
’real adversarial’ step. The actual implementation of the gradient penalty is totally irrelevant
to the point shown here: our cWGAN-GP is grown by adding a simple term, a simple delta, to our
vanilla cGAN.
We can try further and implement Pix2Pix’s loss instead of WGAN-GP. Pix2Pix extends the stan-
dard conditional GAN with a L1 loss between the generated picture and the actual target picture. Inher-
itance can achieve this and is another valid way of implementing our goal. Listing 8.7 exhibits this im-
plementation, adding the L1 term to our vanilla cGAN, the ’l1’ step, after ’real adversarial’.
Pix2PixHD can be implemented as a derivative work that mostly incorporates a feature match-
ing loss to the standard conditional GAN setting and switches to the Least-Squares GAN loss they
112
1 class ConcatConditionalGANLoss:
2 def __init__(self, G: nn.Module, D: nn.Module) -> None:
3 ... # G’s pass to be implemented here
4
5 D_alg = Algorithm()
6
7 @D_alg.add_step(’gen_fakes’)
8 def gen_fakes(env, real, target):
9 # Generate fake images, store them in env[’fake_image’]
10
11 @D_alg.add_step(’fake_adversarial’)
12 def D_fake(env, real, target):
13 # Compute the discriminator loss for fake samples.
14
15 @D_alg.add_step(’fake_backward’)
16 def D_backward(env, real, target):
17 # Backpropagate the fake loss. Return some metrics
18
19 @D_alg.add_step(’real_adversarial’)
20 def D_real(env, real, target):
21 # Compute the discriminator loss for real samples.
22
23 @D_alg.add_step(’real_backward’)
24 def D_backward(env, real, target):
25 # Backpropagate the real loss. Return some metrics
26
27 self.D_alg = D_alg
28
29 def G_step(self, real: torch.Tensor, target: torch.Tensor) -> dict:
30 return self.G_alg(real, target)
31
32 def D_step(self, real: torch.Tensor, target: torch.Tensor) -> dict:
33 return self.D_alg(real, target)
Listing 8.5: The discriminator for a ConcatConditionalGAN. The implementation details are removed in order to
focus on the design principles. The generator pass is in Listing 8.4
1 class Pix2PixLoss(ConcatConditionalGANLoss):
2 def __init__(self, G: nn.Module, D: nn.Module, l1_gain: float) -> None:
3 super().__init__(G, D)
4 self.l1_gain = l1_gain
5
6 @self.G_alg.insert_after(’real_adversarial’, ’l1’)
7 def G_l1_pass(env, real, target):
8 loss = self.l1_gain * F.l1_loss(env[’fake_image’], target)
9 env[’loss’] += loss
10 return {’l1_loss’: loss.item()}
Listing 8.7: Pix2Pix from ConcatConditionalGAN
1 class Pix2PixHDLoss(ConcatConditionalGANLoss):
2
3 def __init__(self, G: nn.Module, D: nn.Module, l1_gain: float):
4 super().__init__(G, D)
5 self.gan_loss = tch.loss.gan.ls
6 self.l1_gain = l1_gain
7
8 D_with_acts = tnn.WithSavedActivations(D.module)
9
10 @self.G_alg.insert_before(’fake_adversarial’, ’extract_features’)
11 def G_features(env, real, target):
12 # Generate fake samples, compute D’s activations and its final
13 # predicted probability
14
15 @self.G_alg.insert_after(’extract_features’, ’feature_matching’)
16 def G_featmatch(env, real, target):
17 # Compute the feature matching loss
18
19 @self.G_alg.override_step(’fake_adversarial’)
20 def G_adv(env, real, target):
21 # add the actual adversarial loss
Listing 8.8: Pix2PixHD
find more stable. Listing 8.8 implements Pix2PixHD’s loss by adding ’extract features’ and
’feature matching’ before ’fake adversarial’, which extract deep features from the dis-
criminators, compute the deep feature matching loss. It finally replaces ’fake adversarial’ in
order to compute the discriminator loss from the features instead of the default step that runs a full
forward pass on D, saving compute.
We believe this programming paradigm highlights the incremental nature of the works, and make
them easily manipulable. Easy experiments can be conducted if one wishes to remove, replace, or add
elements to losses or more complex algorithms.
Torchélie’s Algorithm class is certainly not a silver bullet, but while we believe our implemen-
tation has flaws and certainly has cases that it does not handle in a comfortable and elegant way, we are
confident that this delta programming is an important paradigm for research and deep learning work in
general.
114
8.4 Discussion
8.4.1 The future of Torchélie
It first important to remind that Torchélie is a tool tailored to and built from my needs but could be useful
to the community thanks to its MIT license. Having a bigger user base and more regular contributors
would be very nice, but that would make modifying the library to adapt my needs harder and slower. I
would now have to worry about API stability, breaking changes, etc. This is envisioned, but for now,
I consider some Torchélie parts still unstable and in alpha, often modified with design changes that
would invalidate work based on it. The main cause of instability is the ongoing exploration for better
tools to help delta programming for algorithms. Actually, the algorithms / recipes implemented before
the rise of the delta paradigm in Torchélie will be rewritten, and their current API will be broken. Delta
programming for models is already stable and fully functional.
The library is also in need of a new scope definition. Initially, I saw Torchélie as ”everything com-
puter vision”, which is now unsustainable. While it was doable at its birth to follow along the major
publications in order to implement a wide range of architectures, this is not doable anymore. Torchélie
needs to focus on its raison-d’être, which is research for applied industrial problems and its constraints:
models trainable in a small amount of time on commodity hardware, with small or moderately sized
datasets (fine tuning is another topic). This means that Torchélie only needs to implement models, tools
and algorithms that passed the hard test of time, instead of the latest trend; applied to Vision Trans-
formers [40] for instance this means implementing only the milestone models that resisted industrial
testing, time, and are commonly used in benchmarks. It certainly means lagging behind, but it also
means providing healthy defaults.
This project has no end, but the near future to-do list contains a lot already. First and foremost,
Torchélie has focused a lot on convolutional architectures which has become less and less relevant:
ResNets work well, and every other architecture I tested and implemented did not satisfy me more than
ResNets. Some were just slower, and others did not deliver the expected gains. That said, another
component that has been overlooked in Torchélie is in great need of implementation: the library is not
up to date with modern training settings and need to implement those in convenient ways. Besides, as
Python is a quite volatile language due to its dynamic typing, annotating type hints and checking them
with mypy [1] is an ongoing work.
1. timm provides a very large quantity of models, most of which are also pretrained on ImageNet.
However, its scope remains focused on training image classifiers.
2. Fastai is closer in its scope to Torchélie. Fastai has some different design choices that makes it
feel like learning an entirely novel library. Fastai mostly sits on top of Pytorch and bring its own
design principles, rather than extending PyTorch horizontally.
3. Finally, Ignite and PyTorch-Lightning are source of inspiration and close to what I envision.
Torchélie tries to be more task oriented at the cost of flexibility, and fit my needs better.
In the long run, the broad scope of Torchélie does not seem sustainable and should either narrow
its scope (which has been discussed) or integrate at least parts of the aforementioned libraries in order
115
to remain relevant with a sustainable work load. The field is advancing at an increasing pace and more
means are needed today to keep up with the pace than when it started.
8.5 Conclusion
In this chapter, we presented Torchélie, a PyTorch library gathering the code (layers, losses, algorithms,
training loops, etc) under a consistent and coherent Python package. We presented the double con-
tribution that Torchélie is: 1) a toolbox for the deep learning practitioner (especially in the domain
of computer vision) with a brief presentation of its content, and 2) a programming paradigm that we
call Delta programming encouraging code and models designed as data structures for extrusive and
incremental modifications; illustrated with a GAN example.
We discussed the work that remained to be done on Torchélie in order to keep it relevant in the near
future and more easily maintainable while maximizing its usefulness. I expressed my decision to limit
the scope of new additions to models and algorithms that succeed the test of time and invest on them
rather than chasing every new development in the literature. This is undoable and brings little value
in the not so uncommon case where there are no influential subsequent work in my line of work. This
would also help finding Torchélie’s place in a rapidly growing ecosystem, maybe reusing and delegating
some of its component to other packages in the PyTorch landscape like Albumentations [20] for image
augmentation.
The main thing that could benefit Torchélie is having a community. This would alleviate the weight
of maintaining the code.
All in all, Torchélie is an invaluable tool for both my academic and industrial work. There is a
virtuous double-way feedback loop at play: Torchélie helps my works and its use allows me to figure
out the tools and designs needed in the library. This symbiotic flow also allows to proof test the code
and ensure its correctness and performance.
116
9 Conclusion
During this thesis, we explored how Deep Learning applied to Computer Vision could bring value to a
video platform. We started with a simple idea: extract semantic information from video content in order
to improve user experience (through content categorization and searchability for instance) and content
suggestion and personnalization. This naturally lead to two main parts: extracting information from
videos, and creating a recommender system. We settled on some information that would be interesting:
activity classification could help categorize the content and choose a fitting thumbnail when showing
search results, and face recognition for similar purposes. In a second time, we would analyze the
challenges presented with a recommender system in our domain and propose a model.
The systems presented here are used in production or about to be, bringing value as improved user
experience. This allows us to conclude on several points. First and foremost, deep learning is able to
deliver on our data type and domain, which corroborates the large amount of testimonies of businesses
positively impacted by the usage of deep learning.
Semantic activity information could be extracted and leveraged as see in Section 3.4. We could
analyze the videos fairly easily and reliably enough for the way we want to exploit them. We were
able to develop a model, evaluating several options including randomly initialized models, pretrained
models and data augmentation. We show that data augmentation and pretrained models were able to
significantly increase our accuracy up to a point where the model could be good enough for some non
critical labeling, improving the user experience.
Negative training samples were leveraged with standard softmax classifiers in order to use so that
our face recognition classifier becomes more robust to unknown people (Chapter 5). We first set the
face recognition problem as a standard metric learning problem, showed that it does not scale for our
problem. Since the people we want to recognized are known at training time, we simplified our problem
down to a standard classifier in need an out-of-domain rejection ability. We compared several loss
functions in order to implement this rejection capability: standard cross-entropy, cross-entropy with
an additional out-of-domain class, cross-entropy augmented with a logit regularizer on distractors, and
cross-entropy augmented with a maximum entropy objective for distractors. Each model improved
on the previous one, the maximum entropy loss being the best we tried. We also showcased that all
those models exhibit satisfying calibration, and the rejection threshold could be set by the production
engineers in order to specify the error rate tolerance by trading recall for precision.
Recommender system were our subject of study in Chapter 7, when we have no user ratings but user
browsing histories instead. We compared about fifty models trained on different feature subsets. It was
shown that providing tag embeddings pretrained with Word2Vec lead to significant gains, making it one
117
of the most important features, followed by learnable uploader embeddings. This model needs to be
A/B compared with the current system in production, based on manual heuristics.
Torchélie is a framework based on pytorch that was introduced in Chapter 8, underpinning all ex-
periments and industrial developments done during the thesis. We presented how its design, based on
deltas and patches, diverged from current famous frameworks. We implemented in it various contribu-
tions, including the proposed VQ layers. The pertinence of those design principles were supported by
examples exhibiting how a standard resnet can be patched to produce its variants, and how a conditional
GAN algorithm can be gradually patched to produce the Pix2Pix algorithm then Pix2PixHD.
Those achievements were accomplished thanks to three datasets that we put together, explained in
Section 2.3. HActions is composed of frames extracted from videos, sorted into 13 classes, representing
various activities. HFaces is our face recognition dataset, containing 8938 identities with an average of
50 pictures per identity. We ran our experiments on the 105 most popular identities, using the rest as
distractors that have to be rejected by the classifier. Finally, we leveraged HHistory, the browsing history
of our premium users for our recommendation engine. It contains several metadata, 3673 uploaders,
140k videos and 136k users. We searched for ways to grow those dataset in a principled ways in Section
5.5.4.
Besides those industry focused goals, we explored the metric learning framework and proposed the
Threshold-Softmax, a new loss function able to learn from negative examples (in Section 5.4). The
threshold-Softmax proposes to learn face embeddings fitting a cone with an absolute maximum angle,
rather than imposing angular margins between classes. Negative samples are forced in the negative
space: outside of the regions allocated for the positive classes. We experimented this loss on MS1Mv2
and compared it to the SotA ArcFace. The Threshold-Softmax is competitive but not always superior
to ArcFace, but presents the ability to learn from unlabeled negative samples (unknown people not
belonging to any positive class), halving the error rate in our tests on LFW and FGLFW.
During the ongoing exploration on generative models based on the VQ-VAE, we proposed to im-
prove the efficiency and control of the quantization layer thanks to our expiring codebook in Section
6.11. An expiration mechanism is added to the codebook. When a code has not been used for more than
a fixed number of training iterations, it is resampled to an input data point. This threshold is an hyper
parameter that allows controlling the entropy of assignments. Experiments showed that this algorithm
lead to better training dynamics that consistently outperformed the original VQ algorithm and trained
faster.
We used this to build a face swap model (Section 6.12) based on a VQ bottleneck. A face picture is
encoded, constrained with a VQ bottleneck, and decoded back. We also provide a learnable embedding
of the identity label to the decoder. In order to produce an accurate reconstruction despite the bottleneck,
the model has to learn as much as possible about facial geometry and store that knowledge in the
embedding. The bottleneck can be used to transport the remaining information: colors, lighting, pose,
etc. We showed that this model produces crisp pictures, and that the face swap is successful, actually
fooling a face recognition system.
Some future work remain to be done and questions to be answered. We mainly need to put all
those systems together and test whether the metadata extracted from videos are able to improve the
recommender system. The impact on the face recognition system of the samples generated from the
face swap algorithm is still to be evaluated. This VQ-VAE system can be compared to a GAN based
model, as they are famous for the great image quality they produce. The recommender system has to
be evaluated in production and A/B tested to assess its real world performance. Torchélie needs to find
a community to maintain it and keep it up to date as the needs for deep learning grow too much for
118
a single developer. Most importantly, Torchélie must now get up to date with modern training recipes
(involving Mixup, CutMix, TrivialAugment, etc) while mitigating the cost they incur, often needing
longer training schedules or smaller batch sizes, if we want to keep our extreme applicability focus.
Bibliography
[2] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, Bamshad Mobasher, and Edward Malt-
house. User-centered evaluation of popularity bias in recommender systems. In Proceedings
of the 29th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’21,
page 119–129, New York, NY, USA, 2021. Association for Computing Machinery. ISBN
9781450383660. doi: 10.1145/3450613.3456821. URL https://doi.org/10.1145/
3450613.3456821.
[3] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol (Paul) Natsev, George Toderici, Bal-
akrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video
classification benchmark. In arXiv:1609.08675, 2016. URL https://arxiv.org/pdf/
1609.08675v1.pdf.
[4] Alexander Amir Alemi, Ian S. Fischer, Joshua V. Dillon, and K. Murphy. Deep variational
information bottleneck. ArXiv, abs/1612.00410, 2017.
[5] Ehsan Amid, Manfred K. Warmuth, Rohan Anil, and Tomer Koren. Robust bi-tempered logistic
loss based on bregman divergences. ArXiv, abs/1906.03361, 2019.
[6] Anthreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adver-
sarial networks, 2018. URL https://openreview.net/forum?id=S1Auv-WRZ.
[7] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial net-
works. In International conference on machine learning, pages 214–223. PMLR, 2017.
[8] Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian
McAuley. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artifi-
cial Intelligence, pages 1352–1361. PMLR, 2021.
[9] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function.
IEEE Transactions on Information theory, 39(3):930–945, 1993.
[10] Alan Joseph Bekker and Jacob Goldberger. Training deep neural-networks based on unreli-
able labels. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 2682–2686, 2016.
[11] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with rein-
forcement learning. In International Conference on Machine Learning, pages 459–468. PMLR,
2017.
120
[12] Irwan Bello, William Fedus, Xianzhi Du, Ekin D Cubuk, Aravind Srinivas, Tsung-Yi Lin,
Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies.
arXiv preprint arXiv:2103.07579, 2021.
[13] J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007,
pages 3–6, New York, August 2007. ACM. URL http://www.cs.uic.edu/˜liub/
KDD-cup-2007/NetflixPrize-description.pdf.
[14] David Berthelot, Nicholas Carlini, Ian G Goodfellow, Nicolas Papernot, Avital Oliver, and Colin
Raffel. Mixmatch: A holistic approach to semi-supervised learning. ArXiv, abs/1905.02249,
2019.
[15] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying
mmd gans. In International Conference on Learning Representations, 2018.
[16] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[17] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. Lof: Identifying
density-based local outliers. In SIGMOD Conference, 2000.
[18] Andrew Brock, Theodore Lim, James Millar Ritchie, and Nicholas J Weston. Neural photo
editing with introspective adversarial networks. In 5th International Conference on Learning
Representations 2017, 2017.
[19] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity
natural image synthesis. In International Conference on Learning Representations, 2019. URL
https://openreview.net/forum?id=B1xsqj09Fm.
[20] Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail
Druzhinin, and Alexandr A. Kalinin. Albumentations: Fast and flexible image augmenta-
tions. Information, 11(2), 2020. ISSN 2078-2489. doi: 10.3390/info11020125. URL
https://www.mdpi.com/2078-2489/11/2/125.
[21] Q. Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. Vggface2: A dataset for
recognising faces across pose and age. 2018 13th IEEE International Conference on Automatic
Face & Gesture Recognition (FG 2018), pages 67–74, 2018.
[22] Òscar Celma Herrada et al. Music recommendation and discovery in the long tail. Universitat
Pompeu Fabra, 2009.
[23] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:
Interpretable representation learning by information maximizing generative adversarial nets. In
Proceedings of the 30th International Conference on Neural Information Processing Systems,
pages 2180–2188, 2016.
[24] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved
autoregressive generative model. In International Conference on Machine Learning, pages 864–
872. PMLR, 2018.
[25] Jan Chorowski, Nanxin Chen, Ricard Marxer, Hans Dolfing, Adrian Łańcucki, Guillaume
Sanchez, Tanel Alumäe, and Antoine Laurent. Unsupervised neural segmentation and clus-
tering for unit discovery in sequential data. In NeurIPS 2019 workshop-Perception as generative
reasoning-Structure, Causality, Probability, 2019.
121
[26] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommen-
dations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198,
2016.
[27] Antonia Creswell and Anil Anthony Bharath. Inverting the generator of a generative adversarial
network. IEEE transactions on neural networks and learning systems, 30(7):1967–1974, 2018.
[28] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:
Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 113–123, 2019.
[29] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical auto-
mated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
[30] Maurizio Ferrari Dacrema, Simone Boglio, P. Cremonesi, and D. Jannach. A troubling anal-
ysis of reproducibility and progress in recommender systems research. ACM Transactions on
Information Systems (TOIS), 39:1 – 49, 2021.
[31] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and
attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.
[32] Stéphane d’Ascoli, Hugo Touvron, Matthew L. Leavitt, Ari S. Morcos, Giulio Biroli, and Levent
Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. CoRR,
abs/2103.10697, 2021. URL https://arxiv.org/abs/2103.10697.
[33] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[34] Jiankang Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face
recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 4685–4694, 2019.
[35] Weihong Deng, Jiani Hu, Nanhai Zhang, Binghui Chen, and Jun Guo. Fine-grained face verifi-
cation: Fglfw database, baselines, and human-dcmn partnership. Pattern Recognition, 66:63–73,
2017.
[36] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural net-
works with cutout. arXiv preprint arXiv:1708.04552, 2017.
[37] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. Advances
in Neural Information Processing Systems, 32:10542–10552, 2019.
[38] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv
preprint arXiv:1605.09782, 2016.
[39] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics
based on deep networks. Advances in neural information processing systems, 29:658–666, 2016.
[40] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
An image is worth 16x16 words: Transformers for image recognition at scale. In International
Conference on Learning Representations, 2020.
122
[41] DC Dowson and BV Landau. The fréchet distance between multivariate normal distributions.
Journal of multivariate analysis, 12(3):450–455, 1982.
[42] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution
image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 12873–12883, 2021.
[43] V. Fomin, J. Anmol, S. Desroziers, J. Kriss, and A. Tejani. High-level library to help with training
neural networks in pytorch. https://github.com/pytorch/ignite, 2020.
[44] Benoı̂t Frénay and Michel Verleysen. Classification in the presence of label noise: A survey.
IEEE Transactions on Neural Networks and Learning Systems, 25:845–869, 2014.
[45] Simon Funk. Netflix update: Try this at home, 2006.
[46] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolu-
tional neural networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 2414–2423, 2016.
[47] Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon Shlens.
Exploring the structure of a real-time, arbitrary neural artistic stylization network. Proced-
ings of the British Machine Vision Conference 2017, 2017. doi: 10.5244/c.31.114. URL
http://dx.doi.org/10.5244/C.31.114.
[48] Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture
search for generative adversarial networks. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pages 3224–3234, 2019.
[49] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural informa-
tion processing systems, 27, 2014.
[50] Shanyan Guan, Ying Tai, Bingbing Ni, Feida Zhu, Feiyue Huang, and Xiaokang Yang. Collabo-
rative learning for faster stylegan embedding. arXiv preprint arXiv:2007.01758, 2020.
[51] Ishaan Gulrajani, Faruk Ahmed, Martı́n Arjovsky, Vincent Dumoulin, and Aaron C. Courville.
Improved training of wasserstein gans. In NIPS, 2017.
[52] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural
networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
[53] Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott,
and Dinglong Huang. Curriculumnet: Weakly supervised learning from large-scale web images.
ArXiv, abs/1808.01097, 2018.
[54] Yandong Guo, Lei Zhang, Yuxiao Hu, X. He, and Jianfeng Gao. Ms-celeb-1m: A dataset and
benchmark for large-scale face recognition. In ECCV, 2016.
[55] Jiangfan Han, Ping Luo, and Xiaogang Wang. Deep self-learning from noisy labels. ArXiv,
abs/1908.02160, 2019.
[56] Erik Härkönen, Aaron Hertzman, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering
interpretable gan controls. In IEEE Conference on Neural Information Processing Systems;,
2020.
123
[57] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM
Trans. Interact. Intell. Syst., 5(4), December 2015. ISSN 2160-6455. doi: 10.1145/2827872.
URL https://doi.org/10.1145/2827872.
[58] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. In European conference on computer vision, pages 630–645. Springer, 2016.
[59] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 770–778, 2016. doi: 10.1109/CVPR.2016.90.
[60] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks
for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 558–567, 2019.
[61] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution
examples in neural networks. Proceedings of International Conference on Learning Representa-
tions, 2017.
[62] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train
deep networks on labels corrupted by severe noise. In NeurIPS, 2018.
[63] A. Hermans, Lucas Beyer, and B. Leibe. In defense of the triplet loss for person re-identification.
ArXiv, abs/1703.07737, 2017.
[64] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems, 30, 2017.
[65] E. Hoffer and Nir Ailon. Deep metric learning using triplet network. In SIMBAD, 2015.
[66] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989.
[69] Bo-Yang Hsueh, Wei Li, and I-Chen Wu. Stochastic gradient descent with hyperbolic-tangent
decay on classification. In 2019 IEEE Winter Conference on Applications of Computer Vision
(WACV), pages 435–442. IEEE, 2019.
[70] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[71] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the
wild: A database for studying face recognition in unconstrained environments. Technical Report
07-49, University of Massachusetts, Amherst, October 2007.
[72] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with
conditional adversarial networks. CVPR, 2017.
124
[73] Dietmar Jannach and Gediminas Adomavicius. Recommendations with a purpose. In Proceed-
ings of the 10th ACM conference on recommender systems, pages 7–10, 2016.
[74] Animesh Karnewar and Oliver Wang. Msg-gan: Multi-scale gradients for generative adversar-
ial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 7799–7808, 2020.
[75] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. In International Conference on Learning Representa-
tions, 2018.
[76] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative
adversarial networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Jun 2019. doi: 10.1109/cvpr.2019.00453. URL http://dx.doi.org/10.1109/
CVPR.2019.00453.
[77] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila.
Training generative adversarial networks with limited data. In Proc. NeurIPS, 2020.
[78] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. An-
alyzing and improving the image quality of stylegan. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), Jun 2020. doi: 10.1109/cvpr42600.2020.00813. URL
http://dx.doi.org/10.1109/cvpr42600.2020.00813.
[79] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface
benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4873–4882, 2016.
[80] Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. Nlnl: Negative learning for noisy
labels. ArXiv, abs/1908.07387, 2019.
[81] Diederik P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114,
2014.
[82] B. Klare, B. Klein, Emma Taborsky, Austin Blanton, J. Cheney, K. Allen, P. Grother, Alan
Mah, M. Burge, and Anil K. Jain. Pushing the frontiers of unconstrained face detection and
recognition: Iarpa janus benchmark a. 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1931–1939, 2015.
[83] Alexander Kolesnikov and Christoph H. Lampert. Pixelcnn models with auxiliary variables for
natural image modeling. In ICML, 2017.
[84] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-
mender systems. Computer, 42(8):30–37, August 2009. ISSN 0018-9162. doi: 10.1109/MC.
2009.263. URL https://doi.org/10.1109/MC.2009.263.
[85] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
2661–2671, 2019.
[86] Simon Kornblith, Honglak Lee, Ting Chen, and Mohammad Norouzi. What’s in a loss function
for image classification? arXiv preprint arXiv:2010.16402, 2020.
125
[87] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, CI-
FAR, 2009. URL https://www.cs.toronto.edu/˜kriz/cifar.html.
[88] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25:1097–
1105, 2012.
[89] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved
precision and recall metric for assessing generative models. CoRR, abs/1904.06991, 2019.
[90] Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans JGA
Dolfing, Sameer Khurana, Tanel Alumäe, and Antoine Laurent. Robust training of vector quan-
tized bottleneck models. In 2020 International Joint Conference on Neural Networks (IJCNN),
pages 1–7. IEEE, 2020.
[91] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs
[Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
[92] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In Proceedings of
the 26th annual international conference on machine learning, pages 609–616, 2009.
[93] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning for
scalable image classifier training with label noise. 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5447–5456, 2017.
[94] Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. Content-based multimedia
information retrieval: State of the art and challenges. ACM Trans. Multimedia Comput. Commun.
Appl., 2(1):1–19, February 2006. ISSN 1551-6857. doi: 10.1145/1126004.1126005. URL
https://doi.org/10.1145/1126004.1126005.
[95] Mengtian Li, Ersin Yumer, and Deva Ramanan. Budgeted training: Rethinking deep neural
network training under resource constraints. In International Conference on Learning Represen-
tations, 2019.
[96] Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. Webvision database: Visual
learning and understanding from web data. ArXiv, abs/1708.02862, 2017.
[97] Xiaoqiang Li, Liangbo Chen, Lu Wang, Pin Wu, and Weiqin Tong. Scgan: Disentangled repre-
sentation learning by adding similarity constraint on generative adversarial nets. IEEE Access,
PP:1–1, 09 2018. doi: 10.1109/ACCESS.2018.2872695.
[98] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense
object detection. In Proceedings of the IEEE international conference on computer vision, pages
2980–2988, 2017.
[99] Kanglin Liu, Guoping Qiu, Wenming Tang, and Fei Zhou. Spectral regularization for combating
mode collapse in gans. 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
Oct 2019. doi: 10.1109/iccv.2019.00648. URL http://dx.doi.org/10.1109/ICCV.
2019.00648.
126
[100] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and
Jiawei Han. On the variance of the adaptive learning rate and beyond. In Proceedings of the
Eighth International Conference on Learning Representations (ICLR 2020), April 2020.
[101] Weiyang Liu, Y. Wen, Zhiding Yu, Ming Li, B. Raj, and Le Song. Sphereface: Deep hyper-
sphere embedding for face recognition. 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 6738–6746, 2017.
[102] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining
Xie. A convnet for the 2020s. arXiv preprint arXiv:2201.03545, 2022.
[103] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the
wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[104] Mario Lucic, Karol Kurach, Marcin Michalski, S. Gelly, and O. Bousquet. Are gans created
equal? a large-scale study. In NeurIPS, 2018.
[105] Malte Ludewig, Noemi Mauro, Sara Latifi, and Dietmar Jannach. Performance comparison of
neural and non-neural approaches to session-based recommendation. In Proceedings of the 13th
ACM conference on recommender systems, pages 462–466, 2019.
[106] Brianna Maze, J. Adams, J. A. Duncan, Nathan D. Kalka, T. Miller, Charles Otto, Anil K. Jain,
W. T. Niggel, J. Anderson, J. Cheney, and P. Grother. Iarpa janus benchmark - c: Face dataset
and protocol. 2018 International Conference on Biometrics (ICB), pages 158–165, 2018.
[107] L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection
for Dimension Reduction. ArXiv e-prints, February 2018.
[108] Lars M. Mescheder, Andreas Geiger, and S. Nowozin. Which training methods for gans do
actually converge? In ICML, 2018.
[109] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural information
processing systems, pages 3111–3119, 2013.
[110] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014.
[111] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization
for generative adversarial networks. In International Conference on Learning Representations,
2018. URL https://openreview.net/forum?id=B1QRgziT-.
[112] Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of
linear regions of deep neural networks. arXiv preprint arXiv:1402.1869, 2014.
[113] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going
deeper into neural networks, 2015. URL https://ai.googleblog.com/2015/06/
inceptionism-going-deeper-into-neural.html.
[114] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Deepdream, 2015. URL https:
//www.tensorflow.org/tutorials/generative/deepdream.
[115] Samuel G. Müller and Frank Hutter. Trivialaugment: Tuning-free yet state-of-the-art data aug-
mentation, 2021.
127
[116] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines.
In Icml, 2010.
[117] Duc Tam Nguyen, Thi-Phuong-Nhung Ngo, Zhongyu Lou, Michael Klar, Laura Beggel,
and Thomas Brox. Robust learning under label noise with iterative noise-filtering. ArXiv,
abs/1906.00216, 2019.
[118] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph A Konstan. Ex-
ploring the filter bubble: the effect of using recommender systems on content diversity. In
Proceedings of the 23rd international conference on World wide web, pages 677–686, 2014.
[119] Xia Ning and George Karypis. Slim: Sparse linear methods for top-n recommender systems. In
2011 IEEE 11th International Conference on Data Mining, pages 497–506. IEEE, 2011.
[120] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017.
doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
[121] Eli Pariser. The Filter Bubble: What the Internet Is Hiding from You. Penguin Group , The, 2011.
ISBN 1594203008.
[122] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with
spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019.
[123] Taesung Park, Alexei A. Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for un-
paired image-to-image translation. In European Conference on Computer Vision, 2020.
[124] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.
SIAM journal on control and optimization, 30(4):838–855, 1992.
[125] Yipeng Qin, Niloy Mitra, and Peter Wonka. How does lipschitz regularization influence
gan training? Lecture Notes in Computer Science, page 310–326, 2020. ISSN 1611-
3349. doi: 10.1007/978-3-030-58517-4 19. URL http://dx.doi.org/10.1007/
978-3-030-58517-4_19.
[126] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2016.
[127] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
[128] Juan Ramos. Using tf-idf to determine word relevance in document queries, 1999.
[129] Rajeev Ranjan, Carlos D. Castillo, and R. Chellappa. L2-constrained softmax loss for discrimi-
native face verification. ArXiv, abs/1703.09507, 2017.
[130] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet clas-
sifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–
5400. PMLR, 2019.
[131] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples
for robust deep learning. ArXiv, abs/1803.09050, 2018.
128
[132] Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. Neural collaborative filtering vs.
matrix factorization revisited. In Fourteenth ACM Conference on Recommender Systems, pages
240–248, 2020.
[133] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-
based editing of real images. arXiv preprint arXiv:2106.05744, 2021.
[134] Frank” ”Rosenblatt. The perceptron: A probabilistic model for information storage and organi-
zation in the brain. Psychological Review, 65:386–408, 1958.
[135] Aurko Roy, Ashish Vaswani, Niki Parmar, and Arvind Neelakantan. Towards a better under-
standing of vector quantized autoencoders. ArXiv, 2018.
[136] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-
Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. URL https://
image-net.org/.
[137] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: A pixelcnn
implementation with discretized logistic mixture likelihood and other modifications. In ICLR,
2017.
[138] Guillaume SANCHEZ, Vincente GUIS, Ricard MARXER, and Frederic BOUCHARA. Deep
learning classification with noisy labels. In 2020 IEEE International Conference on Multimedia
Expo Workshops (ICMEW), pages 1–6, 2020. doi: 10.1109/ICMEW46912.2020.9105992.
[139] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for
semantic face editing. In CVPR, 2020.
[140] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. Interfacegan: Interpreting the disen-
tangled face representation learned by gans. IEEE transactions on pattern analysis and machine
intelligence, 2020.
[141] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale im-
age recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on
Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, 2015. URL http://arxiv.org/abs/1409.1556.
[142] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter
conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
[143] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using
deep conditional generative models. Advances in neural information processing systems, 28:
3483–3491, 2015.
[144] Casper Kaae Sønderby, Ben Poole, and Andriy Mnih. Continuous relaxation training of discrete
latent variable image models. In Bayesian Deep Learning workshop, NIPS, volume 201, 2017.
[145] Harald Steck. Embarrassingly shallow autoencoders for sparse data. In The World Wide Web
Conference, pages 3251–3257, 2019.
129
[146] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9,
2015.
[147] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethink-
ing the inception architecture for computer vision. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2818–2826, 2016.
[148] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap
to human-level performance in face verification. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1701–1708, 2014.
[149] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
[150] Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability
of generative adversarial networks. In International Conference on Learning Representations,
2018.
[151] Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability
of generative adversarial networks. In International Conference on Learning Representations,
2019. URL https://openreview.net/forum?id=ByxPYjC5KQ.
[152] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models.
In International Conference on Learning Representations, Apr 2016. URL http://arxiv.
org/abs/1511.01844.
[153] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas
Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Doso-
vitskiy. Mlp-mixer: An all-mlp architecture for vision. CoRR, abs/2105.01601, 2021. URL
https://arxiv.org/abs/2105.01601.
[154] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
[155] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation
learning. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, pages 6309–6318, 2017.
[156] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
[157] Daniel Ponsa Vassileios Balntas, Edgar Riba and Krystian Mikolajczyk. Learning local fea-
ture descriptors with triplets and shallow convolutional neural networks. In Edwin R. Hancock
Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision
Conference (BMVC), pages 119.1–119.11. BMVA Press, September 2016. ISBN 1-901725-59-6.
doi: 10.5244/C.30.119. URL https://dx.doi.org/10.5244/C.30.119.
[158] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pages 5998–6008, 2017.
130
[159] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good
closed-set classifier is all you need. arXiv preprint arXiv:2110.06207, 2021.
[160] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
[161] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang
Wang, and Xiaoou Tang. Residual attention network for image classification. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2017.
[162] Fei Wang, L. Chen, Cheng Li, Shiyao Huang, Yanjie Chen, Chen Qian, and Chen Change Loy.
The devil of face recognition is in the noise. In ECCV, 2018.
[163] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face verifi-
cation. IEEE Signal Processing Letters, 25(7):926–930, 2018.
[164] H. Wang, Yitong Wang, Z. Zhou, Xing Ji, Zhifeng Li, Dihong Gong, Jingchao Zhou, and Wenyu
Liu. Cosface: Large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
[165] Mei Wang and Weihong Deng. Deep face recognition: A survey. Neurocomputing, 429:215–244,
Mar 2021. ISSN 0925-2312. doi: 10.1016/j.neucom.2020.10.081. URL http://dx.doi.
org/10.1016/j.neucom.2020.10.081.
[166] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.
High-resolution image synthesis and semantic manipulation with conditional gans. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[167] Xiaobo Wang, Shuo Wang, Jun Wang, Hailin Shi, and Tao Mei. Co-mining: Deep face recogni-
tion with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 9358–9367, 2019.
[168] Yisen Wang, Weiyang Liu, Xingjun Ma, James Bailey, Hongyuan Zha, Le Song, and Shu-Tao
Xia. Iterative learning with open-set noisy labels. 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8688–8696, 2018.
[169] Cameron Whitelam, Emma Taborsky, Austin Blanton, Brianna Maze, J. Adams, T. Miller,
Nathan D. Kalka, Anil K. Jain, J. A. Duncan, K. Allen, J. Cheney, and P. Grother. Iarpa janus
benchmark-b face dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), pages 592–600, 2017.
[171] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy
labeled data for image classification. 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2691–2699, 2015.
[172] Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised
data augmentation. ArXiv, abs/1904.12848, 2019.
[173] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1492–1500, 2017.
131
[174] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural
networks through deep visualization. In Deep Learning Workshop, International Conference on
Machine Learning (ICML), 2015.
[175] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan
Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep
learning: Training bert in 76 minutes. In International Conference on Learning Representations,
2020. URL https://openreview.net/forum?id=Syx4wnEtvH.
[176] Scott WH Young. Improving library user experience with a/b testing: Principles and process.
Weave: Journal of Library User Experience, 1(1), 2014.
[177] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng,
and Shuicheng Yan. Metaformer is actually what you need for vision. arXiv preprint
arXiv:2111.11418, 2021.
[178] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon
Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032,
2019.
[179] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision
Conference 2016. British Machine Vision Association, 2016.
[180] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond
empirical risk minimization. In International Conference on Learning Representations, 2018.
URL https://openreview.net/forum?id=r1Ddp1-Rb.
[181] Michael R. Zhang, James Lucas, Geoffrey Hinton, and Jimmy Ba. Lookahead Optimizer:
¡i¿K¡/i¿ Steps Forward, 1 Step Back. Curran Associates Inc., Red Hook, NY, USA, 2019.
[182] X. Zhang, R. Zhao, Y. Qiao, Xiaogang Wang, and Hongsheng Li. Adacos: Adaptively scaling
cosine logits for effectively learning deep face representations. 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 10815–10824, 2019.
[183] Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural
networks with noisy labels. In NeurIPS, 2018.
[184] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image trans-
lation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE In-
ternational Conference on, 2017.
[185] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon
Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in
observed gradients. Advances in Neural Information Processing Systems, 33, 2020.
132
Unless otherwise specified in the text, these are the default notations and conventions used throughout
the manuscript.
D Discriminator
G Generator
IS Inception Score
JS Jensen-Shannon
lr learning rate
PPL Perplexity
VQ Vector-Quantized
Figure B.1: Additional plots for the ArcFace model (Section 5.5.5)
136
Figure B.3: Additional plots for the DCE model (Section ??)
.
137
Figure B.4: Additional plots for the ZLog model (Section 5.5.5)
.
C ConvNeXt experiments
Figure C.1: Incremental improvement process from ResNet to ConvNext. Figure extracted from Liu et al. [102].