Deep Bayesian Active Learning with Image Data
Yarin Gal 1 2 Riashat Islam 1 Zoubin Ghahramani 1
Abstract amount of data (the initial training set), and an acquisition
Even though active learning forms an important function (often based on the model’s uncertainty) decides
pillar of machine learning, deep learning tools which data points to ask an external oracle for a label. The
acquisition function selects one or more points from a pool
arXiv:1703.02910v1 [cs.LG] 8 Mar 2017
are not prevalent within it. Deep learning poses
several difficulties when used in an active learn- of unlabelled data points, with the pool points lying outside
ing setting. First, active learning (AL) methods of the training set. An oracle (often a human expert) labels
generally rely on being able to learn and update the selected data points, these are added to the training set,
models from small amounts of data. Recent ad- and a new model is trained on the updated training set. This
vances in deep learning, on the other hand, are no- process is then repeated, with the training set increasing
torious for their dependence on large amounts of in size over time. The advantage of such systems is that
data. Second, many AL acquisition functions rely they often result in dramatic reductions in the amount of
on model uncertainty, yet deep learning methods labelling required to train an ML system (and therefore cost
rarely represent such model uncertainty. In this and time).
paper we combine recent advances in Bayesian Even though existing techniques for active learning have
deep learning into the active learning framework proven themselves useful in a variety of tasks, a major re-
in a practical way. We develop an active learn- maining challenge in active learning is its lack of scalability
ing framework for high dimensional data, a task to high-dimensional data (Tong, 2001). This data appears of-
which has been extremely challenging so far, with ten in image form, with a physician classifying MRI scans to
very sparse existing literature. Taking advantage diagnose Alzheimer’s for example (Marcus et al., 2010), or
of specialised models such as Bayesian convolu- an expert clinician diagnosing skin cancer from dermoscopic
tional neural networks, we demonstrate our active lesion images. To perform active learning, a model has to
learning techniques with image data, obtaining a be able to learn from small amounts of data and represent
significant improvement on existing active learn- its uncertainty over unseen data. This severely restricts the
ing approaches. We demonstrate this on both the class of models that can be used within the active learning
MNIST dataset, as well as for skin cancer diagno- framework. As a result most approaches to active learning
sis from lesion images (ISIC2016 task). have focused on low dimensional problems (Tong, 2001;
Hernandez-Lobato & Adams, 2015), with only a handful
of exceptions (Zhu et al., 2003; Holub et al., 2008; Joshi
1. Introduction et al., 2009) relying on kernel or graph-based approaches to
A big challenge in many machine learning applications handle high-dimensional data.
is obtaining labelled data. This can be a long, laborious, In recent years, with the increased availability of data in
and costly process, often making the deployment of ML some domains, attention within the machine learning com-
systems uneconomical. A framework where a system could munity has shifted from small data problems to big data
learn from small amounts of data, and choose by itself what problems (Sundermeyer et al., 2012; Krizhevsky et al.,
data it would like the user to label, would make machine 2012; Kalchbrenner & Blunsom, 2013; Sutskever et al.,
learning much more widely applicable. Such frameworks 2014). And with the increased interest in big data problems,
for learning are referred to as active learning (Cohn et al., new tools were developed and existing tools were refined
1996) (also known as “experiment design” in the statistics for handling high dimensional data within such regimes.
literature), and have been used successfully in fields such as Deep learning, and convolutional neural networks (CNNs)
medical diagnosis, microbiology, and manufacturing (Tong, (Rumelhart et al., 1985; LeCun et al., 1989) in particular, are
2001). In active learning, a model is trained on a small an example of such tools. Originally developed in 1989 to
1
University of Cambridge, UK 2 The Alan Turing Institute, UK.
parse handwritten zip codes, these tools have flourished and
Correspondence to: Yarin Gal <
[email protected]>. were adapted to a point where a CNN is able to beat a hu-
man on object recognition tasks (given enough training data)
Deep Bayesian Active Learning with Image Data
(He et al., 2015). New techniques such as dropout (Hinton RBF kernel. Lastly, making use of unlabelled data as well,
et al., 2012; Srivastava et al., 2014) are used extensively to Zhu et al. (2003) acquire points using a Gaussian random
regularise these huge models, which often contain millions field model, evaluating an RBF kernel over raw images. We
of parameters (Jozefowicz et al., 2016). But even though ac- compare to this last technique and explain it in more detail
tive learning forms an important pillar of machine learning, below.
deep learning tools are not prevalent within it. Deep learn-
Other related work includes semi-supervised learning of im-
ing poses several difficulties when used in an active learning
age data (Weston et al., 2012; Kingma et al., 2014; Rasmus
setting. First, we have to be able to handle small amounts of
et al., 2015). In semi-supervised learning a model is given a
data. Recent advances in deep learning, on the other hand,
fixed set of labelled data, and a fixed set of unlabelled data.
are notorious for their dependence on large amounts of data
The model can use the unlabelled data to learn about the
(Krizhevsky et al., 2012). Second, many AL acquisition
distribution of the inputs, in the hopes that this information
functions rely on model uncertainty. But in deep learning
will aid in learning from the small labelled set as well. Al-
we rarely represent such model uncertainty.
though the learning paradigm is fairly different from active
Relying on Bayesian approaches to deep learning, in this learning, this research forms the closest modern literature
paper we combine recent advances in Bayesian deep learn- to active learning of image data. We will compare to these
ing into the active learning framework in a practical way. techniques below as well, in section 5.4.
We develop an active learning framework for high dimen-
sional data, a task which has been extremely challenging 3. Bayesian Convolutional Neural Networks
so far with very sparse existing literature from the past 15
years (Zhu et al., 2003; Li & Guo, 2013; Holub et al., 2008; In this paper we concentrate on high dimensional image
Joshi et al., 2009). Taking advantage of specialised models data, and need a model able to represent prediction uncer-
such as Bayesian convolutional neural networks (BCNNs) tainty on such data. Existing approaches such as (Zhu et al.,
(Gal & Ghahramani, 2016a;b), we demonstrate our active 2003; Li & Guo, 2013; Joshi et al., 2009) rely on kernel
learning techniques with image data. Using a small model, methods, and feed image pairs through linear, polynomial,
our system is able to achieve 5% test error on MNIST with and RBF kernels to capture image similarity as an input to
only 295 labelled images without relying on unlabelled data an SVM for example. In contrast, we rely on specialised
(in comparison, 835 labelled images are needed to achieve models for image data, and in particular on convolutional
5% test error using random sampling – requiring an expert neural networks (CNNs) (Rumelhart et al., 1985; LeCun
to label more than twice as many images to achieve the et al., 1989). Unlike the kernels above, which cannot cap-
same accuracy), and achieves 1.64% test error with 1000 ture spatial information in the input image, CNNs are de-
labelled images. This is in comparison to 2.40% test er- signed to use this spatial information, and have been used
ror of DGN (Kingma et al., 2014) or 1.53% test error of successfully to achieve state-of-the-art results (Krizhevsky
the Ladder Network Γ-model (Rasmus et al., 2015), both et al., 2012). To perform active learning with image data
semi-supervised learning techniques which additionally use we make use of the Bayesian equivalent of CNNs, proposed
the entire unlabelled training set. Finally, we study a real- in (Gal & Ghahramani, 2016a)1 . These Bayesian CNNs are
world application by diagnosing melanoma (skin cancer) CNNs with prior probability distributions placed over a set
from a small number of lesion images by fine-tuning the of model parameters ω = {W1 , ..., WL }:
VGG16 convolutional neural network (Simonyan & Zisser- ω ∼ p(ω),
man, 2015) on the ISIC 2016 dataset (Gutman et al., 2016). with for example a standard Gaussian prior p(ω). We further
define a likelihood model
2. Related Research p(y = c|x, ω) = softmax(f ω (x))
Past attempts at active learning of image data have concen- for the case of classification, or a Gaussian likelihood for
trated on kernel based methods. Using ideas from previous the case of regression, with f ω (x) model output (with pa-
research in active learning of low dimensional data (Tong, rameters ω).
2001), Joshi et al. (2009) used “margin-based uncertainty” To perform approximate inference in the Bayesian CNN
and extracted probabilistic outputs from support vector ma- model we make use of stochastic regularisation techniques
chines (SVM) (Cortes & Vapnik, 1995). They used linear, such as dropout (Hinton et al., 2012; Srivastava et al., 2014),
polynomial, and Radial Basis Function (RBF) kernels on originally used to regularise these models. As shown in
the raw images, picking the kernel that gave best classifica- (Gal & Ghahramani, 2016b; Gal, 2016) dropout and various
tion accuracy. Analogously to SVM approaches, Li & Guo
1
(2013) used Gaussian processes (GPs) with RBF kernels As far as we are aware, there are no other tools in current
literature that offer model uncertainty in specialised models for
to get model uncertainty. However Li & Guo (2013) fed
image data, which perform as well as CNNs.
low dimensional features (such as SIFT features) to their
Deep Bayesian Active Learning with Image Data
other stochastic regularisation techniques can be used to tropy (Max Entropy, (Shannon, 1948))
perform practical approximate inference in complex deep H[y|x, Dtrain ] :=
models. Inference is done by training a model with dropout X
before every weight layer, and by performing dropout at − p(y = c|x, Dtrain ) log p(y = c|x, Dtrain ).
c
test time as well to sample from the approximate posterior
(stochastic forward passes, referred to as MC dropout). 2. Choose pool points that are expected to maximise the
More formally, this approach is equivalent to performing information gained about the model parameters, i.e.
approximate variational inference where we find a distri- maximise the mutual information between predictions
bution qθ∗ (ω) in a tractable family which minimises the and model posterior (BALD, (Houlsby et al., 2011))
Kullback-Leibler (KL) divergence to the true model poste- I[y, ω|x, Dtrain ] = H[y|x, Dtrain ]−Ep(ω|Dtrain ) H[y|x, ω]
rior p(ω|Dtrain ) given a training set Dtrain . Dropout can be with ω the model parameters (here H[y|x, ω] is the
interpreted as a variational Bayesian approximation, where entropy of y given model weights ω). Points that max-
the approximating distribution is a mixture of two Gaussians imise this acquisition function are points on which the
with small variances and the mean of one of the Gaussians model is uncertain on average, but there exist model
is fixed at zero. The uncertainty in the weights induces pre- parameters that produce disagreeing predictions with
diction uncertainty by marginalising over the approximate high certainty. This is equivalent to points with high
posterior using Monte Carlo integration: variance in the input to the softmax layer (the logits)
Z
– thus each stochastic forward pass through the model
p(y = c|x, Dtrain ) = p(y = c|x, ω)p(ω|Dtrain )dω
would have the highest probability assigned to a differ-
ent class.
Z
≈ p(y = c|x, ω)qθ∗ (ω)dω
3. Maximise the Variation Ratios (Freeman, 1965)
T
1X variation-ratio[x] := 1 − max p(y|x, Dtrain )
≈ p(y = c|x, ω
b t) y
T t=1
Like Max Entropy, Variation Ratios measures lack of
b t ∼ qθ∗ (ω), where qθ (ω) is the Dropout distribution
with ω confidence.
(Gal, 2016).
Bayesian CNNs work well with small amounts of data (Gal 4. Maximise mean standard deviation (Mean STD)
& Ghahramani, 2016a), and possess uncertainty information (Kampffmeyer et al., 2016; Kendall et al., 2015)
q
that can be used with existing acquisition functions (Gal, σc = Eq(ω) [p(y = c|x, ω)2 ] − Eq(ω) [p(y = c|x, ω)]2
2016). Such acquisition functions for the case of classifica-
1 X
tion are discussed next. σ(x) = σc
C c
averaged over all c classes x can take. Compared to the
4. Acquisition Functions and their above acquisition functions, this is more of an ad-hoc
Approximations technique used in recent literature.
Given a model M, pool data Dpool , and inputs x ∈ Dpool ,
5. Random acquisition (baseline): a(x) = unif() with
an acquisition function a(x, M) is a function of x that the
unif() a function returning a draw from a uniform dis-
AL system uses to decide where to query next:
tribution over the interval [0, 1]. Using this acquisition
x∗ = argmaxx∈Dpool a(x, M). function is equivalent to choosing a point uniformly at
We next explore various acquisition functions appropriate random from the pool.
for our image data setting, and develop tractable approxi-
mations for us to use with our Bayesian CNNs. In tasks
These acquisition functions and their properties are dis-
involving regression we often use the predictive variance or
cussed in more detail in (Gal, 2016, pp. 48–52).
a quantity derived from this for our acquisition function (al-
though we still need to be careful to query from informative We can approximate each of these acquisition functions
areas rather than querying noise). For example, we might using our approximate distribution qθ∗ (ω). For BALD, for
look for images with high predictive variance and choose example, we can write the acquisition function as follows:
those to ask an expert to label – in the hope that these will
I[y, ω|x, Dtrain ] := H[y|x, Dtrain ] − Ep(ω|Dtrain ) H[y|x, ω]
decrease model uncertainty. However, many tasks involving X
image data are often phrased as classification problems. For =− p(y = c|x, Dtrain ) log p(y = c|x, Dtrain )
c
classification, several acquisition functions are available: X
+ Ep(ω|Dtrain ) p(y = c|x, ω) log p(y = c|x, ω) ,
1. Choose pool points that maximise the predictive en- c
Deep Bayesian Active Learning with Image Data
with c the possible classes y can take. I[y, ω|x, Dtrain ] can parison to a current technique for active learning with image
be approximatedR in our setting using the identity p(y = data, which relies on SVMs. We follow with a comparison to
c|x, Dtrain ) = p(y = c|x, ω)p(ω|Dtrain )dω: the closest modern models to our active learning with image
I[y, ω|x, Dtrain ] = data – semi-supervised techniques with image data. These
XZ semi-supervised techniques have access to much more data
− p(y = c|x, ω)p(ω|Dtrain )dω (the unlabelled data) than our active learning models, yet
c
Z we still perform in comparable terms to them. Finally, we
demonstrate the proposed methodology with a real world
· log p(y = c|x, ω)p(ω|Dtrain )dω
application of skin cancer diagnosis from a small number of
X lesion images, relying on fine-tuning of a large CNN model.
+ Ep(ω|Dtrain ) p(y = c|x, ω) log p(y = c|x, ω) .
c
5.1. Comparison of various acquisition functions
Swapping the posterior p(ω|Dtrain ) with our approximate
posterior qθ∗ (ω), and through MC sampling, we then have: We next study all acquisition functions above with our
XZ Bayesian CNN trained on the MNIST dataset (LeCun
≈− p(y = c|x, ω)qθ∗ (ω)dω & Cortes, 1998). All acquisition functions are as-
c
Z sessed with the same model structure: convolution-relu-
· log p(y = c|x, ω)qθ∗ (ω)dω convolution-relu-max pooling-dropout-dense-relu-dropout-
X dense-softmax, with 32 convolution kernels, 4x4 kernel size,
2x2 pooling, dense layer with 128 units, and dropout proba-
+ Eqθ∗ (ω) p(y = c|x, ω) log p(y = c|x, ω)
bilities 0.25 and 0.5 (following the example Keras MNIST
c
X 1 X X CNN implementation (fchollet, 2015)).
1
≈− pbtc log pbt
c
T t T t c All models are trained on the MNIST dataset with a (random
but balanced) initial training set of 20 data points, and a
1 X
+ pbt log pbtc =: bI[y, ω|x, Dtrain ] validation set of 100 points on which we optimise the weight
T c,t c decay (this is a realistic validation set size, in comparison
defining our approximation, with pbtc the probability of input to the standard validation set size of 5K used in similar
x with model parameters ω b t ∼ qθ∗ (ω) to take class c: applications such as semi-supervised learning on MNIST).
t We further use the standard test set of 10K points, and the
p
b = [b p1 , ..., pbC ] = softmax(f ωb t (x)).
t t
rest of the points are used as a pool set. The test error of
We then have each model and each acquisition function was assessed after
bI[y, ω|x, Dtrain ] −−−−→ H[y|x, q ∗ ] − Eq∗ (ω) H[y|x, ω]
θ θ
T →∞
≈ I[y, ω|x, Dtrain ],
100
resulting in a computationally tractable estimator approxi-
mating the BALD acquisition function. The other acquisi- 98
tion functions can be approximated similarly. 96
In the next section we will experiment with these acquisi- 94
tion functions and assess them empirically. These will be
92
compared to the baseline acquisition function which uni-
formly acquires new data points from the pool set at random, 90
and to various other techniques for active learning of image
88
data and semi-supervised learning. This is followed by a
real-world case study using cancer diagnosis. 86
84 BALD
5. Active Learning with Bayesian Var Ratios
Max Entropy
82
Convolutional Neural Networks Mean STD
Random
80
0 100 200 300 400 500 600 700 800 900 1000
We study the proposed technique for active learning of im-
age data. We compare the various acquisition functions
Figure 1. MNIST test accuracy as a function of number of ac-
relying on Bayesian CNN uncertainty with a simple image
quired images from the pool set (up to 1000 images, using valida-
classification benchmark. We then study the importance of tion set size 100, and averaged over 3 repetitions). Four acquisition
model uncertainty by evaluating the same acquisition func- functions (BALD, Variation Ratios, Max Entropy, and Mean STD)
tions with a deterministic CNN. This is followed by a com- are evaluated and compared to a Random acquisition function.
Deep Bayesian Active Learning with Image Data
100 100 100
98 98 98
96 96 96
94 94 94
92 92 92
90 90 90
88 88 88
86 86 86
84 84 84
82 BALD 82 Var Ratios 82 Max Entropy
Deterministic BALD Deterministic Var Ratios Deterministic Max Entropy
80 80 80
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
(a) BALD (b) Var Ratios (c) Max Entropy
Figure 2. Test accuracy as a function of number of acquired images for various acquisition functions, using both a Bayesian CNN (red)
and a deterministic CNN (blue).
each acquisition, using the dropout approximation at test 5.2. Importance of model uncertainty
time. To decide what data points to acquire though we used
We assess the importance of model uncertainty in our
MC dropout following the derivations above. We repeated
Bayesian CNN by evaluating three of the acquisition func-
the acquisition process 100 times, each time acquiring the
tions (BALD, Variation Ratios, and Max Entropy) with a
10 points that maximised the acquisition function over the
deterministic CNN. Much like the Bayesian CNN, the de-
pool set. Each experiment was repeated three times and
terministic CNN produces a probability vector which can
the results averaged (the standard deviation for the three
be used with the acquisition functions of §4 (formally, by
repetitions is shown below)2 .
setting qθ∗ (ω) = δ(ω − θ) to be a point mass at the location
We compared the acquisition functions BALD, Variation of the model parameters θ). Such deterministic models can
Ratios, Max Entropy, Mean STD, and the baseline Random. capture aleatoric uncertainty – the noise in the data – but
We found Random and Mean STD to under-perform com- cannot capture epistemic uncertainty – the uncertainty over
pared to BALD, Variation Ratios, and Max Entropy (figure the parameters of the CNN, which we try to minimise dur-
1). The Variation Ratios acquisition function seems to obtain ing active learning. The models in this experiment still use
slightly better accuracy faster than BALD and Max Entropy. dropout, but for regularisation only (i.e. we do not perform
It is interesting that Mean STD seems to perform similarly MC dropout at test time).
to Random – which samples points at random from the pool
A comparison of the Bayesian models to the deterministic
set.
models for the BALD, Variation Ratios, and Max Entropy
Lastly, in table 1 we give the number of acquisition steps acquisition functions is given in fig. 2. The Bayesian mod-
needed to get to test errors of 5% and 10%. As can be seen, els, propagating uncertainty throughout the model, attain
BALD, Variation Ratios, and Max Entropy attain a small higher accuracy early on, and converge to a higher accuracy
test error with much fewer acquisitions than Mean STD and overall. This demonstrates that the uncertainty propagated
Random. This table demonstrates the importance of data throughout the Bayesian models has a significant effect on
efficiency – an expert using the Variation Ratios model for the models’ measure of their confidence.
example would have to label less than half the number of
images she would have had to label had she acquired new 5.3. Comparison to current active learning techniques
images at random. with image data
We next compare to a method in the sparse existing literature
% error BALD Var Ratios Max Ent Mean STD Random of active learning with image data, concentrating on (Zhu
et al., 2003) which relies on a kernel method and further
10% 145 120 165 230 255 leverages the unlabelled images (which will be discussed in
5% 335 295 355 695 835 more detail in the next section). Zhu et al. (2003) evaluate
an RBF kernel over the raw images to get a similarity graph
which can be used to share information about the unlabelled
Table 1. Number of acquired images to get to model error of % on data. Active learning is then performed by greedily selecting
MNIST.
unlabelled images to be labelled, such that an estimate to
the expected classification error is minimised. This will be
2
The code for these experiments is available at referred to as MBR.
http://mlg.eng.cam.ac.uk/yarin/publications. MBR was formulated for the binary classification case,
html#Gal2016Active.
Deep Bayesian Active Learning with Image Data
100
99
98
BALD
97 Var Ratios
96 Max Entropy
MBR
95
Random
94
0 100 200 300 400 500 600 700 800 900 1000
Figure 3. MNIST test accuracy (two digit classification) as a function of number acquired images, compared to a current
technique for active learning of image data: MBR (Zhu et al., 2003).
hence we compared MBR to the acquisition functions models (although note that we use a fairly small model
BALD, Variation Ratios, Max Entropy, and Random on compared to (Rasmus et al., 2015) for example). Rasmus
a binary classification task (two digits from the MNIST et al. (2015)’s ladder network (full) attains error 0.84% with
dataset). Classification accuracy is shown in fig. 3. Note 1000 labelled images and 59,000 unlabelled images. How-
that even a random acquisition function, when coupled with ever, (Rasmus et al., 2015)’s Γ-model architecture is more
a CNN (a specialised model for image data) outperforms directly comparable to ours. The Γ-model attains 1.53%
MBR which relies on an RBF kernel. We further experi- error, compared to 1.64% error of our Var Ratio acquisition
mented with a CNN version for MBR where we replaced function which relies on no additional unlabelled data.
the RBF kernel with a CNN. It is interesting to note that this
did not give improved results.
5.4. Comparison to semi-supervised learning
We continue with a comparison to the closest models
(in modern literature) to our active learning with image
data: semi-supervised learning with image data. In semi- Technique Test error
supervised learning a model is given a fixed set of labelled
data, and a fixed set of unlabelled data. The model can use Semi-supervised:
the unlabelled dataset to learn about the distribution of the Semi-sup. Embedding (Weston et al., 2012) 5.73%
inputs, in the hopes that this information will aid in learning Transductive SVM (Weston et al., 2012) 5.38%
the mapping to the outputs as well. Several semi-supervised MTC (Rifai et al., 2011) 3.64%
models for image data have been suggested in recent years Pseudo-label (Lee, 2013) 3.46%
(Weston et al., 2012; Kingma et al., 2014; Rasmus et al., AtlasRBF (Pitelis et al., 2014) 3.68%
2015), models which have set benchmarks on MNIST given DGN (Kingma et al., 2014) 2.40%
a small number of labelled images (1000 random images). Virtual Adversarial (Miyato et al., 2015) 1.32%
These models make further use of a (very) large unlabelled Ladder Network (Γ-model) (Rasmus et al., 2015) 1.53%
set of 49K images, and a large validation set of 5K-10K Ladder Network (full) (Rasmus et al., 2015) 0.84%
labelled images to tune model hyper-parameters and model
Active learning with
structure (Rasmus et al., 2015). These models have access
various acquisitions:
to much more data than our active learning models, but we
still compare to them as they are the most relevant models Random 4.66%
in the field given the constraint of small amounts of labelled BALD 1.80%
data. Max Entropy 1.74%
Var Ratios 1.64%
Test error for our active learning models with various ac-
quisition functions (after the acquisition of 1000 training Table 2. Test error on MNIST with 1000 labelled training sam-
points), as well as the semi-supervised models, is given in ples, compared to semi-supervised techniques. Active learning
table 2. In this experiment, to be comparable to the other has access to only the 1000 acquired images. Semi-supervised fur-
techniques, we use a validation set of 5K points. Our model ther has access to the remaining images with no labels. Following
attains similar performance to that of the semi-supervised existing research we use a large validation set of size 5000.
Deep Bayesian Active Learning with Image Data
Figure 4. Skin cancer (melanoma) example lesions from the ISIC 2016 melanoma diagnosis dataset. The two lesions on the left are benign
(non-cancerous), while the two lesions on the right are malignant (cancerous).
5.5. Cancer diagnosis from lesion image data performed on two different random splits – since even a test
set size of 200 gives very different accuracy with different
We finish by assessing the proposed technique with a real
random splits. Note that on each such random split we
world test case. We experiment with melanoma (skin can-
repeat our experiments three times and average the results
cer) diagnosis from dermoscopic lesion images. In this task
with respect to the fixed test set.
we are given image data of skin segments, of both malig-
nant (cancerous) as well as benign (non-cancerous) lesions. We experiment with active learning by following the fol-
Our task is to classify the images as malignant or benign lowing procedure. We begin by creating an initial training
(an example is shown in fig. 4). The data used is the ISIC set of 80 negative examples and 20 positive examples from
Archive (Gutman et al., 2016). This dataset was collected in our training data, as well as a pool set from the remaining
order to provide a “large public repository of expertly anno- data. With each experiment repetition (out of the three ex-
tated high quality skin images” to provide clinical support periment repetitions w.r.t. the fixed test split) the pool is
in the identification of skin cancer, and to develop algo- shuffled anew. The positive examples in the current training
rithms for skin cancer diagnosis. Specifically, we use the set are augmented following the original training procedure,
training data of the “ISBI 2016: Skin Lesion Analysis To- and a model is trained on the augmented training set for
wards Melanoma Detection – Part 3B: Segmented Lesion 100 epochs until convergence. We use batch size 8 and
Classification” task. The data contains 900 dermoscopic weight decay set by (1 − p)l2 /N , where N is the number
lesion images in JPEG format with EXIF tags removed. of training points, p = 0.5 is the dropout probability, and
Malignancy diagnosis for these lesions was obtained from the length-scale squared l2 is set to 0.5. An acquisition
expert consensus and pathology report information. The function is then used to select the 100 most informative
data contains lesion segmentation as well, which we did not images from the pool set. These points are removed from
use. the pool set and added to the (non-augmented) training set,
where we use the original expert-provided labels for these
For our model we replicate the model of (Agarwal et al.,
points. The process is repeated until all pool points have
2016). This model achieved second place in the “Part 3B:
been exhausted, where at each acquisition step we reset the
Segmented Lesion Classification” task, with its code open-
model to its original pre-trained weights (as we also did
sourced. The model relies on data augmentation of the
in the previous section experiments). This reset is done in
positive examples (flipping the lesions vertically and hori-
order to avoid local optima, and to avoid confusing model
zontally), and fine-tunes the VGG16 CNN model (Simonyan
performance improvement with an improvement resulting
& Zisserman, 2015) (i.e. optimises a pre-trained model with
from simply using longer (cumulative) optimisation time.
a small learning rate). The VGG16 model was pre-trained
on ImageNet (Deng et al., 2009). The top layer of the After each acquisition the test performance of the model
model (1000 logits) was removed and replaced with a 2 is logged using MC dropout with 20 samples. We further
dimensional output (for our classification task of malig- keep track of the number of positive examples acquired
nant/benign). Preceding the last layer are two fully con- after each acquisition. Model performance is assessed using
nected layers of size 4096, each one followed by a dropout area-under-the-curve (AUC) as this seems to be the most
layer with dropout probability 0.5. This architecture seems informative of all metrics used by Gutman et al. (2016). We
to provide good uncertainty estimates as observed before experimented with the average precision metric suggested
(Kendall et al., 2015; Gal & Ghahramani, 2016a). by Gutman et al. (2016) as well, but managed to get results
improving over the competition winner by simply predicting
The data is unbalanced, containing 727 negative (benign)
all points as “benign”. This might be because of the data
examples, and 173 positive (malignant) examples (20% pos-
imbalance. AUC on the other hand takes into account all
itive examples). Since the data is so small, to assess model
possible decision-thresholds possible to classify a malignant
performance reliably we have to take a large balanced test
image.
set. We randomly partition the data, and set aside 100 neg-
ative and 100 positive examples. All our experiments are We assessed two acquisition functions: a uniform baseline,
Deep Bayesian Active Learning with Image Data
70
0.74
60
# positive examples acquired
0.72
0.70
50
0.68
AUC
0.66 40
0.64
30
0.62 BALD BALD
uniform uniform
0.60 20
0 1 2 3 4 0 1 2 3 4
Acquisition steps Acquisition steps
(a) AUC as a function of acquisition step, first test split (b) # of positive examples as a function of acquisition
step, first test split
0.78 65
60
0.76
# positive examples acquired
55
0.74 50
0.72 45
AUC
0.70 40
35
0.68
30
0.66 BALD 25
BALD
uniform uniform
0.64 20
0 1 2 3 4 0 1 2 3 4
Acquisition steps Acquisition steps
(c) AUC as a function of acquisition step, second test (d) # of positive examples as a function of acquisition
split step, second test split
Figure 5. AUC (left) as well as the number of acquired positive examples (right) for both the BALD acquisition function as well as
uniform acquisition function, on ISIC 2016 melanoma diagnosis dataset. Two random test splits are assessed (top and bottom), and on
each test set the experiment was repeated three times with different random seeds (shown mean with standard error).
and BALD. Even though Variation Ratios performs well on same initial training set). This demonstrates the difficulties
MNIST above, the function fails with the melanoma data with handling of small data: each test split gives radically
since most malignant images are given only a slight higher different results, and in this case even though each acqui-
probability of being malignant compared to the probability sition function experiment has a relatively small standard
of benign images of being malignant. As a result all pool error, averaging the AUC of the acquisition functions over
points are given identical Variation Ratios acquisition value. the different test splits would artificially increase the stan-
dard error. Lastly, it is interesting to experiment with a
Experiment results are given in fig. 5, where results are
model trained over the entire pool set, i.e. with the settings
reported on both test splits (top and bottom), and where
of the second place winner in the ISIC2016 task. For the
with each split the experiment is repeated three times and
first test split this model attains AUC 0.71 ± 0.003, whereas
performance results are averaged on that fixed split. For
with the second test split it attains AUC 0.75 ± 0.01. For
each test split we report mean with standard error. AUC is
both test splits this AUC is worse than BALD’s converged
reported for each split (left), and number of acquired posi-
AUC after 4 acquisition steps. This might be because BALD
tive examples is reported as well (right) for each acquisition
avoided selecting noisy points – near-by images for which
step. BALD achieves better AUC faster than uniform, and
there exist multiple noisy labels of different classes. Such
acquires more positive examples at each acquisition step
points have large aleatoric uncertainty – uncertainty which
than uniform (i.e. BALD finds positive examples as infor-
cannot be explained away – rather than large epistemic un-
mative and adds these to the training set, whereas uniform
certainty – the uncertainty which BALD captures in order
simply selects positive examples from the pool set based on
to explain it away, i.e. reduce it.
their frequency).
Note how AUC range varies wildly between the two differ- 6. Future Research
ent test splits, but how AUC is similar for both acquisition
functions on each fixed test set before the initial acquisition We presented a new approach for active learning of im-
(when both uniform and BALD models are trained on the age data, relying on recent advances at the intersection of
Deep Bayesian Active Learning with Image Data
Bayesian modelling and deep learning, and demonstrated He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
a real-world application in medical diagnosis. We assessed Jian. Delving deep into rectifiers: Surpassing human-
the performance of the techniques by resetting the models level performance on imagenet classification. In Proceed-
after each acquisition, and training them again to conver- ings of the IEEE International Conference on Computer
gence. This was done to isolate the effects of our acquisition Vision, pp. 1026–1034, 2015.
functions, which came at a cost of prolonged training times
(20 hours for each melanoma experiment for example). We Hernandez-Lobato, Jose Miguel and Adams, Ryan. Proba-
showed that even with this long running time, our technique bilistic backpropagation for scalable learning of Bayesian
still reduces required expert labels, thus reduces costs for neural networks. In Proceedings of The 32nd Interna-
such a system. This running time can be reduced further by tional Conference on Machine Learning, pp. 1861–1869,
not resetting the system – with the potential price of falling 2015.
into local optima. We leave this problem for future research.
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex,
Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving
References neural networks by preventing co-adaptation of feature
Agarwal, Mohit, Damaraju, Nandita, and Chaieb, detectors. arXiv preprint arXiv:1207.0580, 2012.
Sahbi. Dl8803. https://github.com/
Holub, Alex, Perona, Pietro, and Burl, Michael C. Entropy-
NanditaDamaraju/DL8803, 2016.
based active learning for object recognition. In Com-
puter Vision and Pattern Recognition Workshops, 2008.
Cohn, David A, Ghahramani, Zoubin, and Jordan, Michael I.
CVPRW’08. IEEE Computer Society Conference on, pp.
Active learning with statistical models. Journal of artifi-
1–8. IEEE, 2008.
cial intelligence research, 1996.
Houlsby, Neil, Huszár, Ferenc, Ghahramani, Zoubin, and
Cortes, Corinna and Vapnik, Vladimir. Support-vector net- Lengyel, Máté. Bayesian active learning for classification
works. Machine learning, 20(3):273–297, 1995. and preference learning. arXiv preprint arXiv:1112.5745,
2011.
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai,
and Fei-Fei, Li. Imagenet: A large-scale hierarchical Joshi, Ajay J, Porikli, Fatih, and Papanikolopoulos, Niko-
image database. In Computer Vision and Pattern Recog- laos. Multi-class active learning for image classification.
nition, 2009. CVPR 2009. IEEE Conference on, pp. 248– In Computer Vision and Pattern Recognition, 2009. CVPR
255. IEEE, 2009. 2009. IEEE Conference on, pp. 2372–2379. IEEE, 2009.
fchollet. Keras. https://github.com/fchollet/ Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer,
keras, 2015. Noam, and Wu, Yonghui. Exploring the limits of lan-
guage modeling. arXiv preprint arXiv:1602.02410, 2016.
Freeman, Linton G. Elementary applied statistics, 1965.
Kalchbrenner, Nal and Blunsom, Phil. Recurrent continuous
Gal, Yarin. Uncertainty in Deep Learning. PhD thesis, translation models. In EMNLP, 2013.
University of Cambridge, 2016. Kampffmeyer, Michael, Salberg, Arnt-Borre, and Jenssen,
Robert. Semantic segmentation of small objects and
Gal, Yarin and Ghahramani, Zoubin. Bayesian convolu- modeling of uncertainty in urban remote sensing images
tional neural networks with Bernoulli approximate varia- using deep convolutional neural networks. In The IEEE
tional inference. ICLR workshop track, 2016a. Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, June 2016.
Gal, Yarin and Ghahramani, Zoubin. Dropout as a Bayesian
approximation: Representing model uncertainty in deep Kendall, Alex, Badrinarayanan, Vijay, and Cipolla, Roberto.
learning. ICML, 2016b. Bayesian segnet: Model uncertainty in deep convolu-
tional encoder-decoder architectures for scene understand-
Gutman, David, Codella, Noel CF, Celebi, Emre, Helba, ing. arXiv preprint arXiv:1511.02680, 2015.
Brian, Marchetti, Michael, Mishra, Nabin, and Halpern,
Allan. Skin lesion analysis toward melanoma detec- Kingma, Diederik P, Mohamed, Shakir, Rezende,
tion: A challenge at the international symposium on Danilo Jimenez, and Welling, Max. Semi-supervised
biomedical imaging (ISBI) 2016, hosted by the interna- learning with deep generative models. In Advances in
tional skin imaging collaboration (ISIC). arXiv preprint Neural Information Processing Systems, pp. 3581–3589,
arXiv:1605.01397, 2016. 2014.
Deep Bayesian Active Learning with Image Data
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Simonyan, K. and Zisserman, A. Very deep convolutional
Imagenet classification with deep convolutional neural networks for large-scale image recognition. In Interna-
networks. In Advances in neural information processing tional Conference on Learning Representations, 2015.
systems, pp. 1097–1105, 2012.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
LeCun, Yann and Cortes, Corinna. The MNIST database of Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
handwritten digits, 1998. A simple way to prevent neural networks from overfitting.
JMLR, 2014.
LeCun, Yann, Boser, Bernhard, Denker, John S, Henderson,
Donnie, Howard, Richard E, Hubbard, Wayne, and Jackel, Sundermeyer, Martin, Schlüter, Ralf, and Ney, Hermann.
Lawrence D. Backpropagation applied to handwritten zip LSTM neural networks for language modeling. In IN-
code recognition. Neural Computation, 1(4):541–551, TERSPEECH, 2012.
1989.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence
Lee, Dong-Hyun. Pseudo-label: The simple and efficient to sequence learning with neural networks. In NIPS,
semi-supervised learning method for deep neural net- 2014.
works. In Workshop on Challenges in Representation
Tong, Simon. Active Learning: Theory and Applications.
Learning, 2013.
PhD thesis, 2001. AAI3028187.
Li, Xin and Guo, Yuhong. Adaptive active learning for
Weston, Jason, Ratle, Frédéric, Mobahi, Hossein, and Col-
image classification. In Proceedings of the IEEE Con-
lobert, Ronan. Deep learning via semi-supervised em-
ference on Computer Vision and Pattern Recognition, pp.
bedding. In Neural Networks: Tricks of the Trade, pp.
859–866, 2013.
639–655. Springer, 2012.
Marcus, Daniel S, Fotenos, Anthony F, Csernansky, John G,
Zhu, X, Lafferty, J, and Ghahramani, Z. Combining active
Morris, John C, and Buckner, Randy L. Open access
learning and semi-supervised learning using Gaussian
series of imaging studies: longitudinal mri data in nonde-
fields and harmonic functions. In Proceedings of the
mented and demented older adults. Journal of cognitive
ICML-2003 Workshop on The Continuum from Labeled
neuroscience, 22(12):2677–2684, 2010.
to Unlabeled Data, pp. 58–65. ICML, 2003.
Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori,
Nakae, Ken, and Ishii, Shin. Distributional smooth-
ing by virtual adversarial examples. arXiv preprint
arXiv:1507.00677, 2015.
Pitelis, Nikolaos, Russell, Chris, and Agapito, Lourdes.
Semi-supervised learning using an unsupervised atlas.
In Joint European Conference on Machine Learning
and Knowledge Discovery in Databases, pp. 565–580.
Springer, 2014.
Rasmus, Antti, Berglund, Mathias, Honkala, Mikko,
Valpola, Harri, and Raiko, Tapani. Semi-supervised learn-
ing with ladder networks. In Advances in Neural Infor-
mation Processing Systems, pp. 3546–3554, 2015.
Rifai, Salah, Dauphin, Yann N, Vincent, Pascal, Bengio,
Yoshua, and Muller, Xavier. The manifold tangent clas-
sifier. In Advances in Neural Information Processing
Systems, pp. 2294–2302, 2011.
Rumelhart, David E, Hinton, Geoffrey E, and Williams,
Ronald J. Learning internal representations by error prop-
agation. Technical report, DTIC Document, 1985.
Shannon, Claude Elwood. A mathematical theory of com-
munication. Bell System Technical Journal, 27(3):379–
423, 1948.