Machine Learning II BAI702
Unit 3
Chapter13
Decision by Committee: Ensemble Learning
Ensemble learning is a machine learning technique where multiple models (called base
learners or weak learners) are combined to produce a stronger and more accurate model.
Instead of relying on a single model, ensembles reduce errors, improve generalization, and
increase robustness
Figure 13.1 shows the basic idea of ensemble learning, as these methods are collectively called.
Given a relatively simple binary classification problem and some learner that puts an ellipse
around a subset of the data, combining the ellipses can provide a considerably more complex
decision boundary.
Interestingly, ensemble methods do very well when there is very little data as well as when
there is too much. We used cross validation when there was not enough data to go around, and
trained lots of neural networks on different subsets of the data. Then we threw away most of
them. With an ensemble method we keep them all, and combine their results in some way. One
very simple way to combine the results is to use majority voting — if it’s good enough for
electing governments in elections, it’s good enough for machine learning.
13.1 BOOSTING
Boosting is to train models sequentially, where each new model focuses on correcting the
mistakes of the previous ones.
Prediction: Weighted majority voting (classification) or weighted sum (regression).
Popular algorithms:
o AdaBoost (Adaptive Boosting)
o Gradient Boosting Machines (GBM)
o XGBoost (Extreme Gradient Boosting)
o LightGBM
o CatBoost
Advantage: Reduces both bias and variance, often achieves state-of-the-art
performance.
Disadvantage: More prone to overfitting than bagging if not regularized.
The algorithm was first described in the mid-1990s by Freund and Shapiro, and while it has
had many variations derived from it, the principal algorithm is still one of the most widely
used. The algorithm was proposed as an improvement on the original 1990 boosting algorithm,
which was rather data hungry. In that algorithm, the training set was split into three. A classifier
was trained on the first third, and then tested on the second third. All of the data that was
misclassified during that testing was used to form a new dataset, along with an equally sized
random selection of the data that was correctly classified. A second classifier was trained on
this new dataset, and then both of the classifiers were tested on the final third of the dataset. If
they both produced the same output, then that datapoint was ignored, otherwise the datapoint
was added to yet another new dataset, which formed the training set for a third classifiers.
13.1.1 AdaBoost
The innovation that AdaBoost (which stands for adaptive boosting) uses is to give weights to
each datapoint according to how difficult previous classifiers have found to get it correct. These
weights are given to the classifier as part of the input when it is trained. The AdaBoost
algorithm is conceptually very simple. At each iteration a new classifier is trained on the
training set, with the weights that are applied to the training set for each datapoint being
modified at each iteration according to how successfully that datapoint has been classified in
the past. The weights are initially all set to the same value, 1/N, where N is the number of
datapoints in the training set. Then, at each iteration, the error (ϵ) is computed as the sum of
the weights of the misclassified points, and the weights for incorrect examples are updated by
being multiplied by α = (1 − ϵ)/ϵ. Weights for correct examples are left alone, and then the
whole set is normalised so that it sums to 1 (which is effectively a reduction in the importance
of the correctly classified datapoints). Training terminates after a set number of iterations, or
when either all of the datapoints are classified correctly, or one point contains more than half
of the available weight. Figure 13.2 shows the effect of weighting incorrectly classified
examples as training proceeds, with the size of each datapoint being a measure of its
importance.
Most of the work of the algorithm is done by the classification algorithm, which is given new
weights at each iteration. In this respect, boosting is not quite a stand-alone algorithm: the
classifiers need to consider the weights when they perform their classifications. It is not always
obvious how to do this for a particular classifier, but we have seen methods of doing it for a
few classifiers.
As a very simple example showing how boosting works, a very simple classifier was created
that can only separate data by fitting one either horizontal or vertical line, with it choosing
which to fit at the current iteration at random. A two-dimensional dataset was created with data
in the top right-hand corner being in one class, and the rest in another, plus a couple of the
datapoints were randomly mislabelled to simulate noise. Clearly, this dataset cannot be
separated by a single horizontal or vertical decision boundary. However, Figure 13.3 shows the
output of the classifier on an independent test set, where the algorithm gets only one datapoint
wrong, and that is one that is coincidentally close to one of the ‘noisy’ datapoints in the training
data.
Figure 13.4 shows the training data, the error curve on both the training and testing sets, and
the first few iterations of the classifier, which can only put in one horizontal or linear
classification line. Clearly, such impressive results require some explanation and
understanding. The key to this understanding is to compute the loss function, which is simply
the measure of the error that is applied (we have been using a sum-of-squares loss function for
many algorithms in the book). The loss function for AdaBoost has the form:
Deriving the rest of the algorithm from here requires substituting in for the hypotheses h and
then solving for α, which produces the full algorithm. Interestingly, this is not the way that
AdaBoost was created; this understanding of why it works so well came later. It is possible to
choose other loss functions, and providing that they are differentiable they will provide useful
boosting-like algorithms, which are collectively known as arcing algorithms (for adaptive
reweighting and combining). AdaBoost can be modified to perform regression rather than
classification.
13.1.2 Stumping
There is a very extreme form of boosting that is applied to trees. It goes by the descriptive name
of stumping. The stump of a tree is the tiny piece that is left over when you chop off the rest,
and the same is true here: stumping consists of simply taking the root of the tree and using that
as the decision maker. So for each classifier you use the very first question that makes up the
root of the tree, and that is it. Often, this is worse than chance on the whole dataset, but by
using the weights to sort out when that classifier should be used, and to what extent, as opposed
to the other ones, the overall output of stumping can be very successful.
13.2 BAGGING
The simplest method of combining classifiers is known as bagging, which stands for boot strap
aggregating, the statistical description of the method. This is fine if you know what a bootstrap
is, but fairly useless if you don’t. A bootstrap sample is a sample taken from the original dataset
with replacement, so that we may get some data several times and others not at all. The
bootstrap sample is the same size as the original, and lots and lots of these samples are taken:
B of them, where B is at least 50, and could even be in the thousands.
Bootstrap sampling seems like a very strange thing to do. We’ve taken a perfectly good dataset,
mucked it up by sampling from it, which might be good if we had made a smaller dataset (since
it would be faster), but we still ended up with a dataset the same size. Worse, we’ve done it lots
of times. Surely this is just a way to burn up computer time without gaining anything. The
benefit of it is that we will get lots of learners that perform slightly differently, which is exactly
what we want for an ensemble method. A NumPy implementation is shown next, and then we
will look at a simple example:
13.2.1 Subagging (Subsample Aggregating)
The method of subagging wins the prize for the oddest sounding word. It is a combination of
‘subsample’ and ‘bagging,’ and it is the fairly obvious idea that you don’t need to produce
samples that are the same size as the original data.
It is a variant of Bagging (Bootstrap Aggregating).
Instead of using bootstrap samples (with replacement) like Bagging,
Subagging uses subsamples without replacement from the training data.
Steps in Subagging
1. Take your dataset of size N.
2. Randomly choose a subsample of size m < N (without replacement).
3. Train a base learner (like a decision tree) on this subsample.
4. Repeat the process many times to create multiple base learners.
5. Aggregate their predictions:
o Classification → majority voting
o Regression → average predictions
13.3 RANDOM FORESTS
If there is one method in machine learning that has grown in popularity over the last few years,
then it is the idea of random forests. The idea is largely that if one tree is good, then many trees
(a forest) should be better, provided that there is enough variety between them. The most
interesting thing about a random forest is the ways that it creates randomness from a standard
dataset.
The first of the methods that it uses is the one that we have just seen: bagging. If we wish to
create a forest then we can make the trees different by training them on slightly different data,
so we take bootstrap samples from the dataset for each tree. At each node, a random subset of
the features is given to the tree, and it can only pick from that subset rather than from the whole
set. As well as increasing the randomness in the training of each tree, it also speeds up the
training, since there are fewer features to search over at each stage. Of course, it does introduce
a new parameter (how many features to consider), but the random forest does not seem to be
very sensitive to this parameter; in practice, a subset size that is the square root of the number
of features seems to be common.
Once the set of trees are trained, the output of the forest is the majority vote for classification,
as with the other committee methods that we have seen, or the mean response for regression.
And those are pretty much the main features needed for creating a random forest. The algorithm
is given next before we see some results of using the random forest.
13.3.1 Comparison with Boosting
There are some obvious similarities to boosting, but it is the differences that are most telling.
The most general thing is that boosting is exhaustive, in that it searches over the whole set of
features at each stage, and each stage depends on the previous one. This means that boosting
has to run sequentially, and the individual steps can be expensive to run. By way of contrast,
the parallelism of the random forest and the fact that it only searches over a fairly small set of
features at each stage speed the algorithm up a lot. Since the algorithm only searches a small
subset of the data at each stage, it cannot be expected to be as good as boosting for the same
number of trees. However, since the trees are cheaper to train, we can make more of them in
the same computational time, and often the results are amazingly good even on very large and
complicated datasets.
The implementation of this is very easy: we modify the decision to take an extra parameter,
which is m, the number of features that should be used in the selection set at each stage. We
will look at an example of using it shortly as a comparison to boosting.
As a brief example of using the random forest, we start by demonstrating that the random forest
gets the correct results on the Party example that has been used in both this and the previous
chapters, based on 10 trees, each trained on 7 samples, and with just two levels allowed in each
tree:
13.4 DIFFERENT WAYS TO COMBINE CLASSIFIERS
Bagging puts most of its effort into ensuring that the different classifiers see different data,
since they see different samples of the data. This is different than boosting, where the data stays
the same, but the importance of each datapoint changes for the different classifiers, since they
each get different weights according to how well the previous classifiers have performed. Just
as important for an ensemble method, though, is how it combines the outputs of the different
classifiers.
Both boosting and bagging take a vote from amongst the classifiers, although they do it in
different ways: boosting takes a weighted vote, while bagging simple takes the majority vote.
There are other alternatives to these methods, as well. In fact, even majority voting is not
necessarily simple. Some classification systems will only produce an output where all the
classifiers agree, or more than half of them agree, whereas others simply take the most common
output, which is what we usually mean by majority voting.
The idea of not always producing an output is to ensure that the ensemble does not produce
outputs that are contentious, because they are probably difficult datapoints. If the number of
classifiers is odd and the classifiers are each independent of each other, then majority voting
will return the correct label if more than half of the classifiers agree. Assuming that each
individual classifier has a success rate of p, the probability of the ensemble getting the correct
answer is a binomial distribution of the form.
Chapter 14
Unsupervised Learning
14.1 THE K-MEANS ALGORITHM
If you have ever watched a group of tourists with a couple of tour guides who hold umbrellas
up so that everybody can see them and follow them, then you have seen a dynamic version of
the k-means algorithm. Our version is simpler, because the data (playing the part of the tourists)
does not move, only the tour guides. Suppose that we want to divide our input data into k
categories, where we know the value of k (for example, we have a set of medical test results
from lots of people for three diseases, and we want to see how well the tests identify the three
diseases). We allocate k cluster centres to our input space, and we would like to position these
centres so that there is one cluster centre in the middle of each cluster. However, we don’t know
where the clusters are, let alone where their ‘middle’ is, so we need an algorithm that will find
them.
Learning algorithms generally try to minimise some sort of error, so we need to think of an
error criterion that describes this aim. The idea of the ‘middle’ is the first thing that we need to
think about. How do we define the middle of a set of points? There are actually two things that
we need to define: A distance measure in order to talk about distances between points, we need
some way to measure distances.
The mean average Once we have a distance measure, we can compute the central point of a
set of datapoints, which is the mean average (if you aren’t convinced, think what the mean of
two numbers is, it is the point halfway along the line between them). Actually, this is only true
in Euclidean space, which is the one you are used to, where everything is nice and flat.
Everything becomes a lot trickier if we have to think about curved spaces; when we have to
worry about curvature, the Euclidean distance metric isn’t the right one, and there are at least
two different definitions of the mean. So we aren’t going to worry about any of these things,
and we’ll assume that space is flat. This is what statisticians do all the time.
To see how this works in practice, Figures 14.1 and 14.2 show some data and some different
ways to cluster that data computed by the k-means algorithm. It should be clear that the
algorithm is susceptible to local minima: depending upon where the centres are initially
positioned in the space, you can get very different solutions, and many of them look very
unlikely to our eyes. Figure 14.2 shows examples of what happens when you choose the
number of centres wrongly. There are certainly cases where we don’t know in advance how
many clusters we will see in the data, but the k-means algorithm doesn’t deal with this at all
well. By running the algorithm with lots of different values of k, we can see which values give
us the best solution. Of course, we need to be careful with this. If we still just measure the sum-
of-squares error between each datapoint and its nearest cluster centre, then when we set k to be
equal to the number of datapoints, we can position one centre on every datapoint, and the sum-
of-squares error will be zero.
14.1.1 Dealing with Noise
There are lots of reasons for performing clustering, but one of the more common ones is to deal
with noisy data readings. These might be slightly corrupted, or occasionally just plain wrong.
If we can choose the clusters correctly, then we have effectively removed the noise, because
we replace each noisy datapoint by the cluster centre. Unfortunately, the mean average, which
is central to the k-means algorithm, is very susceptible to outliers, i.e., very noisy
measurements. One way to avoid the problem is to replace the mean average with the median,
which is what is known as a robust statistic, meaning that it is not affected by outliers (the mean
of (1,2,1,2,100) is 21.2, while the median is 2). The only change that is needed to the algorithm
is to replace the computation of the mean with the computation of the median. This is
computationally more expensive, as we’ve discussed previously, but it does remove noise
effectively. 14.1.2 The k-Means Neural Network The k-means algorithm clearly works, despite
its problems with noise and the difficulty with choosing the number of clusters. Interestingly,
while it might seem a long way from neural networks, it isn’t. If we think about the cluster
centres that we optimise the positions of as locations in weight space, then we could position
neurons in those places and use neural network training. The computation that happened in the
k-means algorithm was that each input decided which cluster centre it was closest to by
calculating the distance to all of the centres.
So, we can implement the k-means algorithm using a set of neurons. We will use just one layer
of neurons, together with some input nodes, and no bias node. The first layer will be the inputs,
which don’t do any computation, as usual, and the second layer will be a layer of competitive
neurons, that is, neurons that ‘compete’ to fire, with only one of them actually succeeding. Only
one cluster centre can represent a particular input vector, and so we will choose the neuron with
the highest activation h to be the one that fires.
We will choose k neurons (for hopefully obvious reasons) and fully connect the inputs to the
neurons, as usual. There is a picture of this network in Figure 14.3. We will use neurons with a
linear transfer function, computing the activation of the neurons as simply the product of the
weights and inputs:
Providing that the inputs are normalised so that their absolute size is the same (a point that
we’ll come back to in Section 14.1.3), this effectively measures the distance between the input
vector and the cluster centre represented by that neuron, with larger numbers (higher
activations) meaning that the two points are closer together. So the winning neuron is the one
that is closest to the current input. The question is how can we then change the position of that
neuron in weight space, that is, how do we update its weights? In the k-means algorithm that
was described earlier it was easy: we just set the cluster centre to be the mean of all the
datapoints that were assigned to that centre. However, when we do neural network training, we
often feed in just one input vector at a time and change the weights.
14.1.3 Normalisation
Suppose that the weights of all the neurons are small (maybe less than 1) except for those to
one particular neuron. We’ll make those weights be 10 for the example. If an input vector with
values (0.2,0.2,−0.1) is presented, and it happens to be an exact match for one of the neurons,
then the activation of that neuron will be 0.2×0.2+0.2×0.2+−0.1×−0.1 = 0.09.
The other neurons are not perfect matches, so their activations should all be less. However,
consider the neuron with large weights. Its activation will be 10×0.2+10×0.2+10×−0.1 = 3, and
so it will be the winner. Thus, we can only compare activations if we know that the weights for
all of the neurons are the same size. We do this by insisting that the weight vector is normalised
so that the distance between the vector and the origin (the point (0,0,...0)) is one. This means
that all of the neurons are positioned on the unit hypersphere, when we talked about the curse
of dimensionality: it is the set of all points that are distance one from the origin, so it is a circle
in 2D, a sphere in 3D (as shown in Figure 14.4), and a hypersphere in higher dimensions.
Computing this normalisation in NumPy takes a little bit of care because we are normalising
the total Euclidean distance from the origin, and the sum and division are row-wise rather than
column-wise, which means that the matrix has to be transposed before and after the division
14.1.6 Using Competitive Learning for Clustering
Deciding which cluster any datapoint belongs to is now an easy task: we present it to the trained
algorithm and look what is activated. If we don’t have any target data, then the problem is
finished. However, for many problems we might want to interpret the best matching cluster as
a class label (alternatively, a set of cluster centres could all correspond to one class). This is
fine, since if we have target data we can match the output classes to the targets, provided that
we are a bit careful: there is no reason why the order of the nodes in the network should match
the order in the data, since the algorithm knows nothing about that order. For that reason, when
assigning class labels to the outputs, you need to check which numbers match up carefully, or
the results will look a lot worse than they actually are.
There is an alternative solution to this problem of assigning labels, and it is one that we have
seen before. The k-means part positions the RBFs in the input space, so that they represent the
input data well. A Perceptron is then used on top of this in order to provide the match to the
outputs in the supervised learning part of the network. Since this is now supervised learning, it
ensures that the output categories match the target data classes. It also means that you can use
lots of clusters in the k-means network without having to work out which datapoints belong to
which cluster, since the Perceptron will do this for you.