Ensemble Methods
• As an alternative to increasing the performance
of a single model, it is possible to combine
several models to form a powerful team.
• Since a model brings a unique bias to a learning
task, it may readily learn one subset of
examples, but have trouble with another.
• Therefore, by intelligently using the talents of
several diverse team members, it is possible to
create a strong team of multiple weak learners.
• This technique of combining and managing
the predictions of multiple models falls into a
wider set of meta-learning methods defining
techniques that involve learning how to learn.
• This includes anything from simple algorithms
that gradually improve performance by
iterating over design decisions.
• it pertains to modeling a relationship between
the predictions of several models and the
desired outcome.
Understanding ensembles
• The meta-learning approach that utilizes a
similar principle of creating a varied team of
experts is known as an ensemble.
• First, input training data is used to build a
number of models.
• The allocation function dictates how much of
the training data each model receives.
• the allocation function can increase diversity
by artificially varying the input data to bias the
resulting learners, even if they are the same
type.
• For instance, it might use bootstrap sampling
to construct unique training datasets or pass
on a different subset of features or examples
to each model..
• On the other hand, if the ensemble already
includes a diverse set of algorithms—such as a
neural network, a decision tree, and a k-NN
classifier—the allocation function might pass
the data on to each algorithm relatively
unchanged.
• After the models are constructed, they can be used
to generate a set of predictions, which must be
managed in some way.
• The combination function governs how
disagreements among the predictions are
reconciled.
• For example, the ensemble might use a majority
vote to determine the final prediction, or it could
use a more complex strategy such as weighting each
model's votes based on its prior performance.
• Some ensembles even utilize another model to learn
a combination function fromvarious combinations of
predictions.
• For example, suppose that when M1 and M2 both
vote yes, the actual class value is usually no.
• In this case, the ensemble could learn to ignore the
vote of M1 and M2 when they agree.
• This process of using the predictions of several
models to train a final arbiter model is known as
stacking.
• One of the benefits of using ensembles is that
they may allow one to spend less time in
pursuit of a single best model. Instead, one
can train a number of reasonably strong
candidates and combine them.
Ensemble Methods
• Bagging
• Boosting
• Random forest
Bagging
• One of the first ensemble methods to gain
widespread acceptance used a technique called
bootstrap aggregating or bagging for short.
• Bagging generates a number of training datasets by
bootstrap sampling the original training data. These
datasets are then used to generate a set of models
using a single learning algorithm.
• The models' predictions are combined using voting
(for classification) or averaging (for numeric
prediction).
• It can perform quite well as long as it is used
with relatively unstable learners, that is, those
generating models that tend to change
substantially when the input data changes
only slightly.
• For this reason, bagging is often used with
decision trees, which have the tendency to
vary dramatically given minor changes in the
input data.
Boosting
• Another common ensemble-based method is
called boosting because it boosts the
performance of weak learners to attain the
performance of stronger learners.
• Similar to bagging, boosting uses ensembles of
models trained on resampled data and a vote
to determine the final prediction. There are
two key distinctions.
• First, the resampled datasets in boosting are
constructed specifically to generate
complementary learners.
• Second, rather than giving each learner an
equal vote, boosting gives each learner's vote
a weight based on its past performance.
• Models that perform better have greater
influence over the ensemble's final prediction.
• Boosting will result in performance that is often quite
better and certainly no worse than the best of the models
in the ensemble.
• Since the models in the ensemble are built to be
complementary, it is possible to increase ensemble
performance to an arbitrary threshold simply by adding
additional classifiers to the group, assuming that each
classifier performs better than random chance.
• Given the obvious utility of this finding, boosting is thought
to be one of the most significant discoveries in
• machine learning.
AdaBoost or adaptive boosting
• The algorithm is based on the idea of
generating weak learners that iteratively learn
a larger portion of the difficult-to-classify
examples by paying more attention (that is,
giving more weight) to frequently misclassified
examples.
• Beginning from an unweighted dataset, the first
classifier attempts to model the outcome.
• Examples that the classifier predicted correctly will
be less likely to appear in the training dataset for
the following classifier, and conversely, the
difficult-to-classify examples will appear more
frequently.
• As additional rounds of weak learners are added,
they are trained on data with successively more
difficult examples.
• The process continues until the desired overall
error rate is reached or performance no longer
improves.
• At that point, each classifier's vote is weighted
according to its accuracy on the training data
on which it was built.
• Though boosting principles can be applied to
nearly any type of model, the principles are
most commonly used with decision trees.