ENSEMBLE METHODS: INCREASING THE ACCURACY
Ensemble methods
Use a combination of models to
increase accuracy
Combine a series of k learned
models, M1, M2, …, Mk, with the aim
of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction
over a collection of classifiers
Boosting: weighted vote with a
collection of classifiers
Ensemble: combining a set of
heterogeneous classifiers
1
BAGGING: BOOSTRAP AGGREGATION
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., bootstrap)
A classifier model Mi is learned for each training set D i
Classification: classify an unknown sample X
Each classifier Mi returns its class prediction
The bagged classifier M* counts the votes and assigns the class with the most votes
to X
Prediction: can be applied to the prediction of continuous values by taking the average
value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction
2
BOOSTING
Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to allow
the subsequent classifier, Mi+1, to pay more attention to the
training tuples that were misclassified by Mi
The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
Boosting algorithm can be extended for numeric prediction
Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
4
ADABOOST (FREUND AND SCHAPIRE, 1997)
Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
Initially, all the weights of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
Tuples from D are sampled (with replacement) to form a training set D i of the same size
Each tuple’s chance of being selected is based on its weight
A classification model Mi is derived from Di
Its error rate is calculated using Di as a test set
If a tuple is misclassified, its weight is increased, o.w. it is decreased
Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of
the weights of the misclassified tuples: d
error ( M i ) w j err ( X j )
j
1 error ( M i )
log
The weight of classifier Mi’s vote is error ( M i )
5
RANDOM FOREST ( BREIMAN
2001)
Random Forest:
Each classifier in the ensemble is a decision tree classifier and is generated
using a random selection of attributes at each node to determine the split
During classification, each tree votes and the most popular class is returned
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology is
used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
Comparable in accuracy to Adaboost, but more robust to errors and outliers
Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
7
CLASSIFICATION OF CLASS-IMBALANCED
DATA SETS
Class-imbalance problem: Rare positive example but numerous negative
ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
Traditional methods assume a balanced distribution of classes and equal
error costs: not suitable for class-imbalanced data
Typical methods for imbalance data in 2-class classification:
Oversampling: re-sampling of data from positive class
Under-sampling: randomly eliminate tuples from negative class
Threshold-moving: moves the decision threshold, t, so that the rare
class tuples are easier to classify, and hence, less chance of costly false
negative errors
Ensemble techniques: Ensemble multiple classifiers introduced above
Still difficult for class imbalance problem on multiclass tasks
8