Random
Forest
1
Contents
• What is Random Forest?
• Ensemble Methods - Bagging
• How does Random Forest work?
• Hyper-Parameters in Random Forest
• Parameter Tuning - Cross-Validation & GridSearchCV
• Building RF in Scikit-learn
• Pros and Cons
2
What is Random Forest?
• Random Forest is a
supervised learning
algorithm and capable of
performing both regression
and classification tasks.
• As the name suggests,
Random Forest algorithm
creates a forest with a
number of decision trees.
3
Ensemble method
• Use multiple learning algorithms to obtain
better predictions.
• Train various different models, aggregate
their predictions to improvise stability and
predictive power.
• As we see, we need numbers of
models(learners), whose predictive power
is just slightly better than random chance.
Such learners are called as weak learners.
• We take such weak learners to make one
combined strong learner.
4
Bagging
• The idea behind bagging is combining the results of multiple models (for
instance, all decision trees) to get a generalized result.
• Bagging uses a sampling technique called – Bootstrapping.
• Bootstrapping is a sampling technique in which we create subsets of
observations from the original dataset, with replacement.
• Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to
get a fair idea of the distribution (complete set).
• The size of subsets created for bagging may be same or less than the original
set.
5
Bagging
• Multiple subsets are created from the original dataset, selecting
observations with replacement.
6
Bagging
• A base model (weak model) is created on
each of these subsets.
• The models run in parallel and are
independent of each other.
• The final predictions are determined by
combining the predictions from all the
models.
7
How does Random Forest work?
• RF consists multiple decision trees which act as base learners.
• Each decision tree is given a subset of random samples from the data set
(hence the name random).
• RF algorithm uses an Ensemble method – Bagging (Bootstrap Aggregating)
• Then, Random Forest train each base learner (i.e Decision Tree) on a different
sample of data and the sampling of data points happens with replacement.
8
Example
• Consider a training dataset : [X1, X2, X3, … X10, Y].
• Random forest will create decision trees taking the input from subset using
bagging as shown below:
9
Hyper-Parameters Random Forest
• Optimization of RF depends on few inbuilt parameters.
• n_estimators* - number of decision trees that the algorithm creates. As the
number tree increases, the performance increases and the predictions are
more stable but it slows down the computation.
• max_features* - maximum number of features that are considered for
splitting a node.
• n_jobs - number of jobs to run in parallel. If n_jobs=1, it uses one processor.
If n_jobs=-1, then the number of jobs is set to the number of cores available.
10
Parameters Random Forest
• max_depth is the maximum depth of the tree. The deeper the tree, the more
splits it has and it captures more information about the data.
• criterion is the function to measure the quality of a split. Supported criteria
are “gini” for the Gini impurity and “entropy” for the information gain.
11
Cross-Validation (CV)
• Cross-validation is a statistical method used to estimate the performance of
machine learning models.
• It is a resampling procedure used to evaluate machine learning models on a
limited data sample.
• The most common method is K-Fold CV.
• Normally, we split the data into – train & test data sets.
• In K-fold CV, training data is further split into K number of subsets, called
folds.
12
Cross-Validation (CV)
• Then iteratively fit the model K times, each time training the data on K-1 of
the folds and evaluating on the Kth fold (called the validation data).
• For example, consider the train data is splitted into 5 folds (K = 5).
• 1st Iteration - train on the first four folds and evaluate on the fifth.
• 2nd Iteration - train on the first, second, third, and fifth fold and evaluate on the fourth.
• And repeat the same procedure.
• At end of training, we average the performance on each of the folds to come
up with final validation metrics for the model.
13
Cross-Validation (CV)
• 5 Fold Cross Validation –
• For hyperparameter tuning, we perform many iterations of the entire K-Fold CV
process, each time using different model settings.
• If we have 10 sets of hyperparameters and are using 5-Fold CV, that represents 50
training loops.
14
GridSearchCV
• Grid-search is used to find the optimal hyperparameters of a model which
results in the most ‘accurate’ predictions.
• To implement the Grid Search algorithm we need to import GridSearchCV
class from the sklearn.model_selection library.
• The first step you need to perform is to create a dictionary of all the
parameters and their corresponding set of values that you want to test for
best performance.
15
Pros & Cons
Pros:
• Random Forest algorithm avoids overfitting.
• For both classification and regression task, the same random forest algorithm can be used.
• The Random Forest algorithm can be used for identifying the most important features from
the training dataset. It helps in feature engineering.
Cons:
• Random Forest is difficult to interpret. Because of averaging the results of many trees
becomes hard for us to figure out why a random forest is making predictions the way it is.
• Random Forest takes a longer time to create. It is computationally expensive compared to a
Decision Tree.
16