0% found this document useful (0 votes)

0 views33 pages

7.1Random Forests Python Class_v2

Uploaded by

sunny297

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views33 pages

7.1Random Forests Python Class_v2

Uploaded by

sunny297

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Ensemble Models and

Random Forests
Venkat Reddy Chapter 7 in the
book
Contents
Contents
•Introduction
•Ensemble Learning
•How ensemble learning works
•Bagging
•Building models using Bagging
•Random Forest algorithm
•Random Forest model building

statinfer.com
The Wisdom of Crowds
The wisdom of crowds
•“One should not expend energy trying to identify an expert within a
group but instead rely on the group’s collective wisdom, however
make sure that opinions must be independent and some knowledge of
the truth must reside with some group members” – Surowiecki

-So instead of trying to build one great model, its better to build some
independent moderate models and take their average as final
prediction

statinfer.com
The wisdom of crowds
Problem Statement: What is the estimated monthly expense of a family in our city.

An Eminent Professor built a model Vs. 100 Assistant Professors built 100 models

M1 M1 M2 M3 …… M100

1 1 1 1

One Single Prediction Average of all 100 predictions

$6500 $7200

statinfer.com
What is Ensemble Learning
What is Ensemble Learning
• Imagine a classifier problem, there are two classes +1 & -1 in the target
• Imagine that we built a best possible decision tree, it has 91% accuracy
• Let x be the new data point and our decision tree predicts it to be +1. Is there a way we
can do better than 91% by using the same data
• Lets build 3 more models on the same data. And see we can improve the performance

Data New data(x)

Decision Trees Decision Trees

8
90% 90%
+1
statinfer.com
What is Ensemble Learning
•We have four models on the same dataset, Each of them have
different accuracy. But unfortunately there seem to be no real
improvement in the accuracy.

Data

Model1 Model2 Model3 Model4 Model100

90% 89% 88% 89% 89%

statinfer.com
What is Ensemble Learning
•What about prediction of the data point x?
•The combined voting model seem to be having less error than each of
the individual models.
•This is the actual philosophy of ensemble learning
Data

Model1 Model2 Model3 Model4 Model100

-1 -1 1 -1 -1

statinfer.com
Ensemble Models
Ensemble Models
• Obtaining a better predictions using multiple
models on the same dataset Data
• Not every time it is possible to find single best fit
model for our data, ensemble model combines
multiple models to come up with one consolidated
model M1 M2 M3 ….. Mk
• Ensemble models work on the principle that
multiple moderately accurate models can give us
a highly accurate model
• Understandably, the Building and Evaluating the
ensemble models is computationally expensive Combined
• Build one really good model is the usual statistical model
approach. Build many models and average the 12
results is the philosophy of Ensemble learning

statinfer.com
Bagging
Bagging
•Take multiple boot strap samples from the population and build
classifiers on each of the samples. For prediction take mean or mode
of all the individual model predictions.
•Bagging has two major parts 1) Boot strap sampling 2) Aggregation of
learners
•Bagging = Bootstrap Aggregating
•In Bagging we combine many unstable models to produce a stable
model. Hence the predictors will be very reliable(less variance in the
final model).

statinfer.com
Boot strapping
• We have a training data is of size N
• Draw random sample with replacement of size N – This gives a new dataset, it might have repeated
observations, some observations might not have even appeared once.
• We are selecting records one-at–a-time, returning each selected record back in the population, giving it
a chance to be selected again
• Create B such new datasets. These are called boot strap datasets

statinfer.com
The Bagging Algorithm
The Bagging Algorithm
•The training dataset D Data
•Draw k boot strap sample sets from
dataset D
•For each boot strap sample i
B1 B2 B3 ….. Bk
• Build a classifier model Mi
• We will have total of k classifiers M 1 , M2
,….. Mk
• Vote over for the final classifier output M1 M2 M3 ….. Mk
and take the average for regression
output

17
Bagged
model
statinfer.com
Why Bagging works
• We are selecting records one-at–a-time, returning each selected record back
in the population, giving it a chance to be selected again
• Note that the variance in the consolidated prediction is reduced, if we have
independent samples. That way we can reduce the unavoidable errors made
by the single model.
• In a given boot strap sample, some observations have chance to select
multiple times and some observations might not have selected at all.
• There a proven theory that boot strap samples have only 63% of overall
population and rest 37% is not present.
• So the data used in each of these models is not exactly same, This makes our
learning models independent. This helps our predictors have the uncorrelated
errors.
• Finally the errors from the individual models cancel out and give us a better
ensemble model with higher accuracy
• Bagging is really useful when there is lot of variance in our data 18

statinfer.com
Random Forest
Random Forest
•Random forest is a specific case of bagging methodology. Bagging on
decision trees is random forest
•Like many trees form a forest, many decision tree model together
form a Random Forest model

statinfer.com
Random Forest
•In random forest we induce two types of randomness
• Firstly, we take the boot strap samples of the population and build decision trees
on each of the sample.
• While building the individual trees on boot strap samples, we take a subset of the
features randomly
•Random forests are very stable they are as good as NN and SVMs
sometimes better

statinfer.com
Random Forest algorithm
• The training dataset D with t number of
features Data

• Draw k boot strap sample sets from

dataset D
• For each boot strap sample i
B1 B2 B3 ….. Bk
• Build a decision tree model Mi using only p
p p p .. p
number of features (where p<<t)
• Each tree has maximal strength they are fully
grown and not pruned.
• We will have total of k decision treed M1 , M2 D1 D2 D3 ….. Dk
, ….. Mk ; Each of these trees are built on
reactively different training data and
different set of features
• Vote over for the final classifier output and RF
22

take the average for regression output model

statinfer.com
The Random Factors in Random Forest
•We need to note the most important aspect of random forest, i.e
inducing randomness into the bagging of trees. There are two major
sources of randomness
• Randomness in data: Boot strapping, this will make sure that any two samples data
is somewhat different
• Randomness in features: While building the decision trees on boot strapped samples
we consider only a random subset of features.

statinfer.com
Why to induce the randomness?
•The major trick of ensemble models is the independence of models.
•If we take the same data and build same model for 100 times, we will
not see any improvement
•To make all our decision trees independent, we take independent
samples set and independent features set
•As a rule of thumb we can consider square root of the number
features, if ‘t’ is very large else p=t/3

statinfer.com
Why Random Forest Works
•For a training data with 20 features we are building 100 decision trees
with 5 features each, instated of single great decision.
•The individual trees may be weak classifiers.
•Its like building weak classifiers on subsets of data. The grouping of
large sets of random trees generally produces accurate models.

statinfer.com
Why Random Forest Works
• In this example we have three simple
classifiers.
m2 • m1 classifies anything above the line as +1
and below as -1
m1
• m2 classifies all the points above the line
as -1 and below as +1
• m3 classifies everything on the left as -1
and right as +1

m3
• Each of these models have fair amount of
misclassification error.
• All these three weak models together
make a strong model.
26

statinfer.com
Car accidents IOT

statinfer.com
LAB: Random Forest
LAB: Random Forest
•https://www.kaggle.com/c/stayalert
•Dataset: /Car Accidents IOT/Train.csv
•Build a decision tree model to predict the fatality of accident
•Build a decision tree model on the training data.
•On the test data, calculate the classification error and accuracy.
•Build a random forest model on the training data.
•On the test data, calculate the classification error and accuracy.
•What is the improvement of the Random Forest model when compared
with the single tree?
29

statinfer.com
Code: Random Forest
features=list(car_train.columns[1:22])
X_train=car_train[features]
y_train=car_train['Fatal']

###buildng Decision tree on the training data ####

clf = tree.DecisionTreeClassifier()
Import data and build a
clf.fit(X_train,y_train)
decision tree
#####predicting on test data ####
tree_predict=clf.predict(car_test[features])

from sklearn.metrics import confusion_matrix###for using confusion matrix###

cm1 = confusion_matrix(car_test[['Fatal']],tree_predict)
print(cm1)
30

statinfer.com
Code: Random Forest
#####predicting on test data ####
tree_predict=clf.predict(car_test[features])

from sklearn.metrics import confusion_matrix###for using confusion matrix###

cm1 = confusion_matrix(car_test[['Fatal']],tree_predict)
print(cm1)

#####from confusion matrix calculate accuracy

total1=sum(sum(cm1))
accuracy_tree=(cm1[0,0]+cm1[1,1])/total1
accuracy_tree

statinfer.com
Code: Random Forest
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=10, max_features=5, max_depth=11)

forest.fit(X_train,y_train)

predict_y_test=forest.predict(car_test[features])
actual_y_test=car_test['Fatal']
Random forest model and
###check the accuracy on test data its accuracy
from sklearn.metrics import confusion_matrix###for using confusion matrix###
cm2 = confusion_matrix(actual_y_test,predict_y_test)
print(cm2)
total2=sum(sum(cm2))

#####from confusion matrix calculate accuracy

accuracy_forest=(cm2[0,0]+cm2[1,1])/total2 32
accuracy_forest

statinfer.com
Thank you

statinfer.com

3. TensorFlow and Keras v4
No ratings yet
3. TensorFlow and Keras v4
41 pages
11-Spatial Analysis
No ratings yet
11-Spatial Analysis
48 pages
EBR-with-PLSQL-zero-downtime-CO
No ratings yet
EBR-with-PLSQL-zero-downtime-CO
44 pages
How to Use SQL Plan Management
No ratings yet
How to Use SQL Plan Management
66 pages
Introduction To Oracle Database
No ratings yet
Introduction To Oracle Database
38 pages
Oracle Json
No ratings yet
Oracle Json
13 pages
Service Quality Gap Analysis of Automobile Service Centers: Suhas S. Ambekar
No ratings yet
Service Quality Gap Analysis of Automobile Service Centers: Suhas S. Ambekar
4 pages
End-to-End Diagnostics: Trouble Shooting Guide Missing Data in Service Session From BI/CCDB
No ratings yet
End-to-End Diagnostics: Trouble Shooting Guide Missing Data in Service Session From BI/CCDB
19 pages
Case Study
No ratings yet
Case Study
9 pages
S.M Interiors: Subject Quotation For Flat
83% (6)
S.M Interiors: Subject Quotation For Flat
5 pages
NLP - Robert Dilts
100% (5)
NLP - Robert Dilts
3 pages
Philips - HeartStart XL - User Training
No ratings yet
Philips - HeartStart XL - User Training
106 pages
Fabrication of Air Brake System Using Engine Exhaust Gas Ijariie2083
No ratings yet
Fabrication of Air Brake System Using Engine Exhaust Gas Ijariie2083
5 pages
Doordarshan Kendra Industrial Training Report
No ratings yet
Doordarshan Kendra Industrial Training Report
51 pages
Blackberry Curve 8900 FactSheet
100% (2)
Blackberry Curve 8900 FactSheet
1 page
LIN Protocol
100% (1)
LIN Protocol
24 pages
LEYECO II Historical Energy Data
No ratings yet
LEYECO II Historical Energy Data
38 pages
MAPEH VI Summative Test Guide
No ratings yet
MAPEH VI Summative Test Guide
5 pages
Multimedia Information
No ratings yet
Multimedia Information
51 pages
Floorplan Design FAQs
No ratings yet
Floorplan Design FAQs
8 pages
Barnes - Designing A Successful KM Strategy - 2015
100% (5)
Barnes - Designing A Successful KM Strategy - 2015
225 pages
BOQ Ductile Iron
No ratings yet
BOQ Ductile Iron
2 pages
FN1525 FN1529 FN1533
No ratings yet
FN1525 FN1529 FN1533
25 pages
Jean Prouvé
No ratings yet
Jean Prouvé
2 pages
Sieve Analysis of Coarse Agg
No ratings yet
Sieve Analysis of Coarse Agg
10 pages
Urban Flood Risk in Surat City
No ratings yet
Urban Flood Risk in Surat City
41 pages
Architectural Acoustics: Building Utilities 3 - Acoustics and Lighting Systems (Topic 5)
No ratings yet
Architectural Acoustics: Building Utilities 3 - Acoustics and Lighting Systems (Topic 5)
9 pages
Drain Pipe ITP
No ratings yet
Drain Pipe ITP
2 pages
Capacities of Various Plant and Machinery
No ratings yet
Capacities of Various Plant and Machinery
7 pages
Drilling Solutions Brochure PDF
No ratings yet
Drilling Solutions Brochure PDF
5 pages
Anh Chuyen de
No ratings yet
Anh Chuyen de
8 pages
BigoLiveSDK Initialization Log Details
No ratings yet
BigoLiveSDK Initialization Log Details
1 page
Luening & Ussachevsky's Incantation Analysis
100% (1)
Luening & Ussachevsky's Incantation Analysis
6 pages
Design and Fabrication of Condenser For A Room Air Conditioning System (Rep1.)
100% (1)
Design and Fabrication of Condenser For A Room Air Conditioning System (Rep1.)
40 pages
Design For Oil& Greese Trap
100% (2)
Design For Oil& Greese Trap
11 pages
Nursing Informatics: PDAs & Wireless Devices
No ratings yet
Nursing Informatics: PDAs & Wireless Devices
7 pages

7.1Random Forests Python Class_v2

Uploaded by

7.1Random Forests Python Class_v2

Uploaded by

Ensemble Models and

One Single Prediction Average of all 100 predictions

Data New data(x)

Decision Trees Decision Trees

Model1 Model2 Model3 Model4 Model100

90% 89% 88% 89% 89%

Model1 Model2 Model3 Model4 Model100

• Draw k boot strap sample sets from

take the average for regression output model

###buildng Decision tree on the training data ####

from sklearn.metrics import confusion_matrix###for using confusion matrix###

from sklearn.metrics import confusion_matrix###for using confusion matrix###

#####from confusion matrix calculate accuracy

#####from confusion matrix calculate accuracy

You might also like