Random Forest Algorithm in Machine Learning

Random Forest is a supervised machine learning algorithm that is used for both classification and regression tasks. It was developed by Leo Breiman and Adele Cutler. It is an ensemble learning method that builds multiple decision trees during training and combines their predictions to improve accuracy and reduce overfitting. The expected accuracy increases with the number of Decision Trees in the model.

In this lesson, we will learn:

  • How Random Forest Works
  • Features of Random Forest
  • Advantages of Random Forest
  • Disadvantages of Random Forest
  • Applications of Random Forest
  • Example 1: Predicting Loan Approval (Classification)
  • Example 2: Predicting House Prices (Regression)

How Random Forest Works

Let us see how random forest works:

  1. Bootstrap Aggregation (Bagging): Randomly selects subsets of the training data (with replacement) to train multiple decision trees.
  2. Feature Randomness: For each split in a decision tree, only a random subset of features is considered, reducing correlation between trees.
  3. Voting (Classification) / Averaging (Regression):
    • For classification, the final prediction is based on majority voting.
    • For regression, the final prediction is the average of all tree predictions.

Features of Random Forest

Here are some features of Random Forest:

  • Ensemble Method: Combines multiple decision trees for better performance.
  • Handles Overfitting: Due to averaging/voting, it reduces variance compared to a single decision tree.
  • Works with Missing Values: Can handle missing data effectively.
  • Feature Importance: Provides a measure of feature importance.
  • Scalability: Works well with large datasets.

Advantages of Random Forest

The following are the advantages of Random Forest:

  • High Accuracy: Often outperforms single decision trees.
  • Robust to Overfitting: Due to averaging multiple trees.
  • Handles Non-Linearity: Works well with complex datasets.
  • Feature Importance: Helps in feature selection.
  • Works with Both Numerical & Categorical Data.

Disadvantages of Random Forest

The following are the disadvantages of Random Forest:

  • Computationally Expensive: Slower than single decision trees.
  • Less Interpretability: Harder to visualize than a single decision tree.
  • Memory Intensive: Requires more storage for multiple trees.
  • Can Overfit on Noisy Data: If trees are too deep.

Applications of Random Forest

The following are the applications of Random Forest:

  • Banking: Fraud detection, loan approval.
  • Healthcare: Disease prediction, medical diagnosis.
  • E-commerce: Customer segmentation, recommendation systems.
  • Stock Market: Price prediction, risk assessment.
  • Remote Sensing: Land cover classification.

Example 1: Predicting Loan Approval (Classification)

Problem Statement

A bank wants to predict whether a customer’s loan application should be approved based on features like income, credit score, employment status, and loan amount.

Steps:

  1. Data Collection:
    • Features (Independent Variables): IncomeCredit ScoreEmployment StatusLoan Amount
    • Target (Dependent Variable): Approved (Yes/No)
  2. Training the Random Forest:
    • Multiple decision trees are trained on random subsets of data.
    • Each tree predicts independently.
  3. Prediction:
    • New applicant: Income = $80kCredit Score = 700Employed = YesLoan Amount = $50k
    • Trees predict: [Yes, No, Yes, Yes, Yes]
    • Final Prediction: Yes (Majority Voting)

Here are the steps with the code snippets:

Step 1: Import Required Libraries

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import warnings

Step 2: Sample Data

data = {
    'Income': [50, 80, 30, 90, 60],
    'Credit_Score': [600, 700, 500, 800, 650],
    'Employed': [1, 1, 0, 1, 1],
    'Loan_Amount': [20, 50, 10, 60, 30],
    'Approved': [0, 1, 0, 1, 1]  # Fixed typo in column name (was 'Approved')
}
df = pd.DataFrame(data)

Step 3: Features and Target

X = df.drop('Approved', axis=1)
y = df['Approved']

Step 4: Train-Test Split

Step 5: Train Random Forest

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

Step 6: Correct way to make a prediction (preserving feature names)

new_applicant = pd.DataFrame({
    'Income': [80],
    'Credit_Score': [700],
    'Employed': [1],
    'Loan_Amount': [50]
})

prediction = model.predict(new_applicant)
print("Loan Approved?", "Yes" if prediction[0] == 1 else "No")

Output

Random Forest Classification Output

Example 2: Predicting House Prices (Regression)

Problem Statement

A real estate company wants to predict house prices based on features like:

  • Size (sq. ft.)
  • Bedrooms
  • Location (1=Urban, 0=Rural)
  • Age of House (years)

Steps:

  1. Data Collection:
    • Features (Independent Variables): SizeBedroomsLocationAge
    • Target (Dependent Variable): Price ($)
  2. Training the Random Forest Regressor:
    • Multiple decision trees predict house prices independently.
    • Final prediction = Average of all tree predictions.
  3. Prediction:
    • New house: Size = 1800Bedrooms = 3Location = 1 (Urban)Age = 5
    • Individual tree predictions: Each of the 100 trees (n_estimators=100) in your Random Forest makes its own prediction for the house price. The model would generate 100 different price predictions (one from each tree). Each of your 100 trees (n_estimators=100) made its own prediction. Here’s a simulated example of what the first 10 trees might have predicted:
      Tree 1: $372,000
      Tree 2: $385,000  
      Tree 3: $379,000
      Tree 4: $366,000
      Tree 5: $381,000
      Tree 6: $377,000
      Tree 7: $390,000
      Tree 8: $374,000
      Tree 9: $368,000
      Tree 10: $382,000
      ...
      Tree 100: $380,000
    • Final Prediction: The Random Forest takes the average of all 100 predictions:
      ($372k + $385k + $379k + … + $380k) / 100 = $378,500

Here are the steps with the code snippets:

Step 1: Import Required Libraries

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np

Step 2: Sample Data

data = {
'Size': [1200, 1500, 1800, 2000, 2400, 1600, 1900, 2100, 2300, 2500],
'Bedrooms': [2, 3, 3, 4, 4, 2, 3, 4, 3, 4],
'Location': [0, 1, 1, 0, 1, 0, 1, 0, 1, 1], # 0=Rural, 1=Urban
'Age': [10, 5, 8, 15, 3, 7, 9, 12, 4, 2],
'Price': [250000, 320000, 350000, 400000, 450000, 310000, 370000, 410000, 440000, 470000]
}
df = pd.DataFrame(data)

Step 3: Features & Target

X = df.drop('Price', axis=1)
y = df['Price']

Step 4: Train-Test Split – now with more data and proper test size

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 5: Train Random Forest Regressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

Step 6: Make predictions (preserving feature names)

new_house = pd.DataFrame({
'Size': [1800],
'Bedrooms': [3],
'Location': [1],
'Age': [5]
})

predicted_price = model.predict(new_house)
print(f"Predicted House Price: ${predicted_price[0]:,.2f}")

Step 7: Evaluate Model

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:,.2f}, R² Score: {r2:.2f}")

Step 8: Feature Importance

importances = model.feature_importances_
features = X.columns
print("\nFeature Importances:")
for feature, importance in zip(features, importances):
print(f"{feature}: {importance:.2f}")

Output

Random Forest Regression

If you liked the tutorial, spread the word and share the link and our website Studyopedia with others.


For Videos, Join Our YouTube Channel: Join Now


Read More:

Naive Bayes Algorithm
K-Nearest Neighbors (KNN) Algorithm
Studyopedia Editorial Staff
[email protected]

We work to create programming tutorials for all.

No Comments

Post A Comment