0% found this document useful (0 votes)

82 views8 pages

Decision Trees for Imbalanced Data

The document discusses the challenges of using decision trees for classification on an imbalanced dataset with high-dimensional features, specifically focusing on a credit dataset. It proposes strategies such as applying SMOTE for class imbalance and selecting important features based on decision tree feature importance to enhance model robustness and generalization. The document includes code snippets for data loading, preprocessing, and model training.

Uploaded by

anuj rawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views8 pages

Decision Trees for Imbalanced Data

Uploaded by

anuj rawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

hmj2dydk1

January 3, 2025

Given a dataset credit.csv with imbalanced class distributions and a high-dimensional feature space,
discuss the challenges and considerations in using decision trees for classification. Propose strategies
for mitigating the impact of class imbalance and feature selection to improve model robustness and
generalisation performance.You are being provided with a meta data also please rad it before doing
implementation.
[1]: import pandas as pd

# Load the dataset

url = "https://itv-contentbucket.s3.ap-south-1.amazonaws.com/Exams/ML/
↪Decision+Tree/credit.csv"

data = pd.read_csv(url)

# Display the first few rows of the dataset

print(data.head())

# Display summary statistics

print(data.describe())

# Display information about the dataset

print(data.info())

Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941

V8 V9 … V21 V22 V23 V24 V25 \

0 0.098698 0.363787 … -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 … -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 … 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 … -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 … -0.009431 0.798278 -0.137458 0.141267 -0.206010

V26 V27 V28 Amount Class

0 -0.189115 0.133558 -0.021053 149.62 0

1
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0

[5 rows x 31 columns]
Time V1 V2 V3 V4 \
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15 2.074095e-15
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01

V5 V6 V7 V8 V9 \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -2.406331e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01

… V21 V22 V23 V24 \

count … 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean … 1.654067e-16 -3.568593e-16 2.578648e-16 4.473266e-15
std … 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01
min … -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00
25% … -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01
50% … -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02
75% … 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01
max … 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00

V25 V26 V27 V28 Amount \

count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000
mean 5.340915e-16 1.683437e-15 -3.660091e-16 -1.227390e-16 88.349619
std 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109
min -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000
25% -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000
50% 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000
75% 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000
max 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000

Class
count 284807.000000

2
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000

[8 rows x 31 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
None

3
[2]: import matplotlib.pyplot as plt
import seaborn as sns

# Check class distribution

class_counts = data['Class'].value_counts()
print(class_counts)

# Visualize class distribution

sns.countplot(x='Class', data=data)
plt.title('Class Distribution')
plt.show()

# Visualize correlations between features

plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, fmt=".2f")
plt.title('Feature Correlation Matrix')
plt.show()

Class
0 284315
1 492
Name: count, dtype: int64

4
[3]: from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Handle missing values (if any)

data = data.dropna()

# Separate features and target variable

X = data.drop('Class', axis=1)
y = data['Class']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42, stratify=y)

# Scale the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

5
[4]: from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled,␣
↪y_train)

# Check the new class distribution

print(pd.Series(y_train_resampled).value_counts())

Class
0 199020
1 199020
Name: count, dtype: int64

[10]: from sklearn.tree import DecisionTreeClassifier

# Train a decision tree to get feature importance

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_resampled, y_train_resampled)

# Get feature importance scores

importances = clf.feature_importances_
feature_importance = pd.Series(importances, index=X.columns).
↪sort_values(ascending=False)

print(feature_importance)

# Convert resampled data back to DataFrame to select features by their names

X_train_resampled_df = pd.DataFrame(X_train_resampled, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns)

# Select top features (e.g., top 10 features)

selected_features = feature_importance.head(10).index
X_train_selected = X_train_resampled_df[selected_features]
X_test_selected = X_test_scaled_df[selected_features]

V14 0.773470
V4 0.058559
V12 0.020056
V10 0.014436
V8 0.013011
V13 0.010029
V7 0.007680
V1 0.006985
V11 0.006978
Time 0.006907
V23 0.006864

6
V19 0.006841
V6 0.006788
V17 0.005915
V26 0.005891
V3 0.005694
V21 0.005551
V18 0.005128
V24 0.005046
V9 0.003899
V25 0.003670
V20 0.003646
V16 0.003503
V22 0.003228
V5 0.002949
V15 0.002785
Amount 0.001979
V27 0.001449
V2 0.000643
V28 0.000419
dtype: float64

[11]: from sklearn.metrics import classification_report, confusion_matrix,␣

↪accuracy_score

# Train the decision tree model

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_selected, y_train_resampled)

# Make predictions
y_pred = clf.predict(X_test_selected)

# Evaluate the model

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.9964654799105837
Confusion Matrix:
[[85035 260]
[ 42 106]]
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 85295

1 0.29 0.72 0.41 148

7
accuracy 1.00 85443
macro avg 0.64 0.86 0.71 85443
weighted avg 1.00 1.00 1.00 85443

Machine Learning
No ratings yet
Machine Learning
16 pages
Machine Learning
No ratings yet
Machine Learning
12 pages
MLA Lab 6:-Implementation of Decision Tree
No ratings yet
MLA Lab 6:-Implementation of Decision Tree
16 pages
Loan Default Prediction System 1753830667
No ratings yet
Loan Default Prediction System 1753830667
11 pages
Tutorial 6
No ratings yet
Tutorial 6
8 pages
Car Evaluation Data Analysis & Random Forest Model
No ratings yet
Car Evaluation Data Analysis & Random Forest Model
12 pages
Decision Tree Classifier Guide
No ratings yet
Decision Tree Classifier Guide
4 pages
Prathamesh KRAI
No ratings yet
Prathamesh KRAI
38 pages
NF Assighment4
No ratings yet
NF Assighment4
5 pages
DTC Algorithm Implementation Guide
No ratings yet
DTC Algorithm Implementation Guide
7 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
DA PRA WEEK 13 (Random Forest) - 054551
No ratings yet
DA PRA WEEK 13 (Random Forest) - 054551
12 pages
Medical Data ML
No ratings yet
Medical Data ML
6 pages
Loan Delinquency Risk Analysis Model
No ratings yet
Loan Delinquency Risk Analysis Model
5 pages
Build a Decision Tree Classifier Guide
No ratings yet
Build a Decision Tree Classifier Guide
6 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Classification Techniques in Python
No ratings yet
Classification Techniques in Python
30 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
Final-12-Lab Programs
No ratings yet
Final-12-Lab Programs
30 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
EX - NO:3: Algorithm
No ratings yet
EX - NO:3: Algorithm
11 pages
Customer Churn Model Analysis
No ratings yet
Customer Churn Model Analysis
2 pages
ML5 Implementation
No ratings yet
ML5 Implementation
32 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Imbalanced Dataset Customer Churn
No ratings yet
Imbalanced Dataset Customer Churn
9 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
PyTorch Tabular Regression Guide
No ratings yet
PyTorch Tabular Regression Guide
13 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
15 pages
Decision Tree for Admission Prediction
No ratings yet
Decision Tree for Admission Prediction
3 pages
Mini Project With Output
No ratings yet
Mini Project With Output
8 pages
Pca2 1
No ratings yet
Pca2 1
26 pages
Professional Machine Learning
No ratings yet
Professional Machine Learning
67 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
23BCE7092 ML Lab Assignment
No ratings yet
23BCE7092 ML Lab Assignment
14 pages
Decision Tree
No ratings yet
Decision Tree
6 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
Telecom Churn Proj
No ratings yet
Telecom Churn Proj
4 pages
Machine Learning Credit Rating Model
No ratings yet
Machine Learning Credit Rating Model
12 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
5b Python Implementation of Decision Tree
No ratings yet
5b Python Implementation of Decision Tree
7 pages
Decision Tree and Forests - Ipynb - Colab
No ratings yet
Decision Tree and Forests - Ipynb - Colab
3 pages
Assgn 04 ML Jatan - Colab
No ratings yet
Assgn 04 ML Jatan - Colab
4 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Practical Machine Learning Code Examples
No ratings yet
Practical Machine Learning Code Examples
33 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
1 page
Da Lab3 221it084 Final
No ratings yet
Da Lab3 221it084 Final
6 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Hasnain Saeed Lab Task # 11
No ratings yet
Hasnain Saeed Lab Task # 11
11 pages
Classification
No ratings yet
Classification
3 pages
Modeling Imbalance Class
No ratings yet
Modeling Imbalance Class
24 pages
Update on pandas.util.testing Deprecation
No ratings yet
Update on pandas.util.testing Deprecation
10 pages
Practical No4 - 5 ML
No ratings yet
Practical No4 - 5 ML
11 pages
Lab 2
No ratings yet
Lab 2
17 pages
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Rohini 93803270119
No ratings yet
Rohini 93803270119
5 pages
Hahslm Equation in Quran Al-Hijr
No ratings yet
Hahslm Equation in Quran Al-Hijr
8 pages
Ch.06 Feedback Linearization - Problem 3
No ratings yet
Ch.06 Feedback Linearization - Problem 3
13 pages
Alpine-CDA9847 CDE9843 9845 Carradio
No ratings yet
Alpine-CDA9847 CDE9843 9845 Carradio
39 pages
It 12 Integrative Programming and Technologies Learnerx27s Module Unit Final Period
100% (1)
It 12 Integrative Programming and Technologies Learnerx27s Module Unit Final Period
59 pages
Ifc Coping
No ratings yet
Ifc Coping
1 page
Jacobian Matrix and Determinant
No ratings yet
Jacobian Matrix and Determinant
5 pages
F-Distribution Critical Values Table
No ratings yet
F-Distribution Critical Values Table
8 pages
Dupont Analysis Excel Tutorial
No ratings yet
Dupont Analysis Excel Tutorial
12 pages
Adversarial Search
No ratings yet
Adversarial Search
22 pages
Ieee 627
No ratings yet
Ieee 627
14 pages
Remainders Reloaded: Euler, Fermat and Wilson's Theorems For CAT 2011
No ratings yet
Remainders Reloaded: Euler, Fermat and Wilson's Theorems For CAT 2011
5 pages
ECESWA 2022 Mathematics Exam Paper 4
100% (1)
ECESWA 2022 Mathematics Exam Paper 4
20 pages
Types of Chemical Reactions Guided Inquiry
No ratings yet
Types of Chemical Reactions Guided Inquiry
2 pages
Assignment Phet Simulation (PhET, Build An Atom) .Docx-2 PDF
No ratings yet
Assignment Phet Simulation (PhET, Build An Atom) .Docx-2 PDF
6 pages
Sensors and Electrode Systems-Module-1
No ratings yet
Sensors and Electrode Systems-Module-1
25 pages
Low-Cost Development Platform For 32-Bit LPC Microcontroller Family Low-Cost Development Platform For 32-Bit LPC Microcontroller Family
No ratings yet
Low-Cost Development Platform For 32-Bit LPC Microcontroller Family Low-Cost Development Platform For 32-Bit LPC Microcontroller Family
2 pages
Alla France ASTM Thermometer
No ratings yet
Alla France ASTM Thermometer
3 pages
EN 12720 Surface Resistance Testing
No ratings yet
EN 12720 Surface Resistance Testing
4 pages
Technology Course Reader S
No ratings yet
Technology Course Reader S
181 pages
Winter War Rules SPI
No ratings yet
Winter War Rules SPI
7 pages
LANJET 2023 Mathematics Marking Scheme
No ratings yet
LANJET 2023 Mathematics Marking Scheme
15 pages
730 Series Americas
No ratings yet
730 Series Americas
41 pages
Logan Sup-R-Jar Presentation R1
No ratings yet
Logan Sup-R-Jar Presentation R1
18 pages
B.tech 15CS329E Geographical Information Systems
No ratings yet
B.tech 15CS329E Geographical Information Systems
5 pages
System Stability: Chapter Objective: - System Stability - Routh-Hurwitz's Stability Criterion
No ratings yet
System Stability: Chapter Objective: - System Stability - Routh-Hurwitz's Stability Criterion
23 pages
Einstein's Life and Legacy
No ratings yet
Einstein's Life and Legacy
2 pages
Financial Model Version Tracker
No ratings yet
Financial Model Version Tracker
11 pages
All Tripping Devices
No ratings yet
All Tripping Devices
2 pages
Transformer Efficiency and Loading Impact
No ratings yet
Transformer Efficiency and Loading Impact
8 pages