hmj2dydk1
January 3, 2025
Given a dataset credit.csv with imbalanced class distributions and a high-dimensional feature space,
discuss the challenges and considerations in using decision trees for classification. Propose strategies
for mitigating the impact of class imbalance and feature selection to improve model robustness and
generalisation performance.You are being provided with a meta data also please rad it before doing
implementation.
[1]: import pandas as pd
# Load the dataset
url = "https://itv-contentbucket.s3.ap-south-1.amazonaws.com/Exams/ML/
↪Decision+Tree/credit.csv"
data = pd.read_csv(url)
# Display the first few rows of the dataset
print(data.head())
# Display summary statistics
print(data.describe())
# Display information about the dataset
print(data.info())
Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 … V21 V22 V23 V24 V25 \
0 0.098698 0.363787 … -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 … -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 … 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 … -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 … -0.009431 0.798278 -0.137458 0.141267 -0.206010
V26 V27 V28 Amount Class
0 -0.189115 0.133558 -0.021053 149.62 0
1
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0
[5 rows x 31 columns]
Time V1 V2 V3 V4 \
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15 2.074095e-15
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01
V5 V6 V7 V8 V9 \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -2.406331e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01
… V21 V22 V23 V24 \
count … 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean … 1.654067e-16 -3.568593e-16 2.578648e-16 4.473266e-15
std … 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01
min … -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00
25% … -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01
50% … -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02
75% … 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01
max … 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00
V25 V26 V27 V28 Amount \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000
mean 5.340915e-16 1.683437e-15 -3.660091e-16 -1.227390e-16 88.349619
std 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109
min -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000
25% -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000
50% 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000
75% 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000
max 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000
Class
count 284807.000000
2
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 31 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
None
3
[2]: import matplotlib.pyplot as plt
import seaborn as sns
# Check class distribution
class_counts = data['Class'].value_counts()
print(class_counts)
# Visualize class distribution
sns.countplot(x='Class', data=data)
plt.title('Class Distribution')
plt.show()
# Visualize correlations between features
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, fmt=".2f")
plt.title('Feature Correlation Matrix')
plt.show()
Class
0 284315
1 492
Name: count, dtype: int64
4
[3]: from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Handle missing values (if any)
data = data.dropna()
# Separate features and target variable
X = data.drop('Class', axis=1)
y = data['Class']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42, stratify=y)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
5
[4]: from imblearn.over_sampling import SMOTE
# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled,␣
↪y_train)
# Check the new class distribution
print(pd.Series(y_train_resampled).value_counts())
Class
0 199020
1 199020
Name: count, dtype: int64
[10]: from sklearn.tree import DecisionTreeClassifier
# Train a decision tree to get feature importance
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_resampled, y_train_resampled)
# Get feature importance scores
importances = clf.feature_importances_
feature_importance = pd.Series(importances, index=X.columns).
↪sort_values(ascending=False)
print(feature_importance)
# Convert resampled data back to DataFrame to select features by their names
X_train_resampled_df = pd.DataFrame(X_train_resampled, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns)
# Select top features (e.g., top 10 features)
selected_features = feature_importance.head(10).index
X_train_selected = X_train_resampled_df[selected_features]
X_test_selected = X_test_scaled_df[selected_features]
V14 0.773470
V4 0.058559
V12 0.020056
V10 0.014436
V8 0.013011
V13 0.010029
V7 0.007680
V1 0.006985
V11 0.006978
Time 0.006907
V23 0.006864
6
V19 0.006841
V6 0.006788
V17 0.005915
V26 0.005891
V3 0.005694
V21 0.005551
V18 0.005128
V24 0.005046
V9 0.003899
V25 0.003670
V20 0.003646
V16 0.003503
V22 0.003228
V5 0.002949
V15 0.002785
Amount 0.001979
V27 0.001449
V2 0.000643
V28 0.000419
dtype: float64
[11]: from sklearn.metrics import classification_report, confusion_matrix,␣
↪accuracy_score
# Train the decision tree model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_selected, y_train_resampled)
# Make predictions
y_pred = clf.predict(X_test_selected)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
Accuracy: 0.9964654799105837
Confusion Matrix:
[[85035 260]
[ 42 106]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 85295
1 0.29 0.72 0.41 148
7
accuracy 1.00 85443
macro avg 0.64 0.86 0.71 85443
weighted avg 1.00 1.00 1.00 85443