0% found this document useful (0 votes)
43 views7 pages

Predictive Analytics Lab Project

The project aims to develop a machine-learning model to predict loan defaults using the German Credit dataset, focusing on preprocessing, feature engineering, and model evaluation metrics. Key objectives include cleaning data, training classifiers, handling class imbalance, and deploying the model. The final output includes performance metrics such as accuracy and ROC-AUC, along with visualizations like confusion matrices and feature importance plots.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views7 pages

Predictive Analytics Lab Project

The project aims to develop a machine-learning model to predict loan defaults using the German Credit dataset, focusing on preprocessing, feature engineering, and model evaluation metrics. Key objectives include cleaning data, training classifiers, handling class imbalance, and deploying the model. The final output includes performance metrics such as accuracy and ROC-AUC, along with visualizations like confusion matrices and feature importance plots.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

PREDICTIVE ANALYTICS LAB PROJECT

AIM :
To build a machine-learning model that predicts whether a loan
applicant will default (binary classification), and to evaluate the model using
standard metrics (accuracy, precision, recall, F1, AUC-ROC). Also
demonstrate preprocessing, feature engineering, imbalance handling, model
explanation, and a simple deployment demo.

OBJECTIVES:
1. Acquire and explore a real credit dataset.
2. Clean and preprocess the data: missing values, encoding categorical
variables, scaling numeric features.
3. Engineer and select predictive features (e.g., income ratios, credit
history flags).
4. Train several classifiers (Logistic Regression, Random Forest, XGBoost)
and compare performance.
5. Handle class imbalance (SMOTE, class weighting) and evaluate effects.
6. Use model explainability tools (SHAP or feature importances) to
interpret decisions.
7. Package the model for deployment (Flask API or simple pickle + demo
notebook).

Description / Dataset :
German Credit (UCI Statlog) — small, classic dataset (1,000 instances, 20
attributes) for “good/bad” credit risk classification. Good for quick
experiments and interpretability work.

Typical features (varies by dataset):

 Age, Sex, Job type, Housing status

 Credit amount, Duration (months), Purpose of loan

 History of credit, Existing credits at bank, Number of dependents

 Target: credit_risk (Good/Bad or 0/1)


(Full variable descriptions are on the dataset pages.)
PROGRAM:
# credit_risk_with_visuals.py

# Requirements: pandas, numpy, scikit-learn, imbalanced-learn, xgboost,


shap, matplotlib, seaborn, joblib

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.metrics import classification_report, roc_auc_score,


confusion_matrix, roc_curve

from imblearn.over_sampling import SMOTE

from imblearn.pipeline import Pipeline as ImbPipeline

from xgboost import XGBClassifier

import joblib

# 1) Load dataset

df = pd.read_csv("german_credit_data.csv") # replace with your dataset


# 2) Target distribution visualization

plt.figure(figsize=(6,4))

sns.countplot(x='target', data=df, palette="Set2")

plt.title("Target Distribution (Good vs Bad Credit)")

plt.xlabel("Credit Risk")

plt.ylabel("Count")

plt.savefig("target_distribution.png")

plt.close()

# 3) Define features/target

y = df['target'] # 0 = good, 1 = bad (check dataset encoding)

X = df.drop(columns=['target', 'ID'], errors='ignore')

# 4) Split data

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42, stratify=y

# 5) Preprocessing

num_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()

cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()

num_pipeline = Pipeline([

('imputer', SimpleImputer(strategy='median')),

('scaler', StandardScaler())
])

cat_pipeline = Pipeline([

('imputer', SimpleImputer(strategy='most_frequent')),

('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))

])

preprocessor = ColumnTransformer([

('num', num_pipeline, num_cols),

('cat', cat_pipeline, cat_cols)

])

# 6) Model pipeline with SMOTE

clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss',


random_state=42)

pipe = ImbPipeline([

('preproc', preprocessor),

('smote', SMOTE(random_state=42)),

('clf', clf)

])

# 7) Train

pipe.fit(X_train, y_train)

# 8) Predictions

y_pred = pipe.predict(X_test)

y_proba = pipe.predict_proba(X_test)[:,1]
print(classification_report(y_test, y_pred))

print("ROC-AUC:", roc_auc_score(y_test, y_proba))

# 9) Confusion matrix heatmap

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(5,4))

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",


xticklabels=["Good","Bad"], yticklabels=["Good","Bad"])

plt.title("Confusion Matrix")

plt.xlabel("Predicted")

plt.ylabel("Actual")

plt.savefig("confusion_matrix.png")

plt.close()

# 10) ROC Curve

fpr, tpr, thresholds = roc_curve(y_test, y_proba)

plt.figure(figsize=(6,5))

plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_proba):.2f}")

plt.plot([0,1],[0,1],'k--')

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

plt.title("ROC Curve")

plt.legend(loc="lower right")

plt.savefig("roc_curve.png")

plt.close()
# 11) Feature importance (from XGBoost)

model = pipe.named_steps['clf']

importance = model.feature_importances_

# Feature names after preprocessing

preproc = pipe.named_steps['preproc']

ohe_cols = []

if cat_cols:

ohe = preproc.named_transformers_['cat'].named_steps['onehot']

ohe_cols = ohe.get_feature_names_out(cat_cols)

feature_names = list(num_cols) + list(ohe_cols)

# Plot top 10 important features

sorted_idx = np.argsort(importance)[-10:]

plt.figure(figsize=(8,6))

plt.barh(np.array(feature_names)[sorted_idx], importance[sorted_idx],
color="green")

plt.title("Top 10 Feature Importances (XGBoost)")

plt.xlabel("Importance Score")

plt.savefig("feature_importance.png")

plt.close()

# 12) Save trained model

joblib.dump(pipe, "credit_risk_model.pkl")
OUTPUT:

(1000, 21) # shape of dataset (example)

Target distribution:

0 700

1 300

Classification report:

precision recall f1-score support

0 0.78 0.85 0.81 140

1 0.66 0.52 0.58 60

Accuracy: 0.75

ROC-AUC: 0.81

You might also like