Predictive Analytics Lab Project
Predictive Analytics Lab Project
AIM :
To build a machine-learning model that predicts whether a loan
applicant will default (binary classification), and to evaluate the model using
standard metrics (accuracy, precision, recall, F1, AUC-ROC). Also
demonstrate preprocessing, feature engineering, imbalance handling, model
explanation, and a simple deployment demo.
OBJECTIVES:
1. Acquire and explore a real credit dataset.
2. Clean and preprocess the data: missing values, encoding categorical
variables, scaling numeric features.
3. Engineer and select predictive features (e.g., income ratios, credit
history flags).
4. Train several classifiers (Logistic Regression, Random Forest, XGBoost)
and compare performance.
5. Handle class imbalance (SMOTE, class weighting) and evaluate effects.
6. Use model explainability tools (SHAP or feature importances) to
interpret decisions.
7. Package the model for deployment (Flask API or simple pickle + demo
notebook).
Description / Dataset :
German Credit (UCI Statlog) — small, classic dataset (1,000 instances, 20
attributes) for “good/bad” credit risk classification. Good for quick
experiments and interpretability work.
import pandas as pd
import numpy as np
import joblib
# 1) Load dataset
plt.figure(figsize=(6,4))
plt.xlabel("Credit Risk")
plt.ylabel("Count")
plt.savefig("target_distribution.png")
plt.close()
# 3) Define features/target
# 4) Split data
# 5) Preprocessing
num_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
])
preprocessor = ColumnTransformer([
])
pipe = ImbPipeline([
('preproc', preprocessor),
('smote', SMOTE(random_state=42)),
('clf', clf)
])
# 7) Train
pipe.fit(X_train, y_train)
# 8) Predictions
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)[:,1]
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.savefig("confusion_matrix.png")
plt.close()
plt.figure(figsize=(6,5))
plt.plot([0,1],[0,1],'k--')
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.savefig("roc_curve.png")
plt.close()
# 11) Feature importance (from XGBoost)
model = pipe.named_steps['clf']
importance = model.feature_importances_
preproc = pipe.named_steps['preproc']
ohe_cols = []
if cat_cols:
ohe = preproc.named_transformers_['cat'].named_steps['onehot']
ohe_cols = ohe.get_feature_names_out(cat_cols)
sorted_idx = np.argsort(importance)[-10:]
plt.figure(figsize=(8,6))
plt.barh(np.array(feature_names)[sorted_idx], importance[sorted_idx],
color="green")
plt.xlabel("Importance Score")
plt.savefig("feature_importance.png")
plt.close()
joblib.dump(pipe, "credit_risk_model.pkl")
OUTPUT:
Target distribution:
0 700
1 300
Classification report:
Accuracy: 0.75
ROC-AUC: 0.81