Predictive model plan report
1 model logic (generated with genAI)
1 Data Preparation
• Impute missing values (Income, Loan_Balance, Credit_Score)
• Cap Credit_Utilization at 1.0
• Encode categorical variables (e.g., Employment_Status, Month_1–Month_6)
• Normalize or scale numerical features if needed
2. Feature Selection
Select key predictors:
• Credit_Utilization
• Debt_to_Income_Ratio
• Missed_Payments
• Account_Tenure
• Recent Payment Status (Month_6)
• Employment_Status
• Credit_Score, Income
3. Model Training
• Choose model: Logistic Regression (baseline), or Random Forest (for feature importance)
• Train on labeled data (Delinquent_Account as target)
4. Model Evaluation
• Use train-test split or cross-validation
• Evaluate with accuracy, precision, recall, F1-score
• Review confusion matrix for false positives/negatives
5. Prediction and Risk Scoring
• Output a binary label (0 = Non-delinquent, 1 = Delinquent)
• Optional: Output probability score for risk ranking.
Pseudocode
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Step 1: Define features and target
X = df_cleaned[selected_features] # e.g., Credit_Utilization, etc.
y = df_cleaned['Delinquent_Account']
# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Model training
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Step 4: Prediction
y_pred = model.predict(X_test)
# Step 5: Evaluation
print(classification_report(y_test, y_pred))
2 justification for model choice
For the task of predicting customer delinquency, the Random Forest model was selected based on a
balance of accuracy, interpretability, and operational fit for Geldium’s business environment.
Factor Justification
Random Forests are ensemble models that
combine multiple decision trees to improve
Factor Justification prediction performance and reduce overfitting.
They consistently outperform simpler models
Accurac on structured financial data.
Transparency While not as transparent as logistic regression,
Random Forests provide feature importance
scores, allowing business analysts to
understand which variables most influence risk
predictions.
Ease of Use Random Forests are easy to implement using
libraries like scikit-learn, require minimal
feature scaling, and are robust to outliers and
missing data.
Financial Relevance tree-based models like Random Forests have a
proven track record in credit scoring and fraud
detection, making them well-aligned with
financial use cases.
Geldium needs fast, interpretable, and
Business Suitability (Geldium) deployable solutions to identify high-risk
customers. Random Forests offer a strong
trade-off between performance and
explainability, making them ideal for risk-based
decision support systems.
Alternative considerations
• Logistic Regression was considered for its simplicity and transparency, but it lacks the ability
to capture complex, non-linear relationships in behavioral data.
• Neural Networks were ruled out for now due to their "black-box" nature, which is not ideal
for regulated industries like finance where model interpretability is crucial.
3 Evaluation strategy
To ensure the model is both effective and responsible, we use a combination of performance
metrics, bias detection techniques, and ethical safeguards.
Metric What is measure Why it matter for delinquency
predictions
Accuracy Overall correctness of the Useful for balanced datasets,
model's predictions but can be misleading if classes
are imbalanced
Precision % of predicted delinquents Important for minimizing false
that were actually delinquent positives (wrongly labeling
someone as high-risk).
Recall Critical for catching at-risk
% of actual delinquents customers and preventing
correctly identified financial losses.
F1 score Balances false positives and
Harmonic mean of precision
false negatives, ideal for
and recal
uneven class distributions.
Area under ROC curve(AUC) Ability of model to distinguish Robust summary of model
between delinquent and non- performance across
delinquent cases thresholds. AUC closer to 1 is
ideal
Metric interpretation
• High Recall + Moderate Precision: Acceptable in early warning systems to flag potential risk,
followed by manual review.
• Low Recall: Risk of missing truly delinquent customers — unacceptable for financial
applications.
• High F1 Score: Signals strong balance — key indicator of a model ready for production.
bias detections & mitigations
Technique Purpose
Stratified sampling Ensures balanced representation of delinquent
and non-delinquent cases during training.
Fairness audits Evaluate model performance across subgroups
(e.g., gender, location, income level).
Feature sensitivity analysis
Detect if non-relevant features (e.g., ZIP code,
ethnicity) are unduly influencing outcomes
Re-weighting
Adjust class distribution to prevent model
from favoring the majority class.
Ethical considerations
• Transparency: Customers have a right to know if and why they were flagged as high-risk. The
model must support explainability.
• Fairness: Avoid discrimination against protected groups. Model inputs should be behavior-
and performance-based, not demographic.
• Human Oversight: High-risk predictions should trigger manual review, not automatic
rejections or penalties.
• Data Privacy: All customer data used must be anonymized, securely stored, and aligned with
data protection regulations (e.g., GDPR).