Irjet V12i425
Irjet V12i425
1,2,3,41B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India
5Associate Professor, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Loan risk assessment is crucial for financial aggregates the outcomes to increase prediction accuracy.
security. This study combines ensemble learning for loan Features are randomly selected at every node(feature
default prediction and time series analysis for ghost bagging), and each tree in the forest is trained on an
borrower detection. Using PyCaret, we optimize model arbitrary subclass of the data (bootstrapping). Random
selection to identify high-risk borrowers. Additionally, Forest reduces variation and overfitting and averages the
ARIMA, LSTMs, and anomaly detection techniques analyze outcomes of these trees, which enhances generalization.
transaction patterns to flag fraudulent behaviors like The advantages of this algorithm are handling large
sudden withdrawals and post-loan inactivity. By integrating datasets, reducing overfitting, and maintaining good
predictive modeling and anomaly detection, we enhance accuracy even in the presence of missing data.
early fraud detection. This approach provides financial
institutions with a comprehensive risk management XGBoost (Extreme Gradient Boosting) is a scalable and
framework, improving decision-making and reducing extremely effective gradient boosting implementation. It
potential losses. constructs decision trees sequentially, aiming to fix the
mistakes caused by preceding trees with each new tree.
Key Words: Machine Learning, Deep Learning, XGBoost uses gradient descent to minimize the overall loss
Ensemble, Loan Default Prediction, Ghost Borrower function in order to optimize the model. Performance and
Detection, TCN speed are well-known for XGBoost, particularly with
tabular or structured [Link] advantages of this
[Link] algorithm are its speed, scalability, handling of missing
data, and regularization to minimize overfitting.
Accurate loan default prediction is vital for financial
institutions to mitigate risks. Traditional credit 2.2 Deep Learning Algorithms
assessment methods often overlook hidden patterns,
making machine learning a powerful alternative. This Neural Networks replicate the composition of the human
study utilizes PyCaret to compare ensemble techniques brain. Neural networks comprise multiple bands of
like bagging, boosting, and stacking for loan default neurons connected by edges with weights that are
prediction. By analyzing borrower demographics, financial updated during training. Neural networks are widely used
history, and loan details, we evaluate model performance in deep learning processes as they can interpret
using accuracy, precision, recall, and F1-score, identifying occurrences of perplexing repetitive sequences. The
the most effective approach for credit risk assessment. advantages of this algorithm are its flexibility, it can
interpret information from huge datasets, and its potential
To enhance fraud detection, we address ghost to replicate lateral relationships.
borrowers—fraudsters who manipulate financial records
to evade repayment. We integrate time series analysis Multi-Layer Perceptron (MLP) is used in deep learning
using TCN (Temporal Convolution Networks) to identify tasks. MLPs consist of multiple layers which are
suspicious transaction patterns. By combining predictive interconnected to each other, every neuron of one layer
analytics with fraud detection, our study provides acts as an input to the neuron in the next layer. MLPs are
actionable insights to improve lending decisions and used in classification, regression, and serve as the
reduce financial losses. foundation for more intrinsic neural networks. Its
capability to handle both regression and classification
2. LITERATURE REVIEW problems, and its role as a foundation for more advanced
neural networks is advantageous.
2.1 Machine Learning Algorithms
Temporal Convolutional Networks (TCNs) are a type of
Random Forest learning technique is used to solve deep learning architecture designed for sequence
various problems of regression and classification modeling tasks, offering an alternative to recurrent neural
problems. It builds several trees during training and networks (RNNs) like LSTMs and GRUs. TCNs leverage 1D
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 157
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 [Link] p-ISSN: 2395-0072
dilated causal convolutions, ensuring that predictions at The architecture of our loan default prediction model
any time step depend only on past information, making follows a structured machine learning pipeline, integrating
them suitable for time-series forecasting, natural language ensemble learning techniques to enhance predictive
processing, and anomaly detection. They utilize residual accuracy. The system is designed to preprocess financial
connections and dilation to capture long-range datasets, extract meaningful patterns, and make robust
dependencies efficiently, allowing for parallel computation predictions through a combination of machine learning
and stable gradients, unlike RNNs, which suffer from models. The workflow consists of several key stages: data
vanishing gradients and sequential processing limitations. preprocessing, model training using ensemble techniques,
and performance evaluation with hyperparameter tuning.
3. Proposed System
This flowchart represents an ensemble learning pipeline
3.1 Problem Statement: “To predict loan default using for loan default prediction using multiple models. This
machine learning techniques.” ensemble learning pipeline for loan default prediction
integrates Neural Networks, Random Forest, and XGBoost.
3.2 Problem Elaboration: For financial organizations, The dataset undergoes preprocessing, including cleaning
loan failure poses a serious problem since it can result in and scaling, before training. A weighted voting approach
large losses and elevated risk. Conventional credit combines predictions based on model performance. After
evaluation techniques are frequently ineffective and have evaluation on a test set using accuracy metrics,
trouble identifying subtle trends in borrower behavior, hyperparameter tuning is applied if needed. Once
which leads to imprecise forecasts. The volume of loan optimized, final predictions are generated, leveraging the
data has increased due to the growth of digital financial strengths of multiple models to enhance predictive
services, necessitating the use of more advanced accuracy and robustness.
algorithms to forecast defaults. Loan default prediction
can be automated and data-driven with machine learning; [Link] Data:
nevertheless, choosing the best algorithm can be difficult,
especially in cases when the datasets are unbalanced and Data Collection:
defaults are few. In order to determine which machine
learning model performs best, this study compares the The dataset for this study was collected from Kaggle,
efficacy of several algorithms in forecasting loan defaults. containing loan default prediction records with 34
Assisting financial institutions in managing risk better and attributes. These attributes include loan_purpose,
making more informed loan decisions is the aim. Credit_Worthiness, open_credit, business_or_commercial,
Credit_Score, age, LTV, Region, Security_Type, and Status.
3.3 Architecture of the proposed models: These features capture essential financial and
demographic details about loan applicants, providing
3.3.1 Loan Default Prediction System insights into their creditworthiness and likelihood of
default. The dataset also includes variables like
credit_type, co-applicant_credit_type, and dtir1 (Debt-to-
Income Ratio), which are crucial indicators for assessing
risk. With a mix of categorical and numerical attributes,
this dataset offers a well-rounded foundation for training
predictive models.
Data Preprocessing:
Effective preprocessing enhances the reliability of
machine learning models by systematically cleaning and
transforming data. This study follows a structured
pipeline, beginning with data cleaning and feature
selection using pandas. Redundant columns (e.g.,
Interest_rate_spread, credit_type, Upfront_charges) are
removed, and missing values are handled by imputing
numerical features with the median and categorical
features with the mode. Categorical variables are encoded
using one-hot encoding with drop_first=True to avoid the
dummy variable trap. PyCaret automates key
preprocessing tasks, including feature scaling,
transformation, imbalance handling, and feature selection,
Fig-1 Loan Default Prediction System Architecture
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 158
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 [Link] p-ISSN: 2395-0072
before splitting the dataset into training (75%) and testing Ensemble Learning Approach
(25%) sets with Status as the target variable.
Ensemble learning enhances accuracy and robustness by
To address class imbalance, SMOTE generates synthetic integrating multiple models to make more reliable
samples for the minority class, improving predictive predictions. The strengths of MLP (captures complex
fairness. Numerical features undergo standardization via patterns), Random Forest (reduces overfitting), and
StandardScaler (mean = 0, std = 1), while XGBoost (enhances generalization) are leveraged together.
PowerTransformer ensures Gaussian-like distributions for Predictions from each model are weighted based on
skewed data. Finally, column names are sanitized using accuracy, and a final probability score is calculated. This
regular expressions to remove special characters, ensuring approach ensures better bias-variance tradeoff, improving
compatibility with machine learning frameworks. This loan default prediction performance compared to
streamlined approach optimizes model performance and individual models.
enhances predictive accuracy.
[Link] Results
[Link] Models Used in This Project and Their
Working Mechanism The final ensemble model achieves an accuracy of 92.59%,
indicating strong predictive performance. It has a
This project utilizes an ensemble learning approach, precision of 79.98%, meaning 79.98% of predicted
combining multiple machine learning models to improve defaulters were actual defaulters, and a recall of 93.29%,
predictive performance. The individual models used are: showing it successfully identified 93.29% of all actual
defaulters. The F1-score of 86.12% balances precision and
1. Multi-Layer Perceptron (MLP) Classifier recall effectively. The ROC-AUC score of 98.47% suggests
2. Random Forest Classifier excellent differentiation between defaulters and non-
defaulters. The classification report shows that for non-
3. XGBoost Classifier defaulters (Class 0), precision is 98% and recall is 92%,
while for defaulters (Class 1), precision is 80% and recall
The final prediction is made using a weighted ensemble is 93%. The confusion matrix reveals 25869 true
method, where each model's contribution is proportional negatives, 2139 false positives, 615 false negatives, and
to its accuracy. 8545 true positives, showing that while the model
correctly predicts most cases, it misclassifies some non-
1. Multi-Layer Perceptron (MLP) Classifier defaulters as defaulters. Overall, the model is well-
balanced, with high recall ensuring minimal missed
MLP is an artificial neural network with multiple layers
defaults.
that captures complex data relationships. It consists of an
input layer, hidden layers, and an output layer, using High recall (0.9329) indicates the model's strong ability to
activation functions like ReLU to introduce non-linearity. correctly identify positive cases, minimizing false
It learns through backpropagation and gradient descent, negatives. This is particularly crucial in applications such
making it effective for non-linear patterns in financial data as fraud detection, medical diagnosis, and intrusion
like loan default prediction. detection, where missing a true positive can have serious
consequences. With a recall of 93.29%, the ensemble
2. Random Forest Classifier model effectively detects the majority of actual positive
instances, reducing the risk of undetected critical cases.
Random Forest is an ensemble learning method that
Additionally, a high F1-score (0.8612) balances precision
builds multiple decision trees using different data subsets.
and recall, ensuring that the model does not generate too
It reduces overfitting by training trees independently and
many false positives while still identifying true positives
making predictions through majority voting. It is efficient
effectively. This trade-off is essential in scenarios where
with high-dimensional and imbalanced datasets, making it
both false positives and false negatives carry significant
suitable for robust classification tasks.
consequences. An F1-score of 86.12% demonstrates that
3. XGBoost (Extreme Gradient Boosting) the model is well-optimized for overall reliable
classification, making it a robust choice for practical
Classifier
applications requiring high accuracy and minimal errors.
XGBoost is a gradient boosting algorithm optimized for
A model with high recall ensures fewer missed
high performance and accuracy. It builds decision trees
detections, while a high F1-score ensures a balanced
sequentially, correcting previous errors, while using
and optimal classification performance. This is
regularization to prevent overfitting. Its efficiency in
particularly beneficial in applications where false
handling large datasets makes it a top choice for fraud
negatives are costly but precision cannot be sacrificed
detection and credit risk modeling.
entirely.
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 159
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 [Link] p-ISSN: 2395-0072
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 160
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 [Link] p-ISSN: 2395-0072
Data Preprocessing:
This study follows a structured preprocessing pipeline to
enhance model performance and reliability. Missing values
in activity_after_large_withdrawal, loan_payment_total, and
loan_payment_count are imputed with zero to maintain
consistency. Feature engineering introduces derived
variables such as withdrawal_to_loan_ratio,
repayment_ratio, and transaction_activity_ratio to improve
predictive power. The target variable borrower_type is
transformed into a binary format (1 for ghost borrowers, 0
for normal borrowers) for supervised classification. Key
features selected include loan_amount,
pre_loan_transaction_count, first_week_withdrawal_ratio,
post_loan_transaction_count, and repayment_ratio. Finally,
numerical features are standardized using StandardScaler
to ensure uniformity and optimize model performance.
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 161
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 [Link] p-ISSN: 2395-0072
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 162
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 [Link] p-ISSN: 2395-0072
[3] Wanjun Wu "Machine Learning Approaches to Predict Workshops/ STTPs/FDPs. She has participated in 16
Loan Default 2022 DOI:10.4236/iim.2022.145011 National/International Conferences. Worked as Consulting
Editor on – JEECER, JETR,JETMS, Technology Today,
[4] Platur Gashi, “Loan Default Prediction Model” 2023, JAM&AER Engg. Today, The Tech. World Editor – Journals
DOI:10.13140/RG.2.2.22985.01126 of ADR Reviewer -IJEF, Inderscience She has worked as
NBA Coordinator of the Computer Engineering
[5] Lai, L, “Loan default prediction with machine learning Department of VJTI for 5 years. She had written a proposal
techniques”. In: 2020 International Conference on under TEQIP-I in June 2004 for ‘Creating Central
Computer Communication and Network Security Computing Facility at VJTI’. Rs. Eight Crore were
(CCNS). pp. 5–9. IEEE (2020) sanctioned by the World Bank under TEQIP-I on this
proposal. Central Computing Facility was set up at VJTI
[6] Xu Zhu, Qingyong Chu,Xinchang Song, Ping Hu,Lu
through this fund which has played a key role in
Peng, Explainable prediction of loan default based on
improving the teaching learning process at VJTI. Awarded
machine learning models
by SIESRP with Innovative & Dedicated Educationalist
DOI:10.1016/[Link].2023.04.003
Award Specialization : Computer Engineering & I.T. in
[7] Zhao X, Guan S. CTCN: a novel credit card fraud 2020 AD Scientific Index Ranking (World Scientist and
detection method based on Conditional Tabular University Ranking 2022) – 2nd Rank- Best Scientist, VJTI
Generative Adversarial Networks and Temporal Computer Science domain 1138th Rank- Best Scientist,
Convolutional Network. PeerJ Comput Sci. 2023 Oct Computer Science, India.
10;9:e1634. doi: 10.7717/peerj-cs.1634. PMID:
37869461; PMCID: PMC10588710. Kunal Goudani,
B-Tech Student, Dept. of
[8] A. Mandge, R. Fatehchandka, K. Goudani, T. Shelke, and Computer Engineering and IT,
P. M. Chawan, "A Survey on Loan Default Prediction VJTI, Mumbai, Maharashtra,
using Machine Learning Techniques," International India
Research Journal of Engineering and Technology
(IRJET), vol. 11, no. 11, pp. XX-XX, Nov. 2024
BIOGRAPHIES
Adwait Mandge,
B. Tech Student, Dept. of
Computer Engineering and IT,
VJTI, Mumbai, Maharashtra,
India
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page 163