0% found this document useful (0 votes)

28 views34 pages

Report

Uploaded by

hussalz555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views34 pages

Report

Uploaded by

hussalz555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Final Project

Data Science for Business

Group Members

Sidra Nadeem
Hussain Ali
Aruba Irshad
Rizwan

Submitted to: Professor Benoit DEPAIRE

& Haroon Tharwat

Date: May 13th, 2025

Table of Contents

2
SECTION 0

Introduction and Business Understanding

Assessing and reducing credit risk is important for ensuring long-term profit potential and
organisational stability, financial volatility, and regulatory oversight. Banks have a lot of
confusion to identify potential defaulters in the early stages of the loaning cycle and evaluate
their risk vulnerability. This study addresses a similar issue raised by one Belgian bank that
uses an evidence-based approach and develops a predictive model to estimate the likelihood of
loan default among applicants who have already received approval. The bank can update
lending policies, enhance post-approval tracking, and more proactively manage portfolio risk
by identifying high-risk borrowing parties early on.
The provided datasets 10,000 records of rejected loan applications. Every entity has its
financial, demographic, and behavioural characteristics. The main aims or goals of the project
are to create a supervised machine learning model to forecast default results, conduct
unsupervised clustering to identify borrower segments, and finally provide actionable insights
for credit strategy. A different test dataset is also used to mimic the real-world application. This
paper uses the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework since
it provides a methodical and iterative approach for converting business challenges into data
science solutions. The CRISP-DM approach also fits and applies to the banking sector for
another reason: its technical rigour and business alignment balance.Most importantly, this
dataset comprises previously authorised candidates. Therefore, because of sampling bias, this
model is not relevant for screening first loan applications. Used outside the authorised
population, this could skew the forecasts.

Putting the Issue in Data Science Tasks

The following tasks can be derived from the business problem that I mentioned earlier.
● Classification: There are two possible outcomes when predicting whether an applicant
will default.
● Probability estimation is the process of determining the likelihood of default in order to
rank applicants according to the level of risk. Consequently, priority-based intervention
becomes possible.
● Clustering Task: Use unsupervised learning to divide applicants into discrete profiles for
a calculated intervention.

SECTION 1
Chapter 1: Predict Loan Default
Introduction:
Creating three machine learning models to forecast the bank's loan default is the focus of the
first chapter. These three models consist of Support Vector Machines (SVM), Random Forest, and
logistic regression. The chapter is divided into five main sections. KNIME analytics software is
used to create these reliable supervised machine learning models. Understanding the likelihood
of loan defaults is essential for managing portfolio health and risk exposure as per regulatory
requirements and the state of the economy. The specifics of the strategy used to create
supervised learning models and forecast loan default are listed below.
Data Interpretation:
3
There are two files containing the data. First of all three supervised models are trained using the
train data. The test data, to which these three models will be applied, is the other. 10,000 entries
total, with 1 denoting a default and 0 denoting a non-default, make up the historical loan data.
Our numerical features, also known as the independent/predictors, include Loan ID, Age,
Income, Credit scores, Loan Term, Loan Amount, Interest Rate, DTI Ratio, Months Employed, and
Num Credit Lines. Other categorical features include education, marital status, Mortgage status,
Loan Purpose, Co-signer, and Dependents information.Our primary objective is the target
variable default, which has values of 0 and 1 for non-default and default, respectively. Both files
are supplied in CSV format for the KNIME analytics platform to read. The final objective is to use
the class probabilities method to assign default probabilities to each case. Table 0 (appendix)
contains the information.
Task Analysis Plan:
The task analysis plan is divided into the following five main phases: training, testing, model
phase, and class probability estimation phase.
Phase 0: Data Import:
Importing the test data was the first step. The file reader node was used to read it. This stage is
shown in fig. 1 below.

(figure.1)

Phase 1: Descriptive Analysis:

Understanding the patterns, trends, central tendency, and data variations is the goal of the
second phase. As seen in fig. 2, it is divided into two sections: exploratory and outlier handling.

(figure.2)

The primary objective is to show the data in a single glance. The following nodes were covered
in the first section on exploratory analysis (EDA): statistics, histogram, value counter, and group-
by.
EDA Analysis:

4
The main findings in the data, as displayed in table 1 (appendix), are informed by statistics. With
an average yearly income of about €83,000 and an average loan amount of €127,775, the typical
applicant is nearly 44 years old. The income and loan distributions are wide, as indicated by the
standard deviations of €39,148 and €70,615 respectively. Their skewness and kurtosis reflect
normal distribution. The average credit score is 575 ranging from 300 to 849. The Debt-to-
Income ratio is around 0.5 average means their debt obligations caters almost the half of their
salary. The loan default rate is around 11.8%. Additionally, the target variable's kurtosis (3.59)
indicates heavier tails, indicating that some applicants engage in extreme risk behaviours. The
entire dataset is defined by these statistics. As seen in fig. 3 (appendix), the value counter node is
used to view the number of defaults. As illustrated in fig. 4, there are 1182 defaults and 8818
non-defaults, meaning that only 12% of applicants default and 88% do not. The primary
attributes of defaults and non-defaults are separated using a different node called Group By.
Table 2 displays the findings. The average age of the defaulted candidates is 36 years younger
than that of the non-defaulters, indicating immaturity and a lack of experience. Income is a
powerful deterrent against defaults, as evidenced by the average income of €73,245 for
defaulters and €84,323 for non-defaulters. Furthermore, even though defaulters had low credit
scores (554.7 vs. 577.5), they took out larger loans (average €141,233 vs. €125,990). This
implies that the following characteristics of our defaulters are directly related to the target
variable: low income, young age, low credit scores, larger loan amounts, higher financial burden,
and risk tolerance. A slightly higher DTI ratio (0.517 vs. 0.51) suggests that disposable incomes
are being squeezed more tightly. Shorter work history and less job stability are indicated by
lower months employed.
Outlier Detection:
In order to indicate that there were no outliers, the outlier detection process began with a
numeric outlier node that read zero for every variable. Figure 5 presents the results of another
node box plot. A large financial range among borrowers and the existence of high-risk lending
cases are indicated by variables like income and loan amount that exhibit high variability and
multiple upper outliers. The distribution of Months Employed shows a positive skew, with many
cases having comparatively brief work histories. With medians of 13.5% and 36 months,
respectively, the interest rate and loan term distributions seem more structured and represent
the terms of typical loan products. All things considered, the plots show regions with possible
modelling significance and risk concentration while confirming the integrity of the data.
Phase 2: Feature Engineering and Data Transformation:
In order to prepare and preprocess the data for the models, phase 2 involved scaling, cleaning,
and encoding the data. Figure 6 illustrates this as follows:

(fig.6)

In order to handle value loss and finish the feature matrix for model training, the first step
incorporates missing values. While the categorical features were handled using the most
frequent one, the numerical features were imputed using the median and mean values. This was
done in order to avoid data loss and model bias. Figure 7 (appendix) provides this. The

5
categorical data was then converted to numerical data using a one-to-many node. Thus, the
model is able to assess each one separately. It maintains machine-readable categorical data. For
example, high school "1," bachelor's "2," master's "3," and so on. In order to overcome bias in the
data, a column filter was immediately used to eliminate features that were not relevant, such as
Column ID, one from each category feature, such as bachelor's from education, married from
marital status, and so forth. Table 3 (appendix) illustrates this. Normaliser was completed last.
This guarantees that all of our data is in the binary format 0,1, preventing biassed weighting and
ensuring that all features contribute normally to model training. Additionally, it enhances
models based on convergence rate gradients.Table 4 (appendix) illustrates this. In order to
provide outputs for the target variable that are readable by humans, the final step converts the
number to string for the "default" variable.
Phase 3: Bivariate Analysis:
The third stage covered bivariate analysis with scatter plots and correlations. Figure 8 below
shows this:

(fig.8)

Correlation was the first step, and the screenshot is included in the appendix as table 5. Younger
applicants with lower incomes and credit scores are more likely to default, according to the
correlation analysis, which also shows that Age, Income, Credit Score, and Months Employed are
negatively correlated with Default. Notably, the DTI Ratio (r = 0.022) and interest rate (r =
0.137) exhibit weak but statistically significant positive correlations with default, supporting the
notion that greater financial strain raises risk. While full-time employment, marriage, and co-
signing are linked to lower risk, being unemployed, single, or without a co-signer also exhibits
weak-to-moderate positive correlations with default among categorical variables.Although the
majority of correlations are weak (r < 0.2), their cumulative effect on model performance is still
valuable, as indicated by statistical significance (p < 0.05 in many cases). Scatter plots were
covered in the second step. Finding predictive features and lowering noise and multicollinearity
are the goals of this. There are very weak correlations between features like Months Employed
(r ≈ -0.078), Loan Term (r ≈ 0.007), and Num Credit Lines (r ≈ 0.02). The best candidates to keep
were those with the following attributes: credit score, income, interest rate, DTI ratio, co-signer
status, dependents, employment type, and marital status.
1.1. Phase 4: Model Training:
The fourth and the most important phase is the model training. We trained three models:
Logistic, Random Forest and SVM. The workflow is shown below in fig.9:

6
(fig.9)

Now, lets talk about the algorithms we have used like of Logistic Regression, Random Forest, and
Support Vector Machine (SVM) shows us the strategic balance of models capacity and bias
varance specially relevant to our task of prediction of loan default. Here is the complex, partially
non linear relationships within our dataset like credit score, income, DTI ratio, and default,
Logistic Regression is used as a transparent baseline model, which help stakeholders to
identify direct relationships, even if it is not relevant to the underlying patterns. On the other
side, we have Random Forest model that is known for its high capacity and low bias, that can
easily point out the interactions in the multiple borrower attributes and meanwhile it can
manage easily overfitting risk through learning. SVM, applied with an RBF kernel, use for its
flexibility by modeling non-linear decision boundaries, and it is beneficial for separating
borderline applicants specially one of those whose profiles are not linearly distinguishable. After
adding these models, we can ensured perfect evaluation of performance of both simple and
complex feature interactions essential for building a model that can be used to generalizes easily
to various real world loan applicants.
In the first step we did x-partitioning 80/20 data ratio for train and testing of the data. Cross-
validation of 10 folds we used in it. This is shown in attach picture in the file. After this the
learner nodes for all three models related to partitioning node and the predictor node are used
to form the models. Following are some key characteristics of our three models:
1. Cross Validation:
This was done using X-partitioner node. The reason of doing this is to minimize the risk
of overfitting. The best folds that maximized our AUC were 10.
2. SMOTE:
As we have unbalanced nature of our dataset, we used SMOTE to balance it. So, our
models differentiate defaulters without bias of majority class we have in the data set
because defaulters are minority (12%). For Logistic and Random the best configuration
are shown in fig.11 (appendix) This is why oversample by minority classes were chosen
and static seed was tried different combinations such as 1,2,3,4,5 and the best one came
out to be ‘987654’ so, we have the same split every time.
3. Parameter Optimization:

7
This was done to simplify the hyperparameters such as tree depth, node size,
regularization etc. The best values were found and put manually in advanced settings.
Different ranges were tried and increased until the best ones were found. This was done
to boost the predictive powers of our models. Sample screenshot is attached with
optimized parameters in fig.12. For SVM, polynomial kernel was used as boundary
between default and non-default is not linearly separable.
4. Gradient Boost:
Finally, gradient boosted learners were used to boost the performance of random
forests. Unlike, a single decision tree, gradient boost make many small trees one at a
time, where each new tree learns from the mistakes of the previous ones. It gave us high
accuracy while preventing overfitting.

1.2. Phase 5: Evaluation:

The final stage of evaluation and the most important. This is shown in fig.13 below:

(fig.13)

So, firstly talking about the choice of evaluation metrics. We used scorer to evaluate and see the
overall result based on recall, precision, sensitivity, specificity, f measure and overall accuracy.
We decided to check our models by using recall, precision, sensitivity, specificity, F-measure,
and overall accuracy to get enough understanding of performance, because we have class
imbalance present in our dataset. In credit risk modeling, recall is necessary as it check how
effectively the model find out real defaulters, which is a primary concern for banks, those are
looking to minimize financial losses. Accuracy gives us confirmation that we don’t mistakenly
categorize too many trustworthy clients as problematic, which couldnot be beneficial for the
business. Specificity gives us the balance of showing the effectiveness of identifying non-
defaulters. The F-measure help us to better precision in merging and recall into one score for
trade-off evaluation, and although overall accuracy is less informative in cases of imbalance, it
still gives us the overall idea of correct predictions. This multi-metric method gives us clear
understanding of both business risk and model dependability. Last but not least, AUC and ROC
curve. The ROC curve illustrates the true positive rate which means sentivity as compare to false
positive rate, aiding in the visualization of the balance between accurately detecting defaulters
and minimizing false positives. AUC changes this curve into one value; the nearer it is to 1.0, the
more effective the model is at prioritizing high-risk applicants over low-risk ones. This is
particularly beneficial in our loan default initiative since banks frequently prioritize ranking
customers by their risk over binary classification.
Results:
The results of each model are given in screenshot (fig.14 for logistic) (fig. 15 for random) (fig.16
for SVM) (appendix). Their results are given below:
Logistic Regression showed us notable accuracy (0.882) and outstanding specificity (0.992), that
indicates it effectively recognized the majority of non-defaulters. However, its recall for

8
defaulters was very low (0.059), showing that the model failed to notice almost all actual default
instances. This indicates underfitting, where a sign of high bias and low capacity, where the
model simplifies relationships excessively and is unable to recognize non-linear interactions.
Although it achieved the highest AUC (0.811) showing us the strong overall ranking capability its
actual effectiveness of finding out defaulters was the weakest, and as a result it is considered as
inappropriate for use.
Random Forest has given us more equal outcomes. It showed significant reduced AUC (0.790),
yet this was counterbalanced by a considerably greater recall (0.627), guaranteeing a markedly
improved capacity to identify defaulters. While its precision (0.247) was average, its F1-score
(0.355) demonstrates the optimal balance between actual defaulters and reducing false
positives. Its specificity (0.745) showed us the adequate management of dependable customers.
Random Forest, as we already discussed the high-capacity and low-bias model with controlled
variance via ensembling, is adept at managing the difficult, non-linear feature interactions found
in credit data. The significant reduced AUC is satisfactory since the model good in critical area
like recognizing genuine defaulters.
Support Vector Machine (SVM) attained flawless specificity (1.000) and considerable accuracy
(0.882), but still failed to identify any defaulters (recall = 0.000). This indicates overfitting to the
dominant class and a deficiency in generalization for identifying minority class patterns like
those who are going to be default, probably caused by high bias resulting from not having the
accurate parameter tuning. Even with a competitive AUC of 0.799, its classification performance
renders it impractical in this setup for predicting credit risk.
All these results are summarized in table 6 (appendix)
1.3. Testing phase:
The testing phase tests our developed models. If we get high testing error, that means our
models are overfitting. The flow for testing phase in given below in fig.17:

(fig.17)

Firstly, reader node was used to load test data on loan defaults. Meta node for data
preprocessing was developed from test phase and connected to test data. Predictors were used
and connected to the learner’s respectively from trained models. Column filters were used. First,
probabilities were appended in predictor nodes. Column filters added only P(Default) from each
model. They were renamed to ‘p_SVM, p_Log, p_RF’ and then joined using joiner node. Here, we
introduced ensemble method to combine probabilities from all three into one centralized
decision making strategy. This was done to get the desire results from all by aggregating and
casting votes on forecasts, which help us to minimizing both bias and variance in the result.
Rather than depending on one algorithm, ensemble methods were used to improve
generalization by combining differen decision-making logics. In our instance, merging the class
9
probabilities (P(Default)) helped us to stabilize forecasts and reduce the likelihood of overfitting
or underfitting linked to separate learners. Then, math formula node was added using the
formula: if(($Prob_Default_SVM$ + $Prob_Default_LogReg$ + $Prob_Default_RF$) / 3 > 0.5,
1, 0) and finally we had applicant ID and their respective probabilities of default.
1.4. Class Probability Estimation (Default Test Results)
The final stage of task 1 included the results of defaults. This was done using class probability
estimation which is useful for ranking when we have biased distribution results. The workflow
is given below in fig. 18 which was done by using math formula to combine the probabilities of
each model and aggregated into one single one.

(fig.18)

50% probability was related with default initially and using value counter we found 89 defaults
while 4911 non-defaults. With 40% probability, we found 511 defaults and 4489 non-defaults.
This is shown in fig. 19 (appendix) Now, we wanted to rank the defaulters. Then, we used the
sorter node to rank the defaulters from high chances to low and then row sampling and column
filter to find out our top 10 applicants. This is shown in fig.20 (appendix) with 0.66 which means
66% as the highest applicant chances of default. Another node statistics was used whose results
are added in table 7 (appendix) The ensemble result shows us a low average default probability
(0.018) alongside high skewness (7.296) and kurtosis (51.25), suggesting a cautious prediction
tendency. This demonstrates the impact of Logistic Regression and SVM, which preferred non-
default results. Random Forest showed us the increased variance and a higher mean (0.38), by
improving its recall. These findings validate our previous assessment: the ensemble reconciles
prudent forecasts with enhanced detection of defaulters. Finally, rule engine was used to
categorize defaulters into high (50% and greater), medium (40% till 49%) and low (less then
40%), we found 89 to be high risk, 422 medium and 4489 to be low as attached in fig.21
(appendix) Additionally, we used the group by node to find the key characteristics of defaulters
by loan purpose, marital status, employment and education. These results are given in fig.22
(appendix) and along with the summary table (table 8) (appendix) In conclusion, we found out
lower income, younger age, and specific statuses like self-employed or divorced are going to be
with higher predicted default probabilities. You can check it in table 2 (appendix) the following
result by using the group by node with the characteristics that I explained.
Chapter 2: Clustering using Unsupervised Learning:
This report provide us a basic understanding of loan applicants by dividing them into three
different groups called clusters based on their personal and financial characteristics. The
purpose of this analysis is to understand which group is more likely to default on their loans and
how a bank can design strategies to manage risk and treat the customers accordingly and make
changes in their strategies as per the data of the customer. The task started with loading the
train data using reader node, then data preprocessing was done, column filter was used to
remove ID, missing values to replace them by mean values and most frequent ones, one-to-many
for categorical conversions and finally normalizer to make it in 1,0s. K-means clustering was

10
used and the results were displayed using PCA, scatter plots, bar charts, silhouette coefficient.
The entire workflow for clustering analysis in given below in fig.23:

(fig.23)

2.1. Clustering Process:

We used a method called K-Means Clustering to divide the applicants into three groups based on
similar characteristics. The clustering algorithm grouped applicants in a way that people in the
same group are more like each other than to those in other groups. We used K-means as being
unsupervised learning, it helps to uncover natural groupings without the need for labelled data,
grouping similar applicants together helping the bank tailor loan strategies, and is
efficient in computation and adapts well to extensive datasets like ours.
To better visualize and understand the clustering, we also reduced the dimensions of the data
using PCA (Principal Component Analysis), which helped us group the applicants more
meaningfully. For K-means, we did 3 clusters in configuration and connected this to PCA for
visualization, added colour manager and scatter plot which is shown in fig.24 (appendix) Cluster
2 (green) covers an extensive area on PCA Dimension 0, indicating increased variability in
features possibly a mixed or medium-risk category. Cluster 1 (purple) is tight and separate in
the lower right quadrant, suggesting a possibly high-risk category with uniform characteristics.
Cluster 0 (red) is closely packed on the extreme left, probably indicating a low-risk category with
consistent financial characteristics. We used Bar charts to find the mean values for each cluster.
This is attached in fig.25 (appendix)

Cluster Insights using Crosstab (Categorical):

Figure 27 in the appendix illustrates this. The outcomes of this process are as follows:
Cluster 0:

● Primarily employed full-time, possessing a moderate level of education (Bachelor's,

Master's), and having a variety of loan-related reasons.

● Predominantly applicants without co-signers or mortgages, suggesting either first-time

financial independence or those looking for entry-level credit.

● Contains fewer high-risk indicators than Cluster 1, but it displays a fairly uniform
distribution of marital status.
Cluster 1:

11
● A high percentage of jobless people and part-time employees, as well as a lower
percentage of people with advanced degrees.

● Less likely to have co-signers or mortgages, and more likely to be unmarried or divorced.

● This group suggests less financial stability, which may call for stricter risk control
procedures.
Cluster 2:

● The majority of married applicants have steady employment and higher loan
commitments (many have mortgages).

● Overall, education levels are strong, especially for high school and bachelor's degrees.

● Most likely have co-signers and dependents, which suggests more intricate
arrangements for financial or familial responsibilities.
Cluster Insights using numerical features:
We used pie chart nodes to gain insight into clusters, and fig. 28 (appendix) displays the
corresponding numerical features. For each cluster, we obtained the following results:
Cluster 0:

● Has a high average income (€82,794.33) and the highest credit score (577.68), both of
which point to a responsible ability to borrow.

● Applicants usually have 58.78 months of work experience, indicating a long-term

commitment, and the average loan amount is €128,746.33.

● This group appears to be low-risk, making them perfect for favourable credit terms.

Cluster 1:

● Shows us financial stress by having the lowest credit score (571.10) and a slightly lower
average income (€82,603.22).

● Has a slightly shorter employment duration (58.46 months) and an average loan of
€128,726.14, both of which could be signs of instability.

● The group may require additional verification or more stringent lending conditions.

Cluster 2:

● Has the lowest loan amount (€128,051.09) and the highest average income
(€83,358.20), demonstrating responsible borrowing habits.
● The credit score is average (574.71) and the length of employment is constant (58.56
months).

12
● This demographic is steady and moderately risky, making them perfect for tailored loan
offers.

2.5 Overall Cluster Characteristics Table:

Following table summarizes the key characteristics of our clusters:

(table 8)

13
Appendix:

Variable Name Description Type Role

LoanID It is unique for each loan Categorical (ID) Identifier

application (that’s why its
ignored)
Age Age of applicant in years Numerical Predictor

Income Annual Income of the Numerical Predictor

applicant

LoanAmount Amount of loan that is Numerical Predictor

requested by the
applicant

CreditScore A credit score indicates Numerical Predictor

the
creditworthiness of
the applicant

MonthsEmployed Number of months the Numerical Predictor

applicant has been
employed

14
NumCreditLines Number of currently Numerical Predictor
active/open credit
lines

InterestRate The interest rate that is Numerical Predictor

applied to the loan

LoanTerm The duration of loan Numerical Predictor

granted (in months)

DTIRatio DTI Debt-to-Income Numerical Predictor

ratio: total monthly

15
debt payments
divided by monthly
income

Education Highest level of Categorical Predictor

education attained
by the applicant
(High school,
Bachelors, Masters,
PhD)

EmploymentType Employment status Categorical Predictor

(Self-employed,
Full-time, Part-time,
Unemployed)

MaritalStatus Marital status of the Categorical Predictor

applicant (Single,
Married, Divorced)

HasMortgage Indicates (Yes) if the Binary Predictor

applicant has a
mortgage, otherwise
(no)

HasDependents Indicates (Yes) if Binary Predictor

the applicant
has dependents
on
someone,
otherwise (no)

LoanPurpose The stated purpose Categorical Predictor

for seeking the loan
(Auto, Business,
Education, Home
Improvement,
Other)

HasCoSigner Indicates Binary Predictor

whether the loan
has a
co-signer meaning
that someone is
willing to repay a
loan if the primary
borrower fails to do
so (either yes or no)
Default Default tells if the Binary Target
applicant defaulted.
If defaulted (1)
otherwise (0)

(table 0)

(table 1)

(fig.3)

(fig.4)

(table 2)

17
18
19
(fig.5)

20
(fig.7)

(table 3)

(table 4)

21
(table 5)

(fig.10)

(fig.11)

22
(fig.12)

(fig.14)

(fig.15)

23
(fig.16)

(table 6)

(fig.19)

(fig.20)

24
(table 7)

(fig.21)

(fig.22)

(table 8)

25
(fig.24)

(fig.25)

26
(fig.26)

27
(fig.27)

28
29
(fig.28)

Task# 3

How could your model be used in ongoing credit risk processes?

Random Forest, SVM, and Logistic Regression are developed to predict the probability of loan
default among approved applicants in Task 1. After preparing data, training the models, and
evaluating it using Recall and AUC, an ensemble model is created for better and more accurate
predictions. This ensemble model outputs a default probability for each borrower, which is then
categorized into high, medium, or low risk using rule-based classification. This model can be
integrated into the ongoing credit risk processes of the bank through post-approval monitoring.
As the dataset only included the approved applicants, the model can’t be used for the initial
screening phase and the reason behind it is sampling bias, but it's extremely useful after the
loan disbursement. If this model is applied periodically, the bank will be able to examine its
portfolio’s health, identify borrowers showing increased default risk, and early intervention
measures such as revised repayment plans or more frequent follow-ups.
In Task 2 unsupervised clustering was done using K-means and it grouped applicants into three
distinct clusters based on their financial and demographic characteristics providing deeper
insights into borrower profiles. This helps the bank to tailor strategies for different groups.

Cluster 0
Profile: Senior candidates with low default rates, steady incomes, and clean credit histories.
Strategy
Provide rewards for loyalty, such as increased credit lines or reduced interest rates.
Encourage the cross-selling of high-end banking products, such as insurance and savings plans.
Cluster #1
Profile: Younger, poorer people who are more likely to default and have lower credit scores.
Strategy
● Put in place more stringent post-disbursement oversight

30
● Provide budgeting resources, behavioural nudges, and financial literacy initiatives.
● To increase payment consistency, take into account modified repayment plans.
Cluster #2
Candidates in Cluster 2 have balanced risk profiles, moderate credit performance, and are
middle-aged or middle-income. They do not belong to the low-risk, high-trust category, but they
are also not high-risk.
Strategy
● Use up-to-date financial data to periodically reevaluate their risk score.
● Offer optional add-ons (such as payment protection plans and auto-debit discounts).
● Keep an eye out for behavioural changes that might lead them to fall into higher-risk
groups; early detection can help avoid unpleasant surprises.
The bank is able to transition from generic credit policies to a more customised, risk-adjusted
approach by incorporating the anticipated default probabilities from Task 1 into the cluster
assignments from Task 2. For example: Borrowers with medium-to-high default scores may be
flagged for closer monitoring, even within the low-risk Cluster 0.
Before the default risk increases, medium-risk borrowers in the moderate Cluster 2 can get
proactive assistance.
Through the layered application of descriptive segmentation (Task 2) and predictive modelling
(Task 1), the bank is able to: Implement focused risk interventions, sort product offerings
according to the type of customer, and Boost the resilience of your portfolio.
Question 2
We used a wide range of financial and demographic characteristics in Task 1, including income,
credit score, loan term, loan amount, education, and work history. We can add more variables
that represent changing borrower behaviour and the economic environment to further improve
the model's predictability. These could include recent large transactions, account balance
fluctuations, or missed payments, among other behavioural characteristics. In order to generate
summary indicators, these features can be extracted from transactional data and combined with
KNIME's Joiner and GroupBy nodes.
By adding macroeconomic variables like the inflation rate, trends in unemployment, or shifts in
interest rates, we can further improve the model. To help the model produce reliable
predictions in the face of shifting economic conditions, these can be imported into KNIME from
outside sources and combined using CSV Reader, Joiner, or Database Connector nodes.
Additionally, the model can adjust to temporal trends by incorporating time-sensitive features
like "change in job status," "time since last credit inquiry," or "repayment behaviour over time."
KNIME's Date and Time Manipulation nodes can be used to build these.
Within KNIME, feature selection methods like Recursive Feature Elimination (RFE) and
Correlation Filter can be used to reduce noise and enhance generalisability. By preserving
transparency and minimising the chance of overfitting, these techniques assist in keeping only
the most significant features.

31
Furthermore, by communicating which features affect predictions, KNIME's built-in
visualisation tools—such as Box Plot, Bar Chart, and Colour Manager—can help non-technical
stakeholders or regulators understand and accept the model.
question# 3
Model performance was first assessed using 10-fold cross-validation with KNIME's X-
Partitioner node, as shown in Task 1. Robust validation was thus guaranteed. A similar strategy
should be applied on a regular basis to monitor the health of the model during deployment.
a. Retraining Schedule: Using the most recent data, we advise retraining the model every three
to six months. This comprises:
Executing preprocessing procedures (such as encoding, normalisation, and imputation)
If the class disparity continues, reapply SMOTE
Readjusting the hyperparameters for best results
b. Tracking Performance:
Utilise KNIME's Scorer and ROC Curve nodes to monitor metrics like AUC, recall, and precision.
Any notable drop in these scores could be a sign of concept drift, which is a shift in borrower
behaviour that lessens the efficacy of the model.
c. Model Version Control:
Use Model Writer/Reader nodes in KNIME to save and reload model versions. Maintain a
changelog of:
Feature additions or removals
Algorithm changes or tuning adjustments
Data quality or scope changes
d. Business Explanation & Communication:
Model outputs, especially default probabilities, should be categorized (e.g., Low: < 40%,
Medium: 40–50%, High: > 50%) using Rule Engine. These risk groups can be communicated
clearly in management reports.
Additionally, cluster trends (from Task 2) can be monitored over time. If the share of applicants
in Cluster 1 rises sharply, this could signal emerging risk in the customer base.
e. Visualization:
Set up KNIME dashboards with Table View, Bar Chart, and Line Plot nodes to visualize:
Default risk distribution
Cluster trends
Model performance over time
This supports transparent decision-making across technical and non-technical teams.
4. Consider equity, adherence to regulations, and bias in sampling (CRISP-DM Stage: Business
Understanding & Evaluation)

32
Sampling bias is the most significant restriction found in Task 1. The model does not represent
the whole applicant population because the dataset only contains loan applicants who have
been approved. Because of this selection bias issue, the model cannot be applied to initial loan
approval decisions.
Its use during the screening process could result in: Discrimination against candidates whose
profiles differ from those that have already been accepted.
exclusion of historically under-represented applicants who might be creditworthy.
As a result, the model should only be applied for risk stratification, monitoring, and focused
interventions following initial approval.
a. Fairness audits: Clustering in Task 2 identified applicant subgroups with varying financial
practices. We advise performing routine bias checks with KNIME's GroupBy and Statistics nodes
to evaluate:
Are some groups disproportionately identified as high-risk (for example, based on marital
status or type of employment)?
Does each cluster have a significantly different rate of false positives or false negatives?
Rebalancing techniques like re-weighting, threshold adjustment, or feature debiasing might be
necessary if biases are found.
b. Regulatory Compliance (e.g., EU Banking Regulations, GDPR): Regulators demand that models
be:
Transparent: We need to describe the decision-making process.
Documented: We need to demonstrate the use, preprocessing, and modelling of the data.
Auditable: All forecasts and model iterations need to be traceable.
c. Explainability for Compliance: The bank can demonstrate the reasoning behind a decision by
using SHAP, or summary statistics, which is a crucial prerequisite for adhering to the GDPR's
"right to explanation."
d. Ethical Governance: Create a framework for model governance that consists of:
Regular evaluations of fairness
Meetings with stakeholders regarding model modifications
Prior to redeployment, risk committee approvals
This guarantees that the model meets the bank's financial objectives without sacrificing
compliance or fairness.

33
34

Final Report
No ratings yet
Final Report
69 pages
Final Project Title and Abstract Group-3
No ratings yet
Final Project Title and Abstract Group-3
5 pages
Capstone Presentation Final
No ratings yet
Capstone Presentation Final
14 pages
Machine Learning Paper BD
No ratings yet
Machine Learning Paper BD
16 pages
Shsconf Icdeba2023 02008
No ratings yet
Shsconf Icdeba2023 02008
5 pages
Credit Card Default Risk Analysis
100% (1)
Credit Card Default Risk Analysis
16 pages
Coser Al. Crisan Albu (T)
No ratings yet
Coser Al. Crisan Albu (T)
17 pages
An Kit
No ratings yet
An Kit
12 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
Final Project
No ratings yet
Final Project
7 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
26 pages
Reading Material - Module-5 - Introduction To Special Topics
No ratings yet
Reading Material - Module-5 - Introduction To Special Topics
27 pages
Loan Default Prediction Models
No ratings yet
Loan Default Prediction Models
23 pages
Credit Risk Analysis
No ratings yet
Credit Risk Analysis
6 pages
Credit Risk Management Using ML
No ratings yet
Credit Risk Management Using ML
4 pages
Vehicle Loan Application Fraud Detection
No ratings yet
Vehicle Loan Application Fraud Detection
2 pages
EDA Case Study on Loan Default Risk
No ratings yet
EDA Case Study on Loan Default Risk
33 pages
1 PB
No ratings yet
1 PB
13 pages
Credit Risk Prediction Model Analysis
No ratings yet
Credit Risk Prediction Model Analysis
7 pages
Vehicle Loan Fraud Prediction Using Data Science and Machine Learning Techniques
No ratings yet
Vehicle Loan Fraud Prediction Using Data Science and Machine Learning Techniques
4 pages
SSRN Id3769854
No ratings yet
SSRN Id3769854
8 pages
Edafinal 1
No ratings yet
Edafinal 1
32 pages
Vehicle Loan Default Prediction Report
No ratings yet
Vehicle Loan Default Prediction Report
23 pages
Bank Loan Casestudy
No ratings yet
Bank Loan Casestudy
17 pages
Kritika Sejwal 24MCI10023 ML Lab Project Report
No ratings yet
Kritika Sejwal 24MCI10023 ML Lab Project Report
10 pages
WRITEUP
No ratings yet
WRITEUP
2 pages
Ads 9
No ratings yet
Ads 9
8 pages
Assessment of Default Risk Factors in The Disbursement of Home Loans
No ratings yet
Assessment of Default Risk Factors in The Disbursement of Home Loans
13 pages
Credit Default Project 23124001
No ratings yet
Credit Default Project 23124001
13 pages
EDA Assignment Summary PDF
No ratings yet
EDA Assignment Summary PDF
12 pages
Credit EDA Case Study
No ratings yet
Credit EDA Case Study
42 pages
Ppa Final Project
No ratings yet
Ppa Final Project
17 pages
AI Credit Score Prediction Model
No ratings yet
AI Credit Score Prediction Model
3 pages
Loan Default Prediction Article Mar 31 2021
No ratings yet
Loan Default Prediction Article Mar 31 2021
14 pages
Xtreme Boosting Machine
No ratings yet
Xtreme Boosting Machine
5 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
34 pages
Decision Making Assignment
No ratings yet
Decision Making Assignment
6 pages
Prediciton of Loan Apprval-Project Report
No ratings yet
Prediciton of Loan Apprval-Project Report
82 pages
Capstone Project Report v1 - Abhishek Bihani
No ratings yet
Capstone Project Report v1 - Abhishek Bihani
16 pages
Loan Prediction with ML Models
No ratings yet
Loan Prediction with ML Models
11 pages
Capstone Project PPT
No ratings yet
Capstone Project PPT
13 pages
Lending Club Data Analysis PDF
No ratings yet
Lending Club Data Analysis PDF
3 pages
Thera Bank Loan Campaign Analysis
No ratings yet
Thera Bank Loan Campaign Analysis
21 pages
Thera Bank Loan Campaign Analysis
100% (1)
Thera Bank Loan Campaign Analysis
21 pages
Loan Default Prediction
No ratings yet
Loan Default Prediction
7 pages
HCI ScorecardModel PPT
No ratings yet
HCI ScorecardModel PPT
9 pages
EDA Credit Case Study (Karan Pratap Singh)
100% (1)
EDA Credit Case Study (Karan Pratap Singh)
63 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
EDA Assignment
No ratings yet
EDA Assignment
33 pages
PA v0.21
No ratings yet
PA v0.21
17 pages
Loan Eligibility Prediction Model Analysis
No ratings yet
Loan Eligibility Prediction Model Analysis
12 pages
Credit Card Default Predicati ON: High Level Design
No ratings yet
Credit Card Default Predicati ON: High Level Design
6 pages
Data Mining Project
No ratings yet
Data Mining Project
5 pages
Summary and Context
No ratings yet
Summary and Context
51 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
Cluster Credit Risk R PDF
No ratings yet
Cluster Credit Risk R PDF
13 pages
Meeting The Two Guardians
100% (1)
Meeting The Two Guardians
14 pages
PSY 320 L1 Introduction To Educational Statistics
No ratings yet
PSY 320 L1 Introduction To Educational Statistics
5 pages
Apple’s Unique Social Media Strategy
No ratings yet
Apple’s Unique Social Media Strategy
6 pages
Self Construal Scale
No ratings yet
Self Construal Scale
3 pages
Face Painting
No ratings yet
Face Painting
263 pages
2014 Best Sellers: Unloaders & Kits
No ratings yet
2014 Best Sellers: Unloaders & Kits
16 pages
Free Vibration of Multi-DOF Systems
No ratings yet
Free Vibration of Multi-DOF Systems
3 pages
At The Supermarket - Phrases Full Article
No ratings yet
At The Supermarket - Phrases Full Article
7 pages
The Alchemist: Alignment
No ratings yet
The Alchemist: Alignment
2 pages
Biological Science Solution Manual
100% (79)
Biological Science Solution Manual
36 pages
Bluetooth Network Encapsulation (BNEP) Protocol Test Cases - Rev0.95a
No ratings yet
Bluetooth Network Encapsulation (BNEP) Protocol Test Cases - Rev0.95a
41 pages
10 ECDIS Questions SIRE Inspectors Ask and How To Deal With It?
100% (2)
10 ECDIS Questions SIRE Inspectors Ask and How To Deal With It?
23 pages
Katkam - Shiva - SAP S4 HANA SD Consultant
No ratings yet
Katkam - Shiva - SAP S4 HANA SD Consultant
3 pages
Class XII BST Annual Planner
No ratings yet
Class XII BST Annual Planner
3 pages
FELCOM18 Installation Manual
No ratings yet
FELCOM18 Installation Manual
59 pages
Scribd Needs To Cool Off 1
No ratings yet
Scribd Needs To Cool Off 1
3 pages
2025 ST 1-4 Rules
No ratings yet
2025 ST 1-4 Rules
30 pages
600DL DEI Logger
No ratings yet
600DL DEI Logger
5 pages
O&M-System Description-Fuel GAS (DLN 2.0+) - MS9001FA+e PDF
89% (9)
O&M-System Description-Fuel GAS (DLN 2.0+) - MS9001FA+e PDF
16 pages
Piping-Questionnaires Compress
No ratings yet
Piping-Questionnaires Compress
4 pages
Decisions Under Uncertainty
No ratings yet
Decisions Under Uncertainty
25 pages
HT SE Audit Handbook (DEC 2023) WF v.1
0% (1)
HT SE Audit Handbook (DEC 2023) WF v.1
3 pages
Sopwith Triplane ARF Assembly Guide
100% (1)
Sopwith Triplane ARF Assembly Guide
12 pages
Zeeshan MBA, 2011
No ratings yet
Zeeshan MBA, 2011
101 pages
4059-Ca-00213956 - 1 Riser Installation Analysis
No ratings yet
4059-Ca-00213956 - 1 Riser Installation Analysis
110 pages
14th BOB Wonderkid Competition 2024
No ratings yet
14th BOB Wonderkid Competition 2024
8 pages
Hamza Zeeshan: About Me
No ratings yet
Hamza Zeeshan: About Me
2 pages
FD Ffi T: I?utt.-Aff (S, F
No ratings yet
FD Ffi T: I?utt.-Aff (S, F
8 pages
Views and Values On Family Among Filipinos: An Empirical Exploration
No ratings yet
Views and Values On Family Among Filipinos: An Empirical Exploration
26 pages
Invoice Summary and Payment Details
No ratings yet
Invoice Summary and Payment Details
5 pages

Report

Uploaded by

Report

Uploaded by

Final Project

Data Science for Business

Submitted to: Professor Benoit DEPAIRE

Date: May 13th, 2025

Introduction and Business Understanding

Putting the Issue in Data Science Tasks

Phase 1: Descriptive Analysis:

1.2. Phase 5: Evaluation:

2.1. Clustering Process:

Cluster Insights using Crosstab (Categorical):

● Primarily employed full-time, possessing a moderate level of education (Bachelor's,

● Predominantly applicants without co-signers or mortgages, suggesting either first-time

● Applicants usually have 58.78 months of work experience, indicating a long-term

2.5 Overall Cluster Characteristics Table:

Variable Name Description Type Role

LoanID It is unique for each loan Categorical (ID) Identifier

Income Annual Income of the Numerical Predictor

LoanAmount Amount of loan that is Numerical Predictor

CreditScore A credit score indicates Numerical Predictor

MonthsEmployed Number of months the Numerical Predictor

InterestRate The interest rate that is Numerical Predictor

LoanTerm The duration of loan Numerical Predictor

DTIRatio DTI Debt-to-Income Numerical Predictor

Education Highest level of Categorical Predictor

EmploymentType Employment status Categorical Predictor

MaritalStatus Marital status of the Categorical Predictor

HasMortgage Indicates (Yes) if the Binary Predictor

HasDependents Indicates (Yes) if Binary Predictor

LoanPurpose The stated purpose Categorical Predictor

HasCoSigner Indicates Binary Predictor

How could your model be used in ongoing credit risk processes?

You might also like