0% found this document useful (0 votes)
28 views34 pages

Report

Uploaded by

hussalz555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views34 pages

Report

Uploaded by

hussalz555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Final Project

Data Science for Business

Group Members

Sidra Nadeem
Hussain Ali
Aruba Irshad
Rizwan

Submitted to: Professor Benoit DEPAIRE


& Haroon Tharwat

Date: May 13th, 2025


Table of Contents

2
SECTION 0

Introduction and Business Understanding


Assessing and reducing credit risk is important for ensuring long-term profit potential and
organisational stability, financial volatility, and regulatory oversight. Banks have a lot of
confusion to identify potential defaulters in the early stages of the loaning cycle and evaluate
their risk vulnerability. This study addresses a similar issue raised by one Belgian bank that
uses an evidence-based approach and develops a predictive model to estimate the likelihood of
loan default among applicants who have already received approval. The bank can update
lending policies, enhance post-approval tracking, and more proactively manage portfolio risk
by identifying high-risk borrowing parties early on.
The provided datasets 10,000 records of rejected loan applications. Every entity has its
financial, demographic, and behavioural characteristics. The main aims or goals of the project
are to create a supervised machine learning model to forecast default results, conduct
unsupervised clustering to identify borrower segments, and finally provide actionable insights
for credit strategy. A different test dataset is also used to mimic the real-world application. This
paper uses the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework since
it provides a methodical and iterative approach for converting business challenges into data
science solutions. The CRISP-DM approach also fits and applies to the banking sector for
another reason: its technical rigour and business alignment balance.Most importantly, this
dataset comprises previously authorised candidates. Therefore, because of sampling bias, this
model is not relevant for screening first loan applications. Used outside the authorised
population, this could skew the forecasts.

Putting the Issue in Data Science Tasks


The following tasks can be derived from the business problem that I mentioned earlier.
● Classification: There are two possible outcomes when predicting whether an applicant
will default.
● Probability estimation is the process of determining the likelihood of default in order to
rank applicants according to the level of risk. Consequently, priority-based intervention
becomes possible.
● Clustering Task: Use unsupervised learning to divide applicants into discrete profiles for
a calculated intervention.

SECTION 1
Chapter 1: Predict Loan Default
Introduction:
Creating three machine learning models to forecast the bank's loan default is the focus of the
first chapter. These three models consist of Support Vector Machines (SVM), Random Forest, and
logistic regression. The chapter is divided into five main sections. KNIME analytics software is
used to create these reliable supervised machine learning models. Understanding the likelihood
of loan defaults is essential for managing portfolio health and risk exposure as per regulatory
requirements and the state of the economy. The specifics of the strategy used to create
supervised learning models and forecast loan default are listed below.
Data Interpretation:
3
There are two files containing the data. First of all three supervised models are trained using the
train data. The test data, to which these three models will be applied, is the other. 10,000 entries
total, with 1 denoting a default and 0 denoting a non-default, make up the historical loan data.
Our numerical features, also known as the independent/predictors, include Loan ID, Age,
Income, Credit scores, Loan Term, Loan Amount, Interest Rate, DTI Ratio, Months Employed, and
Num Credit Lines. Other categorical features include education, marital status, Mortgage status,
Loan Purpose, Co-signer, and Dependents information.Our primary objective is the target
variable default, which has values of 0 and 1 for non-default and default, respectively. Both files
are supplied in CSV format for the KNIME analytics platform to read. The final objective is to use
the class probabilities method to assign default probabilities to each case. Table 0 (appendix)
contains the information.
Task Analysis Plan:
The task analysis plan is divided into the following five main phases: training, testing, model
phase, and class probability estimation phase.
Phase 0: Data Import:
Importing the test data was the first step. The file reader node was used to read it. This stage is
shown in fig. 1 below.

(figure.1)

Phase 1: Descriptive Analysis:


Understanding the patterns, trends, central tendency, and data variations is the goal of the
second phase. As seen in fig. 2, it is divided into two sections: exploratory and outlier handling.

(figure.2)

The primary objective is to show the data in a single glance. The following nodes were covered
in the first section on exploratory analysis (EDA): statistics, histogram, value counter, and group-
by.
EDA Analysis:

4
The main findings in the data, as displayed in table 1 (appendix), are informed by statistics. With
an average yearly income of about €83,000 and an average loan amount of €127,775, the typical
applicant is nearly 44 years old. The income and loan distributions are wide, as indicated by the
standard deviations of €39,148 and €70,615 respectively. Their skewness and kurtosis reflect
normal distribution. The average credit score is 575 ranging from 300 to 849. The Debt-to-
Income ratio is around 0.5 average means their debt obligations caters almost the half of their
salary. The loan default rate is around 11.8%. Additionally, the target variable's kurtosis (3.59)
indicates heavier tails, indicating that some applicants engage in extreme risk behaviours. The
entire dataset is defined by these statistics. As seen in fig. 3 (appendix), the value counter node is
used to view the number of defaults. As illustrated in fig. 4, there are 1182 defaults and 8818
non-defaults, meaning that only 12% of applicants default and 88% do not. The primary
attributes of defaults and non-defaults are separated using a different node called Group By.
Table 2 displays the findings. The average age of the defaulted candidates is 36 years younger
than that of the non-defaulters, indicating immaturity and a lack of experience. Income is a
powerful deterrent against defaults, as evidenced by the average income of €73,245 for
defaulters and €84,323 for non-defaulters. Furthermore, even though defaulters had low credit
scores (554.7 vs. 577.5), they took out larger loans (average €141,233 vs. €125,990). This
implies that the following characteristics of our defaulters are directly related to the target
variable: low income, young age, low credit scores, larger loan amounts, higher financial burden,
and risk tolerance. A slightly higher DTI ratio (0.517 vs. 0.51) suggests that disposable incomes
are being squeezed more tightly. Shorter work history and less job stability are indicated by
lower months employed.
Outlier Detection:
In order to indicate that there were no outliers, the outlier detection process began with a
numeric outlier node that read zero for every variable. Figure 5 presents the results of another
node box plot. A large financial range among borrowers and the existence of high-risk lending
cases are indicated by variables like income and loan amount that exhibit high variability and
multiple upper outliers. The distribution of Months Employed shows a positive skew, with many
cases having comparatively brief work histories. With medians of 13.5% and 36 months,
respectively, the interest rate and loan term distributions seem more structured and represent
the terms of typical loan products. All things considered, the plots show regions with possible
modelling significance and risk concentration while confirming the integrity of the data.
Phase 2: Feature Engineering and Data Transformation:
In order to prepare and preprocess the data for the models, phase 2 involved scaling, cleaning,
and encoding the data. Figure 6 illustrates this as follows:

(fig.6)

In order to handle value loss and finish the feature matrix for model training, the first step
incorporates missing values. While the categorical features were handled using the most
frequent one, the numerical features were imputed using the median and mean values. This was
done in order to avoid data loss and model bias. Figure 7 (appendix) provides this. The

5
categorical data was then converted to numerical data using a one-to-many node. Thus, the
model is able to assess each one separately. It maintains machine-readable categorical data. For
example, high school "1," bachelor's "2," master's "3," and so on. In order to overcome bias in the
data, a column filter was immediately used to eliminate features that were not relevant, such as
Column ID, one from each category feature, such as bachelor's from education, married from
marital status, and so forth. Table 3 (appendix) illustrates this. Normaliser was completed last.
This guarantees that all of our data is in the binary format 0,1, preventing biassed weighting and
ensuring that all features contribute normally to model training. Additionally, it enhances
models based on convergence rate gradients.Table 4 (appendix) illustrates this. In order to
provide outputs for the target variable that are readable by humans, the final step converts the
number to string for the "default" variable.
Phase 3: Bivariate Analysis:
The third stage covered bivariate analysis with scatter plots and correlations. Figure 8 below
shows this:

(fig.8)

Correlation was the first step, and the screenshot is included in the appendix as table 5. Younger
applicants with lower incomes and credit scores are more likely to default, according to the
correlation analysis, which also shows that Age, Income, Credit Score, and Months Employed are
negatively correlated with Default. Notably, the DTI Ratio (r = 0.022) and interest rate (r =
0.137) exhibit weak but statistically significant positive correlations with default, supporting the
notion that greater financial strain raises risk. While full-time employment, marriage, and co-
signing are linked to lower risk, being unemployed, single, or without a co-signer also exhibits
weak-to-moderate positive correlations with default among categorical variables.Although the
majority of correlations are weak (r < 0.2), their cumulative effect on model performance is still
valuable, as indicated by statistical significance (p < 0.05 in many cases). Scatter plots were
covered in the second step. Finding predictive features and lowering noise and multicollinearity
are the goals of this. There are very weak correlations between features like Months Employed
(r ≈ -0.078), Loan Term (r ≈ 0.007), and Num Credit Lines (r ≈ 0.02). The best candidates to keep
were those with the following attributes: credit score, income, interest rate, DTI ratio, co-signer
status, dependents, employment type, and marital status.
1.1. Phase 4: Model Training:
The fourth and the most important phase is the model training. We trained three models:
Logistic, Random Forest and SVM. The workflow is shown below in fig.9:

6
(fig.9)

Now, lets talk about the algorithms we have used like of Logistic Regression, Random Forest, and
Support Vector Machine (SVM) shows us the strategic balance of models capacity and bias
varance specially relevant to our task of prediction of loan default. Here is the complex, partially
non linear relationships within our dataset like credit score, income, DTI ratio, and default,
Logistic Regression is used as a transparent baseline model, which help stakeholders to
identify direct relationships, even if it is not relevant to the underlying patterns. On the other
side, we have Random Forest model that is known for its high capacity and low bias, that can
easily point out the interactions in the multiple borrower attributes and meanwhile it can
manage easily overfitting risk through learning. SVM, applied with an RBF kernel, use for its
flexibility by modeling non-linear decision boundaries, and it is beneficial for separating
borderline applicants specially one of those whose profiles are not linearly distinguishable. After
adding these models, we can ensured perfect evaluation of performance of both simple and
complex feature interactions essential for building a model that can be used to generalizes easily
to various real world loan applicants.
In the first step we did x-partitioning 80/20 data ratio for train and testing of the data. Cross-
validation of 10 folds we used in it. This is shown in attach picture in the file. After this the
learner nodes for all three models related to partitioning node and the predictor node are used
to form the models. Following are some key characteristics of our three models:
1. Cross Validation:
This was done using X-partitioner node. The reason of doing this is to minimize the risk
of overfitting. The best folds that maximized our AUC were 10.
2. SMOTE:
As we have unbalanced nature of our dataset, we used SMOTE to balance it. So, our
models differentiate defaulters without bias of majority class we have in the data set
because defaulters are minority (12%). For Logistic and Random the best configuration
are shown in fig.11 (appendix) This is why oversample by minority classes were chosen
and static seed was tried different combinations such as 1,2,3,4,5 and the best one came
out to be ‘987654’ so, we have the same split every time.
3. Parameter Optimization:

7
This was done to simplify the hyperparameters such as tree depth, node size,
regularization etc. The best values were found and put manually in advanced settings.
Different ranges were tried and increased until the best ones were found. This was done
to boost the predictive powers of our models. Sample screenshot is attached with
optimized parameters in fig.12. For SVM, polynomial kernel was used as boundary
between default and non-default is not linearly separable.
4. Gradient Boost:
Finally, gradient boosted learners were used to boost the performance of random
forests. Unlike, a single decision tree, gradient boost make many small trees one at a
time, where each new tree learns from the mistakes of the previous ones. It gave us high
accuracy while preventing overfitting.

1.2. Phase 5: Evaluation:


The final stage of evaluation and the most important. This is shown in fig.13 below:

(fig.13)

So, firstly talking about the choice of evaluation metrics. We used scorer to evaluate and see the
overall result based on recall, precision, sensitivity, specificity, f measure and overall accuracy.
We decided to check our models by using recall, precision, sensitivity, specificity, F-measure,
and overall accuracy to get enough understanding of performance, because we have class
imbalance present in our dataset. In credit risk modeling, recall is necessary as it check how
effectively the model find out real defaulters, which is a primary concern for banks, those are
looking to minimize financial losses. Accuracy gives us confirmation that we don’t mistakenly
categorize too many trustworthy clients as problematic, which couldnot be beneficial for the
business. Specificity gives us the balance of showing the effectiveness of identifying non-
defaulters. The F-measure help us to better precision in merging and recall into one score for
trade-off evaluation, and although overall accuracy is less informative in cases of imbalance, it
still gives us the overall idea of correct predictions. This multi-metric method gives us clear
understanding of both business risk and model dependability. Last but not least, AUC and ROC
curve. The ROC curve illustrates the true positive rate which means sentivity as compare to false
positive rate, aiding in the visualization of the balance between accurately detecting defaulters
and minimizing false positives. AUC changes this curve into one value; the nearer it is to 1.0, the
more effective the model is at prioritizing high-risk applicants over low-risk ones. This is
particularly beneficial in our loan default initiative since banks frequently prioritize ranking
customers by their risk over binary classification.
Results:
The results of each model are given in screenshot (fig.14 for logistic) (fig. 15 for random) (fig.16
for SVM) (appendix). Their results are given below:
Logistic Regression showed us notable accuracy (0.882) and outstanding specificity (0.992), that
indicates it effectively recognized the majority of non-defaulters. However, its recall for

8
defaulters was very low (0.059), showing that the model failed to notice almost all actual default
instances. This indicates underfitting, where a sign of high bias and low capacity, where the
model simplifies relationships excessively and is unable to recognize non-linear interactions.
Although it achieved the highest AUC (0.811) showing us the strong overall ranking capability its
actual effectiveness of finding out defaulters was the weakest, and as a result it is considered as
inappropriate for use.
Random Forest has given us more equal outcomes. It showed significant reduced AUC (0.790),
yet this was counterbalanced by a considerably greater recall (0.627), guaranteeing a markedly
improved capacity to identify defaulters. While its precision (0.247) was average, its F1-score
(0.355) demonstrates the optimal balance between actual defaulters and reducing false
positives. Its specificity (0.745) showed us the adequate management of dependable customers.
Random Forest, as we already discussed the high-capacity and low-bias model with controlled
variance via ensembling, is adept at managing the difficult, non-linear feature interactions found
in credit data. The significant reduced AUC is satisfactory since the model good in critical area
like recognizing genuine defaulters.
Support Vector Machine (SVM) attained flawless specificity (1.000) and considerable accuracy
(0.882), but still failed to identify any defaulters (recall = 0.000). This indicates overfitting to the
dominant class and a deficiency in generalization for identifying minority class patterns like
those who are going to be default, probably caused by high bias resulting from not having the
accurate parameter tuning. Even with a competitive AUC of 0.799, its classification performance
renders it impractical in this setup for predicting credit risk.
All these results are summarized in table 6 (appendix)
1.3. Testing phase:
The testing phase tests our developed models. If we get high testing error, that means our
models are overfitting. The flow for testing phase in given below in fig.17:

(fig.17)

Firstly, reader node was used to load test data on loan defaults. Meta node for data
preprocessing was developed from test phase and connected to test data. Predictors were used
and connected to the learner’s respectively from trained models. Column filters were used. First,
probabilities were appended in predictor nodes. Column filters added only P(Default) from each
model. They were renamed to ‘p_SVM, p_Log, p_RF’ and then joined using joiner node. Here, we
introduced ensemble method to combine probabilities from all three into one centralized
decision making strategy. This was done to get the desire results from all by aggregating and
casting votes on forecasts, which help us to minimizing both bias and variance in the result.
Rather than depending on one algorithm, ensemble methods were used to improve
generalization by combining differen decision-making logics. In our instance, merging the class
9
probabilities (P(Default)) helped us to stabilize forecasts and reduce the likelihood of overfitting
or underfitting linked to separate learners. Then, math formula node was added using the
formula: if(($Prob_Default_SVM$ + $Prob_Default_LogReg$ + $Prob_Default_RF$) / 3 > 0.5,
1, 0) and finally we had applicant ID and their respective probabilities of default.
1.4. Class Probability Estimation (Default Test Results)
The final stage of task 1 included the results of defaults. This was done using class probability
estimation which is useful for ranking when we have biased distribution results. The workflow
is given below in fig. 18 which was done by using math formula to combine the probabilities of
each model and aggregated into one single one.

(fig.18)

50% probability was related with default initially and using value counter we found 89 defaults
while 4911 non-defaults. With 40% probability, we found 511 defaults and 4489 non-defaults.
This is shown in fig. 19 (appendix) Now, we wanted to rank the defaulters. Then, we used the
sorter node to rank the defaulters from high chances to low and then row sampling and column
filter to find out our top 10 applicants. This is shown in fig.20 (appendix) with 0.66 which means
66% as the highest applicant chances of default. Another node statistics was used whose results
are added in table 7 (appendix) The ensemble result shows us a low average default probability
(0.018) alongside high skewness (7.296) and kurtosis (51.25), suggesting a cautious prediction
tendency. This demonstrates the impact of Logistic Regression and SVM, which preferred non-
default results. Random Forest showed us the increased variance and a higher mean (0.38), by
improving its recall. These findings validate our previous assessment: the ensemble reconciles
prudent forecasts with enhanced detection of defaulters. Finally, rule engine was used to
categorize defaulters into high (50% and greater), medium (40% till 49%) and low (less then
40%), we found 89 to be high risk, 422 medium and 4489 to be low as attached in fig.21
(appendix) Additionally, we used the group by node to find the key characteristics of defaulters
by loan purpose, marital status, employment and education. These results are given in fig.22
(appendix) and along with the summary table (table 8) (appendix) In conclusion, we found out
lower income, younger age, and specific statuses like self-employed or divorced are going to be
with higher predicted default probabilities. You can check it in table 2 (appendix) the following
result by using the group by node with the characteristics that I explained.
Chapter 2: Clustering using Unsupervised Learning:
This report provide us a basic understanding of loan applicants by dividing them into three
different groups called clusters based on their personal and financial characteristics. The
purpose of this analysis is to understand which group is more likely to default on their loans and
how a bank can design strategies to manage risk and treat the customers accordingly and make
changes in their strategies as per the data of the customer. The task started with loading the
train data using reader node, then data preprocessing was done, column filter was used to
remove ID, missing values to replace them by mean values and most frequent ones, one-to-many
for categorical conversions and finally normalizer to make it in 1,0s. K-means clustering was

10
used and the results were displayed using PCA, scatter plots, bar charts, silhouette coefficient.
The entire workflow for clustering analysis in given below in fig.23:

(fig.23)

2.1. Clustering Process:


We used a method called K-Means Clustering to divide the applicants into three groups based on
similar characteristics. The clustering algorithm grouped applicants in a way that people in the
same group are more like each other than to those in other groups. We used K-means as being
unsupervised learning, it helps to uncover natural groupings without the need for labelled data,
grouping similar applicants together helping the bank tailor loan strategies, and is
efficient in computation and adapts well to extensive datasets like ours.
To better visualize and understand the clustering, we also reduced the dimensions of the data
using PCA (Principal Component Analysis), which helped us group the applicants more
meaningfully. For K-means, we did 3 clusters in configuration and connected this to PCA for
visualization, added colour manager and scatter plot which is shown in fig.24 (appendix) Cluster
2 (green) covers an extensive area on PCA Dimension 0, indicating increased variability in
features possibly a mixed or medium-risk category. Cluster 1 (purple) is tight and separate in
the lower right quadrant, suggesting a possibly high-risk category with uniform characteristics.
Cluster 0 (red) is closely packed on the extreme left, probably indicating a low-risk category with
consistent financial characteristics. We used Bar charts to find the mean values for each cluster.
This is attached in fig.25 (appendix)

Cluster Insights using Crosstab (Categorical):


Figure 27 in the appendix illustrates this. The outcomes of this process are as follows:
Cluster 0:

● Primarily employed full-time, possessing a moderate level of education (Bachelor's,


Master's), and having a variety of loan-related reasons.

● Predominantly applicants without co-signers or mortgages, suggesting either first-time


financial independence or those looking for entry-level credit.

● Contains fewer high-risk indicators than Cluster 1, but it displays a fairly uniform
distribution of marital status.
Cluster 1:

11
● A high percentage of jobless people and part-time employees, as well as a lower
percentage of people with advanced degrees.

● Less likely to have co-signers or mortgages, and more likely to be unmarried or divorced.

● This group suggests less financial stability, which may call for stricter risk control
procedures.
Cluster 2:

● The majority of married applicants have steady employment and higher loan
commitments (many have mortgages).

● Overall, education levels are strong, especially for high school and bachelor's degrees.

● Most likely have co-signers and dependents, which suggests more intricate
arrangements for financial or familial responsibilities.
Cluster Insights using numerical features:
We used pie chart nodes to gain insight into clusters, and fig. 28 (appendix) displays the
corresponding numerical features. For each cluster, we obtained the following results:
Cluster 0:

● Has a high average income (€82,794.33) and the highest credit score (577.68), both of
which point to a responsible ability to borrow.

● Applicants usually have 58.78 months of work experience, indicating a long-term


commitment, and the average loan amount is €128,746.33.

● This group appears to be low-risk, making them perfect for favourable credit terms.

Cluster 1:

● Shows us financial stress by having the lowest credit score (571.10) and a slightly lower
average income (€82,603.22).

● Has a slightly shorter employment duration (58.46 months) and an average loan of
€128,726.14, both of which could be signs of instability.

● The group may require additional verification or more stringent lending conditions.

Cluster 2:

● Has the lowest loan amount (€128,051.09) and the highest average income
(€83,358.20), demonstrating responsible borrowing habits.
● The credit score is average (574.71) and the length of employment is constant (58.56
months).

12
● This demographic is steady and moderately risky, making them perfect for tailored loan
offers.

2.5 Overall Cluster Characteristics Table:


Following table summarizes the key characteristics of our clusters:

(table 8)

13
Appendix:

Variable Name Description Type Role

LoanID It is unique for each loan Categorical (ID) Identifier


application (that’s why its
ignored)
Age Age of applicant in years Numerical Predictor

Income Annual Income of the Numerical Predictor


applicant

LoanAmount Amount of loan that is Numerical Predictor


requested by the
applicant

CreditScore A credit score indicates Numerical Predictor


the
creditworthiness of
the applicant

MonthsEmployed Number of months the Numerical Predictor


applicant has been
employed

14
NumCreditLines Number of currently Numerical Predictor
active/open credit
lines

InterestRate The interest rate that is Numerical Predictor


applied to the loan

LoanTerm The duration of loan Numerical Predictor


granted (in months)

DTIRatio DTI Debt-to-Income Numerical Predictor


ratio: total monthly

15
debt payments
divided by monthly
income

Education Highest level of Categorical Predictor


education attained
by the applicant
(High school,
Bachelors, Masters,
PhD)

EmploymentType Employment status Categorical Predictor


(Self-employed,
Full-time, Part-time,
Unemployed)

MaritalStatus Marital status of the Categorical Predictor


applicant (Single,
Married, Divorced)

HasMortgage Indicates (Yes) if the Binary Predictor


applicant has a
mortgage, otherwise
(no)

HasDependents Indicates (Yes) if Binary Predictor


the applicant
has dependents
on
someone,
otherwise (no)

LoanPurpose The stated purpose Categorical Predictor


for seeking the loan
(Auto, Business,
Education, Home
Improvement,
Other)

HasCoSigner Indicates Binary Predictor


whether the loan
has a
co-signer meaning
that someone is
willing to repay a
loan if the primary
borrower fails to do
so (either yes or no)
Default Default tells if the Binary Target
applicant defaulted.
If defaulted (1)
otherwise (0)

(table 0)

(table 1)

(fig.3)

(fig.4)

(table 2)

17
18
19
(fig.5)

20
(fig.7)

(table 3)

(table 4)

21
(table 5)

(fig.10)

(fig.11)

22
(fig.12)

(fig.14)

(fig.15)

23
(fig.16)

(table 6)

(fig.19)

(fig.20)

24
(table 7)

(fig.21)

(fig.22)

(table 8)

25
(fig.24)

(fig.25)

26
(fig.26)

27
(fig.27)

28
29
(fig.28)

Task# 3

How could your model be used in ongoing credit risk processes?


Random Forest, SVM, and Logistic Regression are developed to predict the probability of loan
default among approved applicants in Task 1. After preparing data, training the models, and
evaluating it using Recall and AUC, an ensemble model is created for better and more accurate
predictions. This ensemble model outputs a default probability for each borrower, which is then
categorized into high, medium, or low risk using rule-based classification. This model can be
integrated into the ongoing credit risk processes of the bank through post-approval monitoring.
As the dataset only included the approved applicants, the model can’t be used for the initial
screening phase and the reason behind it is sampling bias, but it's extremely useful after the
loan disbursement. If this model is applied periodically, the bank will be able to examine its
portfolio’s health, identify borrowers showing increased default risk, and early intervention
measures such as revised repayment plans or more frequent follow-ups.
In Task 2 unsupervised clustering was done using K-means and it grouped applicants into three
distinct clusters based on their financial and demographic characteristics providing deeper
insights into borrower profiles. This helps the bank to tailor strategies for different groups.

Cluster 0
Profile: Senior candidates with low default rates, steady incomes, and clean credit histories.
Strategy
Provide rewards for loyalty, such as increased credit lines or reduced interest rates.
Encourage the cross-selling of high-end banking products, such as insurance and savings plans.
Cluster #1
Profile: Younger, poorer people who are more likely to default and have lower credit scores.
Strategy
● Put in place more stringent post-disbursement oversight

30
● Provide budgeting resources, behavioural nudges, and financial literacy initiatives.
● To increase payment consistency, take into account modified repayment plans.
Cluster #2
Candidates in Cluster 2 have balanced risk profiles, moderate credit performance, and are
middle-aged or middle-income. They do not belong to the low-risk, high-trust category, but they
are also not high-risk.
Strategy
● Use up-to-date financial data to periodically reevaluate their risk score.
● Offer optional add-ons (such as payment protection plans and auto-debit discounts).
● Keep an eye out for behavioural changes that might lead them to fall into higher-risk
groups; early detection can help avoid unpleasant surprises.
The bank is able to transition from generic credit policies to a more customised, risk-adjusted
approach by incorporating the anticipated default probabilities from Task 1 into the cluster
assignments from Task 2. For example: Borrowers with medium-to-high default scores may be
flagged for closer monitoring, even within the low-risk Cluster 0.
Before the default risk increases, medium-risk borrowers in the moderate Cluster 2 can get
proactive assistance.
Through the layered application of descriptive segmentation (Task 2) and predictive modelling
(Task 1), the bank is able to: Implement focused risk interventions, sort product offerings
according to the type of customer, and Boost the resilience of your portfolio.
Question 2
We used a wide range of financial and demographic characteristics in Task 1, including income,
credit score, loan term, loan amount, education, and work history. We can add more variables
that represent changing borrower behaviour and the economic environment to further improve
the model's predictability. These could include recent large transactions, account balance
fluctuations, or missed payments, among other behavioural characteristics. In order to generate
summary indicators, these features can be extracted from transactional data and combined with
KNIME's Joiner and GroupBy nodes.
By adding macroeconomic variables like the inflation rate, trends in unemployment, or shifts in
interest rates, we can further improve the model. To help the model produce reliable
predictions in the face of shifting economic conditions, these can be imported into KNIME from
outside sources and combined using CSV Reader, Joiner, or Database Connector nodes.
Additionally, the model can adjust to temporal trends by incorporating time-sensitive features
like "change in job status," "time since last credit inquiry," or "repayment behaviour over time."
KNIME's Date and Time Manipulation nodes can be used to build these.
Within KNIME, feature selection methods like Recursive Feature Elimination (RFE) and
Correlation Filter can be used to reduce noise and enhance generalisability. By preserving
transparency and minimising the chance of overfitting, these techniques assist in keeping only
the most significant features.

31
Furthermore, by communicating which features affect predictions, KNIME's built-in
visualisation tools—such as Box Plot, Bar Chart, and Colour Manager—can help non-technical
stakeholders or regulators understand and accept the model.
question# 3
Model performance was first assessed using 10-fold cross-validation with KNIME's X-
Partitioner node, as shown in Task 1. Robust validation was thus guaranteed. A similar strategy
should be applied on a regular basis to monitor the health of the model during deployment.
a. Retraining Schedule: Using the most recent data, we advise retraining the model every three
to six months. This comprises:
Executing preprocessing procedures (such as encoding, normalisation, and imputation)
If the class disparity continues, reapply SMOTE
Readjusting the hyperparameters for best results
b. Tracking Performance:
Utilise KNIME's Scorer and ROC Curve nodes to monitor metrics like AUC, recall, and precision.
Any notable drop in these scores could be a sign of concept drift, which is a shift in borrower
behaviour that lessens the efficacy of the model.
c. Model Version Control:
Use Model Writer/Reader nodes in KNIME to save and reload model versions. Maintain a
changelog of:
Feature additions or removals
Algorithm changes or tuning adjustments
Data quality or scope changes
d. Business Explanation & Communication:
Model outputs, especially default probabilities, should be categorized (e.g., Low: < 40%,
Medium: 40–50%, High: > 50%) using Rule Engine. These risk groups can be communicated
clearly in management reports.
Additionally, cluster trends (from Task 2) can be monitored over time. If the share of applicants
in Cluster 1 rises sharply, this could signal emerging risk in the customer base.
e. Visualization:
Set up KNIME dashboards with Table View, Bar Chart, and Line Plot nodes to visualize:
Default risk distribution
Cluster trends
Model performance over time
This supports transparent decision-making across technical and non-technical teams.
4. Consider equity, adherence to regulations, and bias in sampling (CRISP-DM Stage: Business
Understanding & Evaluation)

32
Sampling bias is the most significant restriction found in Task 1. The model does not represent
the whole applicant population because the dataset only contains loan applicants who have
been approved. Because of this selection bias issue, the model cannot be applied to initial loan
approval decisions.
Its use during the screening process could result in: Discrimination against candidates whose
profiles differ from those that have already been accepted.
exclusion of historically under-represented applicants who might be creditworthy.
As a result, the model should only be applied for risk stratification, monitoring, and focused
interventions following initial approval.
a. Fairness audits: Clustering in Task 2 identified applicant subgroups with varying financial
practices. We advise performing routine bias checks with KNIME's GroupBy and Statistics nodes
to evaluate:
Are some groups disproportionately identified as high-risk (for example, based on marital
status or type of employment)?
Does each cluster have a significantly different rate of false positives or false negatives?
Rebalancing techniques like re-weighting, threshold adjustment, or feature debiasing might be
necessary if biases are found.
b. Regulatory Compliance (e.g., EU Banking Regulations, GDPR): Regulators demand that models
be:
Transparent: We need to describe the decision-making process.
Documented: We need to demonstrate the use, preprocessing, and modelling of the data.
Auditable: All forecasts and model iterations need to be traceable.
c. Explainability for Compliance: The bank can demonstrate the reasoning behind a decision by
using SHAP, or summary statistics, which is a crucial prerequisite for adhering to the GDPR's
"right to explanation."
d. Ethical Governance: Create a framework for model governance that consists of:
Regular evaluations of fairness
Meetings with stakeholders regarding model modifications
Prior to redeployment, risk committee approvals
This guarantees that the model meets the bank's financial objectives without sacrificing
compliance or fairness.

33
34

You might also like