Report
Report
Group Members
Sidra Nadeem
Hussain Ali
Aruba Irshad
Rizwan
2
SECTION 0
SECTION 1
Chapter 1: Predict Loan Default
Introduction:
Creating three machine learning models to forecast the bank's loan default is the focus of the
first chapter. These three models consist of Support Vector Machines (SVM), Random Forest, and
logistic regression. The chapter is divided into five main sections. KNIME analytics software is
used to create these reliable supervised machine learning models. Understanding the likelihood
of loan defaults is essential for managing portfolio health and risk exposure as per regulatory
requirements and the state of the economy. The specifics of the strategy used to create
supervised learning models and forecast loan default are listed below.
Data Interpretation:
3
There are two files containing the data. First of all three supervised models are trained using the
train data. The test data, to which these three models will be applied, is the other. 10,000 entries
total, with 1 denoting a default and 0 denoting a non-default, make up the historical loan data.
Our numerical features, also known as the independent/predictors, include Loan ID, Age,
Income, Credit scores, Loan Term, Loan Amount, Interest Rate, DTI Ratio, Months Employed, and
Num Credit Lines. Other categorical features include education, marital status, Mortgage status,
Loan Purpose, Co-signer, and Dependents information.Our primary objective is the target
variable default, which has values of 0 and 1 for non-default and default, respectively. Both files
are supplied in CSV format for the KNIME analytics platform to read. The final objective is to use
the class probabilities method to assign default probabilities to each case. Table 0 (appendix)
contains the information.
Task Analysis Plan:
The task analysis plan is divided into the following five main phases: training, testing, model
phase, and class probability estimation phase.
Phase 0: Data Import:
Importing the test data was the first step. The file reader node was used to read it. This stage is
shown in fig. 1 below.
(figure.1)
(figure.2)
The primary objective is to show the data in a single glance. The following nodes were covered
in the first section on exploratory analysis (EDA): statistics, histogram, value counter, and group-
by.
EDA Analysis:
4
The main findings in the data, as displayed in table 1 (appendix), are informed by statistics. With
an average yearly income of about €83,000 and an average loan amount of €127,775, the typical
applicant is nearly 44 years old. The income and loan distributions are wide, as indicated by the
standard deviations of €39,148 and €70,615 respectively. Their skewness and kurtosis reflect
normal distribution. The average credit score is 575 ranging from 300 to 849. The Debt-to-
Income ratio is around 0.5 average means their debt obligations caters almost the half of their
salary. The loan default rate is around 11.8%. Additionally, the target variable's kurtosis (3.59)
indicates heavier tails, indicating that some applicants engage in extreme risk behaviours. The
entire dataset is defined by these statistics. As seen in fig. 3 (appendix), the value counter node is
used to view the number of defaults. As illustrated in fig. 4, there are 1182 defaults and 8818
non-defaults, meaning that only 12% of applicants default and 88% do not. The primary
attributes of defaults and non-defaults are separated using a different node called Group By.
Table 2 displays the findings. The average age of the defaulted candidates is 36 years younger
than that of the non-defaulters, indicating immaturity and a lack of experience. Income is a
powerful deterrent against defaults, as evidenced by the average income of €73,245 for
defaulters and €84,323 for non-defaulters. Furthermore, even though defaulters had low credit
scores (554.7 vs. 577.5), they took out larger loans (average €141,233 vs. €125,990). This
implies that the following characteristics of our defaulters are directly related to the target
variable: low income, young age, low credit scores, larger loan amounts, higher financial burden,
and risk tolerance. A slightly higher DTI ratio (0.517 vs. 0.51) suggests that disposable incomes
are being squeezed more tightly. Shorter work history and less job stability are indicated by
lower months employed.
Outlier Detection:
In order to indicate that there were no outliers, the outlier detection process began with a
numeric outlier node that read zero for every variable. Figure 5 presents the results of another
node box plot. A large financial range among borrowers and the existence of high-risk lending
cases are indicated by variables like income and loan amount that exhibit high variability and
multiple upper outliers. The distribution of Months Employed shows a positive skew, with many
cases having comparatively brief work histories. With medians of 13.5% and 36 months,
respectively, the interest rate and loan term distributions seem more structured and represent
the terms of typical loan products. All things considered, the plots show regions with possible
modelling significance and risk concentration while confirming the integrity of the data.
Phase 2: Feature Engineering and Data Transformation:
In order to prepare and preprocess the data for the models, phase 2 involved scaling, cleaning,
and encoding the data. Figure 6 illustrates this as follows:
(fig.6)
In order to handle value loss and finish the feature matrix for model training, the first step
incorporates missing values. While the categorical features were handled using the most
frequent one, the numerical features were imputed using the median and mean values. This was
done in order to avoid data loss and model bias. Figure 7 (appendix) provides this. The
5
categorical data was then converted to numerical data using a one-to-many node. Thus, the
model is able to assess each one separately. It maintains machine-readable categorical data. For
example, high school "1," bachelor's "2," master's "3," and so on. In order to overcome bias in the
data, a column filter was immediately used to eliminate features that were not relevant, such as
Column ID, one from each category feature, such as bachelor's from education, married from
marital status, and so forth. Table 3 (appendix) illustrates this. Normaliser was completed last.
This guarantees that all of our data is in the binary format 0,1, preventing biassed weighting and
ensuring that all features contribute normally to model training. Additionally, it enhances
models based on convergence rate gradients.Table 4 (appendix) illustrates this. In order to
provide outputs for the target variable that are readable by humans, the final step converts the
number to string for the "default" variable.
Phase 3: Bivariate Analysis:
The third stage covered bivariate analysis with scatter plots and correlations. Figure 8 below
shows this:
(fig.8)
Correlation was the first step, and the screenshot is included in the appendix as table 5. Younger
applicants with lower incomes and credit scores are more likely to default, according to the
correlation analysis, which also shows that Age, Income, Credit Score, and Months Employed are
negatively correlated with Default. Notably, the DTI Ratio (r = 0.022) and interest rate (r =
0.137) exhibit weak but statistically significant positive correlations with default, supporting the
notion that greater financial strain raises risk. While full-time employment, marriage, and co-
signing are linked to lower risk, being unemployed, single, or without a co-signer also exhibits
weak-to-moderate positive correlations with default among categorical variables.Although the
majority of correlations are weak (r < 0.2), their cumulative effect on model performance is still
valuable, as indicated by statistical significance (p < 0.05 in many cases). Scatter plots were
covered in the second step. Finding predictive features and lowering noise and multicollinearity
are the goals of this. There are very weak correlations between features like Months Employed
(r ≈ -0.078), Loan Term (r ≈ 0.007), and Num Credit Lines (r ≈ 0.02). The best candidates to keep
were those with the following attributes: credit score, income, interest rate, DTI ratio, co-signer
status, dependents, employment type, and marital status.
1.1. Phase 4: Model Training:
The fourth and the most important phase is the model training. We trained three models:
Logistic, Random Forest and SVM. The workflow is shown below in fig.9:
6
(fig.9)
Now, lets talk about the algorithms we have used like of Logistic Regression, Random Forest, and
Support Vector Machine (SVM) shows us the strategic balance of models capacity and bias
varance specially relevant to our task of prediction of loan default. Here is the complex, partially
non linear relationships within our dataset like credit score, income, DTI ratio, and default,
Logistic Regression is used as a transparent baseline model, which help stakeholders to
identify direct relationships, even if it is not relevant to the underlying patterns. On the other
side, we have Random Forest model that is known for its high capacity and low bias, that can
easily point out the interactions in the multiple borrower attributes and meanwhile it can
manage easily overfitting risk through learning. SVM, applied with an RBF kernel, use for its
flexibility by modeling non-linear decision boundaries, and it is beneficial for separating
borderline applicants specially one of those whose profiles are not linearly distinguishable. After
adding these models, we can ensured perfect evaluation of performance of both simple and
complex feature interactions essential for building a model that can be used to generalizes easily
to various real world loan applicants.
In the first step we did x-partitioning 80/20 data ratio for train and testing of the data. Cross-
validation of 10 folds we used in it. This is shown in attach picture in the file. After this the
learner nodes for all three models related to partitioning node and the predictor node are used
to form the models. Following are some key characteristics of our three models:
1. Cross Validation:
This was done using X-partitioner node. The reason of doing this is to minimize the risk
of overfitting. The best folds that maximized our AUC were 10.
2. SMOTE:
As we have unbalanced nature of our dataset, we used SMOTE to balance it. So, our
models differentiate defaulters without bias of majority class we have in the data set
because defaulters are minority (12%). For Logistic and Random the best configuration
are shown in fig.11 (appendix) This is why oversample by minority classes were chosen
and static seed was tried different combinations such as 1,2,3,4,5 and the best one came
out to be ‘987654’ so, we have the same split every time.
3. Parameter Optimization:
7
This was done to simplify the hyperparameters such as tree depth, node size,
regularization etc. The best values were found and put manually in advanced settings.
Different ranges were tried and increased until the best ones were found. This was done
to boost the predictive powers of our models. Sample screenshot is attached with
optimized parameters in fig.12. For SVM, polynomial kernel was used as boundary
between default and non-default is not linearly separable.
4. Gradient Boost:
Finally, gradient boosted learners were used to boost the performance of random
forests. Unlike, a single decision tree, gradient boost make many small trees one at a
time, where each new tree learns from the mistakes of the previous ones. It gave us high
accuracy while preventing overfitting.
(fig.13)
So, firstly talking about the choice of evaluation metrics. We used scorer to evaluate and see the
overall result based on recall, precision, sensitivity, specificity, f measure and overall accuracy.
We decided to check our models by using recall, precision, sensitivity, specificity, F-measure,
and overall accuracy to get enough understanding of performance, because we have class
imbalance present in our dataset. In credit risk modeling, recall is necessary as it check how
effectively the model find out real defaulters, which is a primary concern for banks, those are
looking to minimize financial losses. Accuracy gives us confirmation that we don’t mistakenly
categorize too many trustworthy clients as problematic, which couldnot be beneficial for the
business. Specificity gives us the balance of showing the effectiveness of identifying non-
defaulters. The F-measure help us to better precision in merging and recall into one score for
trade-off evaluation, and although overall accuracy is less informative in cases of imbalance, it
still gives us the overall idea of correct predictions. This multi-metric method gives us clear
understanding of both business risk and model dependability. Last but not least, AUC and ROC
curve. The ROC curve illustrates the true positive rate which means sentivity as compare to false
positive rate, aiding in the visualization of the balance between accurately detecting defaulters
and minimizing false positives. AUC changes this curve into one value; the nearer it is to 1.0, the
more effective the model is at prioritizing high-risk applicants over low-risk ones. This is
particularly beneficial in our loan default initiative since banks frequently prioritize ranking
customers by their risk over binary classification.
Results:
The results of each model are given in screenshot (fig.14 for logistic) (fig. 15 for random) (fig.16
for SVM) (appendix). Their results are given below:
Logistic Regression showed us notable accuracy (0.882) and outstanding specificity (0.992), that
indicates it effectively recognized the majority of non-defaulters. However, its recall for
8
defaulters was very low (0.059), showing that the model failed to notice almost all actual default
instances. This indicates underfitting, where a sign of high bias and low capacity, where the
model simplifies relationships excessively and is unable to recognize non-linear interactions.
Although it achieved the highest AUC (0.811) showing us the strong overall ranking capability its
actual effectiveness of finding out defaulters was the weakest, and as a result it is considered as
inappropriate for use.
Random Forest has given us more equal outcomes. It showed significant reduced AUC (0.790),
yet this was counterbalanced by a considerably greater recall (0.627), guaranteeing a markedly
improved capacity to identify defaulters. While its precision (0.247) was average, its F1-score
(0.355) demonstrates the optimal balance between actual defaulters and reducing false
positives. Its specificity (0.745) showed us the adequate management of dependable customers.
Random Forest, as we already discussed the high-capacity and low-bias model with controlled
variance via ensembling, is adept at managing the difficult, non-linear feature interactions found
in credit data. The significant reduced AUC is satisfactory since the model good in critical area
like recognizing genuine defaulters.
Support Vector Machine (SVM) attained flawless specificity (1.000) and considerable accuracy
(0.882), but still failed to identify any defaulters (recall = 0.000). This indicates overfitting to the
dominant class and a deficiency in generalization for identifying minority class patterns like
those who are going to be default, probably caused by high bias resulting from not having the
accurate parameter tuning. Even with a competitive AUC of 0.799, its classification performance
renders it impractical in this setup for predicting credit risk.
All these results are summarized in table 6 (appendix)
1.3. Testing phase:
The testing phase tests our developed models. If we get high testing error, that means our
models are overfitting. The flow for testing phase in given below in fig.17:
(fig.17)
Firstly, reader node was used to load test data on loan defaults. Meta node for data
preprocessing was developed from test phase and connected to test data. Predictors were used
and connected to the learner’s respectively from trained models. Column filters were used. First,
probabilities were appended in predictor nodes. Column filters added only P(Default) from each
model. They were renamed to ‘p_SVM, p_Log, p_RF’ and then joined using joiner node. Here, we
introduced ensemble method to combine probabilities from all three into one centralized
decision making strategy. This was done to get the desire results from all by aggregating and
casting votes on forecasts, which help us to minimizing both bias and variance in the result.
Rather than depending on one algorithm, ensemble methods were used to improve
generalization by combining differen decision-making logics. In our instance, merging the class
9
probabilities (P(Default)) helped us to stabilize forecasts and reduce the likelihood of overfitting
or underfitting linked to separate learners. Then, math formula node was added using the
formula: if(($Prob_Default_SVM$ + $Prob_Default_LogReg$ + $Prob_Default_RF$) / 3 > 0.5,
1, 0) and finally we had applicant ID and their respective probabilities of default.
1.4. Class Probability Estimation (Default Test Results)
The final stage of task 1 included the results of defaults. This was done using class probability
estimation which is useful for ranking when we have biased distribution results. The workflow
is given below in fig. 18 which was done by using math formula to combine the probabilities of
each model and aggregated into one single one.
(fig.18)
50% probability was related with default initially and using value counter we found 89 defaults
while 4911 non-defaults. With 40% probability, we found 511 defaults and 4489 non-defaults.
This is shown in fig. 19 (appendix) Now, we wanted to rank the defaulters. Then, we used the
sorter node to rank the defaulters from high chances to low and then row sampling and column
filter to find out our top 10 applicants. This is shown in fig.20 (appendix) with 0.66 which means
66% as the highest applicant chances of default. Another node statistics was used whose results
are added in table 7 (appendix) The ensemble result shows us a low average default probability
(0.018) alongside high skewness (7.296) and kurtosis (51.25), suggesting a cautious prediction
tendency. This demonstrates the impact of Logistic Regression and SVM, which preferred non-
default results. Random Forest showed us the increased variance and a higher mean (0.38), by
improving its recall. These findings validate our previous assessment: the ensemble reconciles
prudent forecasts with enhanced detection of defaulters. Finally, rule engine was used to
categorize defaulters into high (50% and greater), medium (40% till 49%) and low (less then
40%), we found 89 to be high risk, 422 medium and 4489 to be low as attached in fig.21
(appendix) Additionally, we used the group by node to find the key characteristics of defaulters
by loan purpose, marital status, employment and education. These results are given in fig.22
(appendix) and along with the summary table (table 8) (appendix) In conclusion, we found out
lower income, younger age, and specific statuses like self-employed or divorced are going to be
with higher predicted default probabilities. You can check it in table 2 (appendix) the following
result by using the group by node with the characteristics that I explained.
Chapter 2: Clustering using Unsupervised Learning:
This report provide us a basic understanding of loan applicants by dividing them into three
different groups called clusters based on their personal and financial characteristics. The
purpose of this analysis is to understand which group is more likely to default on their loans and
how a bank can design strategies to manage risk and treat the customers accordingly and make
changes in their strategies as per the data of the customer. The task started with loading the
train data using reader node, then data preprocessing was done, column filter was used to
remove ID, missing values to replace them by mean values and most frequent ones, one-to-many
for categorical conversions and finally normalizer to make it in 1,0s. K-means clustering was
10
used and the results were displayed using PCA, scatter plots, bar charts, silhouette coefficient.
The entire workflow for clustering analysis in given below in fig.23:
(fig.23)
● Contains fewer high-risk indicators than Cluster 1, but it displays a fairly uniform
distribution of marital status.
Cluster 1:
11
● A high percentage of jobless people and part-time employees, as well as a lower
percentage of people with advanced degrees.
● Less likely to have co-signers or mortgages, and more likely to be unmarried or divorced.
● This group suggests less financial stability, which may call for stricter risk control
procedures.
Cluster 2:
● The majority of married applicants have steady employment and higher loan
commitments (many have mortgages).
● Overall, education levels are strong, especially for high school and bachelor's degrees.
● Most likely have co-signers and dependents, which suggests more intricate
arrangements for financial or familial responsibilities.
Cluster Insights using numerical features:
We used pie chart nodes to gain insight into clusters, and fig. 28 (appendix) displays the
corresponding numerical features. For each cluster, we obtained the following results:
Cluster 0:
● Has a high average income (€82,794.33) and the highest credit score (577.68), both of
which point to a responsible ability to borrow.
● This group appears to be low-risk, making them perfect for favourable credit terms.
Cluster 1:
● Shows us financial stress by having the lowest credit score (571.10) and a slightly lower
average income (€82,603.22).
● Has a slightly shorter employment duration (58.46 months) and an average loan of
€128,726.14, both of which could be signs of instability.
● The group may require additional verification or more stringent lending conditions.
Cluster 2:
● Has the lowest loan amount (€128,051.09) and the highest average income
(€83,358.20), demonstrating responsible borrowing habits.
● The credit score is average (574.71) and the length of employment is constant (58.56
months).
12
● This demographic is steady and moderately risky, making them perfect for tailored loan
offers.
(table 8)
13
Appendix:
14
NumCreditLines Number of currently Numerical Predictor
active/open credit
lines
15
debt payments
divided by monthly
income
(table 0)
(table 1)
(fig.3)
(fig.4)
(table 2)
17
18
19
(fig.5)
20
(fig.7)
(table 3)
(table 4)
21
(table 5)
(fig.10)
(fig.11)
22
(fig.12)
(fig.14)
(fig.15)
23
(fig.16)
(table 6)
(fig.19)
(fig.20)
24
(table 7)
(fig.21)
(fig.22)
(table 8)
25
(fig.24)
(fig.25)
26
(fig.26)
27
(fig.27)
28
29
(fig.28)
Task# 3
Cluster 0
Profile: Senior candidates with low default rates, steady incomes, and clean credit histories.
Strategy
Provide rewards for loyalty, such as increased credit lines or reduced interest rates.
Encourage the cross-selling of high-end banking products, such as insurance and savings plans.
Cluster #1
Profile: Younger, poorer people who are more likely to default and have lower credit scores.
Strategy
● Put in place more stringent post-disbursement oversight
30
● Provide budgeting resources, behavioural nudges, and financial literacy initiatives.
● To increase payment consistency, take into account modified repayment plans.
Cluster #2
Candidates in Cluster 2 have balanced risk profiles, moderate credit performance, and are
middle-aged or middle-income. They do not belong to the low-risk, high-trust category, but they
are also not high-risk.
Strategy
● Use up-to-date financial data to periodically reevaluate their risk score.
● Offer optional add-ons (such as payment protection plans and auto-debit discounts).
● Keep an eye out for behavioural changes that might lead them to fall into higher-risk
groups; early detection can help avoid unpleasant surprises.
The bank is able to transition from generic credit policies to a more customised, risk-adjusted
approach by incorporating the anticipated default probabilities from Task 1 into the cluster
assignments from Task 2. For example: Borrowers with medium-to-high default scores may be
flagged for closer monitoring, even within the low-risk Cluster 0.
Before the default risk increases, medium-risk borrowers in the moderate Cluster 2 can get
proactive assistance.
Through the layered application of descriptive segmentation (Task 2) and predictive modelling
(Task 1), the bank is able to: Implement focused risk interventions, sort product offerings
according to the type of customer, and Boost the resilience of your portfolio.
Question 2
We used a wide range of financial and demographic characteristics in Task 1, including income,
credit score, loan term, loan amount, education, and work history. We can add more variables
that represent changing borrower behaviour and the economic environment to further improve
the model's predictability. These could include recent large transactions, account balance
fluctuations, or missed payments, among other behavioural characteristics. In order to generate
summary indicators, these features can be extracted from transactional data and combined with
KNIME's Joiner and GroupBy nodes.
By adding macroeconomic variables like the inflation rate, trends in unemployment, or shifts in
interest rates, we can further improve the model. To help the model produce reliable
predictions in the face of shifting economic conditions, these can be imported into KNIME from
outside sources and combined using CSV Reader, Joiner, or Database Connector nodes.
Additionally, the model can adjust to temporal trends by incorporating time-sensitive features
like "change in job status," "time since last credit inquiry," or "repayment behaviour over time."
KNIME's Date and Time Manipulation nodes can be used to build these.
Within KNIME, feature selection methods like Recursive Feature Elimination (RFE) and
Correlation Filter can be used to reduce noise and enhance generalisability. By preserving
transparency and minimising the chance of overfitting, these techniques assist in keeping only
the most significant features.
31
Furthermore, by communicating which features affect predictions, KNIME's built-in
visualisation tools—such as Box Plot, Bar Chart, and Colour Manager—can help non-technical
stakeholders or regulators understand and accept the model.
question# 3
Model performance was first assessed using 10-fold cross-validation with KNIME's X-
Partitioner node, as shown in Task 1. Robust validation was thus guaranteed. A similar strategy
should be applied on a regular basis to monitor the health of the model during deployment.
a. Retraining Schedule: Using the most recent data, we advise retraining the model every three
to six months. This comprises:
Executing preprocessing procedures (such as encoding, normalisation, and imputation)
If the class disparity continues, reapply SMOTE
Readjusting the hyperparameters for best results
b. Tracking Performance:
Utilise KNIME's Scorer and ROC Curve nodes to monitor metrics like AUC, recall, and precision.
Any notable drop in these scores could be a sign of concept drift, which is a shift in borrower
behaviour that lessens the efficacy of the model.
c. Model Version Control:
Use Model Writer/Reader nodes in KNIME to save and reload model versions. Maintain a
changelog of:
Feature additions or removals
Algorithm changes or tuning adjustments
Data quality or scope changes
d. Business Explanation & Communication:
Model outputs, especially default probabilities, should be categorized (e.g., Low: < 40%,
Medium: 40–50%, High: > 50%) using Rule Engine. These risk groups can be communicated
clearly in management reports.
Additionally, cluster trends (from Task 2) can be monitored over time. If the share of applicants
in Cluster 1 rises sharply, this could signal emerging risk in the customer base.
e. Visualization:
Set up KNIME dashboards with Table View, Bar Chart, and Line Plot nodes to visualize:
Default risk distribution
Cluster trends
Model performance over time
This supports transparent decision-making across technical and non-technical teams.
4. Consider equity, adherence to regulations, and bias in sampling (CRISP-DM Stage: Business
Understanding & Evaluation)
32
Sampling bias is the most significant restriction found in Task 1. The model does not represent
the whole applicant population because the dataset only contains loan applicants who have
been approved. Because of this selection bias issue, the model cannot be applied to initial loan
approval decisions.
Its use during the screening process could result in: Discrimination against candidates whose
profiles differ from those that have already been accepted.
exclusion of historically under-represented applicants who might be creditworthy.
As a result, the model should only be applied for risk stratification, monitoring, and focused
interventions following initial approval.
a. Fairness audits: Clustering in Task 2 identified applicant subgroups with varying financial
practices. We advise performing routine bias checks with KNIME's GroupBy and Statistics nodes
to evaluate:
Are some groups disproportionately identified as high-risk (for example, based on marital
status or type of employment)?
Does each cluster have a significantly different rate of false positives or false negatives?
Rebalancing techniques like re-weighting, threshold adjustment, or feature debiasing might be
necessary if biases are found.
b. Regulatory Compliance (e.g., EU Banking Regulations, GDPR): Regulators demand that models
be:
Transparent: We need to describe the decision-making process.
Documented: We need to demonstrate the use, preprocessing, and modelling of the data.
Auditable: All forecasts and model iterations need to be traceable.
c. Explainability for Compliance: The bank can demonstrate the reasoning behind a decision by
using SHAP, or summary statistics, which is a crucial prerequisite for adhering to the GDPR's
"right to explanation."
d. Ethical Governance: Create a framework for model governance that consists of:
Regular evaluations of fairness
Meetings with stakeholders regarding model modifications
Prior to redeployment, risk committee approvals
This guarantees that the model meets the bank's financial objectives without sacrificing
compliance or fairness.
33
34