INTRODUCTION TO DATA SCIENCE
“Credit Card Default Prediction”
Project Report By
Group - 32
Harshit Chaudhary (21UCS089)
Namit Kapoor (21UCS135)
Devansh Bhatt (21UCS058)
Raj Kansagra (21UCC080)
Course Instructors
Dr. Subrat Dash
Dr. Lal Upendra Pratap Singh
Dr. Aloke Dutta
Department of Computer Science Engineering
The LNM Institute of Information Technology
TABLE OF CONTENTS
S.No Topic Page No.
1. Problem Statement 2.
2. Description of Dataset 3.
3. Data Analysis 5.
4. Data Visualisation 9.
5. Hypothesis Testing 14.
6. Data Pre-Processing 19.
7. ML Classification Model 23.
8. Model Evaluation 27.
9. Conclusion 31.
10. References and GitHub link 32.
1|P a g e
1. Problem Statement
In recent years, credit card issuers in Taiwan have faced the cash and credit card
debt crisis, and delinquency is expected to peak in the third quarter of 2006 (Chou,
2006). In order to increase market share, card-issuing banks in Taiwan over-issued
cash and credit cards to unqualified applicants. At the same time, most cardholders,
irrespective of their repayment ability, overused credit cards for consumption and
accumulated heavy credit and cash–card debts. The crisis caused a blow to
consumer finance confidence and is a big challenge for banks and cardholders. The
project is aimed at predicting the default of customers in Taiwan
1.1 Objective
The primary objective of this data science project is to develop a predictive model that
can accurately forecast the likelihood of credit card default among customers in
Taiwan. By leveraging historical data on customer behavior, financial transactions, and
credit profiles, the aim is to create a robust predictive tool that assists card-issuing
banks in identifying high-risk customers. This predictive model will contribute to risk
mitigation efforts, enabling timely intervention and targeted strategies to reduce default
rates
1.2 Tools and Frameworks Used
❖ Libraries used in EDA & Machine Learning:
1. Pandas
2. Numpy
3. Matplotib
4. Seaborn
5. Sklearn
6. Scipy
❖ Graphs used for representation:
1. Bar plot
2. Box Plot
3. Grouped bar plot
4. Heatmap
❖ ML Models used for training & testing:
1. Logistic Regression.
2. KNN.
3. Random Forest.
4. Support Vector Classifier
2|P a g e
2. Description of the Dataset
• The dataset provides insights into credit card holders during a financial crisis,
aiming to predict customer defaults. Exploratory Data Analysis (EDA) uncovers
key trends across demographics. Females show slightly higher default rates,
while higher credit limits correlate with reduced defaults. Education, marital
status, and age exhibit distinct connections with default behaviors. These findings
inform predictive models, offering vital insights into factors influencing credit
defaults among diverse customer groups.
• This dataset contains information on default payments, demographic factors,
credit data, history of payment, and bill statements of credit card clients in Taiwan
from April 2005 to September 2005.
There are 25 features in the dataset:
• ID: ID of each client
• LIMIT_BAL: Amount of given credit in NT dollars (includes individual and
family/supplementary credit)
• SEX: Gender (1=male, 2=female)
• EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others,
5=unknown, 6=unknown)
• MARRIAGE: Marital status (1=married, 2=single, 3=others)
• AGE: Age in years
• PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment
delay for one month, 2=payment delay for two months, 8=payment delay for
eight months, 9=payment delay for nine months and above)
• PAY_2: Repayment status in August, 2005 (scale same as above)
• PAY_3: Repayment status in July, 2005 (scale same as above)
• PAY_4: Repayment status in June, 2005 (scale same as above)
• PAY_5: Repayment status in May, 2005 (scale same as above)
• PAY_6: Repayment status in April, 2005 (scale same as above)
• BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
• BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
3|P a g e
• BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
• BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
• BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
• BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
• PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
• PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
• PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
• PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
• PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
• PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
• Default: Default payment (1=yes, 0=no)
Source of our dataset:
https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients
(All the above information about the dataset is taken from this link as well)
4|P a g e
3. Data Analysis
• First, we import all the necessary libraries which will be required in our code.
• Next, we read the dataset in a variable and output its first 5 rows, using
pandas.read_csv( ).
• Using the dataset.info(), we find the information about our dataset, i.e., how
many attributes are there, the datatype of each attribute, and whether is there
any null value to it or not
5|P a g e
• We use DataFrame.isnull().sum() to check if there are any null values in the
dataset or not.
• We also use dataset.describe( ), which is used to view basic statistical
details about our dataset such as min-max values, standard deviation, mean,
etc.
6|P a g e
• Minimizing data duplication through the transformation of numerical columns
into categorical ones enables us to visualize the information using a bar
graph.
Created a new limit_cat column by dividing it into
o 10000 to 140000.
o 140000 to 240000.
o Above 240000.
Created a new Age_cat column by dividing it into
o 21 to 35 years.
o 35 to 45 years.
o Above 45 years of age.
• Checking outliers with seaborn boxplot:
7|P a g e
Ignoring statistical outliers as these values are very well accepted values of age
and balance limit.
Inference:
• In our dataset, we can observe the following things:
1. The dataset contains 30000 rows & 25 columns.
2. All 25 columns have numerical values as variables.
3. Columns SEX, EDUCATION, and MARRIAGE contains categorical variable
which is:-
• SEX: Gender (1 = male; 2 = female).
• EDUCATION: Education (1 = graduate school; 2 = university; 3 = high
school; 0, 4, 5, 6 = others)
• MARRIAGE: Marital status (1 = married; 2 = single; 3 = divorce; 0=others).
4. The dataset has no Null/Duplicate values.
5. The default column is the dependent variable while the rest are the
independent variable.
8|P a g e
4. Data Visualization and Plots
We chose the Bar graph because it summarises a large set of data in a simple visual
form. It displays each category of data in the frequency distribution. It clarifies the
trend of data better than the table. It helps in estimating the key values at a glance.
As we are comparing rented bike sharing demand with different seasons so bar chart
is making it easy to visualize data.
4.1 Barplot of the percentage of defaulters vs non defaulters.
22.12% of the customers are going to default
9|P a g e
4.2 Bar plot of the number of Male vs Female customers:
There are more number of female customers than male customers.
4.3 Bar graph showing the number of customers as per education
using dataset.EDUCATION.plot.hist()
10 | P a g e
Maximum number of customers are university graduate while least are from
others category.
4.4 Gender-wise distribution of number of customers by
education.
The maximum number of female customers are university graduates.
4.5 Bar graph showing default percentage distribution as per education.
11 | P a g e
The maximum percentage of defaulters are university graduates while the least
among the other categories.
4.6 Correlation Heatmap:
Heatmaps are used to show relationships between two variables, one plotted on
each axis. By observing how cell colours change across each axis, you can observe
if there are any patterns in value for one or both variables. Since we want to find the
relationship between different variables in data frame and heatmap can be one of the
ways to visualize it.
➢ PAY_0 to PAY_6 are highly correlated.
➢ BILL_AMT1 to BILL_AMT6 are highly correlated.
12 | P a g e
4.7 Inference from the Exploratory Data Analysis and Visualisation:
1) Insights about Demographic Distribution:-
• 22.12% of the customers are going to default.
• Female customers are more than male customers.
• The number of female defaulters is more than male defaulters.
• There is not much difference in default percentage based on gender.
2) Insights about Education Qualification:-
• The maximum number of customers are university graduates across the
gender while the least is from the others category.
• The absolute numbers and percentage-wise defaults are maximum among
university graduates.
• Higher default percentage is observed in high school graduate males.
3) Insights about Marital status:-
• More customers are single than married.
• The absolute value and percentage of defaults are more among single
customers.
• The maximum default percentage is observed in customers who are
married and high school graduates.
4) Insights about Age Distribution
• The maximum number of customers have age between 25-35 years while
very few are above 60 years of age.
• The maximum number of defaulting customers are between the age of 25
to 30.
• Maximum default percentage is observed in the age group of 20-25 yrs
and 60-80 years of age category.
13 | P a g e
5. Hypothesis Testing
• After thoroughly analysing the dataset through visuals (graph Plots and
Correlation Heatmap) we do Hypothesis Testing.
• Hypothesis testing is a statistical method used to make inferences about
population parameters based on a sample of data. It involves formulating a
hypothesis (Null Hypothesis H0 and Alternate Hypothesis H1), collecting and
analyzing data, and drawing conclusions about the population based on the
sample
• In hypothesis testing, various types of tests are employed to assess different
aspects of data. However, we have employed the commonly used two types of
tests for statistical analysis: the two-tail test and the Chi-square test of
independence
5.1 Two-tail test
A two-tailed t-test is a statistical test used in hypothesis testing to determine if there is
a significant difference between the means of two independent groups. The "two-
tailed" part refers to the fact that the test considers the possibility of a difference in
both directions, either a positive or a negative difference.
(a) Two-tail test for Age and Default Payment Status next month.
H0(null hypothesis): There is no significant difference in the age with respect to
default payment next month.
H1(alternate hypothesis): There is a significant difference in the age with respect to
default payment next month.
14 | P a g e
The objective of this test is to assess whether there is a significant difference in the
ages of individuals who defaulted on their credit card payments compared to those
who did not.
• Two subsets of the dataset are created based on the 'default' column: one for
individuals who defaulted (‘default’) and one for those who did not
(‘not_default’)
• The ‘ttest_ind’ function from the ‘scipy.stats’ module is used to perform a two-
sample independent t-test on the 'AGE' variable for the two groups (default
and not_default).
• The p-value from the t-test is extracted and stored in the variable ‘p_val’.
• The code then checks if the p-value is less than 0.05, which is a common
significance level (95% confidence interval). If the p-value is less than 0.05, the
null hypothesis is rejected; otherwise, the null hypothesis is accepted.
Inference: Here Null hypothesis is rejected. It implies that there is a significant
difference in age with respect to default payments next month. In practical terms, this
could suggest that age plays a statistically significant role in predicting credit card
default. This could have implications for credit risk assessment or targeted financial
planning based on age demographics. The specific direction of the difference (whether
defaulters are generally older or younger) is not inferred from this test, only that a
difference exists.
(b) Two t-tests for the limit balance and Default Payment Status next
month.
H0(null hypothesis): There is no significant difference in the limit balance with
respect to default payment next month.
H1(alternate hypothesis): There is a significant difference in the limit balance with
respect to default payment next month.
The objective of this test is to assess whether there is a significant difference in the
limit balance of individuals who defaulted on their credit card payments compared to
those who did not.
15 | P a g e
Similar steps are followed as in the previous two-tailed test. But here the two-sample
t-test is performed on the 'LIMIT_BAL' variable for the two groups (default and
not_default).
Inference: Here Null hypothesis is rejected. It implies that there is a significant
difference in Limit Balance with respect to default payments next month. It indicates
that the limit balance is statistically significant in differentiating individuals who will
default from those who won't.Depending on the direction of the significant difference,
it may provide insights into the financial dynamics influencing default behavior. For
example, higher or lower limit balances could be associated with increased or
decreased default risk.
5.2 Chi-Square Test of Independence
The Chi-square test of independence is applied when dealing with categorical data
and aims to assess whether there is a significant association between two categorical
variables. It examines whether the observed distribution of frequencies differs from
what would be expected under the assumption of independence between the
variables. This test is commonly used in contingency table analysis.
(a) Chi-Square test for Education and Default Payment Status next
month.
H0(null hypothesis): There is no significant dependency between the default
payment next month and education.
H1(alternate hypothesis): There is a significant dependency between the default
payment next month and education.
16 | P a g e
The objective of this test is to assess whether there is a significant dependency
between the education level of individuals who defaulted on their credit card payments.
• A contingency table (‘chi_table’) is created using the ‘pd.crosstab’ function,
showing the counts of observations for each combination of education level and
default status.
• The ‘chi2_contingency’ function from ‘scipy.stats’ module performs the chi-
square test of independence on the contingency table. This test assesses
whether the observed distribution of counts is significantly different from what
would be expected under the assumption of independence between education
level and default status.
• The variables ‘chi2_stat’, ‘p_val’, ‘dof’, and ‘expected’ represent key outputs
from the chi-square test:
o chi2_stat (Chi-square Statistic): The test statistic quantifies how much
the observed counts in the contingency table deviate from the expected
counts under the assumption of independence.
o p_val (p-value): The p-value is the probability of observing a test
statistic as extreme as, or more extreme than, the one calculated from
the sample data, assuming that the null hypothesis (no association
between variables) is true.
o dof (Degrees of Freedom): The degrees of freedom are determined by
the number of categories in the variables being tested. For a Chi-square
test of independence in a contingency table, the degrees of freedom are
calculated using the formula (number of rows - 1) * (number of columns
- 1).
o Expected (Expected Counts): This variable contains the expected
counts for each cell in the contingency table under the assumption of
independence. Expected values are calculated based on the marginal
totals of the table.
• Then the code checks if the p-value is less than 0.05, which is a common
significance level (95% confidence interval). If the p-value is less than 0.05, the
null hypothesis is rejected; otherwise, the null hypothesis is accepted.
Inference: Here Null hypothesis is rejected. It implies that there is a significant
dependency between the default payment next month and the education level of the
individual. The rejection of the null hypothesis suggests that there is a statistically
significant relationship or association between an individual's education level and their
likelihood of defaulting on a credit card payment next month. The results may have
implications for decision-making in credit risk assessment or financial planning. For
instance, lenders might implement specific risk management practices for individuals
with certain education levels, such as adjusting credit limits, interest rates, or payment
terms based on the observed risk patterns.
17 | P a g e
(b) Chi-Square test for Marital Status and Default Payment Status
next month.
H0(null hypothesis): There is no significant dependency between the default
payment next month and marital status.
H1(alternate hypothesis): There is a significant dependency between the default
payment next month and marital status.
The objective of this test is to assess whether there is a significant dependency
between the marital status of individuals who defaulted on their credit card payments
Similar steps are followed as in the previous Chi-Square test. But here the contingency
table is created for the combination of default status and the Marriage.
Inference: Here Null hypothesis is rejected. It implies that there is a significant
dependency between the default payment next month and the marital status of the
individual. The rejection of the null hypothesis suggests that there is a statistically
significant relationship or association between an individual's marital status and their
likelihood of defaulting on a credit card payment next month. The results may have
implications for decision-making in credit risk assessment or financial planning. Marital
status may influence financial behaviors, including default, which can impact the
design and marketing of financial products. So Financial institutions might customize
product offerings and marketing strategies to cater to the unique financial needs and
challenges associated with different marital status groups.
18 | P a g e
6. Data Pre-Processing
• After doing the Hypothesis Testing with a two-tailed test and Chi-Square test
for Independence, we do the data pre-processing.
• At first Histograms are plotted for numerical features of the dataset, followed by
normalization which is done with the help of MinMax Scaling.
• After Data splitting is done the dataset is split into training and testing subsets,
followed by the StandardScaling for training and testing subsets.
6.1 Histogram Plots
In the above code, Histograms are plotted for each numerical field in the dataset with
the help of ‘plt (pyplot)’ imported from the module ‘matplotlib’, Two vertical lines are
added to each histogram: one representing the mean (in pink) and another
representing the median (in red).
Histograms for PAY_3 and BILL_AMT2
19 | P a g e
Here it can be seen the features PAY_3 and BILL_AMT2 have different ranges. We
have to ensure that no features dominate others simply because of their scale, so it is
required to normalize the features to a common scale.
6.2 MinMax Scaling (Normalization)
• MinMax scaling is particularly useful when the features in the dataset have
different ranges, and you want to bring them to a common scale without
distorting the differences in the ranges.
• Here the line ‘MinMaxScaler()’ function is used from the ‘sklearn.
preprocessing’ module. The ‘MinMaxScaler’ is used for scaling numerical
features to a specific range, usually between 0 and 1
• ‘scaler.fit_transform(dataset[numeric_col])’ fits the scaler to the selected
numeric columns and transforms the values using the Min-Max Scaling. The
‘fit_transform’ method calculates the minimum and maximum values of each
column and scales the values accordingly.
After executing this code, the specified numeric columns in the dataset will be
transformed such that their values are within the range [0, 1]. The main reason
behind doing MinMax Scaling is that “some machine learning models, such as
support vector machines (SVMs) and k-nearest neighbors (KNN), are sensitive
to the scale of input features. MinMax scaling can improve the performance and
convergence of these models.
Again, the Histograms are plotted for numerical features of the dataset after doing the
MinMax Scaling. Now the numeric features are normalized to a common scale
between 0 to 1.
Histograms for PAY_3 and BILL_AMT2 after MinMax Scaling (Normalization)
20 | P a g e
6.3 Data Splitting
Data Splitting involves dividing a dataset into at least two subsets: a training set and
a test set. The primary purpose of data splitting is to have a designated subset (the
training set) on which the machine learning model is trained. The training set is crucial
for the model to adjust its parameters and make predictions. The test set is reserved
for evaluating the performance of the trained model on data it has never seen before.
This helps to assess how well the model generalizes to new, unseen data. It provides
an estimate of how the model is expected to perform on real-world, future data.
• Selecting the features first. X = dataset.drop(['default', 'Limit_cat',
'Age_cat', 'ID'], axis=1) It creates a new DataFrame X by dropping the
specified columns ('default', 'Limit_cat', 'Age_cat', 'ID') from the original dataset,
the axis=1 parameter specifies that the columns are to be dropped. y =
dataset['default'], y containing the target variable ('default' column) that the
model aims to predict.
• Splitting of the dataset is done into training and testing sets with the help of
‘train_test_split’ imported from the module ‘sklearn.model_selection’. The
parameters are
o X: features
o y: Target variable,
o test_size: The proportion of the dataset to include in the test split. Here,
it's set to 20%, meaning 80% of the data will be used for training, and
20% for testing.
o random_state: It is set to 0 for reproducibility, ensuring that the same
split is obtained each time the code is run.
• Then shapes of training and testing sets are printed.
21 | P a g e
• Now standardization or z-score normalization of features is done,
standardization is a common practice to transform the features of the dataset
so that they have a mean of 0 and a standard deviation of 1.
• It is done by ‘StandardScaler’ imported from the ‘sklearn.preprocessing’
module. Two instances of the StandardScaler class are created. These
instances will be used to scale the features of the training set (train_scaler)
and the test set (test_scaler) separately.
• train_scaler.fit_transform(X) fits the train_scaler on the entire feature set X
and transforms the features, ensuring that the mean and standard deviation are
computed based on the entire set. A similar thing is done for the training set
train_scaler.fit_transform(X_train) and test set train_scaler.
fit_transform(X_test). But a separate scaler is used for the test set, it ensures
that the scaling parameters are consistent with the training set but computed
independently to avoid data leakage.
• After this process, the standardized features are stored in the DataFrames
X_scaled_df, X_scaled_train_df, and X_scaled_test_df. These
standardized features can then be used for training and evaluating machine
learning models.
When X_scaled_df.head(2) is executed, we see a tabular representation of the
first two rows of the standardized features in X_scaled_df. The ‘.head(2)’ method
displays the first two rows of the data frame. The number 2 passed to the head
specifies the number of rows to be shown. If no parameter is provided, it defaults to
showing the first five rows. Each column will represent a feature, and the values will
be the standardized versions of the original feature values from X. This gives a quick
glimpse of how the data looks after the standardization process.
22 | P a g e
7. Machine Learning Classification Model
• After the data pre-processing, the training and testing set is passed to different
ML models for fitting (training) and scoring (performance or Accuracy)
• First the comparison of Model is done using cross Validation then the fit and
score of the models are done
• We have used different ML classification models: LogisticRegression,
SupportVectorClassifier(SVC), KNeighborsClassifier, and
RandomForestClassifier.
7.1 Comparing the Models using Cross Validation
• The compare_models_cross_validation() function evaluates multiple
machine learning models using cross-validation and prints the cross-validation
accuracies for each model. This helps in comparing how well the models
perform on different subsets of the training data and aids in selecting the best-
performing model for further analysis or fine-tuning.
• The cross_val_score method imported from the ‘sklearn.model_selection’
module is used to perform cross-validation on the current model (model) using
the training set (X_train_scaled and y_train). cv=5 specifies 5-fold cross-
validation, dividing the training set into 5 subsets (folds) and using them in turns
as a validation set while the model is trained on the rest
• The mean accuracy across the folds is calculated and expressed as a
percentage. The accuracy is rounded to two decimal places for clarity.
• The cross-validation accuracies for the current model and its mean
accuracy are printed to the console.
23 | P a g e
Cross Validation and Mean Accuracy for Different ML Classification Models
• Logistic regression Classifier Has Accuracy of 80.72 %
• Support Vector Classifier Has Accuracy of 80.26 %
• KNeighbors Classifier Has an Accuracy of 79.18 %
• RandomForest Classifier Has an Accuracy of 79.67 %
Logistic Regression Classifier has the highest Mean Accuracy, Calculated by
Cross Validation
7.2 Fitting and Scoring of ML classification models
Fitting a model refers to the process of training or teaching the model on a dataset.
During the fitting process, the model adjusts its internal parameters based on the
provided training data. The model learns patterns, relationships, and underlying
structures in the data that allow it to make predictions. The primary purpose of fitting
a model is to enable it to generalize and make accurate predictions on new, unseen
data.
Scoring a model involves evaluating its performance on a specific dataset using a
predefined metric. Once a model is trained, it needs to be assessed on how well it
can make predictions on data it hasn't seen during training. A scoring metric is used
to quantify the model's performance, and the result is a numerical value representing
how well the model is doing. The purpose of scoring a model is to assess its
effectiveness, understand how well it generalizes to new data, and compare its
performance with other models.
24 | P a g e
• The function ‘fit_and_score’ takes a dictionary of machine learning models,
training, and testing data. The purpose of this function is to streamline the
process of training multiple models, evaluating their performance, and storing
relevant information for later analysis.
• Three dictionaries are initialized to store the trained models
(models_trained), accuracy scores(model_scores), and classification reports
(clf_reports) for each model. The function iterates through each model in the
provided dictionary.
• The current model is fitted to the training data by model.fit(X_train,y_train) ,
and the trained model is stored in the models_trained dictionary. The accuracy
score of the model on the test data is calculated by
model.score(X_test,y_test) and stored in the model_scores dictionary.
• The model makes predictions on the test data (X_test) by
model.predict(X_test), and a classification report is generated using the
classification_report function imported from module sklearn.metrics.
• A data frame (model_compare) is created to store the accuracy scores of each
model. The function returns the accuracy scores, model comparison
DataFrame, classification reports, and the trained models.
25 | P a g e
After fitting and scoring the model by fit_and_score function on scaled trained and test
dataset. We have reported the accuracy score for all the models.
• Accuracy score for Linear Regression is 0.820833
• Accuracy score for KNN is 0.790833
• Accuracy score for Support Vector is 0.825833
• Accuracy score for Random Forest is 0.79933
On our given dataset, the Support Vector Classifier performs the best.
26 | P a g e
8. Model Evaluation Using Confusion Matrix
A confusion matrix is a performance measurement tool in machine learning,
particularly in classification tasks, that provides a comprehensive summary of a
model's predictions. The matrix compares predicted class labels to actual class labels,
breaking down the outcomes into four categories: true positives (TP), true negatives
(TN), false positives (FP), and false negatives (FN).
Confusion Matrix is plotted for all four classifiers that we have used. The code used
the classification report got from the ‘fit_and_score’ function, correlation heatmap, the
‘confusion_matrix’ imported from sklear.metrics module to plot the Confusion
Matrix for all four classifiers.
27 | P a g e
Confusion Matrix for Logistic Regression, KNN, Support Vector,
Random Forest Classifiers
At last printing the classification report for all the implemented models.
28 | P a g e
29 | P a g e
Logistic Regression: Achieves good accuracy among the models (82%). Struggles
with recall for default class, indicating challenges in correctly identifying instances of
the positive class.
K-Nearest Neighbours (KNN): Shows reasonable accuracy but has a lower recall for
default class, suggesting difficulty in capturing true positive instances.
Support Vector: Outperforms other models in terms of accuracy (83%).
Demonstrates better precision and recall for Class 1, making it effective in identifying
positive instances.
Random Forest: Provides a good balance between precision and recall but has lower
overall accuracy compared to the Support Vector model.
30 | P a g e
9. Conclusion
In this project, we have performed Data Analysis or Visualisation, Hypothesis Testing,
and Data Pre-processing on the default credit card client’s dataset. Then we trained
on different Machine learning Classification models: Logistic Regression, KNN,
Support Vector, and Random Forest. The Support Vector model emerges as the
most effective in this context, showcasing the highest accuracy and better precision-
recall trade-off for positive instances. This characteristic can be valuable in scenarios
where false positives have significant consequences. Challenges with defaulters
Class Prediction: Across models, there is a common challenge in effectively
predicting instances of the defaulter’s class, as indicated by lower recall and F1-score
for defaulters Class.
31 | P a g e
10. References
1. https://archive.ics.uci.edu/
2. https://online.stanford.edu/courses/cs229-machine-learning
3. https://www.datacamp.com/blog/classification-machine-learning
4. IDS course material
GitHub Link for Code: https://github.com/Chaudhary-Harshit/IDS_Project.git
32 | P a g e