0% found this document useful (0 votes)

35 views22 pages

CreditCardDetection Project

The Credit Card Fraud Detection Dataset 2023 consists of over 550,000 anonymized transaction records from European cardholders, aimed at developing fraud detection models. Key features include transaction ID, anonymized variables (V1-V28), transaction amount, and a class label indicating fraud status. The project utilizes various statistical and machine learning techniques, including logistic regression, to identify significant predictors of fraudulent transactions.

Uploaded by

dmarker331

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views22 pages

CreditCardDetection Project

Uploaded by

dmarker331

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

🚨 Credit Card Fraud Detection Dataset 2023 –

Project Overview
By : Akshat Kumar Makwana 23029
Welcome to this exciting project where we dive into the world of fraud detection using the
Credit Card Fraud Detection Dataset 2023! With 550,000+ real-world transaction records from
European cardholders (all anonymized for privacy), this dataset gives us a great playground to
build and test fraud detection models that can actually make a difference.

🧾 Key Features:

• id : A unique tag for every transaction

• V1-V28 : Mysterious but powerful anonymized features extracted from transaction
metadata (possibly time, location, behavior patterns, etc.)
• Amount : How much money was spent
• Class : The game-changer: 0 means genuine transaction, 1 means fraud!

The goal here? Train models that can spot fraudulent activity before it causes harm. Think of it as
building a security system that never sleeps.

Source: This dataset was built from 2023 European cardholder transactions. All personal
identifiers have been stripped out to ensure ethical use and complete privacy.

[Link]

# Data Wrangling
import pandas as pd
import numpy as np

#Statistics / Logistic Regression

import [Link] as sm
import [Link] as smf
from sklearn.linear_model import LogisticRegression
from scipy import stats

#Cross Validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

#Confusion Matrix
from [Link] import confusion_matrix

#Plotting
import matplotlib as mpl
import [Link] as plt
import seaborn as sns

#Trees
from sklearn import tree
from [Link] import BaggingClassifier
from [Link] import RandomForestClassifier

#Scores
from [Link] import accuracy_score, precision_score,
recall_score

#Model turning
from sklearn.model_selection import GridSearchCV

RANDOM_STATE=42
%matplotlib inline

Load Dataset & Get Overview of the Data

credit_original_data = pd.read_csv("creditcard_2023.csv")
credit_original_data.head()

id V1 V2 V3 V4 V5 V6
V7 \
0 0 -0.260648 -0.469648 2.496266 -0.083724 0.129681 0.732898
0.519014
1 1 0.985100 -0.356045 0.558056 -0.429654 0.277140 0.428605
0.406466
2 2 -0.260272 -0.949385 1.728538 -0.457986 0.074062 1.419481
0.743511
3 3 -0.152152 -0.508959 1.746840 -1.090178 0.249486 1.143312
0.518269
4 4 -0.206820 -0.165280 1.527053 -0.448293 0.106125 0.530549
0.658849

V8 V9 ... V21 V22 V23 V24

V25 \
0 -0.130006 0.727159 ... -0.110552 0.217606 -0.134794 0.165959
0.126280
1 -0.133118 0.347452 ... -0.194936 -0.605761 0.079469 -0.577395
0.190090
2 -0.095576 -0.261297 ... -0.005020 0.702906 0.945045 -1.154666 -
0.605564
3 -0.065130 -0.205698 ... -0.146927 -0.038212 -0.214048 -1.893131
1.003963
4 -0.212660 1.049921 ... -0.106984 0.729727 -0.161666 0.312561 -
0.414116

V26 V27 V28 Amount Class

0 -0.434824 -0.081230 -0.151045 17982.10 0
1 0.296503 -0.248052 -0.064512 6531.37 0
2 -0.312895 -0.300258 -0.244718 2513.54 0
3 -0.515950 -0.165316 0.048424 5384.44 0
4 1.071126 0.023712 0.419117 14278.97 0

[5 rows x 31 columns]

cc_data = credit_original_data.copy()
cc_data.info()
cc_data.describe()

<class '[Link]'>
RangeIndex: 568630 entries, 0 to 568629
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 568630 non-null int64
1 V1 568630 non-null float64
2 V2 568630 non-null float64
3 V3 568630 non-null float64
4 V4 568630 non-null float64
5 V5 568630 non-null float64
6 V6 568630 non-null float64
7 V7 568630 non-null float64
8 V8 568630 non-null float64
9 V9 568630 non-null float64
10 V10 568630 non-null float64
11 V11 568630 non-null float64
12 V12 568630 non-null float64
13 V13 568630 non-null float64
14 V14 568630 non-null float64
15 V15 568630 non-null float64
16 V16 568630 non-null float64
17 V17 568630 non-null float64
18 V18 568630 non-null float64
19 V19 568630 non-null float64
20 V20 568630 non-null float64
21 V21 568630 non-null float64
22 V22 568630 non-null float64
23 V23 568630 non-null float64
24 V24 568630 non-null float64
25 V25 568630 non-null float64
26 V26 568630 non-null float64
27 V27 568630 non-null float64
28 V28 568630 non-null float64
29 Amount 568630 non-null float64
30 Class 568630 non-null int64
dtypes: float64(29), int64(2)
memory usage: 134.5 MB
id V1 V2 V3
V4 \
count 568630.000000 5.686300e+05 5.686300e+05 5.686300e+05
5.686300e+05
mean 284314.500000 -1.109271e-14 -3.429498e-14 -1.209242e-14
3.825991e-15
std 164149.486121 1.000001e+00 1.000001e+00 1.000001e+00
1.000001e+00
min 0.000000 -3.495584e+00 -4.996657e+01 -3.183760e+00 -
4.951222e+00
25% 142157.250000 -5.652859e-01 -4.866777e-01 -6.492987e-01 -
6.560203e-01
50% 284314.500000 -9.363846e-02 -1.358939e-01 3.528579e-04 -
7.376152e-02
75% 426471.750000 8.326582e-01 3.435552e-01 6.285380e-01
7.070047e-01
max 568629.000000 2.229046e+00 4.361865e+00 1.412583e+01
3.201536e+00

V5 V6 V7 V8
V9 \
count 5.686300e+05 5.686300e+05 5.686300e+05 5.686300e+05
5.686300e+05
mean 6.288281e-15 -2.751174e-14 1.240002e-14 8.208047e-15 -
1.002980e-14
std 1.000001e+00 1.000001e+00 1.000001e+00 1.000001e+00
1.000001e+00
min -9.952786e+00 -2.111111e+01 -4.351839e+00 -1.075634e+01 -
3.751919e+00
25% -2.934955e-01 -4.458712e-01 -2.835329e-01 -1.922572e-01 -
5.687446e-01
50% 8.108788e-02 7.871758e-02 2.333659e-01 -1.145242e-01
9.252647e-02
75% 4.397368e-01 4.977881e-01 5.259548e-01 4.729905e-02
5.592621e-01
max 4.271689e+01 2.616840e+01 2.178730e+02 5.958040e+00
2.027006e+01

... V21 V22 V23 V24 \

count ... 5.686300e+05 5.686300e+05 5.686300e+05 5.686300e+05
mean ... 2.210679e-15 -8.767441e-16 4.376179e-16 6.825608e-16
std ... 1.000001e+00 1.000001e+00 1.000001e+00 1.000001e+00
min ... -1.938252e+01 -7.734798e+00 -3.029545e+01 -4.067968e+00
25% ... -1.664408e-01 -4.904892e-01 -2.376289e-01 -6.515801e-01
50% ... -3.743065e-02 -2.732881e-02 -5.968903e-02 1.590123e-02
75% ... 1.479787e-01 4.638817e-01 1.557153e-01 7.007374e-01
max ... 8.087080e+00 1.263251e+01 3.170763e+01 1.296564e+01

V25 V26 V27 V28

Amount \
count 5.686300e+05 5.686300e+05 5.686300e+05 5.686300e+05
568630.000000
mean 2.545689e-15 1.781906e-15 2.817586e-15 2.891419e-15
12041.957635
std 1.000001e+00 1.000001e+00 1.000001e+00 1.000001e+00
6919.644449
min -1.361263e+01 -8.226969e+00 -1.049863e+01 -3.903524e+01
50.010000
25% -5.541485e-01 -6.318948e-01 -3.049607e-01 -2.318783e-01
6054.892500
50% -8.193162e-03 -1.189208e-02 -1.729111e-01 -1.392973e-02
12030.150000
75% 5.500147e-01 6.728879e-01 3.340230e-01 4.095903e-01
18036.330000
max 1.462151e+01 5.623285e+00 1.132311e+02 7.725594e+01
24039.930000

Class
count 568630.0
mean 0.5
std 0.5
min 0.0
25% 0.0
50% 0.5
75% 1.0
max 1.0

[8 rows x 31 columns]

Significant Variables
Based on the logistic regression, the Z-scores for the coefficients of the variables V1 through
V28 (excluding V5) are greater than 2, which indicates that these variables have a relatively
strong and statistically significant impact on predicting the Class (i.e., whether or not a
transaction is frauduelent). V5 and Amount have Z-scores with a magnitude less than 2, which
suggests that these variable may not have as strong a predictive impact, or it may not be
statistically significant in predicting Class.

However, since there are a large amount of independent variables (without strong theoretical
justification), this can lead to overfitting.

#perform logistic regression using glm (generalized linear model)

method
logit_equation =
'Class~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14+V15+V16+V17+V18+
V19+V20+V21+V22+V23+V24+V25+V26+V27+V28+Amount'
fit1 = [Link](logit_equation, data=cc_data,
family=[Link]()).fit()
print([Link]())
Generalized Linear Model Regression Results

======================================================================
========
Dep. Variable: Class No. Observations:
568630
Model: GLM Df Residuals:
568600
Model Family: Binomial Df Model:
29
Link Function: logit Scale:
1.0000
Method: IRLS Log-Likelihood:
-53585.
Date: Sat, 04 Nov 2023 Deviance:
1.0717e+05
Time: [Link] Pearson chi2:
7.82e+16
No. Iterations: 13

Covariance Type: nonrobust

======================================================================
========
coef std err z P>|z| [0.025
0.975]
----------------------------------------------------------------------
--------
Intercept 9.0810 0.084 108.499 0.000 8.917
9.245
V1 -0.6880 0.019 -36.668 0.000 -0.725
-0.651
V2 0.1582 0.016 9.703 0.000 0.126
0.190
V3 -1.1804 0.020 -59.558 0.000 -1.219
-1.142
V4 3.6329 0.027 136.191 0.000 3.581
3.685
V5 -0.0153 0.014 -1.057 0.290 -0.044
0.013
V6 -0.4789 0.017 -28.909 0.000 -0.511
-0.446
V7 -1.1092 0.026 -42.289 0.000 -1.161
-1.058
V8 -2.8335 0.045 -63.570 0.000 -2.921
-2.746
V9 -0.4779 0.022 -21.598 0.000 -0.521
-0.435
V10 -1.8979 0.031 -60.689 0.000 -1.959
-1.837
V11 1.8757 0.018 103.216 0.000 1.840
1.911
V12 -2.8171 0.026 -109.223 0.000 -2.868
-2.767
V13 0.0176 0.009 2.014 0.044 0.000
0.035
V14 -3.3569 0.026 -127.275 0.000 -3.409
-3.305
V15 -0.2418 0.009 -27.972 0.000 -0.259
-0.225
V16 -0.8542 0.024 -35.459 0.000 -0.901
-0.807
V17 -1.9342 0.027 -71.811 0.000 -1.987
-1.881
V18 -0.9296 0.021 -45.195 0.000 -0.970
-0.889
V19 -0.0804 0.012 -6.443 0.000 -0.105
-0.056
V20 0.1328 0.012 10.902 0.000 0.109
0.157
V21 0.2487 0.028 8.737 0.000 0.193
0.304
V22 0.4401 0.015 29.085 0.000 0.410
0.470
V23 -0.3412 0.011 -31.866 0.000 -0.362
-0.320
V24 -0.1660 0.009 -18.049 0.000 -0.184
-0.148
V25 0.1617 0.011 14.900 0.000 0.140
0.183
V26 -0.1069 0.010 -10.920 0.000 -0.126
-0.088
V27 0.1873 0.022 8.642 0.000 0.145
0.230
V28 0.1534 0.011 14.578 0.000 0.133
0.174
Amount -7.252e-08 1.17e-06 -0.062 0.950 -2.36e-06
2.22e-06
======================================================================
========

For simplicity and to avoid overfitting, lets only focus on V1-

V-5, Amount, and Class to perform our analysis on.
# Columns to keep
target_columns = ['id', 'V1', 'V2', 'V3', 'V4', 'V5', 'Amount',
'Class']
# Create a new DataFrame with the selected columns
cc_data = cc_data[target_columns]
cc_data.head()

id V1 V2 V3 V4 V5 Amount
Class
0 0 -0.260648 -0.469648 2.496266 -0.083724 0.129681 17982.10
0
1 1 0.985100 -0.356045 0.558056 -0.429654 0.277140 6531.37
0
2 2 -0.260272 -0.949385 1.728538 -0.457986 0.074062 2513.54
0
3 3 -0.152152 -0.508959 1.746840 -1.090178 0.249486 5384.44
0
4 4 -0.206820 -0.165280 1.527053 -0.448293 0.106125 14278.97
0

Distribution of V1-V5 and Amount

# Define the columns you want to plot
columns_to_plot = ['V1', 'V2', 'V3', 'V4', 'V5', 'Amount']

# Create subplots for the histograms

fig, axes = [Link](2, 3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.5)

for i, col in enumerate(columns_to_plot):

row_idx = i // 3
col_idx = i % 3

[Link](data=cc_data, x=col, kde=True, color='blue',

ax=axes[row_idx, col_idx])
axes[row_idx, col_idx].set_title(f'Histogram of {col}')
axes[row_idx, col_idx].set_xlabel(col)
axes[row_idx, col_idx].set_ylabel('Frequency')

# Remove empty subplots, if any

for i in range(len(columns_to_plot), 2 * 3):
[Link]([Link]()[i])

[Link]()
Distribution of Fraudulent Transactions in Dataset
[Link](figsize=(6, 4))
[Link](x='Class', data=cc_data, palette='Set1')
[Link]('Distribution of Classes')
[Link]('Class')
[Link]('Count')
[Link]()

fraud_count = (cc_data['Class']==1).sum()
nonfraud_count = (cc_data['Class']==0).sum()
print("Count of Fraudulent Transactions:", fraud_count)
print("Count of Non-Fraudulent Transactions:", nonfraud_count)
Count of Fraudulent Transactions: 284315
Count of Non-Fraudulent Transactions: 284315

Significant Variables Revised

We see that with fewer independent variables in the binary logistic regression model, V5 is now
a statistically significant variable, as evidenced by its Z-score magnitude greater than 2.
Additionally, V1 through V4 also continue to exhibit statistically significant Z-scores.

However, Amount remains a statistically insignificant variable for predicting fraudulent

transactions, as its Z-score does not exceed the threshold of 2.

Negative Coefficients (V1, V3, V5):

When a coefficient is negative, it indicates an inverse relationship between the corresponding
independent variable and the log-odds of the outcome.

Which means that as the values of V1, V3, and V5 increases, the log-odds of a transaction being
fraudulent (Class = 1) decrease. In other words, higher values of V1, V3, and V5 are associated
with a lower likelihood of a fraudulent transaction.

Positive Coefficients (V2, V4):

Positive coefficients indicate a direct relationship between the independent variable and the log-
odds of the outcome.
Which means that as the values of V2 and V4 increases, the log-odds of a transaction being
fraudulent (Class = 1) increase. Higher values of V2 and V4 are associated with a higher
likelihood of a fraudulent transaction.

#perform logistic regression using glm (generalized linear model)

method.
#Note, we leave out Amount as it isn't statistically significant
logit_eq = 'Class~V1+V2+V3+V4+V5'
fit1 = [Link](logit_eq, data=cc_data,
family=[Link]()).fit()
print([Link]())

Generalized Linear Model Regression Results

======================================================================
========
Dep. Variable: Class No. Observations:
568630
Model: GLM Df Residuals:
568624
Model Family: Binomial Df Model:
5
Link Function: logit Scale:
1.0000
Method: IRLS Log-Likelihood: -
1.0149e+05
Date: Sat, 04 Nov 2023 Deviance:
2.0298e+05
Time: [Link] Pearson chi2:
3.39e+08
No. Iterations: 9

Covariance Type: nonrobust

Cross Validation of our Model

Using the validation set approach, we estimate the test error of this model.

• Split the sample set into a random training set and a random validation set. Use a test
size of 20%
• Store the training set in train, testing set in test
#split data into training and validation/test set
train, test = train_test_split(cc_data, test_size=0.2, random_state =
42)

#Fit logistic regression model with training set

fit2 = [Link](logit_eq, data=train,
family=[Link]()).fit()

• Obtain a prediction of 'default' for each individual in the test set by computing the
probability of default for that individual and classifying the individual to the default
category if the posterior probability is greater than or equal to 0.5 (a Bayesian
classifier). Store your predictions in predicted.

• Compute the misclassification rate, which is the fraction of the observations in the
validation set that are misclassified. Name this variable mis_rate and.

#Make predictions on the validation set with the new model

predictions = [Link](test)

#Convert predicted probabilities to binary predictions: 1 if prob>0.5

encode = lambda x: 1 if x>=0.5 else 0
predicted = [Link](encode)

# compare the predicted values with the actual values in the test set
misclassified = (predicted != test['Class']).sum()

mis_rate = misclassified/len(test)
print("Our model has a misclassifcation rate of:", mis_rate*100, "%")

Our model has a misclassifcation rate of: 6.686245889242566 %

Confusion Matrix
true_labels = test['Class']
cm = confusion_matrix(true_labels, predicted)
# Plot the confusion matrix as a heatmap
[Link](figsize=(10, 8))
[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-
Fraudulent', 'Fraudulent'], yticklabels=['Non-Fraudulent',
'Fraudulent'])
[Link]('Predicted')
[Link]('True')
[Link]('Confusion Matrix')
[Link]()

Results
The result obtained indicates that there are 7,611 misclassified instances in the test set, and the
misclassification rate is approximately 0.0669 (or 6.69%). This means that about 6.69% of the
predictions made by your model on the test data are incorrect.

Lets repeat this process three more times, using three different splits of the observations into a
training set and a validation set.
• Use test sizes of 10 % , 30 % , and 50 % .
• Name the misclassification rates mis_rate_10, mis_rate_30, mis_rate50,
respectively.
• Store the misclassification accuracy score for each model within model_rates and model
parameters within model_params
test_sizes = [0.1, 0.3, 0.5]
#to store the misclassification accuracy score for each model
model_rates = []
#to store the model parameters
model_params = []

#iterate through each test size

for test_size in test_sizes:

#split data into training and validation sets

train, test = train_test_split(cc_data, test_size=test_size,
random_state = 42)

#fit the logistic regression model on training set

model = [Link](logit_eq, data=train,
family=[Link]()).fit()

#Store model parameters

model_params.append([Link])

#Get predictions on test/validation test

predicted_probs = [Link](test)

#Classifdy predictions on 0.5 threshold

predicted = predicted_probs.map(encode)

#Calculate misclassification rate & store rates

misclassified = (predicted != test['Class']).sum()
mis_rate = misclassified / len(test)
model_rates.append(mis_rate)

#Store misclassification rates in designated variables

mis_rate_10, mis_rate_30, mis_rate_50 = model_rates

print(f"Misclassification Rate for 10% Test Size:", mis_rate_10*100,

"%")
print(f"Misclassification Rate for 30% Test Size:", mis_rate_30*100,
"%")
print(f"Misclassification Rate for 50% Test Size:", mis_rate_50*100,
"%")

Misclassification Rate for 10% Test Size: 6.853314105833319 %

Misclassification Rate for 30% Test Size: 6.631728892249793 %
Misclassification Rate for 50% Test Size: 6.641225401403374 %
K-Fold Cross Validation, K=10
Using KFold cross validation with 10 folds across 10 trials to further test our model, we calculate
the average misclassification rate.

• Name this mis_rate_kfold.

The average misclassification rate of 0.0662 suggests that, on average, the model is making
correct predictions for approximately 93.38% of the data points, which is a reasonable level of
performance.

#Store all miclassification rates

misclass_rates = []

model = LogisticRegression() #Define ML Model

for trial in range(10):

#Define cross-value method
cv_method = KFold(n_splits = 10, shuffle=True, random_state =
trial)

#perform cv and get accuracy scores for each fold

scores = cross_val_score(model,
cc_data[['V1','V2','V3','V4','V5']], cc_data['Class'], cv=cv_method,
scoring ='accuracy')

#calculate misclassification rate for each fold

misclass_rate = 1 - scores
misclass_rates.extend(misclass_rate)

mis_rate_kfold =sum(misclass_rates) / len(misclass_rates)

print(mis_rate_kfold)

0.06621212387668608

Project Summary So Far

Logistic Regression Modeling:
• Conducted logistic regression analysis to build a predictive model for fraud detection.
• Investigated the significance of individual features ('V1' to 'V28' and 'Amount') by
examining Z-scores and p-values.
• Iteratively built and evaluated logistic regression models with different subsets of
features and test sizes, assessing their misclassification rates.

K-Fold Cross Validation:

• Employed 10-fold cross-validation across 10 trials to estimate the average
misclassification rate of your logistic regression model.
• Calculated the average misclassification rate as a measure of model performance across
different subsets of the data.

Key Findings:
• The model's performance, as measured by the misclassification rate, varied based on the
choice of features and test sizes.
• Specific features, such as 'V5,' demonstrated varying levels of significance depending on
the model configuration.
• Average misclassification rates across K-Fold Cross Validation suggest the model's
performance in correctly classifying transactions as fraudulent or non-fraudulent.
• We averaged a ~6.66% classification rate for fraudulent credit-card transactions using
V1-V5.

Next Steps:
• Explore advanced techniques such as ensemble methods (e.g., Bagging and Random
Forest) for improved fraud detection and hyperparameter tuning.
#Get get original dataset
cc_data = credit_original_data.copy()
cc_data = cc_data.drop(columns = ['id', 'Amount'])

cc_data.head()

V1 V2 V3 V4 V5 V6
V7 \
0 -0.260648 -0.469648 2.496266 -0.083724 0.129681 0.732898
0.519014
1 0.985100 -0.356045 0.558056 -0.429654 0.277140 0.428605
0.406466
2 -0.260272 -0.949385 1.728538 -0.457986 0.074062 1.419481
0.743511
3 -0.152152 -0.508959 1.746840 -1.090178 0.249486 1.143312
0.518269
4 -0.206820 -0.165280 1.527053 -0.448293 0.106125 0.530549
0.658849

V8 V9 V10 ... V20 V21 V22

V23 \
0 -0.130006 0.727159 0.637735 ... 0.091202 -0.110552 0.217606 -
0.134794
1 -0.133118 0.347452 0.529808 ... -0.233984 -0.194936 -0.605761
0.079469
2 -0.095576 -0.261297 0.690708 ... 0.361652 -0.005020 0.702906
0.945045
3 -0.065130 -0.205698 0.575231 ... -0.378223 -0.146927 -0.038212 -
0.214048
4 -0.212660 1.049921 0.968046 ... 0.247237 -0.106984 0.729727 -
0.161666
V24 V25 V26 V27 V28 Class
0 0.165959 0.126280 -0.434824 -0.081230 -0.151045 0
1 -0.577395 0.190090 0.296503 -0.248052 -0.064512 0
2 -1.154666 -0.605564 -0.312895 -0.300258 -0.244718 0
3 -1.893131 1.003963 -0.515950 -0.165316 0.048424 0
4 0.312561 -0.414116 1.071126 0.023712 0.419117 0

[5 rows x 29 columns]

# Calculate pairwise correlations

corr_matrix = cc_data.corr()
# Find the correlation of 'HD' with all predictor variables
corr_HD = corr_matrix['Class']

# Find the predictor with the highest positive correlation

highest_corr = corr_HD.drop('Class').idxmax()

print(f"The feature that has the highest correlation with Credit Card
Fraud is: {highest_corr}")

The feature that has the highest correlation with Credit Card Fraud
is: V4

Splitting Training & Test Data

X = cc_data.drop('Class', axis=1) # All predictor variables
y = cc_data['Class'] # Target Output

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=42)

Bootstrap Aggregation (Bagging)

• Using all predictor variables (V1-V28), train a base model to predict Credit Card Fraud
# Create Bagging Classifier
bag = BaggingClassifier(random_state=42)

# Train Bagging Classifier

[Link](X_train, y_train)

BaggingClassifier(random_state=42)

# Make Predictions using BagClassifier on test data

y_pred = [Link](X_test)

# Calculate Precision, Recall, and Accuracy Scores

bag_precision = precision_score(y_test, y_pred)
bag_recall = recall_score(y_test, y_pred)
bag_accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier - Precision: {bag_precision}")
print(f"Bagging Classifier - Recall: {bag_recall}")
print(f"Bagging Classifier - Accuracy: {bag_accuracy}")

Bagging Classifier - Precision: 0.9991053730243654

Bagging Classifier - Recall: 0.9996489750070205
Bagging Classifier - Accuracy: 0.9993756924537924

bag_misclass_rate = (1 - bag_accuracy)*100
print(f"Bagging misclassification rate: {bag_misclass_rate:.2f}%")

Bagging misclassification rate: 0.06%

Tuning Bagging
• Tuning may not be necessary since our precision, recall, and accuracy scores are already
favorable.
• However, below is an implementation on how we can tune using GridSearchCV with 10-
fold CV if needed
"""
# Define the hyperparameter grid to search through (range 5 to 29)
params = {'n_estimators': range(5, 30)}

# Create GridSearchCV with 10-Fold CV

grid_search = GridSearchCV(bag, params, cv=10, scoring='accuracy')

# Fit GridSearchCV to training data

grid_search.fit(X_train, y_train)

# Get best estimator with tuned paramters

best_bag = grid_search.best_estimator_

# Make predictions on training data

y_pred = best_bag.predict(X_test)

bag_tuned_precision = precision_score(y_test, y_pred)

bag_tuned_recall = recall_score(y_test, y_pred)
bag_tuned_accuracy = accuracy_score(y_test, y_pred)

# Store best performing # of estimators

#number = best_bag.n_estimators
bag_best_param = {'n_estimators': best_bag.n_estimators}

print(f"Tuned Bagging Classifier - Precision: {bag_tuned_precision}")

print(f"Tuned Bagging Classifier - Recall: {bag_tuned_recall}")
print(f"Tuned Bagging Classifier - Accuracy: {bag_tuned_accuracy}")
print(bag_best_param)
"""
'\n# Define the hyperparameter grid to search through (range 5 to 29)\
nparams = {\'n_estimators\': range(5, 30)}\n\n# Create GridSearchCV
with 10-Fold CV\ngrid_search = GridSearchCV(bag, params, cv=10,
scoring=\'accuracy\')\n\n# Fit GridSearchCV to training data\
ngrid_search.fit(X_train, y_train)\n\n# Get best estimator with tuned
paramters\nbest_bag = grid_search.best_estimator_\n\n# Make
predictions on training data\ny_pred = best_bag.predict(X_test)\n\
nbag_tuned_precision = precision_score(y_test, y_pred)\
nbag_tuned_recall = recall_score(y_test, y_pred)\nbag_tuned_accuracy =
accuracy_score(y_test, y_pred)\n\n# Store best performing # of
estimators\n#number = best_bag.n_estimators\nbag_best_param =
{\'n_estimators\': best_bag.n_estimators}\n\nprint(f"Tuned Bagging
Classifier - Precision: {bag_tuned_precision}")\nprint(f"Tuned Bagging
Classifier - Recall: {bag_tuned_recall}")\nprint(f"Tuned Bagging
Classifier - Accuracy: {bag_tuned_accuracy}")\nprint(bag_best_param)\
n'

Random Forest
• Using all predictor variables (V1-V28), train a base model to predict Credit Card Fraud
# Create Random Forest Classifier
rf = RandomForestClassifier()

#Train rf classifier
[Link](X_train, y_train)

RandomForestClassifier()

# Predict test outcomes using random forest classifier

y_pred_rf = [Link](X_test)

rf_precision = precision_score(y_test, y_pred_rf)

rf_recall = recall_score(y_test, y_pred_rf)
rf_accuracy = accuracy_score(y_test, y_pred_rf)

print(f"RandomForest Classifier - Precision: {rf_precision}")

print(f"RandomForest Classifier - Recall: {rf_recall}")
print(f"RandomForest Classifier - Accuracy: {rf_accuracy}")

RandomForest Classifier - Precision: 0.9997543428671697

RandomForest Classifier - Recall: 1.0
RandomForest Classifier - Accuracy: 0.9998768971035648

rf_misclass_rate = (1 - rf_accuracy)*100
print(f"Random Forest misclassification rate: {rf_misclass_rate:.2f}
%")

Random Forest misclassification rate: 0.01%

# Access the feature importances
feature_importances = rf.feature_importances_

# Get the feature names from your dataset

feature_names = X_train.columns

# Print all features and their importances

for feature_name, importance in zip(feature_names,
feature_importances):
print(f"Feature {feature_name}: Importance = {importance:.4f}")

Feature V1: Importance = 0.0120

Feature V2: Importance = 0.0326
Feature V3: Importance = 0.0612
Feature V4: Importance = 0.1463
Feature V5: Importance = 0.0070
Feature V6: Importance = 0.0068
Feature V7: Importance = 0.0173
Feature V8: Importance = 0.0108
Feature V9: Importance = 0.0186
Feature V10: Importance = 0.1262
Feature V11: Importance = 0.0955
Feature V12: Importance = 0.0714
Feature V13: Importance = 0.0064
Feature V14: Importance = 0.1842
Feature V15: Importance = 0.0058
Feature V16: Importance = 0.0538
Feature V17: Importance = 0.0741
Feature V18: Importance = 0.0068
Feature V19: Importance = 0.0075
Feature V20: Importance = 0.0057
Feature V21: Importance = 0.0152
Feature V22: Importance = 0.0041
Feature V23: Importance = 0.0056
Feature V24: Importance = 0.0043
Feature V25: Importance = 0.0056
Feature V26: Importance = 0.0057
Feature V27: Importance = 0.0043
Feature V28: Importance = 0.0055

Tunning Random Forest

• Similar to the reasoning for Tuning Bagging, here is an implementation on how we can
tune the parameters determining the depth of a tree and the number of trees to possibly
further improve our Random Forest Classifier.
"""
# Define our hyperparamters
params = {'max_depth':range(3,20), # Range of tree depth
'n_estimators':range(10,29) # Range of # of trees
}

# Create GridSearchCV for RandomForest Classifier with 10fold CV

grid_search_rf = GridSearchCV(rf, params, cv=10, scoring='accuracy')

# Fit GridSearch with training data

grid_search_rf.fit(X_train, y_train)

# Get best estimators with tuned hyperparamters

best_rf = grid_search_rf.best_estimator_

# Make predictions using tuned rf classifier

y_pred = best_rf.predict(X_test)

rf_tuned_precision = precision_score(y_test, y_pred)

rf_tuned_recall = recall_score(y_test, y_pred)
rf_tuned_accuracy = accuracy_score(y_test, y_pred)

rf_tuned = {'max_depth': best_rf.max_depth, 'n_estimators':

best_rf.n_estimators}

print(f"Tuned Random Forest - Accuracy: {rf_tuned_accuracy}")

print(f"Tuned Random Forest - Recall: {rf_tuned_recall}")
print(f"Tuned Random Forest - Precision: {rf_tuned_precision}")
"""

Summary
In this project, I aimed to develop machine learning models for credit card fraud prediction using
a dataset containing features V1-V28. These features were essential for detecting fraudulent
transactions and building effective classification models.

Results
• Initially, I experimented with a simple subset of features, V1-V5, for our
classification model, resulting in a 6.6% misclassification rate. While this provided a
baseline, I sought to improve model performance further.

• I then explored more advanced techniques, including a Bagging Classification

model, which achieved a significant reduction in the misclassification rate to just
0.06% using all V1-V28 features. This improvement demonstrated the value of
ensemble methods in handling complex classification tasks.

• Subsequently, I employed a Random Forest Classification model using all V1-V28

features, which further enhanced performance with a remarkable 0.01%
misclassification rate. The Random Forest's ability to capture complex relationships
within the data was a key contributor to this achievement.

• The top 3 anonymized features that were most significant were V1, V10, V14.
Future Improvements
• Investigate optimal subsets of features and optimize the selection process. Methods
like forward and backward selection can help identify the most informative feature
subsets for improved model performance.

• Explore the potential for incorporating regularization techniques such as L1

(Lasso) and L2 (Ridge) regularization to control overfitting and further enhance
model generalization.

Credit Card-Fraud-Detection
No ratings yet
Credit Card-Fraud-Detection
39 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
101 pages
Yashvi 2202031030144
No ratings yet
Yashvi 2202031030144
24 pages
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
No ratings yet
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
7 pages
Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
No ratings yet
Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
15 pages
Xtasy
No ratings yet
Xtasy
14 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
8 pages
EDA and Similarity of Transactions On CreditCardFraudDetection
No ratings yet
EDA and Similarity of Transactions On CreditCardFraudDetection
66 pages
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
No ratings yet
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
17 pages
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook
No ratings yet
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook
12 pages
E21CSEU0770 Lab4
No ratings yet
E21CSEU0770 Lab4
4 pages
Credit Card Fraud Detection With CNN 99 Accuracy
No ratings yet
Credit Card Fraud Detection With CNN 99 Accuracy
12 pages
Sampling Ipynb
No ratings yet
Sampling Ipynb
17 pages
Credit Card Data Analysis EDA
No ratings yet
Credit Card Data Analysis EDA
71 pages
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
No ratings yet
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
2 pages
Terro's REA
No ratings yet
Terro's REA
43 pages
Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
Bank Customer Segmentation Guide
No ratings yet
Bank Customer Segmentation Guide
53 pages
Prg7a - Jupyter Notebook
No ratings yet
Prg7a - Jupyter Notebook
12 pages
Credit Card 1679991215
No ratings yet
Credit Card 1679991215
26 pages
DM Project
No ratings yet
DM Project
34 pages
DM Project
No ratings yet
DM Project
36 pages
Predicting Credit Risk Using Financial Data
100% (3)
Predicting Credit Risk Using Financial Data
42 pages
Credit Card Default Analysis
No ratings yet
Credit Card Default Analysis
5 pages
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook2
No ratings yet
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook2
13 pages
Pandas DataFrame Guide for Analysis
No ratings yet
Pandas DataFrame Guide for Analysis
4 pages
Alam-Project3 1
No ratings yet
Alam-Project3 1
16 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
ML LAB - Ipynb - (4) - JupyterLab
No ratings yet
ML LAB - Ipynb - (4) - JupyterLab
10 pages
AIML Lab Ex 3-5 - 1
No ratings yet
AIML Lab Ex 3-5 - 1
31 pages
Cern Electron Mass Prediction 0 9859 R
No ratings yet
Cern Electron Mass Prediction 0 9859 R
53 pages
HW8 La
No ratings yet
HW8 La
18 pages
Run Analysis
No ratings yet
Run Analysis
43 pages
Curentul Electric in Functie de Radacina Patrata A Tensiunii de Franare
No ratings yet
Curentul Electric in Functie de Radacina Patrata A Tensiunii de Franare
5 pages
Assignment 4
No ratings yet
Assignment 4
7 pages
Ionosphere Radar Data Analysis in R
No ratings yet
Ionosphere Radar Data Analysis in R
22 pages
DS Practice
No ratings yet
DS Practice
3 pages
Newton-Raphson Method for Root Finding
No ratings yet
Newton-Raphson Method for Root Finding
17 pages
Data Summary of Demographic Variables
No ratings yet
Data Summary of Demographic Variables
14 pages
Bank Loan
No ratings yet
Bank Loan
85 pages
Credit Card Fraud Detection Model
No ratings yet
Credit Card Fraud Detection Model
1 page
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
OLS Regression with Singular Matrix Handling
No ratings yet
OLS Regression with Singular Matrix Handling
26 pages
Probability Statistics Report
No ratings yet
Probability Statistics Report
34 pages
E17ddba2 8f28 4673 85b5 C301bceae633 Ether Fraud Transaction Case Study
No ratings yet
E17ddba2 8f28 4673 85b5 C301bceae633 Ether Fraud Transaction Case Study
219 pages
Assignment Lab 1
No ratings yet
Assignment Lab 1
3 pages
Python ML for Engineers: Week 3
No ratings yet
Python ML for Engineers: Week 3
12 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Linear Regression for CPU Usage Prediction
No ratings yet
Linear Regression for CPU Usage Prediction
31 pages
Shill Bidding Data Analysis in R
No ratings yet
Shill Bidding Data Analysis in R
15 pages
Assignment: (Statistical Methods 2) - B. Dheeraj
No ratings yet
Assignment: (Statistical Methods 2) - B. Dheeraj
11 pages
Tsne On Credit Card
No ratings yet
Tsne On Credit Card
9 pages
ML Lab Manual (BCSL606)
No ratings yet
ML Lab Manual (BCSL606)
24 pages
Installing Missingno and Seaborn
No ratings yet
Installing Missingno and Seaborn
23 pages
Danmairo - Analysis - Ipynb - Colaboratory
No ratings yet
Danmairo - Analysis - Ipynb - Colaboratory
18 pages
Proc Robust Reg
No ratings yet
Proc Robust Reg
56 pages
Columns 1 Through 4
No ratings yet
Columns 1 Through 4
4 pages
Analyse of Variance Data Fertilisation
No ratings yet
Analyse of Variance Data Fertilisation
29 pages
Principles of Inheritance and Variation
No ratings yet
Principles of Inheritance and Variation
45 pages
Seaborn Ipynb
No ratings yet
Seaborn Ipynb
514 pages
CPIndex - Jan14 To Mar25
No ratings yet
CPIndex - Jan14 To Mar25
2 pages
BARC Interview Experience
No ratings yet
BARC Interview Experience
5 pages
Quiz1 Keys
No ratings yet
Quiz1 Keys
2 pages
Ees 23
No ratings yet
Ees 23
20 pages
Cbse Ja Admit Card
No ratings yet
Cbse Ja Admit Card
2 pages
CH 11 Quiz
No ratings yet
CH 11 Quiz
3 pages
Midterm 3401 Version 2
No ratings yet
Midterm 3401 Version 2
1 page
119 SEM Measures of Fit
No ratings yet
119 SEM Measures of Fit
1 page
Sustainability of Sharia Rural Bank in Central Java
No ratings yet
Sustainability of Sharia Rural Bank in Central Java
8 pages
Poldrack - Can Cognitive Processes Be Inferred From Neuroimaging Data
No ratings yet
Poldrack - Can Cognitive Processes Be Inferred From Neuroimaging Data
5 pages
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
No ratings yet
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
33 pages
Hasn Bot
No ratings yet
Hasn Bot
2 pages
Confidence Intervals for Population Proportions
50% (8)
Confidence Intervals for Population Proportions
11 pages
ANOVA Analysis in Manufacturing Processes
No ratings yet
ANOVA Analysis in Manufacturing Processes
7 pages
Module 9 10 Data Gathering Revised 2024
No ratings yet
Module 9 10 Data Gathering Revised 2024
5 pages
l3 Pox7003 T Test
No ratings yet
l3 Pox7003 T Test
73 pages
Introduction To Machine Learning - Unit 5 - Week 2
No ratings yet
Introduction To Machine Learning - Unit 5 - Week 2
4 pages
SAT Suite Question Bank - MathLogic
No ratings yet
SAT Suite Question Bank - MathLogic
44 pages
Variable 1 Variable 2
No ratings yet
Variable 1 Variable 2
7 pages
2math 4
100% (1)
2math 4
30 pages
Get Tools For Decision Making A Practical Guide For Local Government 2nd Edition David N. Ammons Free All Chapters
100% (17)
Get Tools For Decision Making A Practical Guide For Local Government 2nd Edition David N. Ammons Free All Chapters
84 pages
Social Statistics Concepts Explained
No ratings yet
Social Statistics Concepts Explained
4 pages
Forecast Methods For Time Series Data A Survey
No ratings yet
Forecast Methods For Time Series Data A Survey
18 pages
Basic Statistical Concepts Review
100% (6)
Basic Statistical Concepts Review
227 pages
Guitguit, Jazmine B. - Module 3
No ratings yet
Guitguit, Jazmine B. - Module 3
3 pages
ECO - Chapter 2 SLRM
No ratings yet
ECO - Chapter 2 SLRM
40 pages
DH302 Spring2025 Quiz01 Solutions
No ratings yet
DH302 Spring2025 Quiz01 Solutions
2 pages
Parametric Vs Nonparametric Tests
No ratings yet
Parametric Vs Nonparametric Tests
10 pages
PSQT
No ratings yet
PSQT
2 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
99 pages
Analyzing Within vs. Between Subject Effects
No ratings yet
Analyzing Within vs. Between Subject Effects
34 pages
Statexer#7
No ratings yet
Statexer#7
2 pages
Linear Regression with Gradient Descent
100% (1)
Linear Regression with Gradient Descent
8 pages
Statistics and Prob 11 Summative Test Q4
No ratings yet
Statistics and Prob 11 Summative Test Q4
5 pages
Econometrics Cheat Sheet Stock and Watson
100% (5)
Econometrics Cheat Sheet Stock and Watson
2 pages