0% found this document useful (0 votes)
7 views2 pages

Model

The project focused on building a machine learning model to detect fraudulent credit card transactions using a dataset of 1,000 entries. Data preprocessing involved one-hot and frequency encoding, as well as addressing class imbalance with SMOTE. The Random Forest model achieved the best F1-score of ~0.28, and the project highlighted challenges with data imbalance and the importance of careful data cleaning and encoding.

Uploaded by

Pulkit Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views2 pages

Model

The project focused on building a machine learning model to detect fraudulent credit card transactions using a dataset of 1,000 entries. Data preprocessing involved one-hot and frequency encoding, as well as addressing class imbalance with SMOTE. The Random Forest model achieved the best F1-score of ~0.28, and the project highlighted challenges with data imbalance and the importance of careful data cleaning and encoding.

Uploaded by

Pulkit Dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Credit Card Fraud Detection

For this project, I worked on building a machine learning model to detect fraudulent credit card
transactions. The dataset had 1,000 entries, and the target column is_fraud indicated whether a
transaction was legitimate or not.

 Data Preprocessing
The dataset included both numerical and categorical columns. I used one-hot encoding for
categorical features like gender, transaction category, and state, while I applied frequency
encoding for city names to avoid a high number of dummy variables. I also removed columns like
latitude, longitude, and credit card number, which I felt were either too specific or irrelevant for
the model. After encoding, I checked for imbalance and used SMOTE to oversample the minority
(fraud) class to help models learn better.

 Model Selection and Evaluation


I tried three models: Logistic Regression, Random Forest, and XGBoost. I chose these because
they’re commonly used for classification tasks and work well on tabular data. For evaluation, I
used accuracy, precision, recall, and F1-score — since fraud detection is an imbalanced
classification problem, precision and recall matter more than just accuracy.
Here’s what I found:
 Random Forest gave the best result overall with an F1-score of ~0.28.
 XGBoost followed closely.

 Visualizations and Insights


I created several graphs during EDA. One important one was the heatmap, which shows how
features are correlated. Interestingly, most features weren’t strongly correlated, suggesting the
model has to learn from patterns across multiple weak signals. I also visualized transaction trends
by day of the week and fraud distribution, which helped guide some of the preprocessing steps.

 Challenges and Learnings


One of the biggest challenges was handling the imbalanced dataset — many models performed
poorly on fraud cases despite decent accuracy. I also had to carefully clean and encode the data to
avoid errors like strings causing model crashes.
In just a week, I learned a lot. Even though the metrics weren’t perfect, this was a valuable
starting point and I’m excited to keep improving.

You might also like