0% found this document useful (0 votes)

39 views7 pages

Phase 3 IBM

The document outlines the process of model development and evaluation, including data preparation, algorithm selection, and performance metrics. It details the implementation of various models such as Random Forest, XGBoost, and LightGBM, along with their evaluation results and insights on model performance. The conclusion emphasizes the importance of advanced data cleaning, model building, and the role of AI in optimizing performance for fraud detection.

Uploaded by

Shreya Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views7 pages

Phase 3 IBM

Uploaded by

Shreya Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Model Development and Evaluation

In this phase, Model development and evaluation is a comprehensive process that begins with preparing
the data, including cleaning, feature engineering, and splitting it into training and testing sets. After
selecting an appropriate algorithm, such as Random Forest or Logistic Regression, the model is trained
on the training data. Once trained, the model’s performance is evaluated on test data using various
metrics such as accuracy, precision, recall, ROC curve, or for regression, MAE, MSE, and R². Cross-
validation can be used to assess the model's generalization, and hyperparameter tuning is applied to
optimize performance. The goal is to ensure the model is accurate, robust, and capable of performing
well on unseen data.

Step 1: Importing necessary libraries and loading the dataset

Importing all the required and necessary libraries to successfully implement the project. Advanced
techniques ensure the dataset is free from missing values, outliers, and imbalance issues.

1.1 Importing Libraries and loading the dataset

## Importing the necessary libraries
# General libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import xgboost as xgb
import lightgbm as lgb

# Preprocessing libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

#loading the dataset

df = pd.read_csv(r"C:\Users\Downloads\archive (4)\data\dataset.csv")

1.2 Exploratory Data Analysis (EDA)

Displaying the metadata about the dataset

# display metadata about the dataset

df.info ()

1.3 Calculate the average popularity

# Calculate the average popularity for each genre

top_15_popular_genres =
df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False).head(15)

# Plot the top 15 most popular genres based on average popularity

plt.figure(figsize=(12, 6))
sns.barplot(x=top_15_popular_genres.index, y=top_15_popular_genres.values, palette='viridis')
plt.title("Top 15 Most Popular Genres (Based on Average Popularity)")
plt.xlabel("Genre")
plt.ylabel("Average Popularity")
plt.xticks(rotation=45)
plt.show()

1.4 Checking for duplicates, data preprocessing and label encoding

duplicates = df.duplicated().sum()

#4) Data Preprocessing

# Fill missing values in 'artists', 'album_name', and 'track_name' with 'Unknown'
df['artists'].fillna('Unknown', inplace=True)
df['album_name'].fillna('Unknown', inplace=True)
df['track_name'].fillna('Unknown', inplace=True)

# Label encode the target variable

le = LabelEncoder()
df['track_genre'] = le.fit_transform(df['track_genre'])

Step 2: Building and Training Models

2.1 Feature Selection and Splitting the Data

We started in feature selection by targeting the particular required variable and splited the data into
training and testing sets

• Training set of data includes 80%

• Testing set of data includes 20%

##5) Feature Selection and Splitting Data

# Features (all columns except 'track_genre')
X = df.drop(columns=['track_genre'])

# Target variable
y = df['track_genre']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2.2 Model Development and Evauation

We initialized the Random Forest Classifier then the model is trained by fitting the values later the
predictions are done as per the predictions the model is evaluated.

##6) Model Development and Evaluation

#Random Forest Model
# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model

rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf.predict(X_test)

# Evaluate the model

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

2.3 Trying of Gradient Boosting Libraries

XGBoost model - XGBoost is an optimized and scalable version of gradient boosting that was
developed by Tianqi Chen. It has become very popular due to its speed, accuracy, and ability to handle
large datasets with complex features.
• Initialize the XGBoost Classifier
• Train the model
• Make predictions
• Evaluate

#XGBoost Model
# Initialize the XGBoost classifier
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Train the model

xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model

print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))

LightGBM - LightGBM, developed by Microsoft, is another gradient boosting framework designed to

be fast and efficient, particularly for large datasets. It focuses on faster training times and lower
memory usage.
• Initialize the LightGBM Classifier
• Train the model
• Make predictions
• Evaluate

#LightGBM
# Initialize the LightGBM classifier
lgbm_model = lgb.LGBMClassifier(random_state=42)
# Train the model
lgbm_model.fit(X_train, y_train)

# Make predictions
y_pred_lgbm = lgbm_model.predict(X_test)

# Evaluate the model

print("LightGBM Accuracy:", accuracy_score(y_test, y_pred_lgbm))
print("Classification Report:\n", classification_report(y_test, y_pred_lgbm))

Trying ELBOW Method plot

#ELBOW METHOD PLOT
#to find optimal K

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# Determine the optimal number of clusters using the Elbow Method

inertia = []
k_values = range(1, 11)

for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(train_data)
inertia.append(kmeans.inertia_)

# Plot the Elbow Method

plt.plot(k_values, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

KMeans algotithm -K-Means is a clustering algorithm that groups similar data points into K clusters
based on their features, by minimizing the distance between each point and its closest cluster center.

2.3 Calculations
#CALCULATING ACCURACY,PRECISION,RECALL AND ROC CURVE FOR THE DESCRIBED
ALGORITHM{RANDOM FOREST CLASSIFIER}

accuracy = accuracy_score(y_test, predicted_labels)

precision = precision_score(y_test, predicted_labels, average='weighted')
recall = recall_score(y_test, predicted_labels, average='weighted')
f1 = f1_score(y_test, predicted_labels, average='weighted')

# Calculate ROC AUC Score

# For multi-class, we calculate the ROC AUC score using a one-vs-rest scheme
y_test_bin = label_binarize(y_test, classes=[0, 1, 2]) # Binarize the true labels
y_score = rf.predict_proba(X_test) # Get the probability estimates for each class
# Compute ROC AUC score
roc_auc = roc_auc_score(y_test_bin, y_score, multi_class='ovr')
print(f"ROC AUC Score: {roc_auc}")
2.3 Recommendation system training with kmeans
#recommendation system with kmeans
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend_songs(song_name, df, num_recommendations=5):

# Get the cluster of the input song
song_cluster = df[df["name"] == song_name]["Cluster"].values[0]

# Filter songs from the same cluster

same_cluster_songs = df[df["Cluster"] == song_cluster]

# Calculate similarity within the cluster

song_index = same_cluster_songs[same_cluster_songs["name"] == song_name].index[0]
cluster_features = same_cluster_songs[numerical_features]
similarity = cosine_similarity(cluster_features, cluster_features)

# Get top recommendations

similar_songs = np.argsort(similarity[song_index])[-(num_recommendations + 1):-1][::-1]
recommendations = same_cluster_songs.iloc[similar_songs][["name", "year", "artists"]]

return recommendations

Step 3: Using of AI
3.1 What is AI?
AI can be used to automate tasks, gain insights from data, and make decisions or predictions more
accurately and efficiently than humans. It can also be used to enhance customer experiences, improve
productivity, and drive innovation.

3.2 Why Use of AI?

AI in model development helps unlock the potential of complex, large-scale data, automates repetitive
tasks, improves performance and decision-making, and makes the process more scalable and
adaptable. By incorporating AI, organizations can create more accurate, efficient, and intelligent
models that provide actionable insights, enhance user experiences, and drive innovation.

Step 4: Model Evaluation

4.1 Why Evaluate?
Evaluation refers to the process of assessing the performance of a machine learning model or algorithm.
This involves measuring how well the model is able to make predictions, classify data, or complete other
tasks.

Metrics Used:

1. Accuracy: Measures overall correctness.

2. Precision & Recall: For imbalanced datasets.
3. ROC AUC: Evaluates classification performance.
4. Fairness Metrics: Evaluates model bias.

Step 5: Results and Insights

Comparison:

1. Random Forest Results:

• Accuracy: 0.3
• Precision (Weighted): 0.3132867132867133
• Recall (Weighted): 0.3
• F1-Score (Weighted): 0.30295652173913046
• ROC AUC Score: 0.5200066137566136
2. XGBoost Classifer
• XGBoost Accuracy: 0.31407894736842107
• Classification Report:
precision recall f1-score support

0 0.19 0.17 0.18 213

1 1 0.37 0.34 0.36 203
…………….
3. LightGBM Classifier
• LightGBM Accuracy: 0.1287280701754386
• Classification Report:
precision recall f1-score support

0 0.06 0.04 0.05 213

1 0.09 0.15 0.11 203

Observations:

1. Model Performance Analysis:

● Random Forest: Enhanced accuracy and recall due to its ensemble approach, demonstrating
robustness and effectiveness as an initial model.
● XGBoost: Calculates the accuracy, ,acro avg and also weighted avg. A better scalable version
among gradient boosting libraries.
● LightGBM: Auto-choosing col-wise multi-threading, the overhead of testing is done. Starts
the training from score and tries to split the data with positive gain and best gain until the
accuracy and average is computed.

2. Evaluation Metrics:

● Precision: Measures the proportion of correctly identified fraud cases out of all predicted fraud
cases.
● Recall: Focuses on how many actual fraud cases were detected out of the total fraud cases
present.
● F1-score: Balances precision and recall, offering a comprehensive view of model performance
in the context of fraud detection.

3. Insights on Model Accuracy:

While the model achieved 99% accuracy, this metric can be misleading due to the following factors:

● Class Imbalance: Fraud cases often constitute less than 1% of the dataset. A model predicting
all cases as "not fraud" can yield high accuracy but fail at identifying fraud effectively.
● Overfitting: High accuracy may indicate the model has overfitted the training data, reducing
its ability to generalize to unseen data.
● Evaluation Metrics: For imbalanced datasets, metrics like precision, recall,
F1-score, and ROC AUC provide deeper insights into the model’s true performance.
.
Key Takeaways:

● XGBoost Classifier: Regularization includes both L1 (Lasso) and L2 (Ridge) regularization to

prevent overfitting, which is an important feature missing in traditional gradient boosting.
Parallelization: It supports parallel processing for faster training by splitting tasks like
computing gradients across multiple cores.
Handling Missing Data: XGBoost has built-in support for handling missing values
during training, automatically learning the best way to deal with them.
● LightGBM: Histogram-based Approach in LightGBM uses a histogram-based approach for
binning continuous features, which leads to faster training and lower memory consumption.
Leaf-wise Growth: LightGBM grows trees leaf-wise, as opposed to level-wise growth
in traditional gradient boosting. This often leads to deeper trees and better model performance,
but it can be more prone to overfitting in smaller datasets.
Efficient for Large Datasets: It is highly efficient when dealing with large datasets and
can be distributed across multiple machines for even faster training.
● Metrics Comparison: A detailed comparison of metrics like precision, recall,
F1-score, and ROC AUC highlighted the trade-offs in detecting fraud. These metrics proved
more informative than accuracy for evaluating model performance on imbalanced datasets.

Conclusion:
This project underscored the critical importance of advanced data cleaning, iterative model building,
and comprehensive evaluation to handle real-world challenges.AI emerged as a valuable asset,
offering efficient optimization and reliable performance, especially in resource-constrained
environments. By integrating manual techniques with automated solutions, the project achieved a
robust and fair model that effectively addresses the complexities of fraud detection, providing
actionable insights and scalable solutions.

Pa Unit 4
No ratings yet
Pa Unit 4
5 pages
CS326 Report
No ratings yet
CS326 Report
36 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
Minor Project
No ratings yet
Minor Project
21 pages
Evaluating Machine Learning Models
100% (2)
Evaluating Machine Learning Models
10 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
AIML-Unit 5 Notes-Assignment 5
No ratings yet
AIML-Unit 5 Notes-Assignment 5
24 pages
Machine Learning Evaluation Metrics Guide
No ratings yet
Machine Learning Evaluation Metrics Guide
7 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Lab 2
No ratings yet
Lab 2
17 pages
DA PRA WEEK 13 (Random Forest) - 054551
No ratings yet
DA PRA WEEK 13 (Random Forest) - 054551
12 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
Unit 5
No ratings yet
Unit 5
11 pages
Assignment 9
No ratings yet
Assignment 9
8 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
Unit 4.modelselection
No ratings yet
Unit 4.modelselection
26 pages
ML Overview
No ratings yet
ML Overview
11 pages
Draft Xai
No ratings yet
Draft Xai
16 pages
Amlt Bca Unit-2
No ratings yet
Amlt Bca Unit-2
5 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Machine Learning Lecture 2,3,4
No ratings yet
Machine Learning Lecture 2,3,4
26 pages
How To Create A Python Model
No ratings yet
How To Create A Python Model
29 pages
ES335
No ratings yet
ES335
22 pages
Unit 04 EDA 02
No ratings yet
Unit 04 EDA 02
7 pages
AI For Eng Supervised-Learning
No ratings yet
AI For Eng Supervised-Learning
25 pages
Model Selection On ML
No ratings yet
Model Selection On ML
49 pages
DMBI
No ratings yet
DMBI
15 pages
DTC Algorithm Implementation Guide
No ratings yet
DTC Algorithm Implementation Guide
7 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
15 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Assign2 01clc.06 Duongmt
No ratings yet
Assign2 01clc.06 Duongmt
23 pages
Divorce Prediction Using ML
No ratings yet
Divorce Prediction Using ML
12 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Machine Learning Cheat Sheet: Karn Singh
No ratings yet
Machine Learning Cheat Sheet: Karn Singh
13 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
CH 3
No ratings yet
CH 3
33 pages
Assessing Predictive Models
No ratings yet
Assessing Predictive Models
25 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
4 pages
Hyperparameter Optimization Techniques
No ratings yet
Hyperparameter Optimization Techniques
4 pages
Python ML Methods Cheatsheet
No ratings yet
Python ML Methods Cheatsheet
6 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
ML Algorithms Comprehensive Study
No ratings yet
ML Algorithms Comprehensive Study
9 pages
Moocs Ritesh
No ratings yet
Moocs Ritesh
22 pages
AIML 7 To 11
No ratings yet
AIML 7 To 11
7 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
BigML WhizzML Tutorials
No ratings yet
BigML WhizzML Tutorials
45 pages
Unit 3
No ratings yet
Unit 3
63 pages
INT524 Unit3
No ratings yet
INT524 Unit3
35 pages
CH 7 Ensemble Learning
No ratings yet
CH 7 Ensemble Learning
34 pages
Class 2a-Decision Trees
No ratings yet
Class 2a-Decision Trees
28 pages
Social Policy of The Family Timeline
No ratings yet
Social Policy of The Family Timeline
3 pages
BPL SC
No ratings yet
BPL SC
1 page
Danijel Turina A Yogi Approach OUROBOROS (2004)
No ratings yet
Danijel Turina A Yogi Approach OUROBOROS (2004)
238 pages
ISTA Rules 2018
100% (1)
ISTA Rules 2018
298 pages
Emotional Design Analysis
No ratings yet
Emotional Design Analysis
15 pages
Minor Project Final
No ratings yet
Minor Project Final
39 pages
Principles and Practice of Assisted Reproductive Technology (3 Volumes) online version
No ratings yet
Principles and Practice of Assisted Reproductive Technology (3 Volumes) online version
306 pages
First Quarter - BDRRMC
No ratings yet
First Quarter - BDRRMC
2 pages
Neurodevelopmental Disorders Across The Lifespan: A Neuroconstructivist Approach
100% (8)
Neurodevelopmental Disorders Across The Lifespan: A Neuroconstructivist Approach
39 pages
MANUAL OF TOXI-Forensic Toxicology
No ratings yet
MANUAL OF TOXI-Forensic Toxicology
14 pages
Muhammad Imran Rahmanzai: New Road Chilsetoon, Kabul, Afghanistan Cell: +93-786-242526 25 July, 2023
No ratings yet
Muhammad Imran Rahmanzai: New Road Chilsetoon, Kabul, Afghanistan Cell: +93-786-242526 25 July, 2023
4 pages
Organic Chemistry Basics
100% (1)
Organic Chemistry Basics
44 pages
Chinese Special - English
No ratings yet
Chinese Special - English
3 pages
AP Socio Economic Survey 2014-2015
No ratings yet
AP Socio Economic Survey 2014-2015
359 pages
ICT Trading Roadmap
No ratings yet
ICT Trading Roadmap
7 pages
Postharvest Machinery 1
No ratings yet
Postharvest Machinery 1
4 pages
Paper With Sol IJSO Stage-1 Code-JS533 19-11-2017
No ratings yet
Paper With Sol IJSO Stage-1 Code-JS533 19-11-2017
18 pages
Tutorial
No ratings yet
Tutorial
10 pages
Welder Qualification
No ratings yet
Welder Qualification
57 pages
Environmental
No ratings yet
Environmental
124 pages
Volvo KAD Etc Operators
100% (1)
Volvo KAD Etc Operators
104 pages
Drug Study - Amoxicillin
No ratings yet
Drug Study - Amoxicillin
2 pages
Installation Manual (MAST-A)
No ratings yet
Installation Manual (MAST-A)
15 pages
The Photoconductive Cell
No ratings yet
The Photoconductive Cell
2 pages
Wuthering Heights: Dark Love Analysis
No ratings yet
Wuthering Heights: Dark Love Analysis
1 page
Siemens Installation Scope Matrix for QP-6
No ratings yet
Siemens Installation Scope Matrix for QP-6
1 page
Hexam Mine Nickel Chloride Synthesis
No ratings yet
Hexam Mine Nickel Chloride Synthesis
4 pages
1719 Service Manual
No ratings yet
1719 Service Manual
112 pages
HABTE BEZ Proposal
No ratings yet
HABTE BEZ Proposal
39 pages
BORJA Genetics Lab Exercise 6 Multiple Alleles and ABO Blood Groups in Humans
No ratings yet
BORJA Genetics Lab Exercise 6 Multiple Alleles and ABO Blood Groups in Humans
5 pages