0% found this document useful (0 votes)

24 views12 pages

Data Mining Report

Data mining report for Comp Science

Uploaded by

davidndhm14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views12 pages

Data Mining Report

Data mining report for Comp Science

Uploaded by

davidndhm14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

Data Mining COMP5009

Assignment Report

StudentID 22442066 – David Nathanael Dharma Humala

Faculty Science and Engineering – Curtin University

Content
Contents
Content ................................................................................................................................. 2
Figure List .............................................................................................................................. 2
Executive Summary................................................................................................................ 3
Methodology.......................................................................................................................... 4
Data Preparation ................................................................................................................ 4
Covert Data Type ........................................................................................................... 4
Duplicate Value and Irrelevant Data .............................................................................. 4
Missing Value................................................................................................................. 5
Data Transformation ...................................................................................................... 5
Class Balancing and Data Splitting ............................................................................... 6
Data Classiﬁcation ............................................................................................................. 7
K-NN (K Nearest Neighbors) .......................................................................................... 8
Naïve Bayes ................................................................................................................... 8
Decision Tree ................................................................................................................. 9
Random Forest ............................................................................................................ 10
Prediction ........................................................................................................................ 10
Conclusion .......................................................................................................................... 11
Reference ............................................................................................................................ 12

Figure List
Figure 1 Converting Data Type .............................................................................................. 4
Figure 2 Data Correlation ...................................................................................................... 4
Figure 3 Boxplot Before Scaling ............................................................................................ 5
Figure 4 z-scale formula ....................................................................................................... 6
Figure 5 Scaling Result ......................................................................................................... 6
Figure 6 Boxplot After Scaling ............................................................................................... 6
Figure 7 Class Distribution ................................................................................................... 7
Figure 8 Confusion Matrix Random Forest and K-NN .......................................................... 11
Figure 9 Model Comparison ................................................................................................ 12
3

Executive Summary
In this report, I describe how the dataset was prepared and handled, and how it was subsequently
used to predict the target variable. The dataset consists of 5,000 rows for training data and 500
rows for test data. Initially, after loading the data into the notebook, I checked the data types, data
structure, and data information.

The first step in preparation was to check for duplicate data and remove them. Next, I converted
data types from string to categorical or Boolean where appropriate. I then checked for the missing
data in the dataset. In some cases, missing data can a ect the prediction results because it leads
to incomplete information. Columns with more than 60% missing values Ire dropped. For
columns with less than 5% missing data, the missing values Ire filled with the mean using the
‘fillna’ syntax. For columns with around 18% missing values, I used iterative imputation,
employing regression models based on other columns to predict the missing values.
Randomness was used in predicting these missing values.

I also checked the correlation among the attributes using Pearson correlation to determine
whether the relationships is positive or negative, if it highly corelated between two feature, we
might drop it because it carry almost the same information and can produce overﬁtting when
predict.

After cleaning the data of missing values, irrelevant data, and duplicates, I checked for outliers
using boxplots to visualize the data distribution. I applied scaling (z-score standardization), as it
is less sensitive to outliers and ideal when the data is roughly normal. This concludes the data
preparation process.

The next step was to check the class distribution. The class data needed to be balanced before
classification. I used the SMOTE technique to balance the classes, especially when one class
(typically the "positive" or "rare" class) had far fewer samples than the others. For cross-validation,
I used StratifiedKFold because it splits the data randomly while keeping class proportions
consistent, which is beneficial for imbalanced data.

After all data preparation steps Ire completed, I classiﬁed the data using four methods: k-NN (K-
Nearest Neighbors), Naïve Bayes, Decision Tree, and Random Forest. Among these classiﬁers,
k-NN and Random Forest achieved the highest accuracy in predicting the model. Using these
models, I predicted the class labels for the test data.
4

Methodology
Data Preparation
Covert Data Type
For the ﬁrst step, check the data type in every single dataset. Make sure all the data are in
categorical and numerical (ﬂoat or int). In the data, I convert the string into category because the
string has unique and frequent content after that I convert the categorical into int dtype and
change the object into number (1,2,3,4,etc).

Figure 1 Converting Data Type

Duplicate Value and Irrelevant Data
After that, I check the duplicate value in the data set and sum them. In the dataset, I did not ﬁnd
any duplicate value. Drop the attribute that is irrelevant, attribute index can be dropped because

Figure 2 Data Correlation

it has no meaning in the data only order number. To check the irrelevant value, can be done using
correlation matrix between the numeric attributes.

Check if the matrix has >0.9 or <0.9 value, it is a highly corelated and I can drop the attributes.
Because it carry almost the same information.

Missing Value
In the beginning, check the missing values using code ‘missing()’. The result can be shown with
the attributes and the percentage of missing values.

In the data there is 3 attributes have a missing data; “Balak” , ”Djoop” and “Woorine”, with
percentage missing value is 2.96%, 18.12% and 61.74% respectively. Woorine is the highest
missing value and consider to be drop because it has not really a ected to any other data.
Because Balak has the least missing value, I consider ﬁlling the missing value with the mean
because the standard deviation of Balak is quite small and it is symmetrically distributed, the
mean is a good central estimate. Djoop is a moderate, so I consider imputing the data and ﬁll it
using regression model and repeated until 10 times, the model is a good for complex datasets
with multiple interrelated variables. After handling with missing value, check again using
‘missing()’ or ‘isna().sum()’. All the missing value have gone

Data Transformation
Data Transformation is used to modify data into a suitable format or structure for analysis,
modeling, or storage. This action such as cleaning extreme values to prevent distortion in models,
scaling/standardization I used is z-scale because for the dataset can be scaled into normal
distribution and less sensitive to the extreme min-max.

Check the outlier using boxplot to visualize the data distribution.

Figure 3 Boxplot Before Scaling

In the picture, we can see the distribution value and the outlier in every attribute. The data has
di erent distribution and distortion; in order to ﬁx that, I do the z-scale and transform the dataset.
6

Figure 4 z-scale formula

Where:

μ = mean of the feature

σ = standard deviation of the feature

z = standardized (z-scaled) value

After doing the scaling, the example result value is like shown below:
Dooga Booladarlung Dembart Darbal Miro Balak Yonga Djarlma Ngoornt Ngooloormayup Amangu
-0.1170 -1.7882 0.9583 0.1156 -0.4953 -2.0030 -0.1559 -0.2689 0.1754 -0.9615 0.7240
0.9326 0.1766 0.2745 1.0872 -0.3431 -0.5462 -0.4981 0.7972 0.0233 -0.6897 0.6456
0.9061 1.1314 -0.6510 -1.3458 -0.6879 -0.0218 -1.3469 -0.6497 0.0322 1.6200 0.0098
0.1232 1.0543 0.1081 1.1352 -0.2735 -0.1024 0.0762 0.3138 -0.0097 -0.1472 0.5771
1.2439 0.1942 0.3829 -0.2859 -1.2294 -0.2870 -0.1915 0.3569 -0.0874 -0.6604 -0.7291
0.1472 -0.4523 0.7977 -1.7025 -1.0259 0.6033 -1.0477 0.6503 -0.0940 0.4420 -0.4770

Figure 5 Scaling Result

The data has changed and been scaled using the std deviation and mean of the data. The boxplot
has also changed:

Figure 6 Boxplot After Scaling

Class Balancing and Data Splitting

In the dataset, the class is found imbalanced and need to be balanced, because model to are
biased toward the majority class, leading to misleading accuracy and poor performance.
7

(Aggarwal, 2015). The first check, the class data is imbalanced. Using SMOTE, the class is
balanced. SMOTE is an oversampling technique that creates artificial examples of the minority
class (e.g., "Ngoon") instead of duplicating existing ones. This helps prevent overfitting and
improves model generalization. (Meng, 2022).

New Sample=Original + random(0.1) × (Neighbor−Original)

Class Before Balancing After Balancing

Djook 1993 (39.86%) 1594 (33%)
Koolang 1991 (39.82%) 1594 (33%)
Ngoon 1016 (20.32%) 1594 (33%)

Figure 7 Class Distribution

Imbalance data could also be handled with data splitting and cross validation. This action can
avoid overﬁtting data and robust performance estimates. In this scenario I used StratiﬁedKFold
to split the data. The reason, because it works with SMOTE, each fold maintains the same ratio
as the original dataset, and Standard KFold might create folds where minority classes are
underrepresented (Chugani, 2025).

In this dataset, I split the data for cross validation into 20% and 80% with random_state=42, for
the StratiﬁedKFold I make the fold into 10 folds. The dataset has to be a np.number, cannot be
run if the dataset contain an object or any other than np.number.

Data Classification
In this classification, I used 4 classifications to define the accuracy and predict the class
attributes.
8

K-NN (K Nearest Neighbors)

The K-NN using the approach based on the nearest data. K-NN does not build an explicit model
during training. Instead, it stores all the training data and makes predictions based on the
similarity between new, unseen data points and the stored training examples. In my data, I used
weight type Uniform and Distance, uniform for a straightforward majority vote, distance for closer
points are more likely to have the same class as the query point. The measurement distance also
uses Euclidean and Manhattan.

K-NN generate estimate accuracy prediction and the accuracy prediction.

Estimated prediction accuracy: 0.860 = 86%

{'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'distance'}

The estimated prediction accuracy of K-NN is 86% and the prediction accuracy is 82%

Naïve Bayes
The Naive Bayes classiﬁer is a probabilistic model based on Bayes’ theorem (Chugani,2024),
which assumes that the features are conditionally independent from the class label. This mean,
to make a prediction, the model calculates the probability of each class given the input features,
under the assumption that each feature contributes independently to that probability. In this
dataset, I used the Gaussian Naive Bayes, which is suitable for continuous, normally distributed
data. For model tuning, I performed a grid search over the var_smoothing parameter, which
controls the amount of variance added to the data to prevent division by zero and improve
numerical stability. The grid search tested 100 logarithmically spaced values between 0 and -9.
Model selection and performance estimation were conducted using 10-fold stratiﬁed cross-
validation and the score is F1_weighted as the evaluation metric. This approach ensures that
class proportions are preserved in each fold and that the metric fairly evaluates performance
across all classes, accounting for imbalance. F1_weighted uses because is handles imbalanced
data and gives a more realistic measure of overall model performance.

Naïve Bayes generate estimate accuracy prediction and the accuracy prediction.

Estimated prediction accuracy: 0.621 = 62.1%

Best Naive Bayes Parameters: {'var_smoothing': np.ﬂoat64(0.0001232846739442066)}

The estimated prediction accuracy of Naïve Bayes is 62.1% and the prediction accuracy is 59%

Decision Tree
Decision Tree classiﬁer learning algorithm that models decisions and their possible
consequences as a tree-like structure (Chugani, 2024). At each internal node, the algorithm
selects a feature and a threshold to split the data into subsets, aiming to maximize the separation
between classes according to a chosen criterion. The process continues recursively, creating
branches until the leaves represent class labels or stopping criteria are met.

In this dataset, I used the DecisionTreeClassiﬁer from scikit-learn. To optimize the model, I
performed a grid search over several key hyperparameters:

 max_depth, which controls the maximum depth of the tree and helps prevent overfitting
by limiting how specific the model can become.
 min_samples_split, which determines the minimum number of samples required to split
an internal node, a ecting the granularity of the splits
 criterion, which specifies the function used to measure the quality of a split, with options
including "gini" for Gini impurity and "entropy" for information gain.

The grid search was conducted using 10-fold stratiﬁed cross-validation to ensure that class
proportions were preserved in each fold, which is particularly important for imbalanced datasets.
The weighted F1-score was used as the evaluation metric, providing a balanced assessment of
model performance across all classes by accounting for both precision and recall, weighted by
class frequency.

Decision Tree generate estimate accuracy prediction and the accuracy prediction.

Estimated prediction accuracy: 0.763 = 76.3%

Best Decision Tree Parameters: {'criterion': 'gini', 'max_depth': 15, 'min_samples_split': 2}

The estimated prediction accuracy of Decision Tree is 76.3% and the prediction accuracy is 72%
10

Random Forest
The Random Forest classiﬁer is a model that combines multiple decision trees to improve
generalization and reduce overﬁtting. Every tree is trained on a random subset of the data and
considers a random subset of features at each split, introducing diversity among the trees.
Prediction is made by aggregating from all trees. (Chugani, 2024)

Key Hyperparameters Tuned

 n_estimators: The number of decision trees in the forest. More trees generally improve
performance but increase computational cost.
 max_depth: The maximum depth of individual trees. Limiting depth prevents overﬁtting
by restricting tree complexity.
 min_samples_split: The minimum number of samples required to split an internal node.
Higher values create simpler trees.

Random Forest generate an estimate of accuracy prediction and the accuracy prediction.

Estimated prediction accuracy: 0.908 = 90.8%

Best Random Forest Parameters: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 200}

The estimated prediction accuracy of Random Forest is 90.8% and the prediction accuracy is 88%

Prediction
After completing the classiﬁcation model, I predict the data using test data that I have trained in
train data. I chose the 2 highest accuracy model to be used.

Classiﬁer Est Prediction Accuracy Delta

Accuracy
K-NN 86% 82% 4%
Naïve Bayes 62.1% 59% 3.1%
Decision Tree 76.3% 72% 4.3%
Random Forest 90.8% 88% 2.8%

1. K-NN with 82%

2. Random Forest with 88%
11

The accuracy of the model can also be acquired by the confusion matrix:

Figure 8 Confusion Matrix Random Forest and K-NN

To use the test data, I have to prepare the test data to have the same shape with the train data. I
dropped the Woorine and Index attributes, convert the object dtype into integer dtype. This will
make the test data has the same structure shape with train data so I can predict the class in test
data.

Conclusion
After completing data preparation from cleaning irrelevant data, handling missing data, scaling
using z-scale, converting the data type within the dataset, and clearing the outlier/distortion, I
can train the data using 4 classiﬁer and get the predicted class in test data. The result is
Random Forest is the highest score for accuracy with 88% and is followed by K-NN with 82%

After generating the 4 classiﬁer into the dataset. It can be summarized using the table below.

The highest model is Random Forest and K-NN. The small delta (≤4.3%) across models
suggests cross validation reliably estimates performance. Random Forest conﬁrms it handles
the data’s complexity best
12

Figure 9 Model Comparison

Reference
Aggarwal, C. C. (2015). Data Mining. Springer.

Chugani, V. (2024, June 21). A Comprehensive Guide to K-Fold Cross Validation.

Datacamp.com; DataCamp. https://www.datacamp.com/tutorial/k-fold-cross-validation

Meng, D., & Li, Y. (2022). An imbalanced learning method by combining SMOTE with Center
O set Factor. Applied Soft Computing, 120(1568-4946), 108618.
https://doi.org/10.1016/j.asoc.2022.108618

Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Advance Python
No ratings yet
Advance Python
5 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Credit Card Approval Prediction Report-Final
No ratings yet
Credit Card Approval Prediction Report-Final
27 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
R Assignment
No ratings yet
R Assignment
8 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
MLM Report Customer Churn
No ratings yet
MLM Report Customer Churn
17 pages
Documenting The Solution To Develop A Behaviour Score
No ratings yet
Documenting The Solution To Develop A Behaviour Score
9 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Exp 2
No ratings yet
Exp 2
6 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Academic Performance Data Wrangling
No ratings yet
Academic Performance Data Wrangling
9 pages
Data Prep for ML Beginners
No ratings yet
Data Prep for ML Beginners
39 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Statistics For Data Science
100% (3)
Statistics For Data Science
39 pages
DWM Unit 3
No ratings yet
DWM Unit 3
18 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Essential Steps in Data Cleaning
No ratings yet
Essential Steps in Data Cleaning
17 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Bai Tap
No ratings yet
Bai Tap
34 pages
Data Preprocessing and Cleaning For Machine Learning
No ratings yet
Data Preprocessing and Cleaning For Machine Learning
16 pages
Feature Engineering Basics for ML
No ratings yet
Feature Engineering Basics for ML
33 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Predicting Customer Churn in Telecom
No ratings yet
Predicting Customer Churn in Telecom
15 pages
Analyzing The Random-Walk Algorithm For SAT: Helsinki University of Technology
No ratings yet
Analyzing The Random-Walk Algorithm For SAT: Helsinki University of Technology
54 pages
Finite Element Analysis Guide
No ratings yet
Finite Element Analysis Guide
7 pages
Fire Hotspots Detection System On CCTV Videos Using You Only Look Once (YOLO) Method and Tiny YOLO Model For High Buildings Evacuation
No ratings yet
Fire Hotspots Detection System On CCTV Videos Using You Only Look Once (YOLO) Method and Tiny YOLO Model For High Buildings Evacuation
6 pages
A Comparative Analysis of Logistic Regression and Random Forest For Individual Fairness in Machine Learning
No ratings yet
A Comparative Analysis of Logistic Regression and Random Forest For Individual Fairness in Machine Learning
5 pages
The Z Transform
No ratings yet
The Z Transform
38 pages
Amazon - Software Dev Engineer - Campus Hiring - 2026 Batch - Notification
No ratings yet
Amazon - Software Dev Engineer - Campus Hiring - 2026 Batch - Notification
1 page
Lab 1 2 Aes Rsa Encryption Vf17
No ratings yet
Lab 1 2 Aes Rsa Encryption Vf17
20 pages
Simulink Exercises for MATLAB Course
No ratings yet
Simulink Exercises for MATLAB Course
8 pages
Focus Areas Within Computer Science at JHU: Natural Language Processing Robotics
No ratings yet
Focus Areas Within Computer Science at JHU: Natural Language Processing Robotics
1 page
Digital Filter Design Guide
No ratings yet
Digital Filter Design Guide
20 pages
Algos Qpaper 2022
No ratings yet
Algos Qpaper 2022
6 pages
CNN-Based Face Anti-Spoofing Techniques
No ratings yet
CNN-Based Face Anti-Spoofing Techniques
8 pages
Handwritten Digit Recognition with Python
No ratings yet
Handwritten Digit Recognition with Python
16 pages
STA 301 Assingmnt
No ratings yet
STA 301 Assingmnt
3 pages
Frequency Response Analysis in Control Systems
No ratings yet
Frequency Response Analysis in Control Systems
281 pages
M1 Mock 2022 Marking
No ratings yet
M1 Mock 2022 Marking
12 pages
Machine Learning for Trading Course Guide
100% (1)
Machine Learning for Trading Course Guide
15 pages
Signals and Systems Course Overview
No ratings yet
Signals and Systems Course Overview
158 pages
M-Theory: A Mathematical Overview
No ratings yet
M-Theory: A Mathematical Overview
2 pages
Toa Presentation
No ratings yet
Toa Presentation
12 pages
Python Recursion
No ratings yet
Python Recursion
11 pages
Iscrt Na Sodrzina
No ratings yet
Iscrt Na Sodrzina
2 pages
Signals and Systems - EC3354 - Important Questions With Answer - Unit 1
0% (2)
Signals and Systems - EC3354 - Important Questions With Answer - Unit 1
12 pages
Week 12
No ratings yet
Week 12
65 pages
2021 Algorithm Design and Problem Solving IGCSE 0478
No ratings yet
2021 Algorithm Design and Problem Solving IGCSE 0478
59 pages
BCA Theory: Computer Science Grammar
No ratings yet
BCA Theory: Computer Science Grammar
5 pages
DeepCoFFEA Improved Flow Correlation Attacks On Tor Via Metric Learning and Amplification
No ratings yet
DeepCoFFEA Improved Flow Correlation Attacks On Tor Via Metric Learning and Amplification
18 pages
Algorithm Homework Solutions
No ratings yet
Algorithm Homework Solutions
6 pages
Problem Set #1 - Theory of Matrices
No ratings yet
Problem Set #1 - Theory of Matrices
4 pages
5 Exponential and Logarithmic Functions Quiz 2122 Student LESS SPACE
No ratings yet
5 Exponential and Logarithmic Functions Quiz 2122 Student LESS SPACE
6 pages