1
Data Mining COMP5009
Assignment Report
StudentID 22442066 – David Nathanael Dharma Humala
Faculty Science and Engineering – Curtin University
2
Content
Contents
Content ................................................................................................................................. 2
Figure List .............................................................................................................................. 2
Executive Summary................................................................................................................ 3
Methodology.......................................................................................................................... 4
Data Preparation ................................................................................................................ 4
Covert Data Type ........................................................................................................... 4
Duplicate Value and Irrelevant Data .............................................................................. 4
Missing Value................................................................................................................. 5
Data Transformation ...................................................................................................... 5
Class Balancing and Data Splitting ............................................................................... 6
Data Classification ............................................................................................................. 7
K-NN (K Nearest Neighbors) .......................................................................................... 8
Naïve Bayes ................................................................................................................... 8
Decision Tree ................................................................................................................. 9
Random Forest ............................................................................................................ 10
Prediction ........................................................................................................................ 10
Conclusion .......................................................................................................................... 11
Reference ............................................................................................................................ 12
Figure List
Figure 1 Converting Data Type .............................................................................................. 4
Figure 2 Data Correlation ...................................................................................................... 4
Figure 3 Boxplot Before Scaling ............................................................................................ 5
Figure 4 z-scale formula ....................................................................................................... 6
Figure 5 Scaling Result ......................................................................................................... 6
Figure 6 Boxplot After Scaling ............................................................................................... 6
Figure 7 Class Distribution ................................................................................................... 7
Figure 8 Confusion Matrix Random Forest and K-NN .......................................................... 11
Figure 9 Model Comparison ................................................................................................ 12
3
Executive Summary
In this report, I describe how the dataset was prepared and handled, and how it was subsequently
used to predict the target variable. The dataset consists of 5,000 rows for training data and 500
rows for test data. Initially, after loading the data into the notebook, I checked the data types, data
structure, and data information.
The first step in preparation was to check for duplicate data and remove them. Next, I converted
data types from string to categorical or Boolean where appropriate. I then checked for the missing
data in the dataset. In some cases, missing data can a ect the prediction results because it leads
to incomplete information. Columns with more than 60% missing values Ire dropped. For
columns with less than 5% missing data, the missing values Ire filled with the mean using the
‘fillna’ syntax. For columns with around 18% missing values, I used iterative imputation,
employing regression models based on other columns to predict the missing values.
Randomness was used in predicting these missing values.
I also checked the correlation among the attributes using Pearson correlation to determine
whether the relationships is positive or negative, if it highly corelated between two feature, we
might drop it because it carry almost the same information and can produce overfitting when
predict.
After cleaning the data of missing values, irrelevant data, and duplicates, I checked for outliers
using boxplots to visualize the data distribution. I applied scaling (z-score standardization), as it
is less sensitive to outliers and ideal when the data is roughly normal. This concludes the data
preparation process.
The next step was to check the class distribution. The class data needed to be balanced before
classification. I used the SMOTE technique to balance the classes, especially when one class
(typically the "positive" or "rare" class) had far fewer samples than the others. For cross-validation,
I used StratifiedKFold because it splits the data randomly while keeping class proportions
consistent, which is beneficial for imbalanced data.
After all data preparation steps Ire completed, I classified the data using four methods: k-NN (K-
Nearest Neighbors), Naïve Bayes, Decision Tree, and Random Forest. Among these classifiers,
k-NN and Random Forest achieved the highest accuracy in predicting the model. Using these
models, I predicted the class labels for the test data.
4
Methodology
Data Preparation
Covert Data Type
For the first step, check the data type in every single dataset. Make sure all the data are in
categorical and numerical (float or int). In the data, I convert the string into category because the
string has unique and frequent content after that I convert the categorical into int dtype and
change the object into number (1,2,3,4,etc).
Figure 1 Converting Data Type
Duplicate Value and Irrelevant Data
After that, I check the duplicate value in the data set and sum them. In the dataset, I did not find
any duplicate value. Drop the attribute that is irrelevant, attribute index can be dropped because
Figure 2 Data Correlation
5
it has no meaning in the data only order number. To check the irrelevant value, can be done using
correlation matrix between the numeric attributes.
Check if the matrix has >0.9 or <0.9 value, it is a highly corelated and I can drop the attributes.
Because it carry almost the same information.
Missing Value
In the beginning, check the missing values using code ‘missing()’. The result can be shown with
the attributes and the percentage of missing values.
In the data there is 3 attributes have a missing data; “Balak” , ”Djoop” and “Woorine”, with
percentage missing value is 2.96%, 18.12% and 61.74% respectively. Woorine is the highest
missing value and consider to be drop because it has not really a ected to any other data.
Because Balak has the least missing value, I consider filling the missing value with the mean
because the standard deviation of Balak is quite small and it is symmetrically distributed, the
mean is a good central estimate. Djoop is a moderate, so I consider imputing the data and fill it
using regression model and repeated until 10 times, the model is a good for complex datasets
with multiple interrelated variables. After handling with missing value, check again using
‘missing()’ or ‘isna().sum()’. All the missing value have gone
Data Transformation
Data Transformation is used to modify data into a suitable format or structure for analysis,
modeling, or storage. This action such as cleaning extreme values to prevent distortion in models,
scaling/standardization I used is z-scale because for the dataset can be scaled into normal
distribution and less sensitive to the extreme min-max.
Check the outlier using boxplot to visualize the data distribution.
Figure 3 Boxplot Before Scaling
In the picture, we can see the distribution value and the outlier in every attribute. The data has
di erent distribution and distortion; in order to fix that, I do the z-scale and transform the dataset.
6
Figure 4 z-scale formula
Where:
μ = mean of the feature
σ = standard deviation of the feature
z = standardized (z-scaled) value
After doing the scaling, the example result value is like shown below:
Dooga Booladarlung Dembart Darbal Miro Balak Yonga Djarlma Ngoornt Ngooloormayup Amangu
-0.1170 -1.7882 0.9583 0.1156 -0.4953 -2.0030 -0.1559 -0.2689 0.1754 -0.9615 0.7240
0.9326 0.1766 0.2745 1.0872 -0.3431 -0.5462 -0.4981 0.7972 0.0233 -0.6897 0.6456
0.9061 1.1314 -0.6510 -1.3458 -0.6879 -0.0218 -1.3469 -0.6497 0.0322 1.6200 0.0098
0.1232 1.0543 0.1081 1.1352 -0.2735 -0.1024 0.0762 0.3138 -0.0097 -0.1472 0.5771
1.2439 0.1942 0.3829 -0.2859 -1.2294 -0.2870 -0.1915 0.3569 -0.0874 -0.6604 -0.7291
0.1472 -0.4523 0.7977 -1.7025 -1.0259 0.6033 -1.0477 0.6503 -0.0940 0.4420 -0.4770
Figure 5 Scaling Result
The data has changed and been scaled using the std deviation and mean of the data. The boxplot
has also changed:
Figure 6 Boxplot After Scaling
Class Balancing and Data Splitting
In the dataset, the class is found imbalanced and need to be balanced, because model to are
biased toward the majority class, leading to misleading accuracy and poor performance.
7
(Aggarwal, 2015). The first check, the class data is imbalanced. Using SMOTE, the class is
balanced. SMOTE is an oversampling technique that creates artificial examples of the minority
class (e.g., "Ngoon") instead of duplicating existing ones. This helps prevent overfitting and
improves model generalization. (Meng, 2022).
New Sample=Original + random(0.1) × (Neighbor−Original)
Class Before Balancing After Balancing
Djook 1993 (39.86%) 1594 (33%)
Koolang 1991 (39.82%) 1594 (33%)
Ngoon 1016 (20.32%) 1594 (33%)
Figure 7 Class Distribution
Imbalance data could also be handled with data splitting and cross validation. This action can
avoid overfitting data and robust performance estimates. In this scenario I used StratifiedKFold
to split the data. The reason, because it works with SMOTE, each fold maintains the same ratio
as the original dataset, and Standard KFold might create folds where minority classes are
underrepresented (Chugani, 2025).
In this dataset, I split the data for cross validation into 20% and 80% with random_state=42, for
the StratifiedKFold I make the fold into 10 folds. The dataset has to be a np.number, cannot be
run if the dataset contain an object or any other than np.number.
Data Classification
In this classification, I used 4 classifications to define the accuracy and predict the class
attributes.
8
K-NN (K Nearest Neighbors)
The K-NN using the approach based on the nearest data. K-NN does not build an explicit model
during training. Instead, it stores all the training data and makes predictions based on the
similarity between new, unseen data points and the stored training examples. In my data, I used
weight type Uniform and Distance, uniform for a straightforward majority vote, distance for closer
points are more likely to have the same class as the query point. The measurement distance also
uses Euclidean and Manhattan.
K-NN generate estimate accuracy prediction and the accuracy prediction.
Estimated prediction accuracy: 0.860 = 86%
{'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'distance'}
The estimated prediction accuracy of K-NN is 86% and the prediction accuracy is 82%
Naïve Bayes
The Naive Bayes classifier is a probabilistic model based on Bayes’ theorem (Chugani,2024),
which assumes that the features are conditionally independent from the class label. This mean,
to make a prediction, the model calculates the probability of each class given the input features,
under the assumption that each feature contributes independently to that probability. In this
dataset, I used the Gaussian Naive Bayes, which is suitable for continuous, normally distributed
data. For model tuning, I performed a grid search over the var_smoothing parameter, which
controls the amount of variance added to the data to prevent division by zero and improve
numerical stability. The grid search tested 100 logarithmically spaced values between 0 and -9.
Model selection and performance estimation were conducted using 10-fold stratified cross-
validation and the score is F1_weighted as the evaluation metric. This approach ensures that
class proportions are preserved in each fold and that the metric fairly evaluates performance
across all classes, accounting for imbalance. F1_weighted uses because is handles imbalanced
data and gives a more realistic measure of overall model performance.
Naïve Bayes generate estimate accuracy prediction and the accuracy prediction.
Estimated prediction accuracy: 0.621 = 62.1%
Best Naive Bayes Parameters: {'var_smoothing': np.float64(0.0001232846739442066)}
9
The estimated prediction accuracy of Naïve Bayes is 62.1% and the prediction accuracy is 59%
Decision Tree
Decision Tree classifier learning algorithm that models decisions and their possible
consequences as a tree-like structure (Chugani, 2024). At each internal node, the algorithm
selects a feature and a threshold to split the data into subsets, aiming to maximize the separation
between classes according to a chosen criterion. The process continues recursively, creating
branches until the leaves represent class labels or stopping criteria are met.
In this dataset, I used the DecisionTreeClassifier from scikit-learn. To optimize the model, I
performed a grid search over several key hyperparameters:
max_depth, which controls the maximum depth of the tree and helps prevent overfitting
by limiting how specific the model can become.
min_samples_split, which determines the minimum number of samples required to split
an internal node, a ecting the granularity of the splits
criterion, which specifies the function used to measure the quality of a split, with options
including "gini" for Gini impurity and "entropy" for information gain.
The grid search was conducted using 10-fold stratified cross-validation to ensure that class
proportions were preserved in each fold, which is particularly important for imbalanced datasets.
The weighted F1-score was used as the evaluation metric, providing a balanced assessment of
model performance across all classes by accounting for both precision and recall, weighted by
class frequency.
Decision Tree generate estimate accuracy prediction and the accuracy prediction.
Estimated prediction accuracy: 0.763 = 76.3%
Best Decision Tree Parameters: {'criterion': 'gini', 'max_depth': 15, 'min_samples_split': 2}
The estimated prediction accuracy of Decision Tree is 76.3% and the prediction accuracy is 72%
10
Random Forest
The Random Forest classifier is a model that combines multiple decision trees to improve
generalization and reduce overfitting. Every tree is trained on a random subset of the data and
considers a random subset of features at each split, introducing diversity among the trees.
Prediction is made by aggregating from all trees. (Chugani, 2024)
Key Hyperparameters Tuned
n_estimators: The number of decision trees in the forest. More trees generally improve
performance but increase computational cost.
max_depth: The maximum depth of individual trees. Limiting depth prevents overfitting
by restricting tree complexity.
min_samples_split: The minimum number of samples required to split an internal node.
Higher values create simpler trees.
Random Forest generate an estimate of accuracy prediction and the accuracy prediction.
Estimated prediction accuracy: 0.908 = 90.8%
Best Random Forest Parameters: {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 200}
The estimated prediction accuracy of Random Forest is 90.8% and the prediction accuracy is 88%
Prediction
After completing the classification model, I predict the data using test data that I have trained in
train data. I chose the 2 highest accuracy model to be used.
Classifier Est Prediction Accuracy Delta
Accuracy
K-NN 86% 82% 4%
Naïve Bayes 62.1% 59% 3.1%
Decision Tree 76.3% 72% 4.3%
Random Forest 90.8% 88% 2.8%
1. K-NN with 82%
2. Random Forest with 88%
11
The accuracy of the model can also be acquired by the confusion matrix:
Figure 8 Confusion Matrix Random Forest and K-NN
To use the test data, I have to prepare the test data to have the same shape with the train data. I
dropped the Woorine and Index attributes, convert the object dtype into integer dtype. This will
make the test data has the same structure shape with train data so I can predict the class in test
data.
Conclusion
After completing data preparation from cleaning irrelevant data, handling missing data, scaling
using z-scale, converting the data type within the dataset, and clearing the outlier/distortion, I
can train the data using 4 classifier and get the predicted class in test data. The result is
Random Forest is the highest score for accuracy with 88% and is followed by K-NN with 82%
After generating the 4 classifier into the dataset. It can be summarized using the table below.
The highest model is Random Forest and K-NN. The small delta (≤4.3%) across models
suggests cross validation reliably estimates performance. Random Forest confirms it handles
the data’s complexity best
12
Figure 9 Model Comparison
Reference
Aggarwal, C. C. (2015). Data Mining. Springer.
Chugani, V. (2024, June 21). A Comprehensive Guide to K-Fold Cross Validation.
Datacamp.com; DataCamp. https://www.datacamp.com/tutorial/k-fold-cross-validation
Meng, D., & Li, Y. (2022). An imbalanced learning method by combining SMOTE with Center
O set Factor. Applied Soft Computing, 120(1568-4946), 108618.
https://doi.org/10.1016/j.asoc.2022.108618