lOMoARcPSD|12245914
Assignment 3
Introduction to Data Analytics (University of Technology Sydney)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Introduction to Data Analytics
Assessment Task 3: Data mining in action
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Table of Contents
Data Mining ....................................................................................................................... 4
The Task ........................................................................................................................ 4
Input................................................................................................................................ 4
Output............................................................................................................................. 4
Preprocessing ................................................................................................................... 5
Column Filter ................................................................................................................. 5
Missing Value ................................................................................................................ 5
Number to String .......................................................................................................... 5
Normalizer ..................................................................................................................... 5
Partitioning..................................................................................................................... 5
Classifiers .......................................................................................................................... 6
Decision Trees .............................................................................................................. 6
Random Forest ............................................................................................................. 7
K Nearest Neighbor (KNN) ......................................................................................... 8
SVM ................................................................................................................................ 9
Neural Networks ......................................................................................................... 10
Tree Ensemble.............................................................................................................11
Best Classifier ................................................................................................................. 12
Result Summary ......................................................................................................... 12
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Conclusion ................................................................................................................... 12
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Data Mining
The Task
Following the last assignment, building classifiers and choosing the best one to predict
an attribute “QUALIFIED” for property data set is the main focus of this assignment.
There are number of methods for it. The software called KNIME, which has a graphical
interface, is chosen for it to explicate the process visually.
Input
There are three files for this assignment. These are training data set, unknown data set,
and sample prediction data set. The training data set has the attribute “QUALIFIED”, but
unknown data set has not. The last data set, sample prediction, is filled with random
values for how Kaggle works.
For the assignment, KNIME will handle the training and unknown data sets to predict
the attribute value for unknown data set.
Output
It is not mandatory, but once predicted data is created, uploading on Kaggle will score it
and known how effective the process is.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Preprocessing
Column Filter
Within the data set, attribute “GIS_LAST_MOD_DTTM” which is a column number 37
has same value for all rows. Therefore, a column filter is used to remove the column
from the data set to ignore it.
Missing Value
Missing values which may disturb the prediction are will be removed.
Number to String
There are attributes which have numbers as data, but not numeric data such as “HEAT”,
“STYLE”, “STRUCT”, “GRADE”, “CNDTN”, “EXTWALL”, “ROOF”, “INTWALL”,
“USECODE”. There will be treated as string to improve learner’s performance.
Normalizer
The normalizer normalizes attribute “AYB” with min-max normalization.
Partitioning
The partitioning node separates the training data into two portions, split 70-30 with 70%
will be trained, and 30% will be tested.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Classifiers
Decision Trees
The data will be transformed and predicted by decision tree nodes. It is most
appropriate to construct categorical data. The accuracy is 83.074%. There are 1544
wrong classified rows.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Random Forest
The pre-processed data transmitted into Random Forest learner, and default settings
are used. The accuracy is 88.043%.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
K Nearest Neighbor (KNN)
The preprocessed data transmitted into the KNN node. The “Number of Neighbors to
consider (K) was changed to 5 which was originally 3. The accuracy is 85.855%
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
SVM
After starting the SVM learner over 24 hours, it did not complete the process; thus, no
results came out.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Neural Networks
The pre-processed data was transmitted into the PNN Learner. The settings are default.
The accuracy is 87.01%.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Tree Ensemble
The pre-processed data transmitted into the Tree Ensemble Learner with default
settings except the partitioning, which is 90-10. The accuracy is 88.662%.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
Best Classifier
Result Summary
The result of each method is the following:
Decision Tree: 83.074%
Random Forest: 88.043%
K Nearest Neighbor: 85.855%
SVM:
Neural Networks: 87.01%.
Tree Ensemble: 88.662%.
Conclusion
Based on the result summary above, Tree Ensemble has the highest accuracy among
others. Thus, for unknown data set, Tree Ensemble methods will be used for making a
prediction. The prediction from unknown data set was uploaded on Kaggle.
Downloaded by arun neupane (arunneupane20@[Link])
lOMoARcPSD|12245914
The whole part of KNIME workflows:
Downloaded by arun neupane (arunneupane20@[Link])