IT132 – Introduction to Data Mining
Week 1 – Tutorial
Class 1: Introduction to Weka
1.1. Introduction
Weka is an open-source software available at www.cs.waikato.ac.nz/ml/weka. Weka stands for the
Waikato Environment for Knowledge Analysis. It offers clean, spare implementation of the simplest
techniques, designed to aid understanding of the data mining techniques. It also provides a work-bench
that includes full, working, state-of-the-art implementations of many popular learning schemes that can
be used for practical data mining or for research.
In the first class, we are going to get started with Weka: exploring the “Explorer” interface, exploring
some datasets, building a classifier, using filters, and visualizing your dataset. (See the lecture of class 1
by Ian H. Witten, [1])
Task: Taking notes how you find the Explorer, and answering questions in the following sections
1.2. Exploring the Explorer
Follow the instructions in [1]
1.3. Exploring datasets
Follow the instructions in [1]
In dataset weather.nominal.arff, how many attributes are there in the relation? What are their values?
What is the class and its values? Counting instances for each attribute value.
Answer: There are 5 attributes in the relation, the specific values, the class and also its values are described in the
following table. Instances for each attribute value is counted for nominal data. Numeric data is summarized in
terms of min, max, mean and standard deviation.
1
weather.nominal.arff
Dataset Attributes Values #Instances
Relation: weather.symbolic outlook sunny 5
#Instances: 14 overcast 4
#Attributes: 5 rainy 5
Distinct: 3
temperature hot 4
mild 6
cool 4
Distinct: 3
humidity high 7
normal 7
Distinct: 2
windy TRUE 6
FALSE 8
Distinct: 2
Class play yes 9
no 5
Distinct: 2
Similarly, examine datasets: weather.numeric.arff and glass.arff.
weather.numeric.arff
Dataset Attributes Values #Instances
Relation: weather outlook sunny 5
#Instances: 14 overcast 4
#Attributes: 5 rainy 5
Distinct: 3
temperature Minimum: 64 Distinct: 12
Maximum: 85
Mean: 73.571
StdDev: 6.572
humidity Minimum: 65 Distinct: 10
Maximum: 96
Mean: 81.643
StdDev: 10.285
windy TRUE 6
FALSE 8
Distinct: 2
Class play yes 9
no 5
Distinct: 2
2
glass.arff
Dataset Attributes Values #Instances
Relation: Glass Rl Minimum: 1.511 Distinct: 178
#Instances: 214 Maximum: 1.534
#Attributes: 10 Mean: 1.518
StdDev: 0.003
Na Minimum: 10.73 Distinct: 142
Maximum: 17.38
Mean: 13.408
StdDev: 0.817
Mg Minimum: 0 Distinct: 94
Maximum: 4.49
Mean: 2.685
StdDev: 1.442
Al Minimum: 0.29 Distinct: 118
Maximum: 3.5
Mean: 1.445
StdDev: 0.499
Si Minimum: 69.81 Distinct: 133
Maximum: 75.41
Mean: 72.651
StdDev: 0.775
K Minimum: 0 Distinct: 65
Maximum: 6.21
Mean: 0.497
StdDev: 0.652
Ca Minimum: 5.43 Distinct: 143
Maximum: 16.19
Mean: 8.957
StdDev: 1.423
Ba Minimum: 0 Distinct: 34
Maximum: 3.15
Mean: 0.175
StdDev: 0.497
Fe Minimum: 0 Distinct: 32
Maximum: 0.51
Mean: 0.057
StdDev: 0.097
3
Type build wind float 70
build wind non-float 76
vehic wind float 17
vehic wind non-float 0
containers 13
tableware 9
headlamps 29
Distinct: 6
4
Create a file of ARFF format and examine it.
1.4. Building a classifier
Follow the instructions in [1]
Examine the output of J48 vs. RandomTree applied to dataset glass.arff
Algorithm Pruned/unpruned minNumObj Leaf size Correctly
Classified
Instances
J48 Unpruned 15 8 131
RandomTree N/A N/A N/A 150
Evaluate the confusion matrix every time running an algorithm.
J48 – unpruned – minNumObj=15:
=== Confusion Matrix ===
a b c d e f g <-- classified as
50 15 3 0 0 1 1 | a = build wind float
16 47 6 0 2 3 2 | b = build wind non-float
5 5 6 0 0 1 0 | c = vehic wind float
0 0 0 0 0 0 0 | d = vehic wind non-float
0 2 0 0 10 0 1 | e = containers
1 1 0 0 0 7 0 | f = tableware
3 2 0 0 0 1 23 | g = headlamps
The algorithm is skewed towards classifying into a = build wind float, and b = build wind non-float.
5
RandomTree:
=== Confusion Matrix ===
a b c d e f g <-- classified as
53 11 6 0 0 0 0 | a = build wind float
13 53 4 0 2 2 2 | b = build wind non-float
5 4 8 0 0 0 0 | c = vehic wind float
0 0 0 0 0 0 0 | d = vehic wind non-float
0 1 0 0 11 0 1 | e = containers
0 4 0 0 1 4 0 | f = tableware
2 2 0 0 2 2 21 | g = headlamps
The algorithm is skewed towards classifying into a = build wind float, and b = build wind non-float.
However, RandomTree provides better results than J48.
1.5. Using a filter
Follow the instructions in [1], and remark
_Use a filter to remove an attribute à
- What are attributeIndices?
Answer: Range of attributes to be acted upon by the filter.
_Remove instances where humidity is high à
6
- What are nominalIndices?
Answer: Range of label indices to be used for selection on nominal attribute.
_Fewer attributes, better classification:
Answer:
This is not true for all cases. If it is true, then it is highly possible that the removed attributes prove to be
no more than unnecessary complications to the model, or it is because the model cannot find the global
optimum by including those attributes. However, in cases where important attributes are removed (such
as attribute=size-measures to classify cats or tigers) then there will be major blows that deteriorate the
classification results. Either way, the notion that fewer attributes can lead to better classification requires
observations and experiments to confirm, it depends both on the model and the set of attributes.
7
Follow the instructions in [1], review the outputs of J48 applied to glass.arff:
Filter Leaf size Correctly Classified Remark
Instances
8 131 This is the state of classifier
Original built in section 1.4
Remove Fe 8 133 The model is incrementally
improved from the last, with
higher true positives, lower
false positives.
Remove all 7 142 The model showed good
attributes improvements from the last,
except RI and with higher true positives,
MG lower false positives.
Because the number of
attributes have been greatly
reduced, the number of
leaves decreased.
Interestingly, it appears that
the model works better with
only so few attributes, the
larger number must have
complicated the model.
1.6. Visualizing your data
Follow the instructions in [1], how do you find “Visualize classifier errors”?
– Answer: By right clicking a desired entry in the Result list.
8
After running J48 for iris.arff, determine:
- How many instances are predicted wrong?
Answer: 9 (given J48 classifier – unpruned – minNumObj=15).
- What are they?
Answer:
9
Instance Predicted class Actual class
63 Iris-versicolor Iris-virginica
80 Iris-versicolor Iris-virginica
92 Iris-versicolor Iris-virginica
109 Iris-versicolor Iris-virginica
123 Iris-versicolor Iris-virginica
98 Iris-versicolor Iris-setosa
73 Iris-virginica Iris-versicolor
10
105 Iris-virginica Iris-versicolor
119 Iris-virginica Iris-versicolor
11