Data Science with R
Cluster Analysis
[email protected]
2nd August 2013
Visit http://onepager.togaware.com/ for more OnePageR’s.
The required packages for this module include:
library(rattle) # Weather dataset and normVarNames()
library(randomForest) # na.roughfix()
library(wskm) # Weighted subspace clustering.
library(amap) # hclusterpar
library(cba) # Dendrogram plot
library(dendroextras) # To colour clusters
As we work through this module, new R commands will be introduced. Be sure to review the
command’s documentation and understand what the command does. You can ask for help using
the ? command as in:
?read.csv
We can obtain documentation on a particular package using the help= option of library():
library(help=rattle)
This module is intended to be hands on. To learn effectively, you are encouraged to have R
running (e.g., RStudio) and to run all the commands as they appear here. Check that you get
the same output, and you understand the output. Try some variations. Explore.
Copyright © 2013 Graham J Williams. You can freely copy, distribute,
transmit, adapt, or make commercial use of this module, as long as the at-
tribution is retained and derivative work is provided under the same license.
Data Science with R OnePageR Survival Guides Cluster Analysis
1 Load Weather Dataset for Modelling
Use the weather dataset from rattle (Williams, 2013).
library(rattle) # normVarNames()
library(randomForest) # na.roughfix()
dsname <- "weather"
ds <- get(dsname)
names(ds) <- normVarNames(names(ds))
vars <- names(ds)
target <- "rain_tomorrow"
risk <- "risk_mm"
id <- c("date", "location")
ignore <- c(id, if (exists("risk")) risk)
mvc <- sapply(ds[vars], function(x) sum(is.na(x)))
mvn <- names(which(mvc == nrow(ds)))
ignore <- union(ignore, mvn)
vars <- setdiff(vars, ignore)
ds[vars] <- na.roughfix(ds[vars]) # Maximise available data.
inputs <- setdiff(vars, target)
nobs <- nrow(ds)
numerics <- intersect(inputs, vars[which(sapply(ds[vars], is.numeric))])
categorics <- intersect(inputs, vars[which(sapply(ds[vars], is.factor))])
We summarise here the processing we have done above:
We use na.roughfix() to impute missing values as k-means can not handle missing values.
We identify the numerics, since most clustering will be performed on numeric data.
Copyright © 2013 [email protected] Module: ClustersO Page: 1 of 10
Data Science with R OnePageR Survival Guides Cluster Analysis
2 K-Means
Here is our first attempt.
m.km <- kmeans(ds, 10)
## Warning: NAs introduced by coercion
## Error: NA/NaN/Inf in foreign function call (arg 1)
The error is because there are non-numeric variables that we are attempting to cluster on.
m.km <- kmeans(ds[numerics], 10)
So that appears to have succeeded to build 10 clusters. The sizes of the clusters can readily be
listed.
m.km$size
## [1] 7 57 35 15 40 24 49 63 30 46
The cluster centers (the means) can also be listed.
m.km$centers
## min_temp max_temp rainfall evaporation sunshine wind_gust_speed
## 1 10.3000 24.59 0.34286 7.114 11.029 72.71
## 2 3.0667 17.59 0.35088 2.775 8.209 25.00
## 3 12.8886 31.67 0.12000 8.051 10.974 44.49
....
The component m.km$cluster which of the 10 clusters each of the original observations be-
longs.
head(m.km$cluster)
## [1] 8 8 4 6 6 10
Exercise: Visualise the clusters.
Exercise: Notice the data is not scaled to a common range. What is the impact of this and
then rebuild after rescaling the variables.
Copyright © 2013 [email protected] Module: ClustersO Page: 2 of 10
Data Science with R OnePageR Survival Guides Cluster Analysis
3 Evaluate Clusters
Exercise: What is the use of the sums of squares.
Copyright © 2013 [email protected] Module: ClustersO Page: 3 of 10
Data Science with R OnePageR Survival Guides Cluster Analysis
4 Entropy Weighted K-Means
Use ewkm() from wskm (Williams et al., 2012).
set.seed(42)
library(wskm)
m.ewkm <- ewkm(ds, 10)
## Warning: NAs introduced by coercion
## Error: NA/NaN/Inf in foreign function call (arg 1)
Once again, only numeric variables can be clustered.
library(wskm)
m.ewkm <- ewkm(ds[numerics], 10)
## **********Clustering converged. Terminate!
round(100*m.ewkm$weights)
## min_temp max_temp rainfall evaporation sunshine wind_gust_speed
## 1 0 0 100 0 0 0
## 2 0 0 0 100 0 0
## 3 0 0 100 0 0 0
## 4 0 0 0 0 0 0
## 5 6 6 6 6 6 6
## 6 0 0 0 100 0 0
## 7 0 0 0 100 0 0
## 8 0 0 0 0 0 0
## 9 6 6 6 6 6 6
## 10 0 0 100 0 0 0
....
Exercise: Plot the clusters.
Exercise: Rescale the data so all variables have the same range and then rebuild the cluster,
and comment on the differences.
Exercise: Discuss why ewkm might be better than k-means. Consider the number of vari-
ables as an advantage, particularly in the context of the curse of dimensionality.
Copyright © 2013 [email protected] Module: ClustersO Page: 4 of 10
Data Science with R OnePageR Survival Guides Cluster Analysis
5 Hierarchical Cluster in Parallel
Use hclusterpar() from amap (?).
library(amap)
model <- hclusterpar(na.omit(ds[numerics]),
method="euclidean",
link="ward",
nbproc=1)
Copyright © 2013 [email protected] Module: ClustersO Page: 5 of 10
Data Science with R OnePageR Survival Guides Cluster Analysis
6 Plotting Hierarchical Cluster
Plot from cba (?).
plot(model, main="Cluster Dendrogram", xlab="", labels=FALSE, hang=0)
#Add in rectangles to show the clusters.
rect.hclust(model, k=10)
Cluster Dendrogram
1500
1000
Height
500
0
hclusterpar (*, "ward")
Copyright © 2013 [email protected] Module: ClustersO Page: 6 of 10
7
0 500 1000 1500
Data Science with R
331
290
295
287
306
294
311
297
358
321
357
65
78
121
47
152
151
324
335
326
317
library(dendroextras)
334
115
116
333
318
337
338
366
325
362
57
327
348
154
320
126
130
12
56
108
122
356
44
106
83
84
13
85
20
71
59
87
125
127
131
138
25
134
94
110
93
144
24
49
30
36
58
66
103
111
67
76
51
112
2
38
32
54
102
117
40
45
68
26
140
92
69
70
336
360
361
15
363
132
136
62
63
135
61
137
352
355
332
354
86
95
141
365
17
39
88
119
133
118
139
21
72
73
74
91
128
18
28
16
90
60
89
14
19
191
192
177
190
203
181
313
185
186
161
194
169
175
350
176
193
229
235
322
219
204
206
248
205
230
211
241
213
220
208
209
210
293
236
255
269
273
246
278
266
247
267
304
207
212
148
149
187
312
268
303
182
258
242
299
195
196
plot(colour_clusters(model, k=10), xlab="")
197
150
178
292
300
201
274
343
184
323
265
314
298
301
291
315
302
307
153
156
173
157
351
10
164
107
163
129
162
11
124
344
346
347
329
330
167
168
123
158
345
353
316
1
359
104
113
224
339
165
198
97
22
96
50
145
146
46
80
31
23
OnePageR Survival Guides
105
260
202
234
257
237
249
281
279
283
263
251
270
252
256
215
232
259
214
231
250
216
223
233
41
142
143
183
188
172
8
171
308
309
217
218
310
221
222
159
160
55
109
364
9
170
48
7
42
79
Using the dendroextras (?) package to add colour to the dendrogram:
27
Add Colour to the Hierarchical Cluster
43
328
296
342
262
272
189
200
6
228
5
280
282
285
227
238
166
240
277
340
174
225
34
98
114
4
53
101
349
100
147
52
3
81
99
120
29
37
77
64
75
82
33
35
275
199
261
284
253
Module: ClustersO
305
319
179
155
341
180
264
254
286
226
239
288
289
271
245
276
243
244
Page: 7 of 10
Cluster Analysis
Data Science with R OnePageR Survival Guides Cluster Analysis
8 Hierarchical Cluster Binary Variables
Exercise: Clustering a large population based on the patterns of missing data within the
population is a technique for grouping observations exhibiintg similar patterns of behavi-
ouour assuming missing by pattern.... We can convert each variable to a binary 1/0 indi-
cating present/missing and then use mona() for a hiearchical clustering. Demonstrate this.
Include a levelplot.
Copyright © 2013 [email protected] Module: ClustersO Page: 8 of 10
Data Science with R OnePageR Survival Guides Cluster Analysis
9 Further Reading
The Rattle Book, published by Springer, provides a comprehensive
introduction data mining and analytics using Rattle and R. It
is available from Amazon. Other documentation on a broader
selection of R topics of relevance to the data scientist is freely
available from http://datamining.togaware.com, including the
Datamining Desktop Survival Guide.
This module is one of many OnePageR modules available from
http://onepager.togaware.com. In particular, following the links
on the website with a * which indicate the generally more developed
OnePageR modules.
Copyright © 2013 [email protected] Module: ClustersO Page: 9 of 10
Data Science with R OnePageR Survival Guides Cluster Analysis
10 References
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Williams GJ (2009). “Rattle: A Data Mining GUI for R.” The R Journal, 1(2), 45–55. URL
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf.
Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowl-
edge discovery. Use R! Springer, New York. URL http://www.amazon.com/gp/product/
1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=
217145&creative=399373&creativeASIN=1441998896.
Williams GJ (2013). rattle: Graphical user interface for data mining in R. R package version
2.6.27, URL http://rattle.togaware.com/.
Williams GJ, Huang JZ, Chen X, Wang Q, Xiao L (2012). wskm: Weighted Subspace KMeans
Clustering. R package version 1.3.8, URL http://CRAN.R-project.org/package=wskm.
This document, sourced from ClustersO.Rnw revision 207, was processed by KnitR version 1.2
of 2013-04-10 and took 2.7 seconds to process. It was generated by gjw on nyx running Ubuntu
13.04 with Intel(R) Xeon(R) CPU W3520 @ 2.67GHz having 4 cores and 12.3GB of RAM. It
completed the processing 2013-08-02 06:23:39.
Copyright © 2013 [email protected] Module: ClustersO Page: 10 of 10