0% found this document useful (0 votes)

74 views11 pages

Cluster Analysis With R

R programming; cluster analysis

Uploaded by

john kay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views11 pages

Cluster Analysis With R

R programming; cluster analysis

Uploaded by

john kay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Data Science with R

Cluster Analysis

[email protected]

2nd August 2013

Visit http://onepager.togaware.com/ for more OnePageR’s.

The required packages for this module include:

library(rattle) # Weather dataset and normVarNames()
library(randomForest) # na.roughfix()
library(wskm) # Weighted subspace clustering.
library(amap) # hclusterpar
library(cba) # Dendrogram plot
library(dendroextras) # To colour clusters

As we work through this module, new R commands will be introduced. Be sure to review the
command’s documentation and understand what the command does. You can ask for help using
the ? command as in:
?read.csv

We can obtain documentation on a particular package using the help= option of library():
library(help=rattle)

This module is intended to be hands on. To learn effectively, you are encouraged to have R
running (e.g., RStudio) and to run all the commands as they appear here. Check that you get
the same output, and you understand the output. Try some variations. Explore.

Copyright © 2013 Graham J Williams. You can freely copy, distribute,

transmit, adapt, or make commercial use of this module, as long as the at-
tribution is retained and derivative work is provided under the same license.
Data Science with R OnePageR Survival Guides Cluster Analysis

1 Load Weather Dataset for Modelling

Use the weather dataset from rattle (Williams, 2013).
library(rattle) # normVarNames()
library(randomForest) # na.roughfix()
dsname <- "weather"
ds <- get(dsname)
names(ds) <- normVarNames(names(ds))
vars <- names(ds)
target <- "rain_tomorrow"
risk <- "risk_mm"
id <- c("date", "location")
ignore <- c(id, if (exists("risk")) risk)
mvc <- sapply(ds[vars], function(x) sum(is.na(x)))
mvn <- names(which(mvc == nrow(ds)))
ignore <- union(ignore, mvn)
vars <- setdiff(vars, ignore)
ds[vars] <- na.roughfix(ds[vars]) # Maximise available data.
inputs <- setdiff(vars, target)
nobs <- nrow(ds)
numerics <- intersect(inputs, vars[which(sapply(ds[vars], is.numeric))])
categorics <- intersect(inputs, vars[which(sapply(ds[vars], is.factor))])

We summarise here the processing we have done above:

We use na.roughfix() to impute missing values as k-means can not handle missing values.
We identify the numerics, since most clustering will be performed on numeric data.

Copyright © 2013 [email protected] Module: ClustersO Page: 1 of 10

Data Science with R OnePageR Survival Guides Cluster Analysis

2 K-Means
Here is our first attempt.
m.km <- kmeans(ds, 10)
## Warning: NAs introduced by coercion
## Error: NA/NaN/Inf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on.
m.km <- kmeans(ds[numerics], 10)

So that appears to have succeeded to build 10 clusters. The sizes of the clusters can readily be
listed.
m.km$size
## [1] 7 57 35 15 40 24 49 63 30 46

The cluster centers (the means) can also be listed.

m.km$centers
## min_temp max_temp rainfall evaporation sunshine wind_gust_speed
## 1 10.3000 24.59 0.34286 7.114 11.029 72.71
## 2 3.0667 17.59 0.35088 2.775 8.209 25.00
## 3 12.8886 31.67 0.12000 8.051 10.974 44.49
....

The component m.km$cluster which of the 10 clusters each of the original observations be-
longs.
head(m.km$cluster)
## [1] 8 8 4 6 6 10

Exercise: Visualise the clusters.

Exercise: Notice the data is not scaled to a common range. What is the impact of this and
then rebuild after rescaling the variables.

Copyright © 2013 [email protected] Module: ClustersO Page: 2 of 10

Data Science with R OnePageR Survival Guides Cluster Analysis

3 Evaluate Clusters
Exercise: What is the use of the sums of squares.

Copyright © 2013 [email protected] Module: ClustersO Page: 3 of 10

Data Science with R OnePageR Survival Guides Cluster Analysis

4 Entropy Weighted K-Means

Use ewkm() from wskm (Williams et al., 2012).
set.seed(42)
library(wskm)
m.ewkm <- ewkm(ds, 10)
## Warning: NAs introduced by coercion
## Error: NA/NaN/Inf in foreign function call (arg 1)

Once again, only numeric variables can be clustered.

library(wskm)
m.ewkm <- ewkm(ds[numerics], 10)
## **********Clustering converged. Terminate!

round(100*m.ewkm$weights)
## min_temp max_temp rainfall evaporation sunshine wind_gust_speed
## 1 0 0 100 0 0 0
## 2 0 0 0 100 0 0
## 3 0 0 100 0 0 0
## 4 0 0 0 0 0 0
## 5 6 6 6 6 6 6
## 6 0 0 0 100 0 0
## 7 0 0 0 100 0 0
## 8 0 0 0 0 0 0
## 9 6 6 6 6 6 6
## 10 0 0 100 0 0 0
....

Exercise: Plot the clusters.

Exercise: Rescale the data so all variables have the same range and then rebuild the cluster,
and comment on the differences.

Exercise: Discuss why ewkm might be better than k-means. Consider the number of vari-
ables as an advantage, particularly in the context of the curse of dimensionality.

Copyright © 2013 [email protected] Module: ClustersO Page: 4 of 10

Data Science with R OnePageR Survival Guides Cluster Analysis

5 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (?).
library(amap)
model <- hclusterpar(na.omit(ds[numerics]),
method="euclidean",
link="ward",
nbproc=1)

Copyright © 2013 [email protected] Module: ClustersO Page: 5 of 10

Data Science with R OnePageR Survival Guides Cluster Analysis

6 Plotting Hierarchical Cluster

Plot from cba (?).
plot(model, main="Cluster Dendrogram", xlab="", labels=FALSE, hang=0)

#Add in rectangles to show the clusters.

rect.hclust(model, k=10)

Cluster Dendrogram
1500
1000
Height

500
0

hclusterpar (*, "ward")

0 500 1000 1500

Data Science with R

331
290
295
287
306
294
311
297
358
321
357
65
78
121
47
152
151
324
335
326
317
library(dendroextras)

334
115
116
333
318
337
338
366
325
362
57
327
348
154
320
126
130
12
56
108
122
356
44
106
83
84
13
85
20
71
59
87
125
127
131
138
25
134
94
110
93
144
24
49
30
36
58
66
103
111
67
76
51
112
2
38
32
54
102
117
40
45
68
26
140
92
69
70
336
360
361
15
363
132
136
62
63
135
61
137
352
355
332
354
86
95
141
365
17
39
88
119
133
118
139
21
72
73
74
91
128
18
28
16
90
60
89
14
19
191
192
177
190
203
181
313
185
186
161
194
169
175
350
176
193
229
235
322
219
204
206
248
205
230
211
241
213
220
208
209
210
293
236
255
269
273
246
278
266
247
267
304
207
212
148
149
187
312
268
303
182
258
242
299
195
196
plot(colour_clusters(model, k=10), xlab="")

197
150
178
292
300

201
274
343
184
323
265
314
298
301
291
315
302
307
153
156
173
157
351
10
164
107
163
129
162
11
124
344
346
347
329
330
167
168
123
158
345
353
316
1
359
104
113
224
339
165
198
97
22
96
50
145
146
46
80
31
23
OnePageR Survival Guides

105
260
202
234
257
237
249
281
279
283
263
251
270
252
256
215
232
259
214
231
250
216
223
233
41
142
143
183
188
172
8
171
308
309
217
218
310
221
222
159
160
55
109
364
9
170
48
7
42
79
Using the dendroextras (?) package to add colour to the dendrogram:

27
Add Colour to the Hierarchical Cluster

43
328
296
342
262
272
189
200
6
228
5
280
282
285
227
238
166
240
277
340
174
225
34
98
114
4
53
101
349
100
147
52
3
81
99
120
29
37
77
64
75
82
33
35
275
199
261
284
253

Module: ClustersO
305
319
179
155
341
180
264
254
286
226
239
288
289
271
245
276
243
244

Page: 7 of 10
Cluster Analysis
Data Science with R OnePageR Survival Guides Cluster Analysis

8 Hierarchical Cluster Binary Variables

Exercise: Clustering a large population based on the patterns of missing data within the
population is a technique for grouping observations exhibiintg similar patterns of behavi-
ouour assuming missing by pattern.... We can convert each variable to a binary 1/0 indi-
cating present/missing and then use mona() for a hiearchical clustering. Demonstrate this.
Include a levelplot.

Data Science with R OnePageR Survival Guides Cluster Analysis

9 Further Reading
The Rattle Book, published by Springer, provides a comprehensive
introduction data mining and analytics using Rattle and R. It
is available from Amazon. Other documentation on a broader
selection of R topics of relevance to the data scientist is freely
available from http://datamining.togaware.com, including the
Datamining Desktop Survival Guide.

This module is one of many OnePageR modules available from

http://onepager.togaware.com. In particular, following the links
on the website with a * which indicate the generally more developed
OnePageR modules.

Data Science with R OnePageR Survival Guides Cluster Analysis

10 References
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Williams GJ (2009). “Rattle: A Data Mining GUI for R.” The R Journal, 1(2), 45–55. URL
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf.
Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowl-
edge discovery. Use R! Springer, New York. URL http://www.amazon.com/gp/product/
1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=
217145&creative=399373&creativeASIN=1441998896.
Williams GJ (2013). rattle: Graphical user interface for data mining in R. R package version
2.6.27, URL http://rattle.togaware.com/.
Williams GJ, Huang JZ, Chen X, Wang Q, Xiao L (2012). wskm: Weighted Subspace KMeans
Clustering. R package version 1.3.8, URL http://CRAN.R-project.org/package=wskm.

This document, sourced from ClustersO.Rnw revision 207, was processed by KnitR version 1.2
of 2013-04-10 and took 2.7 seconds to process. It was generated by gjw on nyx running Ubuntu
13.04 with Intel(R) Xeon(R) CPU W3520 @ 2.67GHz having 4 cores and 12.3GB of RAM. It
completed the processing 2013-08-02 06:23:39.

K-Means and Hierarchical Clustering in R
No ratings yet
K-Means and Hierarchical Clustering in R
0 pages
Unit 6 - Machine Learning in R
No ratings yet
Unit 6 - Machine Learning in R
45 pages
Lecture 7 - Integrated Analysis With R
No ratings yet
Lecture 7 - Integrated Analysis With R
79 pages
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
No ratings yet
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
30 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
Cluster Analysis in R TML
No ratings yet
Cluster Analysis in R TML
5 pages
K-Means Clustering
No ratings yet
K-Means Clustering
18 pages
Materi Praktikum
No ratings yet
Materi Praktikum
7 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
MATLAB Cluster Analysis Thesis Help
100% (3)
MATLAB Cluster Analysis Thesis Help
7 pages
Clustering Techniques for Customer Segmentation
No ratings yet
Clustering Techniques for Customer Segmentation
25 pages
Clustering 2
No ratings yet
Clustering 2
11 pages
Clustering R Codes
No ratings yet
Clustering R Codes
2 pages
Fastcluster: Hierarchical Clustering Guide
No ratings yet
Fastcluster: Hierarchical Clustering Guide
16 pages
Lesson 6 - Unsupervised Learning
No ratings yet
Lesson 6 - Unsupervised Learning
63 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
16 pages
Unsupervised Methods Overview
No ratings yet
Unsupervised Methods Overview
26 pages
Week 10
No ratings yet
Week 10
84 pages
KVA Anusha - PGP12021 - BA
100% (1)
KVA Anusha - PGP12021 - BA
8 pages
Clustering - The Data Ensemble
No ratings yet
Clustering - The Data Ensemble
4 pages
Clad Cluster Analysisi Slides-Clusteranalysis
No ratings yet
Clad Cluster Analysisi Slides-Clusteranalysis
7 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
Assignment - Data Mining and R
No ratings yet
Assignment - Data Mining and R
4 pages
R For Data Science Sample Chapter
100% (1)
R For Data Science Sample Chapter
39 pages
Cluster Analysis in R
No ratings yet
Cluster Analysis in R
8 pages
Clustering Techniques in ML
No ratings yet
Clustering Techniques in ML
3 pages
K-Means Clustering Analysis in R
No ratings yet
K-Means Clustering Analysis in R
17 pages
Data Mining Project - Clustering - State Wise Health Income
No ratings yet
Data Mining Project - Clustering - State Wise Health Income
9 pages
DIANA Clustering with R Guide
No ratings yet
DIANA Clustering with R Guide
4 pages
Zara
No ratings yet
Zara
47 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
5 pages
K-Means Cluster Analysis UC Business Analytics R Programming Guide
No ratings yet
K-Means Cluster Analysis UC Business Analytics R Programming Guide
19 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
R Cluster Analysis
No ratings yet
R Cluster Analysis
5 pages
Clustering Techniques in R Guide
No ratings yet
Clustering Techniques in R Guide
4 pages
K-means Clustering Overview
No ratings yet
K-means Clustering Overview
35 pages
Alehandro Lumentah 210211010188 Assignment09
No ratings yet
Alehandro Lumentah 210211010188 Assignment09
10 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Cluster Analysis Techniques Explained
No ratings yet
Cluster Analysis Techniques Explained
84 pages
Makles 2012 Stata Tip 110 How To Get The Optimal K Means Cluster Solution
No ratings yet
Makles 2012 Stata Tip 110 How To Get The Optimal K Means Cluster Solution
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
8 pages
CLUSTERING ANALYSIS State Wise Health PDF
No ratings yet
CLUSTERING ANALYSIS State Wise Health PDF
14 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
DSCI 100 Clustering Concept Cheat Sheet
No ratings yet
DSCI 100 Clustering Concept Cheat Sheet
4 pages
Clustering
No ratings yet
Clustering
7 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
K-Means Clustering for Customer Segmentation
No ratings yet
K-Means Clustering for Customer Segmentation
22 pages
20 ENG 016 Assignment 8
No ratings yet
20 ENG 016 Assignment 8
4 pages
Clustering
No ratings yet
Clustering
55 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Clustering
No ratings yet
Clustering
20 pages
7.1 Introduction To Cluster Analysis: Co Co
No ratings yet
7.1 Introduction To Cluster Analysis: Co Co
1 page
Bootstrap 83
No ratings yet
Bootstrap 83
36 pages
Latent Semantic Analysis Explained
No ratings yet
Latent Semantic Analysis Explained
15 pages
Logistic Regression Quiz Analysis
No ratings yet
Logistic Regression Quiz Analysis
7 pages
Very Easy Hymns For Sight-Reading
100% (3)
Very Easy Hymns For Sight-Reading
18 pages
How Does SVD Work?: Singular Value Decomposition (SVD) On Wikipedia
No ratings yet
How Does SVD Work?: Singular Value Decomposition (SVD) On Wikipedia
6 pages
Negative Theology in Modern Contexts
No ratings yet
Negative Theology in Modern Contexts
22 pages
Roman Pizza for Home Cooks
No ratings yet
Roman Pizza for Home Cooks
4 pages
Pizza Dough Thickness and Recipes
No ratings yet
Pizza Dough Thickness and Recipes
10 pages
Nutanix NCA-6.5 Exam Q&A Guide
No ratings yet
Nutanix NCA-6.5 Exam Q&A Guide
5 pages
Kifercomp 348761 ppt05
No ratings yet
Kifercomp 348761 ppt05
140 pages
Smartvew XML Load Error
No ratings yet
Smartvew XML Load Error
2 pages
ERD for Real Estate Firm Structure
No ratings yet
ERD for Real Estate Firm Structure
2 pages
Big Data Unit 3 Resources Links
No ratings yet
Big Data Unit 3 Resources Links
6 pages
3 Types of Relationship Between Entities-PDF Version
No ratings yet
3 Types of Relationship Between Entities-PDF Version
9 pages
Unit 6 - File Management Notes
No ratings yet
Unit 6 - File Management Notes
30 pages
Revit Architecture
No ratings yet
Revit Architecture
23 pages
CIS 2109 - Database and File Management Systems Department of Computer and Information Sciences
No ratings yet
CIS 2109 - Database and File Management Systems Department of Computer and Information Sciences
5 pages
Multiple Lines Header in ALV Report
No ratings yet
Multiple Lines Header in ALV Report
3 pages
List of Power BI DAX Function With Example
100% (3)
List of Power BI DAX Function With Example
82 pages
Developing An Efficient Database For Property Valuation in Anambra State Nigeria
No ratings yet
Developing An Efficient Database For Property Valuation in Anambra State Nigeria
38 pages
Living in The IT Era
No ratings yet
Living in The IT Era
2 pages
CAPE Information Technology U2 P2 2018 MJ
No ratings yet
CAPE Information Technology U2 P2 2018 MJ
18 pages
Database Security Implementation Guide
No ratings yet
Database Security Implementation Guide
54 pages
DBMS Mini Project
100% (1)
DBMS Mini Project
7 pages
CSE 3rd Semester Assignment
No ratings yet
CSE 3rd Semester Assignment
2 pages
Tableau Blueprint
No ratings yet
Tableau Blueprint
307 pages
MySQL Workshop: Database Exercise
No ratings yet
MySQL Workshop: Database Exercise
3 pages
Data Visualization Project in Tableau
No ratings yet
Data Visualization Project in Tableau
6 pages
Blockchain for Certificate Security
No ratings yet
Blockchain for Certificate Security
7 pages
Voucher-WIFI ZONE-24H-up-578-03.29.25
No ratings yet
Voucher-WIFI ZONE-24H-up-578-03.29.25
13 pages
IITG BSC DS&AI - Curriculum & Tuition
No ratings yet
IITG BSC DS&AI - Curriculum & Tuition
1 page
Roles and Responsibilities in NetApp Storage Admin
No ratings yet
Roles and Responsibilities in NetApp Storage Admin
1 page
Datasets for Aspiring Data Scientists
No ratings yet
Datasets for Aspiring Data Scientists
7 pages
Understanding Second Normal Form Rules
No ratings yet
Understanding Second Normal Form Rules
14 pages
21BCE1406 PardheevKrishnaTammineni
No ratings yet
21BCE1406 PardheevKrishnaTammineni
71 pages
Bca 6 Sem Advance Dbms S 2019
No ratings yet
Bca 6 Sem Advance Dbms S 2019
2 pages
Introduction to Database Systems
No ratings yet
Introduction to Database Systems
12 pages
ETL Testing: Key Scenarios and Cases
No ratings yet
ETL Testing: Key Scenarios and Cases
4 pages

Cluster Analysis With R

Uploaded by

Cluster Analysis With R

Uploaded by

Data Science with R

2nd August 2013

Visit http://onepager.togaware.com/ for more OnePageR’s.

The required packages for this module include:

Copyright © 2013 Graham J Williams. You can freely copy, distribute,

1 Load Weather Dataset for Modelling

We summarise here the processing we have done above:

Copyright © 2013 [email protected] Module: ClustersO Page: 1 of 10

The cluster centers (the means) can also be listed.

Exercise: Visualise the clusters.

Copyright © 2013 [email protected] Module: ClustersO Page: 2 of 10

Copyright © 2013 [email protected] Module: ClustersO Page: 3 of 10

4 Entropy Weighted K-Means

Once again, only numeric variables can be clustered.

Exercise: Plot the clusters.

Copyright © 2013 [email protected] Module: ClustersO Page: 4 of 10

5 Hierarchical Cluster in Parallel

Copyright © 2013 [email protected] Module: ClustersO Page: 5 of 10

6 Plotting Hierarchical Cluster

#Add in rectangles to show the clusters.

hclusterpar (*, "ward")

Copyright © 2013 [email protected] Module: ClustersO Page: 6 of 10

0 500 1000 1500

Copyright © 2013 [email protected]

8 Hierarchical Cluster Binary Variables

Copyright © 2013 [email protected] Module: ClustersO Page: 8 of 10

This module is one of many OnePageR modules available from

Copyright © 2013 [email protected] Module: ClustersO Page: 9 of 10

Copyright © 2013 [email protected] Module: ClustersO Page: 10 of 10

You might also like