0% found this document useful (0 votes)

241 views28 pages

Cluster Analysis Techniques

This document discusses hierarchical clustering techniques for analyzing PFGE fingerprinting data. It provides the following key points: - Dendograms can have multiple valid interpretations and may not accurately reflect relationships between clusters. Alternative clustering methods like single and complete linkage may provide different views. - UPGMA averages similarities between clusters which can obscure relationships; other methods consider maximum/minimum similarities. - Degenerate solutions are common with discrete PFGE data and different analyses may produce different dendrograms. - Visual inspection and validation are needed to properly interpret any dendrogram. Large datasets can provide more robust clusterings.

Uploaded by

ramanchads

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

241 views28 pages

Cluster Analysis Techniques

Uploaded by

ramanchads

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Dendrograms & PFGE analysis

Paul Vauterin Applied Maths BVBA

Outline of this talk: Simple explanation of mainstream hierarchical clustering (UPGMA) Interesting alternatives to UPGMA How to interpret a dendrogram? Problem of degenerate (equivalent) solutions

Bottom line: - be careful in interpreting dendrograms! - Consider alternatives to UPGMA (i. e. single & complete linkage)

Relevance of cluster analysis

Cluster Analysis is the mathematical study of methods for recognizing natural groups within a set of entities
Simply a tool that groups together related entities, based on the observed similarities between them Used as a data exploration/mining tool in virtually every field (psychology, economy, finance, astronomy, ...) Applicable to virtually any type of data. Only a similarity matrix is needed Applicable to large data sets (>10 000 entities) Easy to interpret (simple & intuitive mathematical principle) weak points easier to anticipate

UPGMA algorithm
Organisms A, B, C, D

Biological characterisation technique

127.3kb, 125.3kb, 140.9kb, 128.6kb, 83.6kb, 56,4kb, ... 101.6kb, 66.8kb, ... 129.6kb, 58.0kb, ... 101.3kb, 98.2kb, ...

Data set

Matrix of pairwize similarities

A B C D

A B C D 100 68 100 76 96 100 95 85 71 100

UPGMA algorithm

A B C D
80 90

A B C D 100 68 100 76 96 100 95 85 71 100

100

1. Find & merge two best matching

B + C
2. Update the similarities (averaging)

B C A D
80 90 100

96 72 100 78 95 100

3. Find & merge two best matching

A + D
4. Update the similarities
80 90 100

B C A D

96 75 95

5. Final merge

BC + AD

B C A D

UPGMA algorithm
Crucial step: determine similarities between two groups

UPGMA: average of all similarities

UPGMA algorithm
Crucial step: determine similarities between two groups

Single linkage: highest similarity (best case scenario)

UPGMA algorithm
Crucial step: determine similarities between two groups

Complete linkage: lowest similarity (worst case scenario) ... Other alternative schemes have been developed ...

How to interpret a dendrogram?

UPGMA tree:

A B C

What does this tell you? A & B are more close to each other than to C? Not necesarily true!

Fundamental problem: potential alternative solutions Equally valid Hidden Might give another view = not restricted to UPGMA or PFGE, but a major problem for most methods that summarise the original data

Degenerate dendrogram solutions

A simple example: PFGE, 3 organisms (A, B, C) Bands A B C Similarities:

A B C

A 100 50 50

B 100 0

100

UPGMA rule: Join highest similarities First A+B First A+C

How to solve this? Detect and visualise in a special way

A B C

A C B

Happens very often with discrete data with few degrees of freedom (bands on PFGE, but also MLST, MLVA, Spa typing, ...)

Degenerate dendrogram solutions

PFGE + band matching: even worse! A B C

A B C

A 100 100 0

B 100 100

100

A=B and B=C

A=C

Compromises the concept of a cluster of identical fingerprints Relaxed view: each member is identical to at least one other in the cluster Strict view: each member is identical to all other members of the cluster

Single linkage Single linkage Complete linkage Complete linkage

ALLWAYS human inspection needed anyway!

Case Study
6 5 4 3 2 1 0 # of different bands

PFGE fingerprints (Dis)similarity: # of different bands Complete linkage clustering Result= groups with members that have no more than n bands different with any other member = Good starting point for pattern naming

Case Study
6 5 4 3 2 1 0 # of different bands

PFGE fingerprints (Dis)similarity: # of different bands Single linkage clustering Result= groups with members that have no more than n bands different with some other members = Good starting point for finding clusters of related patterns

How to interpret a dendrogram?

Dendrogram: ... Suppose unique solution What does this tell you? ... Still not necessarily anything! Garbage In Garbage Out ...

A B C

A cluster algorithm will always produce a tree

Need for methods to address the reliability of a dendrogram Phylogenetics: standard tool = Felsensteins boostrap Not (well) suited to most typing data sets PFGE MLST VNTR

How to interpret a dendrogram?

Back to less sophisticated methods E. g. error flags on cluster levels Principle: each branch is an average representative of a variety of similarities -> show standard deviation

Visual inspection Cross-validation Large data sets are your friends!

Recipe 1: finding seed groups for pattern naming

Make sure you have Make sure you have a temporary field a temporary field

Install the plugin Install the plugin Dendrogram tools Dendrogram tools

Recipe 1: finding seed groups for pattern naming

Select Complete Linkage Select Complete Linkage and Different bands and Different bands

Recipe 1: finding seed groups for pattern naming

Use Fill field with Use Fill field with cluster number cluster number

Recipe 1: finding seed groups for pattern naming

Use 100% similarity Use 100% similarity Specify minimum Specify minimum group size group size Chose destination field Chose destination field Will overwrite any content!

Recipe 1: finding seed groups for pattern naming

Results Results

Resulting groups are Resulting groups are guaranteed to consist of guaranteed to consist of all identical fingerprints all identical fingerprints and have at least 5 and have at least 5 members members

Warning: numbering is not persistent: other data set might give different values

Recipe 2: find largest clusters in data set

Select Single Linkage Select Single Linkage and Different bands and Different bands

Recipe 2: find largest clusters in data set

Use 100% similarity Use 100% similarity (or 99% for 1 band difference) (or 99% for 1 band difference) Specify minimum Specify minimum group size group size Chose destination field Chose destination field

Recipe 2: find largest clusters in data set

Use Chart & Statistics tool Use Chart & Statistics tool

Add Temp field Add Temp field

Recipe 2: find largest clusters in data set

Use sort by frequency Use sort by frequency

Recipe 2: find largest clusters in data set

Fingerprints not associated Fingerprints not associated with any (large) cluster with any (large) cluster

Clusters ranked by size Clusters ranked by size use CTRL+click to select entries use CTRL+click to select entries

Recipe 2: find largest clusters in data set

Cluster Analysis in Paleoecology
No ratings yet
Cluster Analysis in Paleoecology
4 pages
Bi12-019 Bi12-263 LW2
No ratings yet
Bi12-019 Bi12-263 LW2
17 pages
Agglomerative Hierarchical Clustering
No ratings yet
Agglomerative Hierarchical Clustering
41 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Gene and Sample Clustering
No ratings yet
Gene and Sample Clustering
5 pages
Dendrogram - Slides
No ratings yet
Dendrogram - Slides
27 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
Cluster Analysis Tutorial in R
No ratings yet
Cluster Analysis Tutorial in R
13 pages
Hierarchical Clustering Methods Explained
No ratings yet
Hierarchical Clustering Methods Explained
31 pages
Heirarchical Clustering
No ratings yet
Heirarchical Clustering
22 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
38 pages
10 Cluster Analysis
No ratings yet
10 Cluster Analysis
13 pages
Cluster Analysis in R
No ratings yet
Cluster Analysis in R
8 pages
3.2 HierCluster
No ratings yet
3.2 HierCluster
17 pages
Clustering
No ratings yet
Clustering
19 pages
Iris HC Solution
No ratings yet
Iris HC Solution
31 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
23 pages
Hierarchical Clustering Dendrograms
No ratings yet
Hierarchical Clustering Dendrograms
12 pages
Phylogenetic Tree Construction Guide
No ratings yet
Phylogenetic Tree Construction Guide
17 pages
Stat401 ch6
No ratings yet
Stat401 ch6
37 pages
Clustering With Dendrograms On Interpretation Variables: M. Forina, C. Armanino, V. Raggio
No ratings yet
Clustering With Dendrograms On Interpretation Variables: M. Forina, C. Armanino, V. Raggio
7 pages
Clustering
No ratings yet
Clustering
8 pages
ML Lec-18
No ratings yet
ML Lec-18
21 pages
Cluster Analysis in Psychology
No ratings yet
Cluster Analysis in Psychology
55 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Hierarchical Clustering Case Study
No ratings yet
Hierarchical Clustering Case Study
4 pages
Agnes
No ratings yet
Agnes
25 pages
41 ML
No ratings yet
41 ML
3 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
Hierarchical Clustering Techniques Explained
100% (1)
Hierarchical Clustering Techniques Explained
33 pages
Hierarchical Clustering Explained
No ratings yet
Hierarchical Clustering Explained
9 pages
Phân Cấp Phân Cụm
No ratings yet
Phân Cấp Phân Cụm
17 pages
Hierarchical Clustering Explained
No ratings yet
Hierarchical Clustering Explained
13 pages
Clustering Techniques in ML
No ratings yet
Clustering Techniques in ML
3 pages
Hierarchial Clustering
No ratings yet
Hierarchial Clustering
14 pages
K-Means Clustering Overview
No ratings yet
K-Means Clustering Overview
24 pages
ML Lec-17
No ratings yet
ML Lec-17
12 pages
Understanding Cluster Analysis Techniques
33% (3)
Understanding Cluster Analysis Techniques
27 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Lect 11 DM
No ratings yet
Lect 11 DM
41 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
110 pages
4 Phylogenetics
No ratings yet
4 Phylogenetics
43 pages
Hierarchical Clustering in R: Fruits Analysis
No ratings yet
Hierarchical Clustering in R: Fruits Analysis
29 pages
Cluster Past
No ratings yet
Cluster Past
5 pages
Clustering 2
No ratings yet
Clustering 2
11 pages
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
No ratings yet
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
30 pages
Multidendrograms: Variable-Group Agglomerative Hierarchical Clusterings
No ratings yet
Multidendrograms: Variable-Group Agglomerative Hierarchical Clusterings
22 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Cluster Analysis: Biological Data Analysis and Chemometrics
No ratings yet
Cluster Analysis: Biological Data Analysis and Chemometrics
41 pages
Module 3
No ratings yet
Module 3
123 pages
Clustering
No ratings yet
Clustering
75 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
96 pages
Clustering
No ratings yet
Clustering
22 pages
13 Clustering and Classifier
No ratings yet
13 Clustering and Classifier
123 pages
Cluster Analysis
No ratings yet
Cluster Analysis
1 page
Week 10
No ratings yet
Week 10
84 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
Psce Conference
No ratings yet
Psce Conference
96 pages
Scan Through TAP
No ratings yet
Scan Through TAP
10 pages
Inventory & Downtime Cost Analysis
No ratings yet
Inventory & Downtime Cost Analysis
2 pages
MIPS Instruction Set Architecture PDF
No ratings yet
MIPS Instruction Set Architecture PDF
70 pages
Homework Psychology Definition
100% (1)
Homework Psychology Definition
5 pages
Student Info System Using JTabbedPane
No ratings yet
Student Info System Using JTabbedPane
6 pages
Sadness of Saying Enough
No ratings yet
Sadness of Saying Enough
3 pages
Application Proforma
No ratings yet
Application Proforma
14 pages
701P48938 FreeFlow Accxes V13.0 Drivers Install Guide
No ratings yet
701P48938 FreeFlow Accxes V13.0 Drivers Install Guide
42 pages
LBYIE3A Grp1 Case4
No ratings yet
LBYIE3A Grp1 Case4
15 pages
In Focus Sacred Geometry Your Personal Guide Complete Chapter Download
100% (16)
In Focus Sacred Geometry Your Personal Guide Complete Chapter Download
16 pages
Basic Operation Circuits: 1-Integrators and Differentiators
No ratings yet
Basic Operation Circuits: 1-Integrators and Differentiators
17 pages
The Pittston Dispatch 09-16-2012
No ratings yet
The Pittston Dispatch 09-16-2012
70 pages
Teachers Survey Questionnaire
No ratings yet
Teachers Survey Questionnaire
6 pages
Modified Harvard Architecture - Wikipedia, The Free Encyclopedia PDF
No ratings yet
Modified Harvard Architecture - Wikipedia, The Free Encyclopedia PDF
4 pages
Africa 2030 PDF
No ratings yet
Africa 2030 PDF
212 pages
NE List of Students SEM 2 - 2024-25
No ratings yet
NE List of Students SEM 2 - 2024-25
9 pages
Management Concepts and Practices Exam Guide
No ratings yet
Management Concepts and Practices Exam Guide
4 pages
TLE 7-8 Front Office Service Q1 - M2 For Printing
No ratings yet
TLE 7-8 Front Office Service Q1 - M2 For Printing
22 pages
Solar System Wiring and Specs
No ratings yet
Solar System Wiring and Specs
2 pages
PDHID Manual
No ratings yet
PDHID Manual
25 pages
Death Was Arrested
No ratings yet
Death Was Arrested
31 pages
OTA Student Evaluation at Taunton Mills
No ratings yet
OTA Student Evaluation at Taunton Mills
10 pages
Engage Older Adults & Caregivers Online
No ratings yet
Engage Older Adults & Caregivers Online
16 pages
Risk Prediction of Theft Crimes in Urban Communities - An Integrated Model of LSTM and ST-GCN
No ratings yet
Risk Prediction of Theft Crimes in Urban Communities - An Integrated Model of LSTM and ST-GCN
9 pages
Themes in Twelve Angry Men & Mockingbird
No ratings yet
Themes in Twelve Angry Men & Mockingbird
3 pages
CUSAT B.Tech Syllabus 2006: CS Sem VII
No ratings yet
CUSAT B.Tech Syllabus 2006: CS Sem VII
19 pages
February 2015 Physician Exam School Results
No ratings yet
February 2015 Physician Exam School Results
2 pages
SW Username & Password
No ratings yet
SW Username & Password
4 pages