Dendrograms & PFGE analysis
Paul Vauterin Applied Maths BVBA
Outline of this talk: Simple explanation of mainstream hierarchical clustering (UPGMA) Interesting alternatives to UPGMA How to interpret a dendrogram? Problem of degenerate (equivalent) solutions
Bottom line: - be careful in interpreting dendrograms! - Consider alternatives to UPGMA (i. e. single & complete linkage)
Relevance of cluster analysis
Cluster Analysis is the mathematical study of methods for recognizing natural groups within a set of entities
Simply a tool that groups together related entities, based on the observed similarities between them Used as a data exploration/mining tool in virtually every field (psychology, economy, finance, astronomy, ...) Applicable to virtually any type of data. Only a similarity matrix is needed Applicable to large data sets (>10 000 entities) Easy to interpret (simple & intuitive mathematical principle) weak points easier to anticipate
UPGMA algorithm
Organisms A, B, C, D
Biological characterisation technique
127.3kb, 125.3kb, 140.9kb, 128.6kb, 83.6kb, 56,4kb, ... 101.6kb, 66.8kb, ... 129.6kb, 58.0kb, ... 101.3kb, 98.2kb, ...
Data set
Matrix of pairwize similarities
A B C D
A B C D 100 68 100 76 96 100 95 85 71 100
UPGMA algorithm
A B C D
80 90
A B C D 100 68 100 76 96 100 95 85 71 100
100
1. Find & merge two best matching
B + C
2. Update the similarities (averaging)
B C A D
80 90 100
96 72 100 78 95 100
3. Find & merge two best matching
A + D
4. Update the similarities
80 90 100
B C A D
96 75 95
5. Final merge
BC + AD
B C A D
UPGMA algorithm
Crucial step: determine similarities between two groups
UPGMA: average of all similarities
UPGMA algorithm
Crucial step: determine similarities between two groups
Single linkage: highest similarity (best case scenario)
UPGMA algorithm
Crucial step: determine similarities between two groups
Complete linkage: lowest similarity (worst case scenario) ... Other alternative schemes have been developed ...
How to interpret a dendrogram?
UPGMA tree:
A B C
What does this tell you? A & B are more close to each other than to C? Not necesarily true!
Fundamental problem: potential alternative solutions Equally valid Hidden Might give another view = not restricted to UPGMA or PFGE, but a major problem for most methods that summarise the original data
Degenerate dendrogram solutions
A simple example: PFGE, 3 organisms (A, B, C) Bands A B C Similarities:
A B C
A 100 50 50
B 100 0
100
UPGMA rule: Join highest similarities First A+B First A+C
How to solve this? Detect and visualise in a special way
A B C
A C B
A C B
Happens very often with discrete data with few degrees of freedom (bands on PFGE, but also MLST, MLVA, Spa typing, ...)
Degenerate dendrogram solutions
Degenerate dendrogram solutions
PFGE + band matching: even worse! A B C
A B C
A 100 100 0
B 100 100
100
A=B and B=C
A=C
Compromises the concept of a cluster of identical fingerprints Relaxed view: each member is identical to at least one other in the cluster Strict view: each member is identical to all other members of the cluster
Single linkage Single linkage Complete linkage Complete linkage
ALLWAYS human inspection needed anyway!
Case Study
6 5 4 3 2 1 0 # of different bands
PFGE fingerprints (Dis)similarity: # of different bands Complete linkage clustering Result= groups with members that have no more than n bands different with any other member = Good starting point for pattern naming
Case Study
6 5 4 3 2 1 0 # of different bands
PFGE fingerprints (Dis)similarity: # of different bands Single linkage clustering Result= groups with members that have no more than n bands different with some other members = Good starting point for finding clusters of related patterns
How to interpret a dendrogram?
Dendrogram: ... Suppose unique solution What does this tell you? ... Still not necessarily anything! Garbage In Garbage Out ...
A B C
A cluster algorithm will always produce a tree
Need for methods to address the reliability of a dendrogram Phylogenetics: standard tool = Felsensteins boostrap Not (well) suited to most typing data sets PFGE MLST VNTR
How to interpret a dendrogram?
Back to less sophisticated methods E. g. error flags on cluster levels Principle: each branch is an average representative of a variety of similarities -> show standard deviation
Visual inspection Cross-validation Large data sets are your friends!
Recipe 1: finding seed groups for pattern naming
Make sure you have Make sure you have a temporary field a temporary field
Install the plugin Install the plugin Dendrogram tools Dendrogram tools
Recipe 1: finding seed groups for pattern naming
Select Complete Linkage Select Complete Linkage and Different bands and Different bands
Recipe 1: finding seed groups for pattern naming
Use Fill field with Use Fill field with cluster number cluster number
Recipe 1: finding seed groups for pattern naming
Use 100% similarity Use 100% similarity Specify minimum Specify minimum group size group size Chose destination field Chose destination field Will overwrite any content!
Recipe 1: finding seed groups for pattern naming
Results Results
Resulting groups are Resulting groups are guaranteed to consist of guaranteed to consist of all identical fingerprints all identical fingerprints and have at least 5 and have at least 5 members members
Warning: numbering is not persistent: other data set might give different values
Recipe 2: find largest clusters in data set
Select Single Linkage Select Single Linkage and Different bands and Different bands
Recipe 2: find largest clusters in data set
Use 100% similarity Use 100% similarity (or 99% for 1 band difference) (or 99% for 1 band difference) Specify minimum Specify minimum group size group size Chose destination field Chose destination field
Recipe 2: find largest clusters in data set
Use Chart & Statistics tool Use Chart & Statistics tool
Add Temp field Add Temp field
Recipe 2: find largest clusters in data set
Use sort by frequency Use sort by frequency
Recipe 2: find largest clusters in data set
Fingerprints not associated Fingerprints not associated with any (large) cluster with any (large) cluster
Clusters ranked by size Clusters ranked by size use CTRL+click to select entries use CTRL+click to select entries
Recipe 2: find largest clusters in data set
Recipe 2: find largest clusters in data set