Tree building algorithms
[Link].- I Microbiology
Microbial Systematics
Presented by:
Miss Vidya Vijaykumar Sonalkar
[Link]
CSIR NET
The steps in a phylogenetic analysis
are as follows:
1. Decide which gene and species to analyze (small-subunit ribosomal
RNA [SSU rRNA])
2. Determine the gene sequences (polymerase chain reaction [PCR] and
DNA sequencing, database “mining”)
3. Identify homologous residues (sequence alignment)
4. Perform the phylogenetic analysis
The most common type of phylogenetic analysis is tree construction. A tree is nothing more than a graph representing the
similarity relationships between the sequences in an alignment
Tree construction starts with an alignment. Neighbor joining is a distance matrix method, meaning that the alignment is
first reduced to a table of evolutionary distances, a distance matrix.
The distance matrix cannot be generated directly from the alignment, however, because actual evolutionary distance
cannot be directly measured. Instead, the alignment is reduced to a table of observed (measurable) similarity, the
similarity matrix. The distance matrix is calculated from the similarity matrix, and then the tree is generated from the
distance matrix
Generating a similarity matrix
In this example, sequences A and B are 0.90 (90%) similar, A and C are 0.75 similar, B and C are 0.75 similar, and so forth.
Note that values on the diagonal (A:A, B:B, . . .) do not need to be calculated; they are always 1.
Converting a similarity matrix into an evolutionary distance matrix
More than one evolutionary change at a single position (e.g., A to G to U, or A to G in one sequence and the same A to U in
another) counts as only one difference between the two sequences, and in the case of reversion or convergence it counts as
no change at all (e.g., A to G to A, or A to G in one organism and the same A to G in another). As a result, the observed
similarity between two sequences underestimates the evolutionary distance that separates them.
One common way to estimate evolutionary distances from similarity is the Jukes and Cantor method, which uses the
following equation:
Evolutionary distance = −3/4 ln[1 − 4(1 − similarity)/3]
similarity and distance are very closely related initially (e.g., 0.90 similarity ≈ 0.10 distance) but level off to 0.25 similarity,
where evolutionary distance is infinite. This makes sense; for two sequences that are very similar, the probable frequency of
more than one change at a single site is low, requiring only a small correction, whereas two sequences that have changed
beyond all recognition (infinite evolutionary distance) are still approximately 25% similar just because there are only four
bases and so approximately one of the four will match entirely by chance.
Generating a tree from a distance matrix
In the neighbor-joining method, the structure of the tree is determined first and then the branch lengths are fit to this
skeleton
The tree starts out with a single internal node and a branch out to each sequence:an n-pointed star, where n is the number
of sequences in the alignment.
The pair of sequences with the smallest evolutionary distance separating them is joined onto a single branch (i.e., the
neighbors are joined, hence the name of the method), and then the process is repeated after merging these two sequences
in the distance matrix by averaging their distances from every other sequence in the matrix.
determined
Fitch-Margoliash: an alternative distance-matrix treeing method
No distance matrix is calculated; instead, trees are searched and each ancestral sequence is calculated, allowing
for all uncertainties, in a process analogous to Sudoku puzzles
The number of “mutations” required is added up, and the tree with the best score wins
Testing every possible tree is not usually possible (the number of trees grows exponentially with the number of
sequences), so a variety of search algorithms are used to examine only the most likely trees. Likewise, there are a
variety of ways of counting (scoring) sequence changes
Parsimony methods are typically slower than distance-matrix methods but very much faster than the maximum-
likelihood methods
Parsimony uses more of the information in an alignment, since it does not reduce all of the individual sequence
differences to a distance matrix, but it seems to work best with relatively closely related sequences and is not
usually used for rRNA sequences.
Maximum likelihood
The maximum-likelihood method turns the tree construction process on its head, starting with a cluster analysis to
generate a “guide” tree, from which a very complete substitution model is calculated
The algorithm then goes back and calculates the likelihood of any particular tree by summing the probabilities of
all of the possible intermediates required to get to the observed sequences
maximum-likelihood tree construction is by far the most computationally intensive of the methods in common use.
However, it is generally also the best, in the sense that the trees are more consistent and robust. The limitation is
that fewer and shorter sequences can be analyzed by the maximum-likelihood method because of its computational
demands.
A tree that might take a few seconds by neighbor joining or a few minutes by parsimony or Fitch can take a few
hours or a couple of days by maximum likelihood.
Bayesian inference
Bayesian inference is a relatively new approach to tree construction. This approach starts with a random tree structure,
random branch lengths, and random substitution parameters for an alignment, and the probability of the tree being
generated from the alignment with these parameters is scored
Obviously the initial score is likely to be very poor. Then a random change is made in this tree (branch order, branch length,
or substitution parameter) and the result is rescored.
Then a choice is made whether to accept the change; this choice is partially random, but the greater the improvement in
tree score, the more likely it is to be accepted. If the change is accepted, the process is repeated starting with this new
tree; if the change is rejected, the process is repeated starting with the old tree.
After many, many cycles of this process, the algorithm settles in to a collection of trees that are nearly optimal. Various
tricks are used to keep the algorithm from getting stuck in local-scoring minimum zones.