Practical Bioinformatics
Lecture 8
Dayhoff Algorithm
Dayhoff’s Algorithm - Foundation
● Dayhoff (1978) created a "base dataset" to learn from
● 34 protein “superfamilies” grouped into 71 phylogenetic trees.
● Range of conservation (e.g., histones and glutamate dehydrogenase to
immunoglobulin (Ig) chains and kappa casein
● Protein families were aligned, then counted how often any one amino acid in
the alignment was replaced by another.
Accepted Point Mutations
Dayhoff Model (Step 1)
An amino acid change that is accepted by natural selection occurs when:
(1) a gene undergoes a DNA mutation such that it encodes a different amino
acid; and
(2) the entire species adopts that change as the predominant form of the
protein.
PAM rate of proteins used by Dayhoff et. al.
Dayhoff Model (Step 2): Frequency of AA
Dayhoff Model (Step 3): Mutability
Dayhoff Model (Step 4): Mutation Prob over 1 PAM
One PAM:- defined as the unit of evolutionary divergence in which 1% of the
amino acids have been changed between the two protein sequences
PAM1 Mutation Probability Matrix: e.g. 98.7 of Ala in the sequence stay same over 1 PAM
PAM 10
Notice the switches!
Notice higher penalties, e.g. D to R in PAM 10; E to N switches
How can that happen?
Computational Intuition: Matrix Exponentiation is not a Linear Process
Biological Intuition: There may be a multiple step change and indirect paths
For example, if direct A→G is rare, but:
A→S→G is more probable over multiple steps,
is more probable over multiple steps, then raising PAM1 to PAM10 can suddenly
make A→G much more frequent.
Dayhoff Model (Step 5): PAM 250
Simply PAM1 ˆ 250
This matrix applies to an evolutionary distance where proteins share about 20%
amino acid identity.
RECALL:
● Mutability
● NOT Symmetric
PAM250 mutation probability matrix. At this evolutionary distance only one in
five amino acid residues remain unchanged from the original AA sequence.
What do the PAM matrices mean?
Which PAM matrix to use?
Human beta globin (NP_000509.1)
and Chimp beta globin
(XP_508242.1)- 100% amino acid
identity.
Human beta globin and alpha globin
-Divergent. Mismatches are
assigned large negative scores.
Most broadly useful scoring matrix
such as BLOSUM62
Twilight Zone
Dayhoff Model (Step 6):Mutation Probability to Odds
1: substitution occurs as often as can be expected by chance.
> 1: Alignment of two residues occurs more often than expected by chance (e.g., a
conservative substitution of serine for threonine)
<1: Alignment is not favored
Dayhoff Model (Step 7): log Odds as the score
Relatedness
IS symmetric
Using scores to align sequences
Using PAM 250