Scoring Matrices
Sequence alignment and database searching programs compare sequences
to each other as a series of characters ( residues). All algorithms (programs)
for comparison rely on some scoring scheme for that. Scoring matrices are
used to assign a score to each comparison of a pair of residues.
A scoring matrix is a set of values (weighing scheme) representing the
likelihood of one residue being substituted by another during sequence
divergence through evolution. This is why the scoring matrix is also known as
the substitution matrix.
SCORING MATRICES: DNA SEQUENCES
• The scoring system for nucleotide sequences are relatively simple.
• Match +1 • Mismatch -1
• However even this assumption is not completely realistic as it assumes
that every base has the same probability of being replaced by any other
base (the Jukes and Cantor assumption).
• Observations show that purine to purine and pyrimidine to pyrimidine
transitions occur more frequently than purine to pyrimidine (or vice versa)
transversions.
To deal with this differential mutation frequency in DNA sequences,
sophisticated statistical models have been developed by Kimura and others.
For generating DNA sequence-alignment score, the simple scoring matrix
is still used, such as theNUC4.2 and NUC4.4DNA scoring matrices.
SCORING OF PROTEIN ALIGNMENT
To score matches and mismatches in alignment of proteins, it is necessary to
know how often one amino acid is substituted for another in related proteins.
The substitutions found are counted, divided by the frequency expected
for each type of substitution to give an odds score.
The logarithm of odds score is placed in a scoring matrix of substitution
values, which can then be used to score the residues in a sequence
alignment.
AMINOACID SCORING MATRICES • Amino acid substitution matrices (20x20)
have been developed for scoring. • Each matrix entry gives the ratio of : the
observed frequency of substitution between each possible pair of amino
acids in related proteins to that expected by chance, given the frequencies
of amino acids in proteins. • These ratios are called odds scores. • These
ratios are transformed to logarithms of odds scores called log odds scores. •
Odds scores and log odds scores are used to score protein alignments. •
• (e.g.) compare 10 sequences and at an aligned position, 9sequences have
a Phe (F) and the remaining one is a isoleucine (I).
• Observed frequency of mutation/substitution is 1 in 10 (0.1)
•Probability that F will be substituted by I by chance is 1 in 20 (0.05)
Probability is a measure of how often an event may occur, whereas odds is a
measure based on the probability that an event may even occur. Odds is the
ratio of probabilities.
In the case of amino acid substitution (mutation), the odds of substitution
means the ratio of probability that one specific amino acid is preferentially
substituted by another specific amino acid during evolution to the probability
of that such substitution is random.
By assigning a score (odds score) to all possible pairs of amino-acid
substitution, a scoring matrix can be obtained.
Basic principle behind amino acid scoring matrices is that if certain amino
acids have the same physicochemical structure, they can be substituted and
still preserve the function of the protein so Higher score
• Substitutions that involve very different amino acids, will disrupt protein
structure and lead to non-functional/less functional proteins. These would be
selected out by evolution so Lower score.
Cysteine very important in disulfide bond formation and metal ion binding.
Disrupting cysteine can lead to deactivity. It is therefore infrequently
substituted ( negative score)
• Glycine and proline are also unique in that they contribute structures so
don't have frequent substitutions.( negative score)
substitution matrices are constructed by assembling a large and diverse
sample of verified pairwise alignments (or multiple sequence alignments) of
amino acids. Substitution matrices should reflect the true probabilities of
mutations occurring through a period of evolution.
•BLOSUM
•BLOCKSSUBSTITUTIONMATRIX
• PAM
•POINTACCEPTEDMUTATIONS
Dayhoff is knownas the founder of bioinformatics. This she did by pioneering
the application of mathematics and computational techniques to the
sequencing of proteins and nucleic acids and establishing the first publicly
available database for research in the area.
PAM MATRICES • Also known as Dayhoff (1978) PAM matrices.
• PAM: PointAcceptedMutation
• A PAM represents a substitution of one amino acid by another that has
been fixed by natural selection because either it does not alter the protein
function or it is beneficial to the organism.