0% found this document useful (0 votes)

56 views47 pages

Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions

Sequence comparison is important for identifying homologous proteins and DNA sequences. The Needleman-Wunsch algorithm finds the optimal global alignment of two sequences by using dynamic programming to recursively calculate alignment scores. The Smith-Waterman algorithm was later developed to allow for local sequence alignments by finding the highest-scoring aligned segment between two sequences. Both algorithms represent the basis for sequence alignment.

Uploaded by

Somashree Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views47 pages

Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions

Uploaded by

Somashree Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Sequence comparison: Motivation

Finding similarity between sequences is important

for many biological questions.

• Find homologous proteins

– Allows to predict structure and function

• Locate similar subsequences in DNA

– e.g: allows to identify regulatory elements

• Locate DNA sequences that might overlap

– Helps in sequence assembly
Sequence Alignment
• Input: two sequences over the same alphabet
• Output: an alignment of the two sequences

• Two basic variants of sequence alignment:

Global – all characters in both sequences participate
• Needleman-Wunsch, 1970
Local – find related regions within sequences
• Smith-Waterman, 1981
Sequence Alignment - Example
 Input:
GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA

 Possible output:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A

• Three elements:
– Perfect matches
– Mismatches
– Insertions & deletions (indel)
For example, the two hypothetical sequences
abcdefghajklm
abbdhijk
Given the payoff matrix
payoff = { match => 4,
could be aligned like this
mismatch => -3,
abcdefghajklm
gap_open => -2,
|| | | ||
gap_extend => -1 };
abbd...hijk
As shown, there are 6 matches,
2 mismatches, and one gap of length 3.

The sequences
abcdefghajklm
abbdhijk
are aligned and scored like this
a b c d e f g h a j k l m
| | | | | |
a b b d . . . h i j k
match 4 4 4 4 4 4
mismatch -3 -3
gap_open -2
gap_extend -1-1-1
for a total score of 24-6-2-3 = 13.
Scoring Function
• Score each position independently:
– Match: +1
– Mismatch: -1
– Indel: -2
• Score of an alignment is sum of position scores
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3

------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score: (+1x5) + (-1x6) + (-2x11) = -23
Needleman-Wunsch Method
The algorithm guarantees that no other
alignment of these two sequences has a
higher score under this payoff matrix.
Output:

An alignment of two sequences is represented by three lines

The first line shows the first sequence

The third line shows the second sequence.

The second line has a row of symbols.

The symbol is a vertical bar wherever characters in the two

sequences match, and a space where ever they do not.

Dots may be inserted in either sequence to represent gaps.

Needleman-Wunsch Method
Typical output file
Global: HBA_HUMAN vs HBB_HUMAN
Score: 290.50

HBA_HUMAN 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44
|:| :|: | | |||| : | | ||| |: : :| |: :|
HBB_HUMAN 1 VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43

HBA_HUMAN 45 HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83
| ||| |: :|| ||||| | :: :||:|:: : |
HBB_HUMAN 44 SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88

HBA_HUMAN 84 SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128

|:|| || ||| ||:|| : |: || | |||| | |: |
HBB_HUMAN 89 SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133

HBA_HUMAN 129 LASVSTVLTSKYR 141

:| |: | ||
HBB_HUMAN 134 VAGVANALAHKYH 146

%id = 45.32 %similarity = 63.31

Overall %id = 43.15; Overall %similarity = 60.27
Needleman-Wunsch Method
Dynamic Programming
Potential difficulty. How does one come up with the optimal alignment in
the first place? We now introduce the concept of dynamic programming
(DP).

DP can be applied to a large search space that can be structured into a

succession of stages such that:
1) the initial stage contains trivial solutions to sub-problems
2) each partial solution in a later stage can be calculated by
recurring on only a fixed number of partial solutions in an
earlier stage.
3) the final stage contains the overall solution.
Three steps in Dynamic
Programming

1. Initialization

2 Matrix fill or scoring

3. Traceback and alignment

Two sequences will be aligned.

GAATTCAGTTA (sequence #1)

GGATCGA (sequence #2)

A simple scoring scheme will be used

Si,j = 1 if the residue at position I of sequence #1 is the same as

the residue at position j of the sequence #2 (called match score)

Si,j = 0 for mismatch score

w = gap penalty
Initialization step: Create Matrix with M + 1 columns
and N + 1 rows. First row and column filled with 0.
Matrix fill step: Each position Mi,j is defined to be the
MAXIMUM score at position i,j
Mi,j = MAXIMUM [
Mi-1, j-1 + si,,j (match or mismatch in the diagonal)
Mi, j-1 + w (gap in sequence #1)
Mi-1, j + w (gap in sequence #2)]
Fill in rest of row 1 and column 1
Fill in column 2
Fill in column 3
Column 3 with answers
Traceback step:
Position at current cell and look at direct predecessors

Seq#1 A
|
Seq#2 A
Traceback step:
Position at current cell and look at direct predecessors

Seq#1 G A A T T C A G T T A
| | | | | |
Seq#2 G G A T - C - G - - A
Over a decade after the initial publication of the Needleman-Wunsch algorithm, a
modification was made to allow for local alignments (Smith and Waterman, 1981).
In this adaptation, the alignment path does not need to reach the edges of the
search graph, but may begin and end internally. In order to accomplish this, 0 was
added as a term in the score calculation described by Needleman and Wunsch.
Smith-Waterman Algorithm (cont.)
• Only works effectively when gap C A G C C U C G C U U A G
penalties are used A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
• In example shown A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7
– match = +1 U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7
G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0
– mismatch = -1/3
C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
– gap = -1+1/3k (k=extent of C 1.0 0.7 0.0 1.0 3.0 1.7 ?
gap) A
• Start with all cell values = 0 U
• Looks in subcolumn and subrow U
shown and in direct diagonal for G
a score that is the highest when A
you take alignment score or gap C
penalty into account G
G

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}

Smith-Waterman Algorithm (cont.)
• Four possible ways of forming a path
For every residue in the query sequence
1. Align with next residue of db sequence … score is previous
score plus similarity score for the two residues
2. Deletion (i.e. match residue of query with a gap) … score is
previous score minus gap penalty dependent on size of gap
3. Insertion (i.e. match residue of db sequence with a gap) …
score is previous score minus gap penalty dependent on size
of gap
4. Stop … score is zero

• Choose whichever of these is the highest

Smith-Waterman Algorithm (cont.)
C A G C C U C G C U U A G
Construct Alignment A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
• The score in each cell is the A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7
maximum possible score U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7
G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0
for an alignment of ANY
C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
LENGTH ending at those
C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0
coordinates A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0
• Trace pathway back from U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0
highest scoring cell U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0
• This cell can be anywhere G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7
in the array A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0
C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0
• Align highest scoring G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0
segment G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0

GCC-UCG
GCCAUUG
Differences
• Needleman-Wunsch • Smith-Waterman
1. Global alignments 1. Local alignments
2. Requires alignment score for a pair of 2. Residue alignment score may be
residues to be >=0 positive or negative
3. No gap penalty required 3. Requires a gap penalty to work
effectively
4. Score cannot decrease between two 4. Score can increase, decrease or stay
cells of a pathway level between two cells of a pathway

Sequence Alignment Methods Overview
No ratings yet
Sequence Alignment Methods Overview
57 pages
Tabby
No ratings yet
Tabby
11 pages
Bioinfo Generic Skill
No ratings yet
Bioinfo Generic Skill
10 pages
Lecture2 Sequence Alignment
No ratings yet
Lecture2 Sequence Alignment
26 pages
L-8 Global Alignment
No ratings yet
L-8 Global Alignment
19 pages
Unit Iv
No ratings yet
Unit Iv
98 pages
Needleman-Wunsch and Smith-Waterman Algorithm
67% (9)
Needleman-Wunsch and Smith-Waterman Algorithm
19 pages
Global Sequence Alignment Guide
No ratings yet
Global Sequence Alignment Guide
24 pages
Needleman-Wunsch Algorithm Explained
No ratings yet
Needleman-Wunsch Algorithm Explained
28 pages
Needleman Wunsch
100% (1)
Needleman Wunsch
6 pages
Lecture5 Newest
No ratings yet
Lecture5 Newest
124 pages
Dynamic Programming in Sequence Alignment
No ratings yet
Dynamic Programming in Sequence Alignment
41 pages
Introduction Dynamic Programming
No ratings yet
Introduction Dynamic Programming
52 pages
Daa Assignment 10 Aryan Project
No ratings yet
Daa Assignment 10 Aryan Project
11 pages
Lecture 4.1 and 4.2 Sequence Alignment (Global and Local)
No ratings yet
Lecture 4.1 and 4.2 Sequence Alignment (Global and Local)
14 pages
Sequence Alignment: Lecture 2, Thursday April 3, 2003
No ratings yet
Sequence Alignment: Lecture 2, Thursday April 3, 2003
39 pages
Lecture 5 Introduction Dynamic Programming
No ratings yet
Lecture 5 Introduction Dynamic Programming
52 pages
String Alignment Techniques
No ratings yet
String Alignment Techniques
76 pages
Pairwise Sequence Alignment Methods
No ratings yet
Pairwise Sequence Alignment Methods
22 pages
Needleman-Wunsch Algorithm Explained
No ratings yet
Needleman-Wunsch Algorithm Explained
39 pages
Running BLAST Through Perl
No ratings yet
Running BLAST Through Perl
35 pages
Early Sequence Aligment
No ratings yet
Early Sequence Aligment
14 pages
Smith-Waterman Algorithm Overview
No ratings yet
Smith-Waterman Algorithm Overview
15 pages
PCB Lect02 Pairwise Allign
No ratings yet
PCB Lect02 Pairwise Allign
51 pages
Pattern Matching Techniques and Their Applications To Computational Molecular Biology - A Review
No ratings yet
Pattern Matching Techniques and Their Applications To Computational Molecular Biology - A Review
8 pages
Bioinformatics Sequence Alignments
No ratings yet
Bioinformatics Sequence Alignments
37 pages
Sequence Alignment Algorithms Overview
75% (4)
Sequence Alignment Algorithms Overview
37 pages
Three Steps in Dynamic Programming
No ratings yet
Three Steps in Dynamic Programming
7 pages
Dynamic Programming Approach
No ratings yet
Dynamic Programming Approach
32 pages
Ada 1
No ratings yet
Ada 1
9 pages
Sequence Alignment: "Continuing.." (5th Week)
No ratings yet
Sequence Alignment: "Continuing.." (5th Week)
61 pages
Sequence Alignment Techniques in Bioinformatics
No ratings yet
Sequence Alignment Techniques in Bioinformatics
26 pages
Unit I Algorithms
No ratings yet
Unit I Algorithms
42 pages
Lecture1 2
No ratings yet
Lecture1 2
44 pages
Global Alignment
100% (1)
Global Alignment
40 pages
Sequence Alignment Algorithms in Bioinformatics
No ratings yet
Sequence Alignment Algorithms in Bioinformatics
95 pages
Algorithm
No ratings yet
Algorithm
36 pages
Optimization of A Classical Algorithm For The Alignment of Genomic Sequences With Artificial Bee Colony
No ratings yet
Optimization of A Classical Algorithm For The Alignment of Genomic Sequences With Artificial Bee Colony
7 pages
Sequence Alignment Techniques
No ratings yet
Sequence Alignment Techniques
49 pages
Pairwise Sequence Alignment Techniques
No ratings yet
Pairwise Sequence Alignment Techniques
27 pages
Dynamic Programming
No ratings yet
Dynamic Programming
28 pages
MIT6 047F15 Lecture03
No ratings yet
MIT6 047F15 Lecture03
56 pages
Lecture 5
No ratings yet
Lecture 5
15 pages
Sequence Alignment: Lecture - 4
No ratings yet
Sequence Alignment: Lecture - 4
19 pages
Dynamic Programming in Sequence Alignment
No ratings yet
Dynamic Programming in Sequence Alignment
38 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
Bioinformatics Basics PDF
No ratings yet
Bioinformatics Basics PDF
10 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Lecture-7-Dynamic Programming Global-Sequence Alignment
No ratings yet
Lecture-7-Dynamic Programming Global-Sequence Alignment
31 pages
Bioinformatics Sequence Alignment
No ratings yet
Bioinformatics Sequence Alignment
3 pages
11 Smith-Waterman Algorithm 06-08-2024
No ratings yet
11 Smith-Waterman Algorithm 06-08-2024
9 pages
Daa Assignment 9
No ratings yet
Daa Assignment 9
4 pages
Bioinformatics Sequence Alignment
No ratings yet
Bioinformatics Sequence Alignment
90 pages
Bio Medical Tics - Sequence Analysis - Alignment - 2011
No ratings yet
Bio Medical Tics - Sequence Analysis - Alignment - 2011
96 pages
Bio Ass
No ratings yet
Bio Ass
3 pages
DNA Sequence Alignment Techniques
No ratings yet
DNA Sequence Alignment Techniques
57 pages
The Needleman Wunsch Algorithm For Sequence Alignment
No ratings yet
The Needleman Wunsch Algorithm For Sequence Alignment
46 pages
Reverse/Inverse Docking and snoRNP
No ratings yet
Reverse/Inverse Docking and snoRNP
2 pages
Protein Purification Techniques
No ratings yet
Protein Purification Techniques
21 pages
Project1-27 02
No ratings yet
Project1-27 02
3 pages
Introduction To The Cell Cycle, Phases, Why Cells Divide, Biochemical and Physiological
No ratings yet
Introduction To The Cell Cycle, Phases, Why Cells Divide, Biochemical and Physiological
37 pages
Dna Etbr Theory
No ratings yet
Dna Etbr Theory
15 pages
R-3rd Program-Mean of HB
No ratings yet
R-3rd Program-Mean of HB
1 page
Urea Molecular Weight Analysis
No ratings yet
Urea Molecular Weight Analysis
1 page
Modern Notes Bacterial Genetics Lecture Slides
No ratings yet
Modern Notes Bacterial Genetics Lecture Slides
20 pages
Biochemistry MCQs for Students
No ratings yet
Biochemistry MCQs for Students
18 pages
12-1 Science Notebook
No ratings yet
12-1 Science Notebook
4 pages
GR 11 Biomolecules Study Materials - Case Study (2023)
No ratings yet
GR 11 Biomolecules Study Materials - Case Study (2023)
9 pages
Definisi Genomik dan Proteomik
No ratings yet
Definisi Genomik dan Proteomik
17 pages
Ihc Manual Procedure
No ratings yet
Ihc Manual Procedure
1 page
Unit 6.3biology Quipper
No ratings yet
Unit 6.3biology Quipper
16 pages
Rosalind Franklin: DNA Pioneer
No ratings yet
Rosalind Franklin: DNA Pioneer
6 pages
Tawk Et Al 2007 PDF
No ratings yet
Tawk Et Al 2007 PDF
4 pages
Quarter Book Biology 10
No ratings yet
Quarter Book Biology 10
6 pages
Passionfruit Genomic Database
No ratings yet
Passionfruit Genomic Database
9 pages
Membrane Biology Full Notes Biochemistry Level 300
No ratings yet
Membrane Biology Full Notes Biochemistry Level 300
4 pages
5.1 Inheritance (F) MS
No ratings yet
5.1 Inheritance (F) MS
7 pages
Cell Physiology
100% (1)
Cell Physiology
7 pages
Overview of Lipid Functions and Types
No ratings yet
Overview of Lipid Functions and Types
3 pages
Biology Syllabus for 9th Grade BFCGI
No ratings yet
Biology Syllabus for 9th Grade BFCGI
2 pages
Bsci223 Exam 3
No ratings yet
Bsci223 Exam 3
39 pages
Plant Growth Quiz
No ratings yet
Plant Growth Quiz
5 pages
Department of Biotechnology Fungal Biotechnology Group Assignment Group 5 Members
No ratings yet
Department of Biotechnology Fungal Biotechnology Group Assignment Group 5 Members
38 pages
Campbell Biology Chapter1 50 MCQs
No ratings yet
Campbell Biology Chapter1 50 MCQs
6 pages
How The Genes Are Regulated & Expressed?: Assignment
No ratings yet
How The Genes Are Regulated & Expressed?: Assignment
17 pages
Molecular Immunology: in Silico Analysis of Transmembrane Protein 31 (TMEM31) Antigen To Design
No ratings yet
Molecular Immunology: in Silico Analysis of Transmembrane Protein 31 (TMEM31) Antigen To Design
10 pages
Learning Module For Modern Biology
No ratings yet
Learning Module For Modern Biology
76 pages
SNP Discovery Using NCBI BLAST
No ratings yet
SNP Discovery Using NCBI BLAST
4 pages
Biochemistry Upm PDF
No ratings yet
Biochemistry Upm PDF
3 pages
Nucleus and Chromosomes
No ratings yet
Nucleus and Chromosomes
43 pages
Pedsap 1 Immunology
100% (1)
Pedsap 1 Immunology
96 pages
Genome Anatomies
No ratings yet
Genome Anatomies
15 pages
HMMs for Gene Finding in Bioinformatics
No ratings yet
HMMs for Gene Finding in Bioinformatics
32 pages
Molecular Basis of Inheritance Quiz
No ratings yet
Molecular Basis of Inheritance Quiz
30 pages

Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions

Uploaded by

Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions

Uploaded by

Sequence comparison: Motivation

Finding similarity between sequences is important

• Find homologous proteins

• Locate similar subsequences in DNA

• Locate DNA sequences that might overlap

• Two basic variants of sequence alignment:

An alignment of two sequences is represented by three lines

The first line shows the first sequence

The third line shows the second sequence.

The second line has a row of symbols.

The symbol is a vertical bar wherever characters in the two

Dots may be inserted in either sequence to represent gaps.

HBA_HUMAN 84 SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128

HBA_HUMAN 129 LASVSTVLTSKYR 141

%id = 45.32 %similarity = 63.31

DP can be applied to a large search space that can be structured into a

2 Matrix fill or scoring

3. Traceback and alignment

GAATTCAGTTA (sequence #1)

A simple scoring scheme will be used

Si,j = 1 if the residue at position I of sequence #1 is the same as

Si,j = 0 for mismatch score

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}

• Choose whichever of these is the highest

You might also like