CS 466
Introduction to Bioinformatics
Lecture 5
Mohammed El-Kebir
September 13, 2024
Course Announcements
Instructor:
• Mohammed El-Kebir (melkebir) HW1: To be released on 09/13
Office hours: Wednesdays, 3:30-4:30pm in and due 09/21 at 11:59pm
Siebel 3216
Piazza / Gradescope: Midterm:
• [Link] week of 10/09 – 10/11,
• Gradescope entry code: ‘VDEE52’ details to follow
TA:
• Nicole (nsdong2): Mondays 4-5pm in CS Tutoring Center (Siebel basement)
• Claire (czchou2): Tuesdays 2-3pm in CS Tutoring Center (Siebel basement)
• Mrinmoy (mroddur2): Thursdays 2-3pm in Siebel 2219
2
Global, Fitting and Local Alignment
Global Alignment problem: Given strings 𝐯 ∈ Σ !
and 𝐰 ∈ Σ " and scoring function 𝛿, find alignment
of 𝐯 and 𝐰 with maximum score. 𝐯\𝐰 0 T A C G G C
[Needleman-Wunsch algorithm] 0
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! A
and 𝐰 ∈ Σ " and scoring function 𝛿, find an
alignment of 𝐯 and a substring of 𝐰 with maximum G
global alignment score 𝑠 ∗ among all global
alignments of 𝐯 and all substrings of 𝐰 G
Local Alignment problem: Given strings 𝐯 ∈ Σ !
and 𝐰 ∈ Σ " and scoring function 𝛿, find a substring
of 𝐯 and a substring of 𝐰 whose alignment has Question: How to assess
maximum global alignment score 𝑠 ∗ among all resulting algorithms?
global alignments of all substrings of 𝐯 and 𝐰
[Smith-Waterman algorithm]
3
Time Complexity
Edit graph is a weighed, directed grid
graph 𝐺 = (𝑉, 𝐸) with source vertex
A T C G
(0, 0) and target vertex (𝑚, 𝑛). Each
W 0 1 2 3 n=4 edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight
V
depending on direction.
0 O O O O O
A 1 O O O O O Alignment is a path from source (0, 0)
to target (𝑚, 𝑛) in edit graph
T 2 O O O O O
G 3 O O O O O Running time is 𝑂(𝑚𝑛)
T [quadratic time]
m=4 O O O O O
4
Time Complexity
Edit graph is a weighed, directed grid
graph 𝐺 = (𝑉, 𝐸) with source vertex
A T C G
(0, 0) and target vertex (𝑚, 𝑛). Each
W 0 1 2 3 n=4 edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight
V
depending on direction.
0 O O O O O
A 1 O O O O O Alignment is a path from source (0, 0)
to target (𝑚, 𝑛) in edit graph
T 2 O O O O O
G 3 O O O O O Running time is 𝑂(𝑚𝑛)
T [quadratic time]
m=4 O O O O O
Question: Compute alignment faster
than 𝑂(𝑚𝑛) time? [subquadratic time]
5
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3
T m=4
6
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3
T Question: How long is an alignment?
m=4
7
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3
T Question: How long is an alignment?
m=4
Question: Compute alignment in 𝑂(𝑚)
space? [linear space]
8
Outline
1. Recap of global, fitting, local and gapped alignment
2. Space-efficient alignment
3. Subquadratic time alignment
Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes
9
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
10
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.
11
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.
Question: What if we want alignment itself? 𝑚
12
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.
w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.
w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.
w w
v v
Best score for column
might not be part of
best alignment!
Space Efficient Alignment – Second Attempt
𝑛/2
Alignment is a path from
source (0, 0) to target (𝑚, 𝑛)
in edit graph
𝑖∗
Maximum weight path from
(0,0) to (𝑚, 𝑛) passes
through (𝑖 ∗ , 𝑛/2) 𝑖
Question: What is 𝑖 ∗ ? 𝑚
16
Space Efficient Alignment – Second Attempt
𝑛/2
𝑖∗
𝑚
17
Space Efficient Alignment – Second Attempt
𝑛/2
𝑖∗
𝑚
18
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
19
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)
20
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)
Question: How to reconstruct
alignment from reported vertices?
21
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗
𝑖 ∗ = arg max{ pre3ix 𝑖 + suf3ix 𝑖 }
$%&%!
𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖
22
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗
𝑖 ∗ = arg max{ pre3ix 𝑖 + suf3ix 𝑖 }
$%&%!
𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖
Want: Compute suf3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time
and O(𝑚) space 𝑚
Doing a longest path from each 𝑖, 𝑗 to 𝑚, 𝑛 (for all 0 ≤ 𝑖 ≤ 𝑚) will not achieve desired running time!
Reversing edges enables single-source multiple destination computation in desired time and space bound!
23
Hirschberg Algorithm: Reconstructing Alignment
A T C G C Hirschberg(𝑖, 𝑗, 𝑖 ' , 𝑗′)
W 0 1 2 3 4 5 1. if 𝑗 ' − 𝑗 > 1
V
2. 𝑖 ∗ ß arg max wt(𝑖)
0 0 -1 -2 -3 -4 -5 $%&%!
( ! )(
3. Report (𝑖 ∗, 𝑗 + )
A 1 -1 1 0 -1 -2 -3 *
( ! )(
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗, 𝑗 + )
T 2 -2 0 2 1 0 -1
*
( ! )( '
5. Hirschberg(𝑖 ∗, 𝑗 + , 𝑖 , 𝑗′)
*
G 3 -3 -1 1 1 2 1
Problem: Given reported vertices and
T 4 -4 -2 0 0 1 1 scores { 𝑖$ , 0, 𝑠$ , … , 𝑖" , 𝑛, 𝑠" },
find intermediary vertices.
C 5 -5 -3 -1 1 0 2
Transposing matrix does not help,
A T - G T C because gaps could occur in both
A T C G - C input sequences (and there might be
multiple opt. alignments) 24
Linear Space Alignment – The Hirschberg Algorithm
25
Outline
1. Recap of global, fitting, local and gapped alignment
2. Space-efficient alignment
3. Subquadratic time alignment
Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes
26
Banded Alignment x1
y1 y2 y3 ... ... ... ... yN
x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .
Gives a good approximation of .
.
highly identical sequences .
xM
Constrain traceback to band of DP matrix (penalize big gaps)
Figure source: [Link] 27
Banded Alignment x1
y1 y2 y3 ... ... ... ... yN
x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .
Gives a good approximation of .
.
highly identical sequences .
Question: How to change
xM
recurrence to accomplish this?
Constrain traceback to band of DP matrix (penalize big gaps)
Figure source: [Link] 28
Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
w1, …, wt wt+1, …, w2t … wn-t+1, …, wn
29
Introduction to Bioinformatics Algorithms
Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
w1, …, wt wt+1, …, w2t … wn-t+1, …, wn
Require that paths in edit graph pass
through corners of blocks
30
Introduction to Bioinformatics Algorithms
Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
w1, …, wt wt+1, …, w2t … wn-t+1, …, wn
Require that paths in edit graph pass
through corners of blocks
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0.
0 ≤ 𝑖, 𝑗 ≤ 𝑡 and 𝛽(𝑖, 𝑗) is max score alignment
between block 𝑖 of 𝐯 and block 𝑗 of 𝐰 31
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t
2t
nt
32
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t
Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t
nt
33
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t
Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t
Computing 𝛽 𝑖, 𝑗 takes 𝑂 𝑡 ( time
There are 𝑛/𝑡 × 𝑛/𝑡 values 𝛽 𝑖, 𝑗
) ) nt
Total: 𝑂 × × 𝑡( =𝑂 𝑛( time
* * 34
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings
nt
Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
35
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings
nt
Question: How to choose 𝑡 for DNA?
Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
36
Block Alignment – Four Russians Technique
Question: How to choose 𝑡 for DNA?
37
Fastest Subquadratic Alignment* Algorithm
Edit distance in
O(𝑛N / log 𝑛) time
Barely subquadratic!
Want: O(𝑛NOP ) time
where 𝜀 > 0
*for edit distance 38
Fastest Subquadratic Alignment* Algorithm
Edit distance in
O(𝑛N / log 𝑛) time
Barely subquadratic!
Want: O(𝑛NOP ) time
where 𝜀 > 0
Question: Is 𝑛NOP in O(𝑛N / log 𝑛) for any 𝜀 > 0?
*for edit distance 39
Hardness Result for Edit Distance [STOC 2015]
40
41
Take Home Messages
1. Global alignment in O(𝑚𝑛) time and O(𝑚) space
• Hirschberg algorithm
2. Block alignment can be done in subquadratic time
• Four Russians Technique: O(𝑛N / log 𝑛) time
3. Global alignment cannot be done in O(𝑛#$% ) time under SETH
Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes
42