0% found this document useful (0 votes)
30 views42 pages

Lecture 5

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views42 pages

Lecture 5

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS 466

Introduction to Bioinformatics
Lecture 5
Mohammed El-Kebir
September 13, 2024
Course Announcements
Instructor:
• Mohammed El-Kebir (melkebir) HW1: To be released on 09/13
Office hours: Wednesdays, 3:30-4:30pm in and due 09/21 at 11:59pm
Siebel 3216

Piazza / Gradescope: Midterm:


• [Link] week of 10/09 – 10/11,
• Gradescope entry code: ‘VDEE52’ details to follow

TA:
• Nicole (nsdong2): Mondays 4-5pm in CS Tutoring Center (Siebel basement)
• Claire (czchou2): Tuesdays 2-3pm in CS Tutoring Center (Siebel basement)
• Mrinmoy (mroddur2): Thursdays 2-3pm in Siebel 2219
2
Global, Fitting and Local Alignment
Global Alignment problem: Given strings 𝐯 ∈ Σ !
and 𝐰 ∈ Σ " and scoring function 𝛿, find alignment
of 𝐯 and 𝐰 with maximum score. 𝐯\𝐰 0 T A C G G C
[Needleman-Wunsch algorithm] 0

Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! A


and 𝐰 ∈ Σ " and scoring function 𝛿, find an
alignment of 𝐯 and a substring of 𝐰 with maximum G
global alignment score 𝑠 ∗ among all global
alignments of 𝐯 and all substrings of 𝐰 G

Local Alignment problem: Given strings 𝐯 ∈ Σ !


and 𝐰 ∈ Σ " and scoring function 𝛿, find a substring
of 𝐯 and a substring of 𝐰 whose alignment has Question: How to assess
maximum global alignment score 𝑠 ∗ among all resulting algorithms?
global alignments of all substrings of 𝐯 and 𝐰
[Smith-Waterman algorithm]
3
Time Complexity
Edit graph is a weighed, directed grid
graph 𝐺 = (𝑉, 𝐸) with source vertex
A T C G
(0, 0) and target vertex (𝑚, 𝑛). Each
W 0 1 2 3 n=4 edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight
V
depending on direction.
0 O O O O O

A 1 O O O O O Alignment is a path from source (0, 0)


to target (𝑚, 𝑛) in edit graph
T 2 O O O O O

G 3 O O O O O Running time is 𝑂(𝑚𝑛)


T [quadratic time]
m=4 O O O O O

4
Time Complexity
Edit graph is a weighed, directed grid
graph 𝐺 = (𝑉, 𝐸) with source vertex
A T C G
(0, 0) and target vertex (𝑚, 𝑛). Each
W 0 1 2 3 n=4 edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight
V
depending on direction.
0 O O O O O

A 1 O O O O O Alignment is a path from source (0, 0)


to target (𝑚, 𝑛) in edit graph
T 2 O O O O O

G 3 O O O O O Running time is 𝑂(𝑚𝑛)


T [quadratic time]
m=4 O O O O O

Question: Compute alignment faster


than 𝑂(𝑚𝑛) time? [subquadratic time]
5
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3

T m=4

6
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3

T Question: How long is an alignment?


m=4

7
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3

T Question: How long is an alignment?


m=4

Question: Compute alignment in 𝑂(𝑚)


space? [linear space]
8
Outline
1. Recap of global, fitting, local and gapped alignment
2. Space-efficient alignment
3. Subquadratic time alignment

Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes

9
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖

10
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.

11
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.

Question: What if we want alignment itself? 𝑚

12
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.

w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.

w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.

w w
v v

Best score for column


might not be part of
best alignment!
Space Efficient Alignment – Second Attempt
𝑛/2
Alignment is a path from
source (0, 0) to target (𝑚, 𝑛)
in edit graph

𝑖∗
Maximum weight path from
(0,0) to (𝑚, 𝑛) passes
through (𝑖 ∗ , 𝑛/2) 𝑖

Question: What is 𝑖 ∗ ? 𝑚
16
Space Efficient Alignment – Second Attempt
𝑛/2

𝑖∗

𝑚
17
Space Efficient Alignment – Second Attempt
𝑛/2

𝑖∗

𝑚
18
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(

19
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)

20
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)
Question: How to reconstruct
alignment from reported vertices?
21
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗

𝑖 ∗ = arg max{ pre3ix 𝑖 + suf3ix 𝑖 }


$%&%!

𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖

22
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗

𝑖 ∗ = arg max{ pre3ix 𝑖 + suf3ix 𝑖 }


$%&%!

𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖

Want: Compute suf3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time


and O(𝑚) space 𝑚

Doing a longest path from each 𝑖, 𝑗 to 𝑚, 𝑛 (for all 0 ≤ 𝑖 ≤ 𝑚) will not achieve desired running time!

Reversing edges enables single-source multiple destination computation in desired time and space bound!
23
Hirschberg Algorithm: Reconstructing Alignment
A T C G C Hirschberg(𝑖, 𝑗, 𝑖 ' , 𝑗′)
W 0 1 2 3 4 5 1. if 𝑗 ' − 𝑗 > 1
V
2. 𝑖 ∗ ß arg max wt(𝑖)
0 0 -1 -2 -3 -4 -5 $%&%!
( ! )(
3. Report (𝑖 ∗, 𝑗 + )
A 1 -1 1 0 -1 -2 -3 *
( ! )(
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗, 𝑗 + )
T 2 -2 0 2 1 0 -1
*
( ! )( '
5. Hirschberg(𝑖 ∗, 𝑗 + , 𝑖 , 𝑗′)
*
G 3 -3 -1 1 1 2 1
Problem: Given reported vertices and
T 4 -4 -2 0 0 1 1 scores { 𝑖$ , 0, 𝑠$ , … , 𝑖" , 𝑛, 𝑠" },
find intermediary vertices.
C 5 -5 -3 -1 1 0 2
Transposing matrix does not help,
A T - G T C because gaps could occur in both
A T C G - C input sequences (and there might be
multiple opt. alignments) 24
Linear Space Alignment – The Hirschberg Algorithm

25
Outline
1. Recap of global, fitting, local and gapped alignment
2. Space-efficient alignment
3. Subquadratic time alignment

Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes

26
Banded Alignment x1
y1 y2 y3 ... ... ... ... yN

x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .

Gives a good approximation of .


.
highly identical sequences .

xM

Constrain traceback to band of DP matrix (penalize big gaps)


Figure source: [Link] 27
Banded Alignment x1
y1 y2 y3 ... ... ... ... yN

x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .

Gives a good approximation of .


.
highly identical sequences .

Question: How to change


xM
recurrence to accomplish this?
Constrain traceback to band of DP matrix (penalize big gaps)
Figure source: [Link] 28
Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

29
Introduction to Bioinformatics Algorithms

Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass


through corners of blocks

30
Introduction to Bioinformatics Algorithms

Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass


through corners of blocks
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0.

0 ≤ 𝑖, 𝑗 ≤ 𝑡 and 𝛽(𝑖, 𝑗) is max score alignment


between block 𝑖 of 𝐯 and block 𝑗 of 𝐰 31
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t

2t

nt

32
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t

Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t

nt

33
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t

Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t

Computing 𝛽 𝑖, 𝑗 takes 𝑂 𝑡 ( time

There are 𝑛/𝑡 × 𝑛/𝑡 values 𝛽 𝑖, 𝑗

) ) nt
Total: 𝑂 × × 𝑡( =𝑂 𝑛( time
* * 34
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings

nt

Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
35
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings

nt
Question: How to choose 𝑡 for DNA?
Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
36
Block Alignment – Four Russians Technique
Question: How to choose 𝑡 for DNA?

37
Fastest Subquadratic Alignment* Algorithm

Edit distance in
O(𝑛N / log 𝑛) time

Barely subquadratic!

Want: O(𝑛NOP ) time


where 𝜀 > 0

*for edit distance 38


Fastest Subquadratic Alignment* Algorithm

Edit distance in
O(𝑛N / log 𝑛) time

Barely subquadratic!

Want: O(𝑛NOP ) time


where 𝜀 > 0

Question: Is 𝑛NOP in O(𝑛N / log 𝑛) for any 𝜀 > 0?


*for edit distance 39
Hardness Result for Edit Distance [STOC 2015]

40
41
Take Home Messages
1. Global alignment in O(𝑚𝑛) time and O(𝑚) space
• Hirschberg algorithm
2. Block alignment can be done in subquadratic time
• Four Russians Technique: O(𝑛N / log 𝑛) time
3. Global alignment cannot be done in O(𝑛#$% ) time under SETH

Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes

42

You might also like