0% ont trouvé ce document utile (0 vote)

22 vues52 pages

Recherche Motifs

Transféré par

Nous prenons très au sérieux les droits relatifs au contenu. Si vous pensez qu’il s’agit de votre contenu, signalez une atteinte au droit d’auteur ici.

Formats disponibles

Téléchargez aux formats PDF, TXT ou lisez en ligne sur Scribd

0% ont trouvé ce document utile (0 vote)

22 vues52 pages

Recherche Motifs

Transféré par

Ilham Imene

Nous prenons très au sérieux les droits relatifs au contenu. Si vous pensez qu’il s’agit de votre contenu, signalez une atteinte au droit d’auteur ici.

Formats disponibles

Téléchargez aux formats PDF, TXT ou lisez en ligne sur Scribd

Algorithmes

de recherche de motifs
dans les séquences

Liffré Novembre 2005

Jacques Nicolas

Participants

LEROY Hugues ing. rech. Symbiose

ASSI Anthony ing. expert Symbiose
BIMBOT Frédéric chercheur Cnrs Metiss
GUESDON Maxence ing. rech. Rocq Miriad, Cristal
CENAC Peggy doctorante Rocq Preval
CORDIER Marie-Odile professeure Dream
BAYOUDH Sabri doctorant Cordial
QUINIOU René chercheur Inria Dream
ROSSI Vivien postdoc Aspi
GAIGNARD Alban ing. associé Visages
ILLANES MANRIQUEZ Alfredo doctorant Sosso 2
SERICOLA Bruno chercheur Armor
TUFFIN Bruno chercheur Armor
CAMBEFORT Jeanne ingénieure Symbiose

1
Recherche de motifs (String matching)

Problème fondamental dans de nombreux domaines

Initialement recherche dans des fichiers ou des documents;
Regain d’intérêt depuis le web et la génomique.

Problème de complexité faible (algorithmes linéaires en

temps/espace) mais algorithmique sophistiquée (difficile
à implémenter).

Recherche exacte de mots ok, erreurs (fréquent en

biologie) et motifs plus difficiles à prendre en compte.

Le cours d’aujourd’hui

Introduit rapidement la biologie moléculaire;

Présente les concepts fondamentaux de la

recherche exacte de chaînes;

Propose un choix des meilleurs algorithmes;

Termine par quelques problèmes de recherche.

2
Chromosomes: le support de l’hérédité

10 14 cellules chez
chaque individu

Eukaryotes :
chromosomes
dans un noyau
(sinon prokaryote
bactéries +archées)

Homme :
génome= 23 paires
de chromosomes

Structure de l’ADN : une double hélice

Deux brins complémentaires orientés

Alphabet à 4 lettres: Purines A et T Pyrimidines C et G

3
5'P 3'OH
ACGTATGCCCATACGCGCG

Le séquençage TGCATACGGGTATGCGCGC
3'OH 5'P

Séquencer un génome, c’est obtenir la série

de nucléotides des chromosomes;
De nombreux génomes sont maintenant
disponibles.
Combien de génomes complètement
séquencés ?

Données Gold oct. 2005 :

315 génomes complets
804 prokaryotes,
547 eukaryotes en cours

HGP : le projet génome humain

4
Taille de quelques génomes

16 700 MB

2 734 MB

278 MB

1,7 MB

9749 B

Les Gènes codent

pour des protéines

Médiateur
ARN :

Structure & Fonction

5
Transcription : synthèse de l’ARN

Alphabet
équivalent
A, U, G, C

Structure de l’ARN :
simple brin et palindromes…

5'P A U C G G C U U A G A C G A U G A A G C C G U C C C G G A A A 3'OH

A
G A
U G
HIV A C
protéase GC
CG
AU
GC
5'P A U C G G C U U A C C G G A A A 3'OH

6
Traduction : synthèse des protéines

Alphabet des
acides aminés :
20 lettres

Le code génétique

7
L’art des mots…

Quelques notations sur les mots

A word is a finite suite of letters, elements of a finite non empty alphabet.

|S| or len(S) denotes the length of word S.
The empty word, of size 0, is denoted ε or λ .
Positions in a word start from 0 : S[0] is the 1st letter of S.
A word x is a subword (substring) of a word y if there exists 2 words u and v
such that y=uxv. A proper subword is an non empty subword.
If u= ε, then x is a prefix, if v= ε , then x is a suffix of y.
More generally, x is a subsequence of y if there exists |x+1| words wi such
that y= w0 x[0] w1 x[1] w2 … w |x-1| x[|x-1|] w |x|
S[a..b] is an occurrence of a subword of S between positions a and b.
S= GATATAAATACATATATG
Recherche ATAT ATATAT
du mot 1 11 13
ATA
S[0]=G, S[1]=A, S[1..4]=ATAT len(S)=18

8
Deux notions fondamentales:
periode et bord
Period of word x : Positive integer p such that

∀i∈[0, x − p−1], x[i+ p]=x[i]

The period of x is its smallest period.

Border of word x : non empty word which is

a proper prefix and suffix of x.
The border of x is the largest border of x.

If p is the period of x and b the border of x,

then p=|x|-|b|

Exercice

Quels sont les facteurs de taille 3 du mot

aaababbbaa ?

Trouvez toutes les périodes du mot

babbababbab.

Trouver la taille des bords des préfixes du

mot abbabaabbabb.

9
Solution

aaa, aab, aba, abb, baa, bab, bba, bbb

1 occurrence, all words on {a,b} : word of de Bruijn.

babbababbab 5
babbababbab 8
babbababbab 10
babbababbab 11

abbabaabbabb
---121123453
period of abbab : 5-2=3, period of the word : 12-3=9

Complexité : hypothèses
d’opérations en temps constant

Access to a position in a sequence of size n;

Comparison of 2 letters or 2 numbers ≤ n;

Boolean operations on o(log2(n))- bit vectors

where n is the length of the sequence.

10
Exercice : complexité
What is the complexity (in time and space) of a naive
string matching approach for the search of a word of length
m in a sequence of length n ?

How many time will require the naïve search of a 1.6kb

segment in the human genome (3. 109 b) with a computer
doing 300 millions letter comparisons per second ?

Recherche exacte de chaîne :

algorithme naïf

Naive Exact string matching (Pattern,Sequence):

n
For each Position in Sequence
Take a Substring of length
the length of the pattern at Position in Sequence

if for each letter of the Pattern m

Substring and Pattern are equal
then print Position

Complexité : o(m) space, o(n.m) time 16000s ~4H

11
Recherche exacte de chaîne :
deux voies

Pretreatment of the pattern

one marks multiple starts of the pattern in it;
algorithm in O(n+m) (typically sublinear !);
Knuth-Morris-Pratt 1977 (Aho-Corasick 1975),
Boyer-Moore 1977 (Horspool 1980, Apostolico-
Giancarlo 1986), BDM 1994 (BOM 2001).

Pretreatment of the sequence

one builds the dictionary of words in the sequence;
algorithm in O(m) + initial step in O(n);
suffix tree : McCreight 1976, Ukonnen 1995.

Prétraitement du motif

Three types of approaches exist

1. Reading of the text, one character at a time, updating

variables and doing inclusion test on pattern prefixes at
the current position (Knuth Morris Pratt)

2. Reading of the text with a sliding window and inclusion

test on pattern suffixes in this window. (Boyer Moore)

3. Reading of the text with a sliding window and inclusion

test on pattern subwords or reverse prefixes in this
window. (BDM)

12
Prétraitement du motif
Quelques notations
But :
Calculer tous les départs possibles (via Zi) du mot

Zi
longueur du plus long mot commun aux positions 0 et i≠0

[li , ri ]
Position du bord le plus à droite calculé par Zj , j= 1 à i

Prétraitement du motif, un exemple

ATAT
AT présent en 0 et 2
Z1= 0, Z2= 2, Z3= 0
[l1,r1]= [0,0] [l2,r2]= [l3,r3] = [2,3]

The naive algorithm computing Zi works on a

smaller string, but has the same complexity
than the initial problem : m2
How to compute it linearly with respect to m?

13
Exercice

Calculer les valeurs de Zi pour le motif

attataattataa.

Solution

¾
attataattataa
002017002011

14
Comment calculer Zi incrémentalement

a)
p i-l l i r

b)
c)

a) Z-word at l (i.e. r) does not reach i : compare from i

b) Z-word at l (i.e. r) passes i and Z i-l < r-i: do nothing !

c) Z-word at l (i.e. r) passes i and Z i-l ≥ r-i: compare from r

Prétraitement du motif, algorithme

def update_Z (Seq,Z,i,l,r):

% Compute Z[i], updates initial & final positions l and r
If (i>r) : % a) Complete comparison necessary
Z[i] = prefix length (Seq,Seq,i, length(Seq)-i+1, 0)
if non empty common word : [l, r] = [i, i+Z[i]-1]

else if (Z[i-l] < r +1 –i ) : % b) no comparison

Z[i] = Z[i-l] % No update of l and r

else : % c) comparison from the previous r

q = prefix length (Seq,Seq,r+1, length(Seq)-r, r+1-i)
Z[i] = q+r+1-i ; [l, r] = [i, r+q]

15
Exercice

Quelles sont les valeurs de Zi, li et ri

pour la séquence ATATCATATA ?

Exemple de prétraitement

ATATCATATA

Current i:1 Z[1]=0 , l=0, r=0 Case a)

rightmost i:2 Z[2]=2, l=2, r=3
Zbox
AT[AT]CATATA Case b)
i:3 l=2, r=3, i-l=1, r+1-i=1, Z[1]<1, Z[3]=Z[1] =0

AT(AT)C[AT{AT]A} Case c)
i:7 l=5, r=8, Z[2]=2, q=1 Z[7]=1+8+1-7=3 , l=7, r=9

16
Table des résultats

i Zi new l new r i-l r+1-i case

1 (T) 0 0 0 a)
2 (A) 2 2 3 2 -1 a)
3 (T) 0 2 3 1 1 b)
4 (C) 0 2 3 2 0 a)
5 (A) 4 5 8 3 -2 a)
6 (T) 0 5 8 1 3 b)
7 (A) 3 7 9 2 2 c)
8 (T) 0 7 9 1 2 b)
9 (A) 1 7 9 2 1 c)

Sur le nombre de comparaisons dans

update_Z

If (i>r) : % a) Complete comparison necessary

Z[i] = prefix length (Seq,Seq,i, length(Seq)-i+1, 0)
if non empty common word : [l, r] = [i, i+Z[i]-1]
r increases by the number of good comparisons

else if (Z[i-l] < r +1 –i ) : % b) no comparison

Z[i] = Z[i-l] % No update of l and r
r stable and no comparison

else : % c) comparison from the previous r

q = prefix length (Seq,Seq,r+1, length(Seq)-r, r+1-i)
Z[i] = q+r+1-i ; [l, r] = [i, r+q]
r increases by the number of good comparisons

17
Complexité linéaire du prétraitement

Pretreatment is linear with respect to the length m of the

sequence, independent of the length of the alphabet.

• m loops on update_Z ;
• Other loops are calls to prefix length,
making a number of comparisons;
• r increases each time by the number of successful
comparisons and r is bounded by m.
• at each step, there is at most 1 failure comparison, that is at
most m altogether.

Un algorithme linéaire
(Main & Lorentz 1984)

Append pattern, $,and sequence;

Compute Z on the whole sequence;

Print positions where Z=m.

ATAT$GATATAAATACATATATG
0200040301130104040200

18
Note sur l’ algorithme équivalent
KMP

KMP, in a similar way, looks at every position in the

sequence for the longest prefix of the pattern that is
also a suffix of the sequence.

In practice, these algorithms are not the most

efficient. They are overperformed by the « subword »
approach for long patterns and the « numerical »
approach for short patterns.

Principe de KMP
i+1
i
x
sequence

pattern y
border z
y
z

x ≠ y, y≠z:
In case of matching failure on y, it is sufficient to restart
after the border of the correct prefix (MP), or better, a
smaller border starting with a different letter (KMP)

19
Exercice : chromosomes circulaires

La plupart des chromosomes

bactériens sont en fait circulaires
Comment rechercher un motif dans
ces conditions conditions ?

Solution

Travailler sur la séquence CC !

(longueur C+ motif suffisant)

(pour d’autres problèmes, ce n’est

plus forcément si simple…)

20
Les meilleurs algorithmes actuels

Approche Numérique

Start from a binary representation of words.

« Shift-and » or « shift-or » method depending

of used boolean operators.

Shift-or is slightly better than shift-and, working on the

complement, but slightly less easy to explain.

21
Notations pour la méthode
Shift-And
R. Baeza-Yates, G. Gonnet 1992.
Efficient search of short patterns.

Alphabet of size t, coded from 0 to t-1. Pattern of size m.

Looking for a match of a prefix of the pattern
ending at current position k in the sequence.

Data structures : Bit Matrix I(m,t) (Vector of t integers)

on the pattern Bit Vector M(m) (Integer)

I[i,j]=1 iff pattern [i]==j, Beware,

M[i]=1 iff pattern [0..i]==Seq[k-i..k] position 0 right !

Structures de données du Shift-And

ATAT ATATCATATA
Pattern Sequence
I TATA
A 0101 0
C 0000 1
G 0000 2
T 1010 3
k=2 k=2
M 0000 4

Seq ATA M2= 0101 ATA

Motif ATAT 3210 ATAT
i=0 i=2

22
«Shift-And» : l’ algorithme
Let’s consider an increasing position k in sequence Seq
M=0m; I=0mt ; for i from 0 to m-1, I[Pattern[i] ]= I[Pattern[i] ] or 0m-i-110i

For k from 0 to n-1

M = (1 Or M<<1) And I [.,Seq [k]]
If M[m-1] , then pattern has an occurrence at k
M[0]
M[1]
k M[2]
M[3]
M[0]
M[1]
M[2]
M[3]

Exercice
Essayer le Shift-And algorithm et donner les
valeurs de M pour chaque valeur de k.

ATAT ATATCATATA
Motif Séquence

23
Méthode « Shift-And » : Exemple

ATATCATATA
K=
0 A M= 0001∧0101= 0001
TATA 1 T M= 0011∧1010= 0010
A 0101 0
2 A M= 0101∧0101= 0101
C 0000 1
3 T M= 1011∧1010= 1010
G 0000 2
4 C M= 0101∧0000= 0000
T 1010 3
5 A M= 0001∧0101= 0001
M 0000 4 6 T M= 0011∧1010= 0010
7 A M= 0101∧0101= 0101
8 T M= 1011∧1010= 1010
9 A M= 0101∧0101= 0101

L’approche par facteurs

G. Navarro & M. Raffinot 2000 :

BNDM (Backward Nondeterministic Dawg Matching)
Improvement of algorithm BDM of Crochemore & al. 94 with a
numerical approach.

The idea is to build and use an automaton recognizing all

subwords of the pattern.
Worst case complexity o(n.m) but on average (one
assumes that the probability of occurrence of each
character is equal and that letters of the sequence are
independent) sub-linear : o(n. log |Σ|m/m)

24
Principle of the subword approach
Property 1
Given a pattern p of size m and a sequence Seq, if there exists 0≤i <m,
Seq[k+i..k+m-1] is not a subword of p, then there exists no match of p
in Seq between position k and k+i.
(contrapositive of if p matches w, then all its subwords match w).

Property 2
Given a pattern p of size m and a sequence Seq, if there exists last <m
such that for all i such that 0≤i <last, Seq[k+i..k+m-1] is not a prefix of
p, then there exists not match of p in Seq between position k and
k+last-1.

Principe de l’approche facteur

Comparison
subword xu fails
k+last Beware !, reverse
k reading direction on sequences
during the search of subwords
Sequence x u
0 last
Pattern
subword y u x≠y Property 1
v Property 2
Prefix subword
BOM
Safe shifts
xu not present in the pattern
BNDM
v longest recognized prefix

25
BOM : la structure de données
d’oracle des facteurs

Recent algorithm :
reference A. Allauzen, M. Crochemore, M. Raffinot 2001

Applicable for longer patterns (> 1 word of memory).

Backward Oracle Matching idea : replace suffix

automaton with factor oracle, simply indicating words
that are not subwords, i.e. recognizing a superset of
the set of subwords.

The algorithm is o(n.m) worst case,

efficient in practice, «conjectured optimal» on average.

Décalage correct dans BOM

Reading direction on sequences

during the search of subwords

k
x
sequence
pattern

subword y u x≠y

xu not present in the pattern Safe shift

26
Oracle des facteurs:
1ère caractérisation

C’est un automate déterministe uniquement défini par sa

construction
(on ne connaît toujours pas le langage qu’il accepte !)

1. Build the maximal canonical automaton recognizing the pattern;

2. For increasing state number i in the automaton,

For each letter α in the alphabet such that no α−transition
exist from i,
if u is the shortest word regognized at i and uα is a subword ending
at j, after position i in the sequence, add an α−transition from i to j.

The factor oracle recognizes a (little) superset of the set of

subwords. It has m+1 states and no more than 2m-1
transitions.
It can be built in linear time.

Schéma

u ≠ αv ≠ αw

0 i j k
α
α

i j k
uα uα

27
Oracle des facteurs:
caractérisation en ligne
1. Build the maximal canonical automaton recognizing
the pattern (internal transitions);

2. All Suffixes : All states are final states.

3. All prefixes : For each state from the initial state
There is an backward link to the «closest» supply
state with an in-transition with the same letter
(default, initial state). Closest is with respect to the
backward path from the previous state.
Add for all states on the backward path an (external)
transition with the same letter to the new state.

Algorithme construisant l’oracle

create-factor-oracle(word):

1. Create the initial state q0; link(0):=⊥;

2. For i from 1 to m do
1. Create state qi; δ(qi-1, word[i]):=qi;
No word [i] transition
2. j:= link(i-1); from j
3. While ( j ≠ ⊥ and δ(qj, word[i])= ⊥Go) Backward
do
1. δ(qj, word[i]):= qi; j:= link(j); on the link path
4. od
5. If j= ⊥ then link(i):= 0
else link(i):= k / qk=δ(qj,word[i]) fi
3. od

28
Oracle des facteurs pour reverse(ATAT)
A

0 0 1 0 1 2
T T A

⊥ ⊥ ⊥
A

0 1 2 3 4
T A T A

⊥
The automaton recognizes just factors in this case
It recognizes just one word of the size of the pattern : the pattern itself.

Algorithme BOM

1. k:=0; create-factor-oracle(reverse pattern);

2. While (k ≤ n-m) do % window of size m at k on the sequence
1. state:= q0 ; i:= m-1 % initial state, end of window
2. While (i≥0 & state ≠ ⊥) do % progress on the window
2. state := δ(state,Seq[k+i]); i:=i-1 % and the automaton
3. od
4. If state ≠ ⊥ then write(‘pattern occurrence at’, k)
5. k:=k+ i +2 %slide the window on the next possible factor
3. od

29
Exercice

Run BOM algorithm and give the value of

used oracle states and i for each value of k.

ATAT ATATCATATA
Motif Sequence

0 1 2 3 4
T A T A

BOM : exemple d’exécution

k= 0 window = ATAT, state=0, i= 3

state=1, i= 2 state=2, i= 1
state=3, i= 0
state=4, i=-1 occurrence at 0
k= 1 window = TATC , state=0, i= 3
state= ⊥, i= 2
k= 5 window = ATAT , state=0, i= 3
state=1, i= 2 state=2, i= 1
state=3, i= 0
state=4, i= -1 occurrence at 5
k= 6 window = TATA, state=0, i= 3
state=2, i= 2 state=3, i= 1
state=4, i= 0 state=⊥, i=-1
k=7

30
Comparaison empirique des algorithmes
Extrait de Navarro et Raffinot “Flexible pattern matching in strings” 2002

Size
alphabet

Size
Pattern

UltraSparc 32 bits Machine, Random Sequences 10Mbytes

Quelques problèmes de recherche

de motifs en biologie moléculaire

Recherche de motifs spécifiques,

avec « don’t care » et « erreurs »;

Visualisation du contenu de génomes;

Recherche de « répétitions »;

Recherche de modèles syntaxiques.

Assemblage de séquences

31
Exemple de Motif de Protéine/ADN

Motif de la protéine « En Doigt de Zinc » :

utilisation de don’t cares pour modéliser les distances.
HVRQH-X(10,20)-CLGC

Idée de résolution

Consider the set of k patterns delimited by Xs;

Look in parallel for the k patterns;

Adapt Aho-Corasick’s multiple matching

algorithm, to manage shifts.

32
Recherche de chaînes multiples
If one has to search for a set of k patterns, a trivial
solution is to repeat k times the previous algorithm :
complexity o(k(n+m)).

For a large number of patterns (a dictionnary to

check or code the sequence), it is possible to design
better algorithms, with a complexity not depending on
k, with the help of a pretreatment of the set of
patterns.

The most useful structure for this is a pattern tree

with suffix links

Arbre des motifs avec liens suffixes

G I M H
E A N E U
E T D M
N M R
E X L A
O E R
9 T N
I
M T A O
1 I
N 8
E E C N
A
E
2 T
3 4 6 Link from each node to
7 the node accepting the
1 edge = 1 letter 5
longest proper suffix of
1 node = word read on path from root node the word on this node.
1 number= 1 word

33
Application : recherche de tags

STSs
(Sequenced Tagged Site)
are small DNA sequences
150-300 bases long
whose extremities
(20-30 nucleotides) are
unique in a whole genome

Recherche de tags : taille des données

Jan 2005 gbdiv sts[prop] AND

Homo sapiens 117 925
Mus musculus 120 837
Canis 61 328
Gallus 1 354
Maize 59 271
Arabidopsis 28 584

Given a new sequence, locate it in the whole genome, wrt these STSs.

Direct application of set matching, where an algorithm with complexity not

depending on the number of STS is welcome !

34
Aho-Corasick’s algorithm

Start from the root of the tree and the beginning

of the sequence;

While it is possible, go down in the tree with

the current letter in the sequence;

If a labeled node is reached

output presence of a pattern;
Follow the suffix link
to reach the new starting node

Visualisation de génomes (Reputer)

35
Visualisation de génomes (paysages)

« Landscapes »: Number of occurrences of subwords along the sequence

(B. Clift & al NAR 86, thèse R. Verin Univ. Marne la vallee 98)

Elementary suffix tree

Aim : a dictionary of all subwords of a sequence

It is the pattern tree of suffixes of a sequence (Trie).

Leaves of the tree refer to positions in the sequence.

Remark :
Subwords may label either the nodes or the branches of the tree

36
An elementary Suffix Tree

Root
Example : GOOGOL
G O L

O O G L

O L G O

G O L

O L

Compact suffix tree (ST)

It is the elementary suffix tree of a word W where all

branching nodes of degree 1 have been compacted.

It is the deterministic automaton recognizing

the set of suffixes of M, where :
There are n final states;
Each state that is neither initial or final

has at least 2 successors;

Transitions are labeled with a subword of W

In order to get unique suffixes, one considers the string

W$, instead of W, where $ is a new letter.

37
Exemple de ST sur GOOGOL

GO $
L$ O

OGOL$ L$ GOL$ L$

OGOL$

Subwords may be also represented at the level of nodes

Arbre des suffixes généralisé (GST)

GTTAT, GTTGTT et ATTGT
root

AT GT T

$1,3 TGT$3,0 T $3,3

AT$1,2 GT $1,4

AT$1,0 GTT$2,0 $2,3 $2,5

T$2,2 $3,2

$3,4

T
Plusieurs séquences
représentées dans le
même arbre AT$1,1 GT $2,4

T$2,1 $3,1

38
Exercice :
Le GST au secours de la contamination
When one purifies, clones or sequences samples of DNA
or proteins in a laboratory, there may be a possible
contamination of these samples with the experimenter,
the host organism or used products.

In most cases, sequences of contaminants are known

(keratines, yeast, enzymes…). How to filter the produced
sequences ?

It is sufficient to build the GST of potential contaminants

and the produced sequence and to filter from it all
positions corresponding to a common subword (two
leaves with the same label, one from the sequence and
the other from a contaminant sequence).

L’arbre des suffixes est

une structure “à tout faire”

ε (Root):2
A:2 C:1 $:0
AC$:0 C:1 AAC$:0 $:0
AAC$:0 $:0

• A great variety of usages :

• A calculus on attributes may string matching, dictionary,
be associated to each node : common subsequence, or on
see example of longest repeat the contrary, superstring ,
with ACAAC$ compression ….

39
Exercice : attributs de ST & GST

How to compute the following entities, using

attributes within ST or GST ?

Longest repeat;
Length of the longest common word of 2
sequences (Knuth’s conjecture in 70 : the
problem is not solvable in linear time !)

Solution : attributes de ST et GST

• Longest repeat :
Synthesized attribute S on internal nodes
S(Parent)= if ( leaf, 0, max Child ( S( Child))+ length( Parent) )

Length of the longest common word of 2 sequences :

Synthesized attributes O, T and S on nodes of the GST
O( Parent)= if( leaf, pointer 1st sequence, or Child ( O( Child) ) )
T ( Parent)= if( leaf, pointer 2nd sequence, or Child ( T ( Child) ) )
S ( Parent)= if( not( O( Parent) and T( Parent) ), 0,
max Child (S( Child))+ length( Parent) )

40
L’arbre des suffixes est
une structure de données efficace
The size of the sequence is n, size of pattern is m
Construction of the tree in O(n) ;
Memory space complexity O(n) ;
(Max 2n+1 nodes, 2n edges, obtained for a « comb » on An)
String matching in O(m).

From a practical point of view, considering the value of constants α

and β in the complexity formula αn+β is important, particularly for
memory space.
The structure is efficient only if it is stored in main memory !
Several variations exist, more compact (array, automaton…)

Looking for a pattern in a suffix tree

Go down in the tree with the pattern

If it is not possible : no match

If one reaches a leaf : match

at position given in the leaf
ABB

Else : multiple match, with occurrences

pointed by leaves under the reached node
A

41
Ukkonen’s construction algorithm (1995)

Systematic, lazy exploration of subwords Seq[j..i] :

1st loop i : 0..n ([0..i])

building a suffix tree for increasing prefixes prefi

2nd loop j : 0..i ([j..i])

inserting decreasing suffixes sufj..i (letter Seq[i])

Importance des repeats :

génome de C. elegans

1 number= 1 word

1 node = 1 word read on the path

from the root node

42
Implication des répétitions dans
les maladies

Ubiquité des répétitions dans le

génome : exemple du prion

43
Reputer :
un outil utilisant les arbres de suffixes
Developed in 1999 at Bielefeld University by S. Kurtz
and C. Schleiermacher

Find all repeats in complete genomes (contrary to

programs such as Repeatfinder of GCG).

Constant α =12.5 (repeats + palindroms on S.

cerevisiae, 11.5 Mb, with 160 Mb of memory and
<1 mn on Sparc 400 MHz).

Allows to take into account errors, based on exact

repeats and includes a significance calculus.

Recherche de maximal repeats

A maximal repeat of a sequence is a subword occurring at

2 different positions in the sequence such that it cannot
be enlarged to the left or to the right without loosing the
double occurrence. The issue is to study their
construction with a suffix tree.

Give 2 maximal repeats of xabcyuuzabcvabcywxaw

Show that if w is a maximal repeat , it is associated to a
internal node of the suffix tree.
Deduce from this fact that there is at most n maximal
repeats in a sequence, if n is the size of the sequence.

44
Recherche de maximal repeats :
solution

abc and abcy

Nodes of a suffix trees are repeats.

But not all repeats are present in the tree.
Consider the tree of ACAC$ . A is not in the tree because all A are
followed with a C in the sequence.
However given a maximal repeat w, it exists by definition 2 different
letters x et y (one of them being possibly $) such that words wx and wy
appear in the sequence, with a common prefix parent, w that will thus
appear in the tree.

There exists n internal nodes thus at most n maximal repeats

Un calcul de maximal repeats

dans l’arbre des suffixes

T(Parent):= if (leaf(Parent,i),
Sequence[i-1],
(S:=∪ Child(T(Child));
if |S| >1 then true else S) )

Solution = words leading to nodes Parent

such that T(Parent).

45
Exemple d’arbre d’analyse de la
séquence d’un gène (illustrations de D. Searls)

R CS
CF RE RE
Hiérarchie de Chomsky

Languages Automaton Grammar Recognition Dependency Biology

Turing Machine Unrestricted Undecidable Arbitrary Unknown
Recursively
Enumerable Baa → A

Linear-Bounded Context- NP-Complete Crossing Pseudoknots, etc.

Context- Sensitive
Sensitive At → aA

Pushdown Context-Free Polynomial Nested Orthodox

(stack) 2o Structure
Context-Free S → gSc

Finite-State Regular Linear Strictly Local Central Dogma

Machine
Regular A → cA

From D. Searls

46
Inverted repeats (Tige-boucle)
et pseudo-noeuds

Exemples simples de grammaires

Genlang

purine ---> “g” | “a”.

pyrimidine ---> “c” | “t”.
purine_rich ---> purine:[cost=0], mostly_purine,
purine:[cost=0], .
mostly_purine---> purine, mostly_purine | [].

(…, purine_rich: [size>25, cost <9], …): P ==> 1.

tandem_repeat ---> X: [size>9], X.

palindrom ---> X: [size>9], 2…1000, ~X.

gene ---> “atg”, @-3, modules, stop_codon.

modules ---> exon, intron, modules | exon.
stop_codon ---> “tga” | “ta”, purine.

47
Schéma de base de l’assemblage

Recherche de recouvrements

Shotgun

Fragments obtenus par ultrasons ou flux d’air haute pression

Taille fragments = Sequencing reads =

2-3 Kb + 8-10 Kb paires de séquences ~ 600-700 b

48
Assemblage :
un problème NP-complet/difficile

2 formulations simplifiées possibles

Recherche de la plus courte

superchaîne d’un ensemble de mots;

Recherche de chemin hamiltonien dans

un graphe (nœud = séquence clone, arc
= chevauchement clone).

Illustration
du problème abstrait d’assemblage

{A=ACGA, B=CGAC, C=ACTT, D=CTTG, E=ATAC, H=TTGC}

E
A
ATACGACTTGC G
E
A B
B F
C Genome
D
H
C
I
H D

Superchaîne la plus courte Chemin hamiltonien

49
Programmes d’assemblage:
état de l’art rapide
3 méthodes principales

Algorithme glouton de recherche de

superchaîne

Algorithme Overlap-Layout-Consensus

Algorithme de recherche de chemin eulérien

[1] M. Pop, S. L. Salzberg, M. Shumway. (2002) Genome Sequence

Assembly: Algorithms and Issues. IEEE Computer 35(7), pp. 47-54.

Algorithme overlap-layout-consensus

1 4 7
2 5 8
3 6 9

1 2 3 4 5 6 7 8 9
Overlap

2 3 1 3 1
1 3

2 2
Layout
(oter la
2 3 2 3
1 transitivité)
1 1 2
3

ACCTGA
ACCTGA Consensus
AGCTGA (correction
ACCAGA d’erreurs)

50
De la difficulté de l’assemblage…

Le problème principal est celui de l’existence de

répétitions dans les génomes.
I II III
a b c a b c d

Tandem I
c b

a d
III
+b II
c
a c
b
Excision
I II III IV
a b c d e f
Réarrangement
I a III II IV

a d e b c f

Ce problème est compliqué par la présence d’erreurs de

séquençage et de polymorphisme sur les séquences.

Quelques résultats sur le génome du chien

(Genome Biology 6/10 2005 P. Quignon, M. Giraud, M. Rimbault, P. Lavigne, S.
Tacher, E. Retout, E. Morin, A.-S. Valin, K. Linblad-Toh, J. Nicolas, F. Galibert)

Jeu d’apprentissage de 60 gènes;

Apprentissage par Pratt de 5 patterns caractéristiques,
dont la localisation est distribuée le long du gène;
Filtrage de 63745 séquences par automates pondérés
déduits des motifs (RDISK);
Nettoyage conduisant à 61321 séquences;
Assemblage par CAP3 avec 97% d’identité des
recouvrements donnant 6727 contigs d’en moyenne 7
fragments;
Finition par pattern le plus discriminant, Blast et filtrage
manuel conduisant à 1050 gènes;
Travail ensuite sur le génome assemblé quand
disponible conduisant à 1000 gènes.

51
52

Vous aimerez peut-être aussi

Algorithmique du texte : méthodes et applications
Pas encore d'évaluation
Algorithmique du texte : méthodes et applications
13 pages
(CCINP) 2020 TSI Informatique e
Pas encore d'évaluation
(CCINP) 2020 TSI Informatique e
20 pages
Recherche textuelle et algorithmes en bio-info
Pas encore d'évaluation
Recherche textuelle et algorithmes en bio-info
8 pages
CCP Tsi 2020 Informatique Corrige
Pas encore d'évaluation
CCP Tsi 2020 Informatique Corrige
9 pages
KMP Knuth Morris Pratt
Pas encore d'évaluation
KMP Knuth Morris Pratt
17 pages
Structures de Données Et Algorithmes: Nadi Tomeh Tomeh@
Pas encore d'évaluation
Structures de Données Et Algorithmes: Nadi Tomeh Tomeh@
170 pages
Serie - 1 Complexite Algo Iteratifs
100% (1)
Serie - 1 Complexite Algo Iteratifs
4 pages
Recherche Textuelle : Algorithmes Naïf et Boyer-Moore
100% (1)
Recherche Textuelle : Algorithmes Naïf et Boyer-Moore
9 pages
Algo Cours2
Pas encore d'évaluation
Algo Cours2
22 pages
Algorithmes et démonstrations avancés
100% (1)
Algorithmes et démonstrations avancés
118 pages
Algorithmes de Recherche Dans Les Chaînes de Caractères
100% (1)
Algorithmes de Recherche Dans Les Chaînes de Caractères
15 pages
Algorithmes et Complexité: Exercices
Pas encore d'évaluation
Algorithmes et Complexité: Exercices
3 pages
Chap1 Complexit
Pas encore d'évaluation
Chap1 Complexit
13 pages
Recherche de motifs et automates finis
Pas encore d'évaluation
Recherche de motifs et automates finis
6 pages
Support Cours AlgoAv 1
Pas encore d'évaluation
Support Cours AlgoAv 1
13 pages
TD Complexite
Pas encore d'évaluation
TD Complexite
9 pages
Memento L1 2024 Section2 Files
Pas encore d'évaluation
Memento L1 2024 Section2 Files
24 pages
Devoir de Contrôle N°3 (Théorique) - Algorithmique Programmation - Bac Info (2008-2009)
Pas encore d'évaluation
Devoir de Contrôle N°3 (Théorique) - Algorithmique Programmation - Bac Info (2008-2009)
1 page
Algorithmique Du Texte
Pas encore d'évaluation
Algorithmique Du Texte
5 pages
Concours Admission 2021 Informatique MP
Pas encore d'évaluation
Concours Admission 2021 Informatique MP
12 pages
Introduction à l'algorithmique ULB
100% (1)
Introduction à l'algorithmique ULB
33 pages
Complexité des Algorithmes Récursifs
Pas encore d'évaluation
Complexité des Algorithmes Récursifs
30 pages
Corrigé Devoir Maison M2 AIGEME 2008-2009
Pas encore d'évaluation
Corrigé Devoir Maison M2 AIGEME 2008-2009
3 pages
Complexité des Algorithmes et Données
Pas encore d'évaluation
Complexité des Algorithmes et Données
45 pages
Introduction à la Programmation Dynamique
Pas encore d'évaluation
Introduction à la Programmation Dynamique
6 pages
Analyse des Algorithmes en Programmation 3
Pas encore d'évaluation
Analyse des Algorithmes en Programmation 3
9 pages
Algorithme (Con17)
Pas encore d'évaluation
Algorithme (Con17)
4 pages
Complexité et Représentation des Algorithmes
Pas encore d'évaluation
Complexité et Représentation des Algorithmes
5 pages
Complexité des algorithmes de tri et recherche
Pas encore d'évaluation
Complexité des algorithmes de tri et recherche
5 pages
Recherche de mots dans un texte
Pas encore d'évaluation
Recherche de mots dans un texte
2 pages
LCTD3
Pas encore d'évaluation
LCTD3
3 pages
Corrigé Exercices Supplémentaires Sur La Complexité Des Algorithmes
100% (1)
Corrigé Exercices Supplémentaires Sur La Complexité Des Algorithmes
5 pages
DS1 4SI Algo Prog GHARBI Moncef 2024 2025
Pas encore d'évaluation
DS1 4SI Algo Prog GHARBI Moncef 2024 2025
5 pages
SuporAlgo&Prg R F C
Pas encore d'évaluation
SuporAlgo&Prg R F C
20 pages
Cours Complexité U-FOAD
Pas encore d'évaluation
Cours Complexité U-FOAD
33 pages
Chapitre 1 Intro
Pas encore d'évaluation
Chapitre 1 Intro
36 pages
Motifs en Bioinformatique et Algorithmes
Pas encore d'évaluation
Motifs en Bioinformatique et Algorithmes
7 pages
Modèles informatiques et bio-informatique
Pas encore d'évaluation
Modèles informatiques et bio-informatique
17 pages
Corrigé d'Examen Algorithmique II
Pas encore d'évaluation
Corrigé d'Examen Algorithmique II
5 pages
Guitari Exercices
Pas encore d'évaluation
Guitari Exercices
25 pages
Exercices Corrigés sur les Sous-Programmes
100% (2)
Exercices Corrigés sur les Sous-Programmes
11 pages
TD 082021
Pas encore d'évaluation
TD 082021
2 pages
Chapitre 1
Pas encore d'évaluation
Chapitre 1
72 pages
Exercices d'Algorithmique du Texte en Bioinformatique
Pas encore d'évaluation
Exercices d'Algorithmique du Texte en Bioinformatique
2 pages
Correction TD N°5
Pas encore d'évaluation
Correction TD N°5
5 pages
Examen: Algorithmique Avancée et C
Pas encore d'évaluation
Examen: Algorithmique Avancée et C
12 pages
Cours Arbres
Pas encore d'évaluation
Cours Arbres
47 pages
Introduction aux structures de données
Pas encore d'évaluation
Introduction aux structures de données
22 pages
Analyse des Algorithmes et Complexité
Pas encore d'évaluation
Analyse des Algorithmes et Complexité
4 pages
Exercices sur algorithmes et complexité
Pas encore d'évaluation
Exercices sur algorithmes et complexité
12 pages
Algorithmes de recherche de chaînes
Pas encore d'évaluation
Algorithmes de recherche de chaînes
89 pages
Exercices sur les algorithmes récurrents
100% (2)
Exercices sur les algorithmes récurrents
4 pages
Algorithmes et structures en Python
Pas encore d'évaluation
Algorithmes et structures en Python
8 pages
Chap.5 Recherche Textuelle
100% (1)
Chap.5 Recherche Textuelle
9 pages
Efficacité et Complexité des Algorithmes
Pas encore d'évaluation
Efficacité et Complexité des Algorithmes
198 pages
Devoir Algorithmique 3ème SI 2013-2014
Pas encore d'évaluation
Devoir Algorithmique 3ème SI 2013-2014
4 pages
Correction TD Complexite Final
Pas encore d'évaluation
Correction TD Complexite Final
5 pages
SI Algo 2025
Pas encore d'évaluation
SI Algo 2025
5 pages
Performances du Vision Transformer
Pas encore d'évaluation
Performances du Vision Transformer
35 pages
Vente en L'état Futur D'achèvenement
Pas encore d'évaluation
Vente en L'état Futur D'achèvenement
12 pages
Chapitre 1 GRH
Pas encore d'évaluation
Chapitre 1 GRH
5 pages
Réussite et Richesse : Clés Essentielles
100% (1)
Réussite et Richesse : Clés Essentielles
3 pages
Exercices sur le courant alternatif 9ème
100% (4)
Exercices sur le courant alternatif 9ème
5 pages
Leica Captivate BRO
Pas encore d'évaluation
Leica Captivate BRO
4 pages
L'Homme et sa Place dans la Nature
Pas encore d'évaluation
L'Homme et sa Place dans la Nature
3 pages
Corrigé Type Examen de Remplacement Etat de L'art
100% (2)
Corrigé Type Examen de Remplacement Etat de L'art
2 pages
09nu03 PR
Pas encore d'évaluation
09nu03 PR
1 page
Qualités Clés d'un Entrepreneur Performant
Pas encore d'évaluation
Qualités Clés d'un Entrepreneur Performant
32 pages
Floristique TP 04
Pas encore d'évaluation
Floristique TP 04
15 pages
Vulnérabilité sociale au Cameroun
Pas encore d'évaluation
Vulnérabilité sociale au Cameroun
2 pages
Formation en Couture par Apprentissage
Pas encore d'évaluation
Formation en Couture par Apprentissage
54 pages
6exercices Sur Les Vecteurs
Pas encore d'évaluation
6exercices Sur Les Vecteurs
1 page
Rapport de négociation des prix 2023
Pas encore d'évaluation
Rapport de négociation des prix 2023
2 pages
FEMD Présentation Heri
100% (2)
FEMD Présentation Heri
35 pages
Composants Arduino et Développement
Pas encore d'évaluation
Composants Arduino et Développement
52 pages
USTHB FI M2 RSD RSF Chapitre0 Introduction
Pas encore d'évaluation
USTHB FI M2 RSD RSF Chapitre0 Introduction
3 pages
Gestion Des Carrières
Pas encore d'évaluation
Gestion Des Carrières
4 pages
Amplificateurs Opérationnels : Régime Non-Linéaire
Pas encore d'évaluation
Amplificateurs Opérationnels : Régime Non-Linéaire
8 pages
Examen de Physique-Chimie 3AC 2023-2024
Pas encore d'évaluation
Examen de Physique-Chimie 3AC 2023-2024
3 pages
Intégration des MES en entreprise
Pas encore d'évaluation
Intégration des MES en entreprise
7 pages
TD 4
Pas encore d'évaluation
TD 4
7 pages
Influenceurs Instagram et marchandisation
Pas encore d'évaluation
Influenceurs Instagram et marchandisation
26 pages
Fiche Revision Intro Au Droit
Pas encore d'évaluation
Fiche Revision Intro Au Droit
7 pages
Speedtec180c 200c 215
Pas encore d'évaluation
Speedtec180c 200c 215
1 page
CCTP SP1 Japoma
Pas encore d'évaluation
CCTP SP1 Japoma
167 pages
Machines Asynchrones : Principe et Types
Pas encore d'évaluation
Machines Asynchrones : Principe et Types
11 pages
Entraînement Natation: Filières Énergétiques
Pas encore d'évaluation
Entraînement Natation: Filières Énergétiques
16 pages
1spe SVT The Me 1A Chapitre 2
Pas encore d'évaluation
1spe SVT The Me 1A Chapitre 2
4 pages

Recherche Motifs

Transféré par

Recherche Motifs

Transféré par

Algorithmes

Liffré Novembre 2005

 LEROY Hugues ing. rech. Symbiose

 Problème fondamental dans de nombreux domaines

 Problème de complexité faible (algorithmes linéaires en

 Recherche exacte de mots ok, erreurs (fréquent en

 Introduit rapidement la biologie moléculaire;

 Présente les concepts fondamentaux de la

 Propose un choix des meilleurs algorithmes;

 Termine par quelques problèmes de recherche.

Structure de l’ADN : une double hélice

Deux brins complémentaires orientés

Alphabet à 4 lettres: Purines A et T Pyrimidines C et G

 Séquencer un génome, c’est obtenir la série

Données Gold oct. 2005 :

HGP : le projet génome humain

Les Gènes codent

Structure & Fonction

Quelques notations sur les mots

 A word is a finite suite of letters, elements of a finite non empty alphabet.

∀i∈[0, x − p−1], x[i+ p]=x[i]

 Border of word x : non empty word which is

 If p is the period of x and b the border of x,

 Quels sont les facteurs de taille 3 du mot

 Trouvez toutes les périodes du mot

 Trouver la taille des bords des préfixes du

 aaa, aab, aba, abb, baa, bab, bba, bbb

 Access to a position in a sequence of size n;

 Comparison of 2 letters or 2 numbers ≤ n;

 Boolean operations on o(log2(n))- bit vectors

 How many time will require the naïve search of a 1.6kb

Recherche exacte de chaîne :

Naive Exact string matching (Pattern,Sequence):

if for each letter of the Pattern m

Complexité : o(m) space, o(n.m) time 16000s ~4H

 Pretreatment of the pattern

 Pretreatment of the sequence

Three types of approaches exist

1. Reading of the text, one character at a time, updating

2. Reading of the text with a sliding window and inclusion

3. Reading of the text with a sliding window and inclusion

Prétraitement du motif, un exemple

The naive algorithm computing Zi works on a

 Calculer les valeurs de Zi pour le motif

a) Z-word at l (i.e. r) does not reach i : compare from i

b) Z-word at l (i.e. r) passes i and Z i-l < r-i: do nothing !

c) Z-word at l (i.e. r) passes i and Z i-l ≥ r-i: compare from r

Prétraitement du motif, algorithme

def update_Z (Seq,Z,i,l,r):

else if (Z[i-l] < r +1 –i ) : % b) no comparison

else : % c) comparison from the previous r

 Quelles sont les valeurs de Zi, li et ri

Current i:1 Z[1]=0 , l=0, r=0 Case a)

i Zi new l new r i-l r+1-i case

Sur le nombre de comparaisons dans

If (i>r) : % a) Complete comparison necessary

else if (Z[i-l] < r +1 –i ) : % b) no comparison

else : % c) comparison from the previous r

Pretreatment is linear with respect to the length m of the

 Append pattern, $,and sequence;

 Compute Z on the whole sequence;

 Print positions where Z=m.

 KMP, in a similar way, looks at every position in the

 In practice, these algorithms are not the most

La plupart des chromosomes

Travailler sur la séquence CC !

(pour d’autres problèmes, ce n’est

Start from a binary representation of words.

« Shift-and » or « shift-or » method depending

Shift-or is slightly better than shift-and, working on the

 Alphabet of size t, coded from 0 to t-1. Pattern of size m.

 Data structures : Bit Matrix I(m,t) (Vector of t integers)

 I[i,j]=1 iff pattern [i]==j, Beware,

Structures de données du Shift-And

Seq ATA M2= 0101 ATA

 For k from 0 to n-1

LEROY Hugues ing. rech. Symbiose

Problème fondamental dans de nombreux domaines

Problème de complexité faible (algorithmes linéaires en

Recherche exacte de mots ok, erreurs (fréquent en

Introduit rapidement la biologie moléculaire;

Présente les concepts fondamentaux de la

Propose un choix des meilleurs algorithmes;

Termine par quelques problèmes de recherche.

Séquencer un génome, c’est obtenir la série

A word is a finite suite of letters, elements of a finite non empty alphabet.

Border of word x : non empty word which is

If p is the period of x and b the border of x,

Quels sont les facteurs de taille 3 du mot

Trouvez toutes les périodes du mot

Trouver la taille des bords des préfixes du

aaa, aab, aba, abb, baa, bab, bba, bbb

Access to a position in a sequence of size n;

Comparison of 2 letters or 2 numbers ≤ n;

Boolean operations on o(log2(n))- bit vectors

How many time will require the naïve search of a 1.6kb

Pretreatment of the pattern

Pretreatment of the sequence

Calculer les valeurs de Zi pour le motif

Quelles sont les valeurs de Zi, li et ri

Append pattern, $,and sequence;

Compute Z on the whole sequence;

Print positions where Z=m.

KMP, in a similar way, looks at every position in the

In practice, these algorithms are not the most

Alphabet of size t, coded from 0 to t-1. Pattern of size m.

Data structures : Bit Matrix I(m,t) (Vector of t integers)

I[i,j]=1 iff pattern [i]==j, Beware,

For k from 0 to n-1

G. Navarro & M. Raffinot 2000 :

The idea is to build and use an automaton recognizing all

Applicable for longer patterns (> 1 word of memory).

Backward Oracle Matching idea : replace suffix

The algorithm is o(n.m) worst case,

C’est un automate déterministe uniquement défini par sa

The factor oracle recognizes a (little) superset of the set of

k= 0 window = ATAT, state=0, i= 3

Recherche de motifs spécifiques,

Visualisation du contenu de génomes;

Recherche de modèles syntaxiques.

Consider the set of k patterns delimited by Xs;

Look in parallel for the k patterns;

Adapt Aho-Corasick’s multiple matching

For a large number of patterns (a dictionnary to

The most useful structure for this is a pattern tree

Direct application of set matching, where an algorithm with complexity not

It is the pattern tree of suffixes of a sequence (Trie).

Leaves of the tree refer to positions in the sequence.

It is the elementary suffix tree of a word W where all

It is the deterministic automaton recognizing

In order to get unique suffixes, one considers the string

In most cases, sequences of contaminants are known

It is sufficient to build the GST of potential contaminants

Length of the longest common word of 2 sequences :

From a practical point of view, considering the value of constants α

If it is not possible : no match

If one reaches a leaf : match

Else : multiple match, with occurrences

1st loop i : 0..n ([0..i])

2nd loop j : 0..i ([j..i])