Sequences alignments (similarity & homology)
What is a sequence?
!B " ioinformatic : a biological sequence is a simple word ! "A word is a ordonned collection of symboles in an aplhabet ! "The primary structure is only taking into account
Bremen
P. Thbault- Alignment sequence
The sequence is represented through a given format
P. Thbault- Alignment sequence
Sequence analysis, what for?
" A sequence contains information about : - function, - relationships between other molecules " A sequence reflects some physico-chimic constraints due to: " The environment (water, lipid) " the molecular evolution The objectif is to predict important informations about the macromolecular function thanks to the only sequence
P. Thbault- Alignment sequence 4
Sequence analysis, what for?
Multiple Alignment Motifs search
Phylogeny
Databases Sequence Homologous sequences search Pairwise alignment
P. Thbault- Alignment sequence
Genome Annotation
Text search
" Search terms/criteria within sequences annotations: function, keywords, organisms, features " Generic sites :
" Entrez : NCBI server (http:// www.ncbi.nlm.nih.gov/Entrez/) " SRS : available at different sites
" Specialized sites
" SGD (all about yeast : http://genome-www.stanford.edu/ Saccharomyces/)
P. Thbault- Alignment sequence 6
Homology search
" Goal: search for sequences similarities in order to infer structural or functional information " Method : sequences alignment or dot matrix " Definitions :
" the similarity to measure the similarity " The homology : " is an hypothesis based on sequence similarity " stipulates that 2 sequences are derived from a common ancestor
P. Thbault- Alignment sequence 7
Evolution of sequences
" The concept of homology relates to the mechanisms of molecular evolution " Principles :
" homologous sequences are derived from a common ancestor which sequence is not available (unfortunately!) " at the molecular level, the events of evolution are substitutions, insertions and deletions " there exists a selection pressure at the structural or functional levels on either genes or their products : this pressure guides sequence evolution
P. Thbault- Alignment sequence 8
Information inference and evolution
" Most bioinformatics methods rely on information transfer from known sequences towards new sequences : inference reasoning " This inference relies on evolution events such as: " speciation (an ancestor specie => different species) " Genes duplication " merge / split of genes (leads to the domains composition of genes and proteins)
P. Thbault- Alignment sequence
Common ancestor
spciation time P1 P2 duplication P1
F
Orthologous
P2a
P2b
F2
Specie 1
Specie 2
Paralogous 10
functional inference
Common ancestor P Database
fonction F (deduced from homology)
What is the Function of P1?
spciation
function F (experimental work)
time
P1
specie 1
homology
Software to compare sequences
P2
specie2
11
Ex. 1: trypsin, human & chiken (~80 % id.)
TRY3_CHICK MKFLFLILSCLGAAVAFPGGADDDKIVGGYTCPEHSVPYQVSLNSGYHFCGGSLINSQWV TRY3_HUMAN MN-PFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWV *: ****: :*****.* *********** *:*:********* ********..*** TRY3_CHICK LSAAHCYKSRIQVRLGEYNIDVQEDSEVVRSSSVIIRHPKYSSITLNNDIMLIKLASAVE TRY3_HUMAN VSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAV :*******:********:**.* *..* . .:: *******. **:********:*.. TRY3_CHICK YSADIQPIALPSSCAKAGTECLISGWGNTLSNGYNYPELLQCLNAPILSDQECQEAYPGD TRY3_HUMAN INARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLREAECKASCPGK .* :..*:**:: . *************** * :**: *:**:**:* : **: : **. TRY3_CHICK ITSNMICVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGIGCALKGYPGVYTKVCNYVDW TRY3_HUMAN ITNSMFCVGFLEGGKDSWKRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDW P. Thbault-: Alignment sequence **..*:*********** **********:***:**** *** *. ******* *****
12
Ex. 2: trypsin, human & mosquito (~30 % id.)
TRY3_ANOGA MISNKIAILLAVLVVAVACAQARVALKHRSVQALPRFLPRPQYDVGHRIVGGFEIDVSET TRY3_HUMAN --------MNPFLILAFVGAA--V--------AVP------FDDDDKIVGGYTCEENSL : ..*::*.. * * *:* :* ..:****: TRY3_ANOGA PYQVSLQYFNSHRCGGSVLNSKWILTAAHCTVNLQPSSLAVRLGSS--RHASGGTVVRV TRY3_HUMAN PYQVSLN-SGSHFCGGSLISEQWVVSAAHC---YKTRIQVRLGEHNIKVLEGNEQFINA ******: .** ****::..:*:::**** : : ****. : .*. .:.. TRY3_ANOGA ARVLEHPNYDDSTIDYDFSLMELETELTFSDVVQPVSLPEQDEAVEDGTMTTVSGWGNTQ TRY3_HUMAN AKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAA-GTECLISGWGNTL *:::.**:*: .*:* *: *::*.: .:. *..:*** *. **
: ..
:******
TRY3_ANOGA SAAESNAILRAANIPTVNQKECTIAYSSSGGITDRMLCAGYKRGGKDACQGDSGGPLVV TRY3_HUMAN SFGADYPDELKCLDAPVLREAECKA-SCPGKITNSMFCVGFLEGGKDSWKRDSGGPVVC * .*: *:. : *.:.:Alignment **. *..* **: *:*.*: .****: : *****:* P. Thbaultsequence TRY3_ANOGA DGKLVGVVSWGFGCAMPGYPGVYARVAVVRNWVRENSGA--
13
the trypsin case
" Very conserved sequence " Strong structural constraints : 3 cysteines bonds (cys-cys) " Sequence similarity is in accordance with phylogenetic distances of species " Function identity is proved experimentally
P. Thbault- Alignment sequence
14
Difficulties
" With time, mutations accumulate until similarity between sequences disappear : homology is not detectable = false negatives " There are mechanisms, independent from evolution, which result in artifact similarities (low complexity regions) similarity but no homology = false positive
P. Thbault- Alignment sequence
15
Modular composition of proteins
" Many proteins appear as domain combinations " Domains can be repeated and present in different protein in various orders " Similarity (and homology!) between proteins can thus be partial : this makes the alignment more complicated and affect functional inference (a common domain might not be enough to result in a common function)
P. Thbault- Alignment sequence 16
Example of modular proteins
F12 PLAT F2 F1 E E F1 K E K K catalytic catalytic
F12 & PLAT are 2 proteins involved in blood coagulation (the catalytic domain has a serine protease activity). Domains frequently correspond to exons.
P. Thbault- Alignment sequence
17
How to compare 2 sequences?
" Based on a graphical view -> Dot matrix approach " Based on a sequence view ->Alignment approach
Bremen
P. Thbault- Alignment sequence
18
!dot matrix! view
P. Thbault- Alignment sequence
19
Protein 2
Dot - Matrix Protein 1
P. Thbault- Alignment sequence 20
Alignment of 2 sequences
" Pb : A huge number of possibilities " Which one?
A C - T T A G G C A - G T - G G C * * * * * A C T T A G G C - A G T G G C * * * * A C T T A G G C A G T - G G C * * * * *
Alignment of 2 sequences
" Evaluation " Similarity criteria
4 matchs 2 mismatchs 2 gaps A C T T A G G C - A G T G G C * * * * 5 matchs 0 mismatch 4 gaps A C - T T A G G C A - G T - G G C * * * * * 5 matchs 1 mismatch 2 gaps
A C T T A G G C A G T - G G C * * * * *
Score Systems
" Alignement Score = " scores at each position " Different events: match = +2 " Indel / substitution / identit mismatch = -1 " All substitutions are not equivalents gap = -2 " ADN : transitions / transversions
Proteins : " physico-chimics properties " Models for evoultion " Penality for gaps : " Linear, log
"
Matrix of substitution Opening Extending
Matrix of substitutions (aa)
BLOSUM62
Matrix of substitutions (aa)
matchs : always > 0, but different scores BLOSUM62 mismatchs :
<0 : penality =0 : neutral >0 : neatrly like a match
Algorithms
" How to find the best alignement? " Exacts Algorithms : " Programmation dynamique (Needleman & Wunsch, Smith &
Waterman) " Take time if databases
" Heuristiques = not sure about the optimal solution " Blast, Fasta
Global orlocal ?
" 2 types of alignment :
Needleman & Wunsch Fasta
global : total length
local : by pieces
Smith & Waterman Blast
Comparing a sequence with those of a databases
The goal is to compare a query sequence all the subject sequences of the database
sequence database
For each sequence of the database, the program tries to find the best alignment
P. Thbault- Alignment sequence 28
Blast Hit evaluation
" Satistic evaluation: random ? " E-value
" " "
S : !bit-score! of the alignement K, ! : parameters (score system, sequence composition) m, n : lentgh of sequences (or size of the database)
E = K.m.n.e-!S
E = nb of alignements that we may get in the database with a score more that a score under the random hypothesis
Blast Tools
Blast utilisation
" Questions : " Which database ?
General (GenBank, UniProt) Specialized (EST, limited to one organism, family of proteins, etc.) " Nucleic or proteic sequences? " Are the default parameters adapted? " Interpretation of the results: " Which E-value max ? No simple rule : E < 1e-10 => clear homology 1e-10 < E < 1e10-5 => may be ??? 1e10-5 < E => not significant enough
" "
"
But also : size, %id, %gap = to examin alignements
Blast programs
Program blastp blastn blastx tblastn tblastx database proteins proteins nucleotides query seq. comment proteins nucleotides proteins
Translation of the query seq Translation of the database. Translation of the query seq and the database
nucleotides nucleotides
nucleotides nucleotides
P. Thbault- Alignment sequence
32