Biopython Tutorial and Cookbook
Biopython Tutorial and Cookbook
Contents
1 Introduction
1.1 What is Biopython? . . . . . . . . . . . .
1.2 What can I find in the Biopython package
1.3 Installing Biopython . . . . . . . . . . . .
1.4 Frequently Asked Questions (FAQ) . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
8
9
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
14
15
15
16
16
16
17
3 Sequence objects
3.1 Sequences and Alphabets . . . . . . . . . . . . .
3.2 Sequences act like strings . . . . . . . . . . . . .
3.3 Slicing a sequence . . . . . . . . . . . . . . . . .
3.4 Turning Seq objects into strings . . . . . . . . . .
3.5 Concatenating or adding sequences . . . . . . . .
3.6 Changing case . . . . . . . . . . . . . . . . . . . .
3.7 Nucleotide sequences and (reverse) complements
3.8 Transcription . . . . . . . . . . . . . . . . . . . .
3.9 Translation . . . . . . . . . . . . . . . . . . . . .
3.10 Translation Tables . . . . . . . . . . . . . . . . .
3.11 Comparing Seq objects . . . . . . . . . . . . . . .
3.12 MutableSeq objects . . . . . . . . . . . . . . . . .
3.13 UnknownSeq objects . . . . . . . . . . . . . . . .
3.14 Working with strings directly . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
19
20
21
21
23
23
24
25
27
28
29
30
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
34
34
35
36
37
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
38
41
42
42
42
45
47
5 Sequence Input/Output
5.1 Parsing or Reading Sequences . . . . . . . . . . . . . . . . . . . . .
5.1.1 Reading Sequence Files . . . . . . . . . . . . . . . . . . . .
5.1.2 Iterating over the records in a sequence file . . . . . . . . .
5.1.3 Getting a list of the records in a sequence file . . . . . . . .
5.1.4 Extracting data . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Parsing sequences from compressed files . . . . . . . . . . . . . . .
5.3 Parsing sequences from the net . . . . . . . . . . . . . . . . . . . .
5.3.1 Parsing GenBank records from the net . . . . . . . . . . . .
5.3.2 Parsing SwissProt sequences from the net . . . . . . . . . .
5.4 Sequence files as Dictionaries . . . . . . . . . . . . . . . . . . . . .
5.4.1 Sequence files as Dictionaries In memory . . . . . . . . .
5.4.2 Sequence files as Dictionaries Indexed files . . . . . . . . .
5.4.3 Sequence files as Dictionaries Database indexed files . . .
5.4.4 Indexing compressed files . . . . . . . . . . . . . . . . . . .
5.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Writing Sequence Files . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Round trips . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Converting between sequence file formats . . . . . . . . . .
5.5.3 Converting a file of sequences to their reverse complements
5.5.4 Getting your SeqRecord objects as formatted strings . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
48
48
49
50
51
53
54
54
55
55
56
58
60
60
61
62
63
64
64
65
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
68
70
72
74
75
77
78
78
81
81
82
83
84
85
87
4.4
4.5
4.6
4.7
4.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 BLAST
7.1 Running BLAST over the Internet . . . . . . . . . . . . . .
7.2 Running BLAST locally . . . . . . . . . . . . . . . . . . . .
7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Standalone NCBI BLAST+ . . . . . . . . . . . . . .
7.2.3 Other versions of BLAST . . . . . . . . . . . . . . .
7.3 Parsing BLAST output . . . . . . . . . . . . . . . . . . . .
7.4 The BLAST record class . . . . . . . . . . . . . . . . . . . .
7.5 Deprecated BLAST parsers . . . . . . . . . . . . . . . . . .
7.5.1 Parsing plain-text BLAST output . . . . . . . . . .
7.5.2 Parsing a plain-text BLAST file full of BLAST runs
7.5.3 Finding a bad record somewhere in a huge plain-text
7.6 Dealing with PSI-BLAST . . . . . . . . . . . . . . . . . . .
7.7 Dealing with RPS-BLAST . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
BLAST file
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
. 89
. 91
. 91
. 91
. 92
. 92
. 94
. 95
. 95
. 98
. 99
. 100
. 100
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
102
102
107
110
113
114
115
115
116
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
118
119
120
122
122
123
123
126
128
128
128
129
131
132
134
134
136
136
136
137
139
140
141
141
142
143
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
144
144
144
146
147
148
148
150
150
151
151
152
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
154
154
154
155
155
155
156
158
159
159
159
160
161
162
162
162
162
163
163
163
163
163
166
166
166
166
167
167
167
167
168
168
169
169
170
170
171
171
171
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
171
172
172
172
172
172
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
and molecular adaptation
. . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
173
173
175
175
177
178
178
181
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
182
182
183
186
187
191
191
193
193
194
194
195
195
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Scoring Matrix
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
197
197
197
199
200
200
206
209
212
213
214
215
215
216
216
217
220
221
221
222
223
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15 Cluster analysis
15.1 Distance functions . . . . . . . . . .
15.2 Calculating cluster properties . . . .
15.3 Partitioning algorithms . . . . . . .
15.4 Hierarchical clustering . . . . . . . .
15.5 Self-Organizing Maps . . . . . . . . .
15.6 Principal Component Analysis . . .
15.7 Handling Cluster/TreeView-type files
15.8 Example calculation . . . . . . . . .
15.9 Auxiliary functions . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
224
225
228
230
233
237
239
240
245
245
. . . . .
. . . . .
. . . . .
. . . . .
Support
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . .
. . . .
. . . .
. . . .
Vector
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Machines
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
246
246
246
247
249
251
251
251
252
252
254
254
254
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
255
255
255
255
256
258
258
259
261
261
265
266
269
273
274
274
274
277
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
279
279
279
280
281
282
282
283
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18.2
18.3
18.4
18.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
284
285
286
288
288
289
290
292
292
293
294
296
297
298
298
299
300
302
302
302
303
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
304
304
305
306
307
310
20 Advanced
20.1 Parser Design . . . . .
20.2 Substitution Matrices
20.2.1 SubsMat . . . .
20.2.2 FreqTable . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
311
311
311
311
314
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
316
316
316
316
316
317
317
318
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
1.1
What is Biopython?
The Biopython Project is an international association of developers of freely available Python (http://www.
python.org) tools for computational molecular biology. Python is an object oriented, interpreted, flexible
language that is becoming increasingly popular for scientific computing. Python is easy to learn, has a very
clear syntax and can easily be extended with modules written in C, C++ or FORTRAN.
The Biopython web site (http://www.biopython.org) provides an online resource for modules, scripts,
and web links for developers of Python-based software for bioinformatics use and research. Basically, the
goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality,
reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats
(BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common
and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering
modules, a KD tree data structure etc. and even documentation.
Basically, we just like to program in Python and want to make it as easy as possible to use Python for
bioinformatics by creating high-quality, reusable modules and scripts.
1.2
1.3
Installing Biopython
All of the installation information for Biopython was separated from this document to make it easier to keep
updated.
The short version is go to our downloads page (http://biopython.org/wiki/Download), download and
install the listed dependencies, then download and install Biopython. Biopython runs on many platforms
(Windows, Mac, and on the various flavors of Linux and Unix). For Windows we provide pre-compiled clickand-run installers, while for Unix and other operating systems you must install from source as described in
the included README file. This is usually as simple as the standard commands:
python setup.py build
python setup.py test
sudo python setup.py install
(You can in fact skip the build and test, and go straight to the install but its better to make sure everything
seems to be working.)
The longer version of our installation instructions covers installation of Python, Biopython dependencies
and Biopython itself. It is available in PDF (http://biopython.org/DIST/docs/install/Installation.
pdf) and HTML formats (http://biopython.org/DIST/docs/install/Installation.html).
1.4
10
If the import Bio line fails, Biopython is not installed. If the second line fails, your version is very
out of date. If the version string ends with a plus, you dont have an official release, but a snapshot of
the in development code.
5. Where is the latest version of this document?
If you download a Biopython source code archive, it will include the relevant version in both HTML
and PDF formats. The latest published version of this document (updated at each release) is online:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
If you are using the very latest unreleased code from our repository you can find copies of the in-progress
tutorial here:
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html
http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf
6. Why is the Seq object missing the upper & lower methods described in this Tutorial?
You need Biopython 1.53 or later. Alternatively, use str(my_seq).upper() to get an upper case
string. If you need a Seq object, try Seq(str(my_seq).upper()) but be careful about blindly re-using
the same alphabet.
7. Why doesnt the Seq object translation method support the cds option described in this Tutorial?
You need Biopython 1.51 or later.
8. What file formats do Bio.SeqIO and Bio.AlignIO read and write?
Check the built in docstrings (from Bio import SeqIO, then help(SeqIO)), or see http://biopython.
org/wiki/SeqIO and http://biopython.org/wiki/AlignIO on the wiki for the latest listing.
9. Why wont the Bio.SeqIO and Bio.AlignIO functions parse, read and write take filenames? They
insist on handles!
You need Biopython 1.54 or later, or just use handles explicitly (see Section 22.1). It is especially
important to remember to close output handles explicitly after writing your data.
10. Why wont the Bio.SeqIO.write() and Bio.AlignIO.write() functions accept a single record or
alignment? They insist on a list or iterator!
You need Biopython 1.54 or later, or just wrap the item with [...] to create a list of one element.
11. Why doesnt str(...) give me the full sequence of a Seq object?
You need Biopython 1.45 or later.
12. Why doesnt Bio.Blast work with the latest plain text NCBI blast output?
The NCBI keep tweaking the plain text output from the BLAST tools, and keeping our parser up
to date is/was an ongoing struggle. If you arent using the latest version of Biopython, you could
try upgrading. However, we (and the NCBI) recommend you use the XML output instead, which is
designed to be read by a computer program.
13. Why doesnt Bio.Entrez.parse() work? The module imports fine but there is no parse function!
You need Biopython 1.52 or later.
14. Why has my script using Bio.Entrez.efetch() stopped working?
This could be due to NCBI changes in February 2012 introducing EFetch 2.0. First, they changed
the default return modes - you probably want to add retmode="text" to your call. Second, they are
now stricter about how to provide a list of IDs Biopython 1.59 onwards turns a list into a comma
separated string automatically.
11
15. Why doesnt Bio.Blast.NCBIWWW.qblast() give the same results as the NCBI BLAST website?
You need to specify the same options the NCBI often adjust the default settings on the website, and
they do not match the QBLAST defaults anymore. Check things like the gap penalties and expectation
threshold.
16. Why doesnt Bio.Blast.NCBIXML.read() work? The module imports but there is no read function!
You need Biopython 1.50 or later. Or, use next(Bio.Blast.NCBIXML.parse(...)) instead.
17. Why doesnt my SeqRecord object have a letter_annotations attribute?
Per-letter-annotation support was added in Biopython 1.50.
18. Why cant I slice my SeqRecord to get a sub-record?
You need Biopython 1.50 or later.
19. Why cant I add SeqRecord objects together?
You need Biopython 1.53 or later.
20. Why doesnt Bio.SeqIO.convert() or Bio.AlignIO.convert() work? The modules import fine but
there is no convert function!
You need Biopython 1.52 or later. Alternatively, combine the parse and write functions as described
in this tutorial (see Sections 5.5.2 and 6.2.1).
21. Why doesnt Bio.SeqIO.index() work? The module imports fine but there is no index function!
You need Biopython 1.52 or later.
22. Why doesnt Bio.SeqIO.index_db() work? The module imports fine but there is no index db function!
You need Biopython 1.57 or later (and a Python with SQLite3 support).
23. Where is the MultipleSeqAlignment object? The Bio.Align module imports fine but this class isnt
there!
You need Biopython 1.54 or later. Alternatively, the older Bio.Align.Generic.Alignment class supports some of its functionality, but using this is now discouraged.
24. Why cant I run command line tools directly from the application wrappers?
You need Biopython 1.55 or later. Alternatively, use the Python subprocess module directly.
25. I looked in a directory for code, but I couldnt find the code that does something. Wheres it hidden?
One thing to know is that we put code in __init__.py files. If you are not used to looking for code
in this file this can be confusing. The reason we do this is to make the imports easier for users. For
instance, instead of having to do a repetitive import like from Bio.GenBank import GenBank, you
can just use from Bio import GenBank.
26. Why does the code from CVS seem out of date?
In late September 2009, just after the release of Biopython 1.52, we switched from using CVS to git,
a distributed version control system. The old CVS server will remain available as a static and read
only backup, but if you want to grab the latest code, youll need to use git instead. See our website
for more details.
For more general questions, the Python FAQ pages http://www.python.org/doc/faq/ may be useful.
12
Chapter 2
2.1
As mentioned in the introduction, Biopython is a set of libraries to provide the ability to deal with things
of interest to biologists working on the computer. In general this means that you will need to have at
least some programming experience (in Python, of course!) or at least an interest in learning to program.
Biopythons job is to make your job easier as a programmer by supplying reusable libraries so that you
can focus on answering your specific question of interest, instead of focusing on the internals of parsing a
particular file format (of course, if you want to help by writing a parser that doesnt exist and contributing
it to Biopython, please go ahead!). So Biopythons job is to make you happy!
One thing to note about Biopython is that it often provides multiple ways of doing the same thing.
Things have improved in recent releases, but this can still be frustrating as in Python there should ideally
be one right way to do something. However, this can also be a real benefit because it gives you lots of
flexibility and control over the libraries. The tutorial helps to show you the common or easy ways to do
things so that you can just make things work. To learn more about the alternative possibilities, look in the
Cookbook (Chapter 18, this has some cools tricks and tips), the Advanced section (Chapter 20), the built
in docstrings (via the Python help command, or the API documentation) or ultimately the code itself.
2.2
Disputably (of course!), the central object in bioinformatics is the sequence. Thus, well start with a quick
introduction to the Biopython mechanisms for dealing with sequences, the Seq object, which well discuss in
more detail in Chapter 3.
Most of the time when we think about sequences we have in my mind a string of letters like AGTACACTGGT.
You can create such Seq object with this sequence as follows - the >>> represents the Python prompt
13
2.3
A usage example
Before we jump right into parsers and everything else to do with Biopython, lets set up an example to
motivate everything we do and make life more interesting. After all, if there wasnt any biology in this
tutorial, why would you want you read it?
Since I love plants, I think were just going to have to have a plant based example (sorry to all the fans
of other organisms out there!). Having just completed a recent trip to our local greenhouse, weve suddenly
developed an incredible obsession with Lady Slipper Orchids (if you wonder why, have a look at some Lady
Slipper Orchids photos on Flickr, or try a Google Image Search).
Of course, orchids are not only beautiful to look at, they are also extremely interesting for people studying
evolution and systematics. So lets suppose were thinking about writing a funding proposal to do a molecular
study of Lady Slipper evolution, and would like to see what kind of research has already been done and how
we can add to that.
After a little bit of reading up we discover that the Lady Slipper Orchids are in the Orchidaceae family and
the Cypripedioideae sub-family and are made up of 5 genera: Cypripedium, Paphiopedilum, Phragmipedium,
Selenipedium and Mexipedium.
That gives us enough to get started delving for more information. So, lets look at how the Biopython
tools can help us. Well start with sequence parsing in Section 2.4, but the orchids will be back later on as
well - for example well search PubMed for papers about orchids and extract sequence data from GenBank in
Chapter 9, extract data from Swiss-Prot from certain orchid proteins in Chapter 10, and work with ClustalW
multiple sequence alignments of orchid proteins in Section 6.4.1.
14
2.4
A large part of much bioinformatics work involves dealing with the many types of file formats designed to
hold biological data. These files are loaded with interesting biological data, and a special challenge is parsing
these files into a format so that you can manipulate them with some kind of programming language. However
the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and
that formats may contain small subtleties which can break even the most well designed parsers.
We are now going to briefly introduce the Bio.SeqIO module you can find out more in Chapter 5. Well
start with an online search for our friends, the lady slipper orchids. To keep this introduction simple, were
just using the NCBI website by hand. Lets just take a look through the nucleotide databases at NCBI,
using an Entrez online search (http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide)
for everything mentioning the text Cypripedioideae (this is the subfamily of lady slipper orchids).
When this tutorial was originally written, this search gave us only 94 hits, which we saved as a FASTA
formatted text file and as a GenBank formatted text file (files ls orchid.fasta and ls orchid.gbk, also
included with the Biopython source code under docs/tutorial/examples/).
If you run the search today, youll get hundreds of results! When following the tutorial, if you want to
see the same list of genes, just download the two files above or copy them from docs/examples/ in the
Biopython source code. In Section 2.5 we will look at how to do a search like this from within Python.
2.4.1
If you open the lady slipper orchids FASTA file ls orchid.fasta in your favourite text editor, youll see
that the file starts like this:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...
It contains 94 records, each has a line starting with > (greater-than symbol) followed by the sequence
on one or more lines. Now try this in Python:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq(CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC, SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq(CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC, SingleLetterAlphabet())
592
Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has defaulted to the rather
generic SingleLetterAlphabet() rather than something DNA specific.
15
2.4.2
Now lets load the GenBank file ls orchid.gbk instead - notice that the code to do this is almost identical
to the snippet used above for the FASTA file - the only difference is we change the filename and the format
string:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
This should give:
Z78533.1
Seq(CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC, IUPACAmbiguousDNA())
740
...
Z78439.1
Seq(CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC, IUPACAmbiguousDNA())
592
This time Bio.SeqIO has been able to choose a sensible alphabet, IUPAC Ambiguous DNA. Youll also
notice that a shorter string has been used as the seq_record.id in this case.
2.4.3
Biopython has a lot of parsers, and each has its own little special niches based on the sequence format it is
parsing and all of that. Chapter 5 covers Bio.SeqIO in more detail, while Chapter 6 introduces Bio.AlignIO
for sequence alignments.
While the most popular file formats have parsers integrated into Bio.SeqIO and/or Bio.AlignIO, for
some of the rarer and unloved file formats there is either no parser at all, or an old parser which has
not been linked in yet. Please also check the wiki pages http://biopython.org/wiki/SeqIO and http:
//biopython.org/wiki/AlignIO for the latest information, or ask on the mailing list. The wiki pages
should include an up to date list of supported file types, and some additional examples.
The next place to look for information about specific parsers and how to do cool things with them is in
the Cookbook (Chapter 18 of this Tutorial). If you dont find the information you are looking for, please
consider helping out your poor overworked documentors and submitting a cookbook entry about it! (once
you figure out how to do it, that is!)
2.5
One of the very common things that you need to do in bioinformatics is extract information from biological
databases. It can be quite tedious to access these databases manually, especially if you have a lot of repetitive
work to do. Biopython attempts to save you time and energy by making some on-line databases available
from Python scripts. Currently, Biopython has code to extract information from the following databases:
Entrez (and PubMed) from the NCBI See Chapter 9.
ExPASy See Chapter 10.
SCOP See the Bio.SCOP.search() function.
The code in these modules basically makes it easy to write Python code that interact with the CGI
scripts on these pages, so that you can get results in an easy to deal with format. In some cases, the results
can be tightly integrated with the Biopython parsers to make it even easier to extract information.
16
2.6
What to do next
Now that youve made it this far, you hopefully have a good understanding of the basics of Biopython and
are ready to start using it for doing useful work. The best thing to do now is finish reading this tutorial,
and then if you want start snooping around in the source code, and looking at the automatically generated
documentation.
Once you get a picture of what you want to do, and what libraries in Biopython will do it, you should
take a peak at the Cookbook (Chapter 18), which may have example code to do something similar to what
you want to do.
If you know what you want to do, but cant figure out how to do it, please feel free to post questions
to the main Biopython list (see http://biopython.org/wiki/Mailing_lists). This will not only help us
answer your question, it will also allow us to improve the documentation so it can help the next person do
what you want to do.
Enjoy the code!
17
Chapter 3
Sequence objects
Biological sequences are arguably the central object in Bioinformatics, and in this chapter well introduce
the Biopython mechanism for dealing with sequences, the Seq object. Chapter 4 will introduce the related
SeqRecord object, which combines the sequence information with any annotation, used again in Chapter 5
for Sequence Input/Output.
Sequences are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the
most common way that sequences are seen in biological file formats.
There are two important differences between Seq objects and standard Python strings. First of all, they
have different methods. Although the Seq object supports many of the same methods as a plain string, its
translate() method differs by doing biological translation, and there are also additional biologically relevant
methods like reverse_complement(). Secondly, the Seq object has an important attribute, alphabet, which
is an object describing what the individual characters making up the sequence string mean, and how they
should be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a protein sequence that
happens to be rich in Alanines, Glycines, Cysteines and Threonines?
3.1
The alphabet object is perhaps the important thing that makes the Seq object more than just a string.
The currently available alphabets for Biopython are defined in the Bio.Alphabet module. Well use the
IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite objects:
DNA, RNA and Proteins.
Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides
the ability to extend and customize the basic definitions. For instance, for proteins, there is a basic IUPACProtein class, but there is an additional ExtendedIUPACProtein class providing for the additional
elements U (or Sec for selenocysteine) and O (or Pyl for pyrrolysine), plus the ambiguous symbols
B (or Asx for asparagine or aspartic acid), Z (or Glx for glutamine or glutamic acid), J (or Xle
for leucine isoleucine) and X (or Xxx for an unknown amino acid). For DNA youve got choices of IUPACUnambiguousDNA, which provides for just the basic letters, IUPACAmbiguousDNA (which provides for
ambiguity letters for every possible situation) and ExtendedIUPACDNA, which allows letters for modified
bases. Similarly, RNA can be represented by IUPACAmbiguousRNA or IUPACUnambiguousRNA.
The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information
the Seq object contains. Secondly, this provides a means of constraining the information, as a means of type
checking.
Now that we know what we are dealing with, lets look at how to utilize this class to do interesting work.
You can create an ambiguous sequence with the default generic alphabet like this:
>>> from Bio.Seq import Seq
18