MBI25.
1: Programming in
Python and Biopython
Dr. S. Zeeshan Hussain
Dept. of Computer Science
Jamia Millia Islamia
New Delhi
1
Unit-Wise Syllabus
MBI-25.1: Programming in Python & Biopython
1. An Introduction to Python Programming: Working with Python, An interpreter for python, Relational
operators, Logical operators, Bitwise operators, Variables and assignment Statements, Keywords, Script
mode.
2. Basic Concepts: Control structures, if-else conditional statement, Looping statements, Nested loops,
break, continue and pass, Debugging, Scope of variables, Strings, String manipulations, Regular
Expressions, Built-in Functions, I/O Functions, Function Definition and Call, Importing user-defined
modules, Command-line arguments, Mutable and Immutable objects, Recursion.
3. Advanced Concepts: Lists, Accessing lists, Working with lists, Operations, related Functions and
Methods, Tuples, Accessing tuples, Working with tuples, Operations, related Functions and Methods,
Dictionary, Working with dictionary, Accessing values in dictionaries, Working with dictionaries,
Operations, related Functions and Methods. Files and Exceptions: File Handling, Writing structures to a
file., Errors and Exceptions, Handling exceptions using try-except, File processing examples.
4. OOP concepts: OOPs concepts, Classes and objects, Constructor, Destructor, Attributes,
Encapsulation, Data Hiding and Data Abstraction, , Inheritance, Polymorphism, Overloading,
overriding., Inbuilt Object-Oriented functions and modules. Managing Databases using SQL.
5. Biopython: Introduction to Biopython, Installation, Inbuilt modules related to sequence
objects, sequence annotation objects, sequence analysis, sequence input/output, sequence
alignment objects and tools, Applications of Biopython. Overview of Scikit module.
References
Taneja & Kumar: Python Programming: A Modular Approach, Pearson
Kenneth & Lambert: Fundamental of Python. Course Technology Chang, Chapman, et al.
Biopython Tutorial and Cookbook (ebook)
2
Biopython
Biopython is a set of freely available tools for biological computation written in Python by an
international team of developers. It is a distributed collaborative effort to develop Python libraries
and applications which address the needs of current and future work in bioinformatics. The latest
release is Biopython 1.76, released on 20 December 2019
This lecture notes walks through the basics of Biopython package, overview of bioinformatics,
sequence objects, sequence annotation objects, sequence analysis and manipulation, sequence
input/output, sequence alignment objects and tools, Applications of Biopython and Overview of
Scikit module.
This lecture notes is prepared for students who are aspiring to make a career in the field of
bioinformatics programming using python as programming tool. This tutorial is intended to make
you comfortable in getting started with the Biopython concepts and its various functions.
Before proceeding with the various types of concepts given in this lecture notes, it is being assumed
that the readers are already aware about bioinformatics. In addition to this, it will be very helpful if
the readers have a sound knowledge on Python.
Biopython is the largest and most popular bioinformatics package for Python. It contains a
number of different sub-modules for common bioinformatics tasks. It is developed by Chapman
and Chang, mainly written in Python. It also contains C code to optimize the complex computation
part of the software. It runs on Windows, Linux, Mac OS X, etc.
Basically, Biopython is a collection of python modules that provide functions to deal with DNA, RNA
& protein sequence operations such as reverse complementing of a DNA string, finding motifs in
protein sequences, etc. It provides lot of parsers to read all major genetic databases like GenBank,
SwissPort, FASTA, etc., as well as wrappers/interfaces to run other popular bioinformatics
software/tools like NCBI BLASTN, Entrez, etc., inside the python environment. It has sibling projects
like BioPerl, BioJava and BioRuby.
3
Features
Biopython is portable, clear and has easy to learn syntax.
Some of the salient features are listed below −
Interpreted, interactive and object-oriented.
Supports FASTA, PDB, GenBank, Blast, SCOP,
PubMed/Medline, ExPASy-related formats.
Option to deal with sequence formats.
Tools to manage protein structures.
BioSQL − Standard set of SQL tables for storing sequences
plus features and annotations.
Access to online services and database, including NCBI
services (Blast, Entrez, PubMed) and ExPASY services
(SwissProt, Prosite).
Access to local services, including Blast, Clustalw,
EMBOSS. 4
Goals
The goal of Biopython is to provide simple, standard
and extensive access to bioinformatics through python
language. The specific goals of the Biopython are listed
below −
Providing standardized access to bioinformatics
resources.
High-quality, reusable modules and scripts.
Fast array manipulation that can be used in Cluster
code, PDB, NaiveBayes and Markov Model.
Genomic data analysis.
5
Advantages
Biopython requires very less code and comes up with the
following advantages −
Provides microarray data type used in clustering.
Reads and writes Tree-View type files.
Supports structure data used for PDB parsing,
representation and analysis.
Supports journal data used in Medline applications.
Supports BioSQL database, which is widely used standard
database amongst all bioinformatics projects.
Supports parser development by providing modules to
parse a bioinformatics file into a format specific record
object or a generic class of sequence plus features.
Clear documentation based on cookbook-style.
6
Applications
The main Biopython releases have lots of functionality, including:
The ability to parse bioinformatics files into Python utilizable data structures, including
support for the following formats:
Blast output – both from standalone and WWW Blast
Clustalw
FASTA
GenBank
PubMed and Medline
ExPASy files, like Enzyme and Prosite
SCOP, including ‘dom’ and ‘lin’ files
UniGene
SwissProt
Files in the supported formats can be iterated over record by record or indexed and
accessed via a Dictionary interface.
Code to deal with popular on-line bioinformatics destinations such as:
NCBI – Blast, Entrez and PubMed services
ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches
EMBOSS command line tools
7
Applications
Interfaces to common bioinformatics programs such as:
Standalone Blast from NCBI
Clustalw alignment program
A standard sequence class that deals with sequences, ids on sequences, and sequence
features.
Tools for performing common operations on sequences, such as translation,
transcription and weight calculations.
Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support
Vector Machines.
Code for dealing with alignments, including a standard way to create and deal with
substitution matrices.
Code making it easy to split up parallelizable tasks into separate processes.
GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
Extensive documentation and help with using the modules, including this file, on-line
wiki documentation, the web site, and the mailing list.
Integration with BioSQL, a sequence database schema also supported by the BioPerl and
BioJava projects.
8
Biopython - Creating Simple Application
Let us create a simple Biopython application to The extension, fasta refers to the file format of
parse a bioinformatics file and print the content. the sequence file. FASTA originates from the
This will help us understand the general concept bioinformatics software, FASTA and hence it gets
of the Biopython and how it helps in the field of its name. FASTA format has multiple sequence
bioinformatics. arranged one by one and each sequence will have
Step 1 − First, create a sample sequence file, its own id, name, description and the actual
“[Link]” and put the below content into sequence data.
it. Step 2 − Create a new python script,
>sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A *simple_example.py" and enter the below code
precursor (CS1 pilin) and save it.
MKLKKTIGAMALATLFATMGASAVEKTISVTAS from [Link] import parse
VDPTVDLLQSDGSALPNSVALTYSPAV from [Link] import SeqRecord
NNFEAHTINTVVHTNDSDKGVVVKLSADPVLS from [Link] import Seq
NVLNPTLQIPVSVNFAGKPLSTTGITID
SNDLNFASSGVNKVSSTQKLSIHADATRVTGGA file = open("[Link]")
LTAGQYQGLVSIILTKSTTTTTTTKGT
records = parse(file, "fasta") for record in records:
>sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A print("Id: %s" % [Link])
precursor (CS3 pilin)
print("Name: %s" % [Link])
MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALN
print("Description: %s" % [Link])
VLSPAALDATWAPQDNLTLSNTGVS
NTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVT print("Annotations: %s" % [Link])
FAHETNNSASFATTISTDNANITLDK print("Sequence Data: %s" % [Link])
NAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVS print("Sequence Alphabet: %s" %
GNYRANITITSTIKGGGTKKGTTDKK [Link])
9
Biopython - Sequence
A sequence is series of letters used to represent an
organism’s protein, DNA or RNA. It is represented by Seq
class. Seq class is defined in [Link] module.
Let’s create a simple sequence in Biopython as shown
below −
>>> from [Link] import Seq >>> seq = Seq("AGCT") >>>
seq Seq('AGCT') >>> print(seq) AGCT
Here, we have created a simple protein sequence AGCT and
each letter represents Alanine, Glycine, Cysteine
and Threonine.
Each Seq object has two important attributes −
data − the actual sequence string (AGCT)
alphabet − used to represent the type of sequence. e.g.
DNA sequence, RNA sequence, etc. By default, it does not
represent any sequence and is generic in nature. 10
Basic Operations
This section briefly explains about all the basic operations available in
the Seq class. Sequences are similar to python strings. We can perform
python string operations like slicing, counting, concatenation, find,
split and strip in sequences.
Use the below codes to get various outputs.
To get the first value in sequence.
>>> seq_string = Seq("AGCTAGCT") >>> seq_string[0] 'A'
To print the first two values.
>>> seq_string[0:2] Seq('AG')
To print all the values.
>>> seq_string[ : ] Seq('AGCTAGCT')
To perform length and count operations.
>>> len(seq_string) 8 >>> seq_string.count('A') 2
To add two sequences.
>>> from [Link] import generic_dna, generic_protein >>> seq1 =
Seq("AGCT", generic_dna) >>> seq2 = Seq("TCGA", generic_dna)>>>
seq1+seq2 Seq('AGCTTCGA', DNAAlphabet()) 11
Biopython - Sequence I/O Operations
Biopython provides a module, [Link] to read and write sequences from
and to a file (any stream) respectively. It supports nearly all file formats
available in bioinformatics. Most of the software provides different approach
for different file formats. But, Biopython consciously follows a single approach
to present the parsed sequence data to the user through its SeqRecord object.
Let us learn more about SeqRecord in the following section.
SeqRecord
[Link] module provides SeqRecord to hold meta information of the
sequence as well as the sequence data itself as given below −
seq − It is an actual sequence.
id − It is the primary identifier of the given sequence. The default type is string.
name − It is the Name of the sequence. The default type is string.
description − It displays human readable information about the sequence.
annotations − It is a dictionary of additional information about the sequence.
The SeqRecord can be imported as specified below
from [Link] import SeqRecord. Let us understand the nuances of
parsing the sequence file using real sequence file in the coming sections.
12
Parsing Sequence File Formats
This slide explains about how to parse two of the most popular sequence file
formats, FASTA and GenBank.
FASTA
FASTA is the most basic file format for storing sequence data. Originally, FASTA is a
software package for sequence alignment of DNA and protein developed during the early
evolution of Bioinformatics and used mostly to search the sequence similarity.
Biopython provides an example FASTA file and it can be accessed
at [Link]
Download and save this file into your Biopython sample directory as ‘[Link]’.
[Link] module provides parse() method to process sequence files and can be
imported as follows −
from [Link] import parse parse() method contains two arguments, first one is file
handle and second is file format.
>>> file = open('path/to/biopython/sample/[Link]')
>>> for record in parse(file, "fasta"): print([Link])
gi|2765658|emb|Z78533.1|CIZ78533 gi|2765657|emb|Z78532.1|CCZ78532 .......... ..........
gi|2765565|emb|Z78440.1|PPZ78440 gi|2765564|emb|Z78439.1|PBZ78439 >>>
Here, the parse() method returns an iterable object which returns SeqRecord on every
iteration. Being iterable, it provides lot of sophisticated and easy methods and let us see
some of the features.
13
next() Method
next() method returns the next item available in the iterable object,
which we can be used to get the first sequence as given below −
>>> first_seq_record =
next([Link](open('path/to/biopython/sample/[Link]'),'fast
a')) >>> first_seq_record.id 'gi|2765658|emb|Z78533.1|CIZ78533' >>>
first_seq_record.name 'gi|2765658|emb|Z78533.1|CIZ78533' >>>
first_seq_record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGAT
GAGACCGTGG...CGC', SingleLetterAlphabet()) >>>
first_seq_record.description 'gi|2765658|emb|Z78533.1|CIZ78533
[Link] 5.8S rRNA gene and ITS1 and ITS2 DNA' >>>
first_seq_record.annotations {} >>>
Here, seq_record.annotations is empty because the FASTA format does
not support sequence annotations.
14
To transcribe a DNA Sequence into mRna and
back-transcribe a mRna sequence into DNA
Biopython has a method to transcribe a DNA sequence
into a mRNA sequence (and one to go back):
[Link]() that transcribes a DNA sequence
into mRNA (converting Ts in Us).
Seq.back_transcribe() that back-transcribes a mRNA
sequence into DNA (converting Us in Ts)
Let’s translate into mRNA the coding sequence:
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGAT
AG.
15
To transcribe a DNA Sequence into mRna and back-
transcribe a mRna sequence into DNA
from [Link] import Seq
from [Link] import IUPAC coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
print(coding_dna)
mrna =
coding_dna.transcribe()print(mrna)print("")print([Link])
print("")
print("... and back")
print(mrna.back_transcribe())
Output:
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
IUPACUnambiguousRNA() ... and back
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
16
Biopython - Sequence Alignments
Sequence alignment is the process of arranging two or more sequences (of DNA, RNA or
protein sequences) in a specific order to identify the region of similarity between them.
Identifying the similar region enables us to infer a lot of information like what traits are
conserved between species, how close different species genetically are, how species evolve, etc.
Biopython provides extensive support for sequence alignment.
Let us learn some of the important features provided by Biopython in this chapter −
Parsing Sequence Alignment
Biopython provides a module, [Link] to read and write sequence alignments. In
bioinformatics, there are lot of formats available to specify the sequence alignment data similar
to earlier learned sequence data. [Link] provides API similar to [Link] except that the
[Link] works on the sequence data and [Link] works on the sequence alignment data.
Before starting to learn, let us download a sample sequence alignment file from the Internet.
To download the sample file, follow the below steps −
Step 1 − Open your favorite browser and go to [Link] website. It
will show all the Pfam families in alphabetical order.
Step 2 − Choose any one family having less number of seed value. It contains minimal data and
enables us to work easily with the alignment. Here, we have selected/clicked PF18225 and it
opens go to [Link] and shows complete details about it,
including sequence alignments.
Step 3 − Go to alignment section and download the sequence alignment file in Stockholm
format (PF18225_seed.txt).
17
Import [Link] module
>>> from Bio import AlignIO
Read alignment using read method. read method is
used to read single alignment data available in the
given file. If the given file contain many alignment, we
can use parse method. parse method returns iterable
alignment object similar to parse method in [Link]
module.
>>> alignment =
[Link](open("PF18225_seed.txt"), "stockholm")
Print the alignment object
>>> print(alignment)
18
Numpy Module
NumPy, which stands for Numerical Python, is a
library consisting of multidimensional array objects
and a collection of routines for processing those arrays.
Using NumPy, mathematical and logical operations on
arrays can be performed. This tutorial explains the
basics of NumPy such as its architecture and
environment. It also discusses the various array
functions, types of indexing, etc.
19
Numpy Example
import numpy as np print("No. of dimensions: ", [Link])
arr = [Link]((1, 2, 3, 4, 5)) # Printing shape of array
print("Shape of array: ", [Link])
print(arr)
# Printing size (total number of
arr = [Link]([[1, 2, 3], elements) of array
[4, 5, 6]]) print("Size of array: ", [Link])
print(arr) # Printing type of elements in array
print("Array stores elements of type: ",
# Printing type of arr object [Link])
print("Array is of type: ", type(arr))
# Printing array dimensions (axes)
20
Scipy Module
NumPy stands for Numerical Python while SciPy
stands for Scientific Python. Both NumPy and SciPy
are modules of Python, and they are used for various
operations of the data. SciPy is an open-source Python
library which is used to solve scientific and
mathematical problems. It is built on the
NumPy extension and allows the user to manipulate
and visualize data with a wide range of high-level
commands. As mentioned earlier, SciPy builds on
NumPy and therefore if you import SciPy, there is no
need to import NumPy.
21
Scipy Example
from scipy import special
a = special.exp10(3)
print(a)
b = special.exp2(3)
print(b)
c = [Link](90)
print(c)
d = [Link](45)
print(d)
22
NumPy vs SciPy
Both NumPy and SciPy are Python libraries used for
used mathematical and numerical analysis. NumPy
contains array data and basic operations such as
sorting, indexing, etc whereas, SciPy consists of all the
numerical code. Though NumPy provides a number
of functions that can help resolve linear algebra,
Fourier transforms, etc, SciPy is the library that
actually contains fully-featured versions of these
functions along with many others. However, if you are
doing scientific analysis using Python, you will need to
install both NumPy and SciPy since SciPy builds on
NumPy.
23
Scikit-learn
Scikit-learn (Sklearn) is the most useful and robust
library for machine learning in Python. It provides a
selection of efficient tools for machine learning and
statistical modeling including classification,
regression, clustering and dimensionality reduction
via a consistence interface in Python. This library,
which is largely written in Python, is built upon
NumPy, SciPy and Matplotlib.
24
Q&A
25