Introduction to bioinformatics (databases)
Course Code:- BioTech 202
Course Title:- Introductory Bioinformatics
Credit- 3(2+1)
By:
Course Instructor
Ashutosh Ranjan (Guest Faculty)
Department of Basic Sciences and Language, CBS & H
What is bioinformatics ?
In biology, bioinformatics is defined as, “the use of computer
to store, retrieve, analyze or predict the composition or
structure of bio-molecules” . Bioinformatics is the application
of computational techniques and information technology to
the organization and management of biological data. Classical
bioinformatics deals primarily with sequence analysis.
Aims of bioinformatics
❑Development of database containing all biological
information.
❑Development of better tools for data designing, annotation
and mining.
❑Design and development of drugs by using simulation
software.
❑Design and development of software tools for protein
structure prediction function, annotation.
❑Creation and development of software to improve tools for
analyzing sequences for their function and similarity with other
sequences
Applications of bioinformatics
Crop
development
Drought
Resistance Medicine
BioTechnology
Forensic Application of Drug Designing
Analysis
Bioinformatics
Weather
Gene Therapy
Analysis
Veterinary
Science
Biological databases
Biological data are complex, exception-ridden, vast and
incomplete. Therefore several databases has been created
and interpreted to ensure unambiguous results. A collection
of biological data arranged in computer readable form that
enhances the speed of search and retrieval and convenient to
use is called biological database. A good database must have
updated information.
Importance of biological database
A range of information like biological sequences, structures,
binding sites, metabolic interactions, molecular action, functional
relationships, protein families, motifs and homologous can be
retrieved by using biological databases. The main purpose of a
biological database is to store and manage biological data and
information in computer readable forms.
Types of
biological
Database
Primary Secondary Derived
database database Database
Protein Nucleotide Protein Domain and Gene Metabolic
sequence sequence structure Motif Structure expression pathway Specialized
database database
database database database Database database database
Swis-port PIR Gene Pept Gene Bank DDBJ EMBL PDB EBI-MSD MMDB SCOP CATH
Primary database vs. secondary database
❑A primary database contains only sequence or structural
information.
❑The database derived from the analysis or treatment of
primary data are secondary database. It is very important
for interfering protein function.
Examples of some primary
biological database
Gene Bank
❑One of the fastest growing repositories of known nucleotide
sequences, Gene Bank (Genetic Sequence Databank), has a flat
file structure. It is an ASCII text file, readable by both humans and
computers. Besides sequence data, Gene Bank files contain
information such as accession numbers and gene names,
phylogenetic classification.
❑This database has been developed and maintained at the NCBI,
Bethesda, MD, USA, as a part of International Sequence Database
Collaboration (INSDC).
❑It is an open access sequence database. Contd………..
❑It coordinates with individual laboratories and other sequence
databases like EMBL and DDBJ.
❑It is an annotated collection of all nucleotide sequences that are
available to the public.
❑The nucleotide database was divided into three databases at
NCBI: Core Nucleotide database, Expressed Sequence Tag (EST)
and Genome Survey Sequence (GSS).
❑Core Nucleotide database has most of the nucleotide
sequences used. It also encloses all nucleotide records that are
not in the EST and GSS databases.
❑Submission of sequences to GeneBank can be done using
BankIt, Sequin and tbl2asn tools.
EMBL(European Molecular Biology Laboratory)
❑ A comprehensive database of DNA and RNA sequences, EMBL nucleotide
sequence database is collected from scientific literature, patient offices and is
directly submitted by researchers. EMBL has been prepared in collaboration with
GeneBank (USA) and the DNA Database of Japan (DDBJ).
❑ It is established in 1980.
❑ It is maintained by EBI (European Bioinformatics Institute)
Swiss-Port
❑ This is a curated protein sequence database that offers a high level of integration
with other databases and also has a very low level of redundancy. Swiss-Port strives
to provide protein sequences with a high level of annotation (for instance, the
description of protein function, domain structure and post translational
modifications, etc.).
❑ It is established in 1986 and maintained collaboratively , since 1987, by the
department of Medical Biochemistry of the University of Geneva and the EMBL data
Library.
❑ TrEMBL is a computer–annotated supplement of Swiss-Port that contains all
translations of EMBL nucleotide sequence entries, which is not yet integrated in
Swiss-Port.
❑ Currently Swiss-Port have 0.5 and TrEMBL have 7.6 milliom sequences.
Protein Information Resource(PIR)
❑ PIR is an integrated public bioinformatics resource to support genomic and
proteomic research and scientific studies. Nowadays, PIR offers a wide variety of
resources mainly oriented to assisting the propagation and consistency of protein
annotations like PIRSF, ProClass and ProLINK.
Examples of Some Secondary
Biological Database
Motif Databases
❑Protein sequence motif is a set of conserved amino acid residues that are
important for protein function and are located within a certain distance from one
another. These motifs usually provide clues to the functions of otherwise
uncharacterized proteins.
❑ The PROSITE database consists of documentation entries describing protein
domains, families and functional sites as well as associated patterns and profiles
to identify them.
❑ PRINT is a database for protein fingerprints. A fingerprint is a group of
conserved motifs used to characterize a protein family.
Domain Database
❑ A protein domain is an independently folded, structurally compact unit that
forms a steady three- dimensional structure and shows a certain level of
evolutionary conservation. Typically , a conserved domain contains one or more
motifs.
❑ ProDom is a protein domain database automatically generated from the
Swiss-Port and TrEMBL sequence database.
❑ SMART is a highly reliable and sensitive tool for domain identification.
❑ COG is a database and a convenient tool for motif and domain identification.
3D Structure databases
❑PDB (Protein Data bank) is the main primary database for 3D structures of
biological macromolecules determined by X-ray, crystallography and NMR. It also
accepts experimental data used to determine the structures and homology models.
❑ SCOP (Structural Classification of Protein database) classifies protein 3D
structures in a hierarchical scheme of structure classes. All the protein structures in
PDB are classified her, and the updated new structures are deposited in PDB.
❑ The CATH database (Class, Architecture, Topology, Homologous) contains a
hierarchical classification of protein domain structure.
Protein data bank
❑ PDB (Protein data bank) is a repository for 3D structural data obtained by x-ray
crystallography or NMR spectroscopy of proteins and nucleic acids.
❑ Research Collaboratory for Structural Bioinformatics (RCSB) PDB provides a
variety of tools and resources for studying the structures of biological
macromolecules and their relationship with other sequences, its function and
diseases caused if any .
Annotation of Gene ?????
❑ In molecular biology, genomes make the basic genetic material and typically
consist of DNA. Whereby, genome include the genes (coding ) and non coding
regions, of interest to us, are the coding regions as they actively influence basic life
processes. The genes contain useful biological information that is required in
building up and maintaining an organism. Gene annotation can be defined merely
as the process of making nucleotide sequence meaningful.
❑ Gene annotation involves the process of taking the raw DNA sequence produced
by genome sequencing projects and adding layers of analysis and interpretation
necessary to extracting biologically significant information and placing such derived
details into context. Annotation is the process by which pertinent information about
these raw DNA sequences is added to the databases.
Accession number
Accession numbers are unique identifiers which permanently identify sequences in
the database.