Search Sequence Database
Submitted to:
Dr. Samrah
Submitted by:
Group no#2 (M)
BS Zoology 2021-2025
Institute Of Zoology
Bahauddin Zakariya University, Mlt.
Search Sequence Database
Biological Sequence Database
Biological sequence databases are digital libraries that store and organize biological sequences,
such as DNA, RNA, and protein sequences. These databases are crucial resources for researchers
in various fields of biology, enabling them to access, analyze, and compare sequence data.
Key Characteristics:
o Digital Repositories: They exist as computerized systems capable of storing and
managing vast amounts of sequence information.
o Types of Sequences: They primarily hold nucleotide sequences (DNA and RNA) and
amino acid sequences (proteins). Some may also include other polymer sequences.
o Accessibility: Most major biological sequence databases are publicly accessible via the
internet, making them an indispensable tool for the global scientific community.
o earch and Analysis Tools: They typically provide tools and interfaces that allow users to
search for specific sequences, perform sequence alignments, and conduct other
bioinformatic analyses.
Types of Biological Sequence Database
These databases contain original sequence data submitted by researchers. Examples include:
GenBank (National Center for Biotechnology Information - NCBI, USA) for nucleotide
sequences.
EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics
Institute, Europe) for nucleotide sequences.
DDBJ (DNA Data Bank of Japan, Japan) for nucleotide sequences.
Protein Data Bank (PDB) for 3D structural data of proteins and nucleic acids.
UniProtKB/Swiss-Prot (part of UniProt) for high-quality, manually annotated protein
sequences.
TrEMBL (part of UniProt) for computationally annotated protein sequences.
Tools for Sequence Searching
Bioinformatics tools help researchers find similar sequences in large databases. Key tools
include:
o BLAST
o FASTA
o HMMER
1. BLAST
BLAST, which stands for Basic Local Alignment Search Tool, is a
fundamental and widely used algorithm and program in bioinformatics. Its primary purpose is to
compare a query biological sequence (DNA, RNA, or protein) against a large database of
sequences to identify regions of local similarity.
Core Function
BLAST takes a query sequence and searches a database for sequences that have similar
segments. It doesn't try to find a perfect, end-to-end match of the entire query sequence. Instead,
it focuses on identifying local alignments, which are regions of significant similarity within the
sequences.
How it works
Query Segmentation: The query sequence is broken down into short "words" of a specific
length (e.g., 3 amino acids for proteins, 11 nucleotides for DNA).
Database Searching for Word Matches: The algorithm quickly scans the database for
exact or near-exact matches to these query words. These matches are called "seeds" or
"hits."
Extending the Matches: Once a seed is found, BLAST extends the alignment in both
directions along the query and database sequences. It tries to extend the alignment as long
as the similarity score remains above a certain threshold. Gaps (insertions or deletions)
can be introduced during this extension to improve the alignment score.
Scoring the Alignments: Each alignment is assigned a score based on the similarity of the
aligned residues (nucleotides or amino acids) and any gaps introduced. Higher scores
indicate greater similarity.
Statistical Significance: BLAST calculates the statistical significance of each alignment.
This is often expressed as an E-value (Expect value), which represents the number of
alignments with a score equal to or greater than the observed score that are expected to
occur by chance in a database of that size. A low E-value (close to zero) suggests that the
alignment is unlikely to be due to random chance and is therefore more significant.
Types of BLAST
There isn't just one "BLAST." Several variations are designed for different types of comparisons:
BLASTn: Compares a nucleotide query sequence against a nucleotide database.
BLASTp: Compares a protein query sequence against a protein database.
BLASTx: Compares a nucleotide query sequence translated in all six reading frames
against a protein database. This is useful for finding potential protein-coding regions in a
new nucleotide sequence.
Why is BLAST important?
Identifying Unknown Sequences: Determining the identity or potential function of a
newly sequenced DNA, RNA, or protein by finding similar sequences with known
functions.
Finding Homologous Sequences: Identifying genes or proteins in different organisms that
share a common evolutionary ancestor. This helps in understanding evolutionary
relationships and conserved functions.
Gene Annotation: In newly sequenced genomes, BLAST can help locate and identify
genes by comparing genomic sequences to databases of known genes.
Protein Function Prediction: If a newly discovered protein sequence is similar to a protein
with a known function, BLAST can provide clues about its potential role.
Drug Target Identification: Comparing pathogen sequences to human sequences can help
identify unique pathogen-specific targets for drug development.
2. FASTA
The FASTA format is a simple, text-based format widely used in
bioinformatics to represent nucleotide or amino acid sequences. It's a standard way to store and
share biological sequence data. The name "FASTA" also refers to a suite of sequence alignment
software that utilizes this format.
Structure of a FASTA File
A FASTA file can contain one or more sequences. Each sequence in the file has two main parts:
i. The Header Line (Definition Line):
It always begins with a greater-than symbol (>).
Immediately following the ">" is a sequence identifier (ID). This is a unique name or
code for the sequence.
After the ID, there can be an optional description or annotation of the sequence. This
information is usually separated from the ID by a space.
The entire header line is typically kept to a single line of text (ideally less than 80
characters).
ii. The Sequence Lines:
These lines immediately follow the header line.
They contain the actual sequence data, using single-letter codes to represent
nucleotides (A, C, G, T, and sometimes U for RNA, or ambiguous bases like N) or
amino acids (using standard single-letter abbreviations).
The sequence can span multiple lines.
It's a common convention to break the sequence into lines of a certain length (e.g.,
60-80 characters) for readability, but this is not a strict requirement of the format.
There should be no extra formatting or spaces within the sequence lines themselves.
Example of a FASTA File (DNA Sequence):
>gi|12345|ref|NC_000001.10| Human chromosome 1, complete sequence
GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
Key Characteristics and Importance:
Simplicity: The FASTA format is very straightforward and easy to read and parse by both
humans and computer programs.
Universality: It has become a near-universal standard in bioinformatics. Most sequence
analysis tools, databases, and software packages recognize and use FASTA format.
* Flexibility: It can represent both nucleotide and amino acid sequences.
Interoperability: Its plain text nature makes it easy to work with using standard text
editors and scripting languages.
Input for Bioinformatics Tools: FASTA files are commonly used as input for a wide
range of bioinformatics analyses, including sequence alignment (like BLAST and
FASTA software), phylogenetic analysis, and genome assembly.
Data Exchange: It's a standard format for exchanging sequence data between researchers
and databases.
iii. HMMER
HMMER is a powerful and widely used software suite in bioinformatics for
sequence analysis using profile Hidden Markov Models (profile HMMs). Developed by Sean
Eddy and his lab, HMMER is designed to find homologous protein or nucleotide sequences and
to perform sequence alignments. It's particularly adept at detecting remote homologs – sequences
that are evolutionarily related but may have low sequence similarity, making them difficult to
identify with simpler methods like BLAST.
How HMMER works?
The general workflow with HMMER involves these steps:
Building a Profile HMM (hmmbuild): Starting with a well-curated multiple sequence
alignment of a protein family (or a set of related nucleotide sequences), HMMER's
hmmbuild program constructs a profile HMM that statistically describes the family.
* Searching Databases (hmmsearch, phmmer, hmmscan): Once a profile HMM is built,
HMMER provides tools to search sequence databases for sequences that are likely to be
members of the family represented by the HMM.
* hmmsearch: Takes a profile HMM as a query and searches it against a database of
individual sequences (protein or nucleotide).
* phmmer: Takes a single protein sequence as a query and searches it against a database
of protein sequences. It's often faster than hmmsearch for single queries.
* hmmscan: Takes one or more query sequences and searches them against a database
of profile HMMs (like Pfam). This is useful for identifying which families a given
sequence might belong to.
* Aligning Sequences to a Profile (hmmalign): HMMER can align individual sequences
or even entire MSAs to an existing profile HMM using the hmmalign program. This
produces structurally informed alignments.
* Iterative Searching (jackhmmer, PSI-BLAST): For even greater sensitivity in finding
remote homologs, HMMER offers iterative search tools. jackhmmer performs iterative
searches of a sequence database using a query sequence, building a profile HMM from
the hits in each round to improve subsequent searches. PSI-BLAST is a similar tool that
uses position-specific scoring matrices instead of HMMs.
Key Features and Significance of HMMER:
High Sensitivity: Profile HMMs are excellent at detecting distantly related sequences that
might be missed by other methods.
Probabilistic Framework: The underlying probabilistic models provide a more robust way
to handle sequence variation.
Structure-Aware Alignments: Alignments generated with HMMER tend to be more
biologically meaningful as they are based on the conserved patterns captured in the
profile HMM.
Widely Used: HMMER is the foundation for many important protein family databases
like Pfam and InterPro.
Versatile: It can be used for both protein and nucleotide sequence analysis (with
specialized tools like nhmmer for DNA homology search).
Efficient: Modern versions of HMMER (HMMER3) are significantly faster than earlier
versions, making large-scale database searches feasible.
Conclusion
Searching sequence databases is a foundational skill in bioinformatics and life sciences. Tools
like BLAST enable scientists to find similar sequences, annotate genes, and understand genetic
relationships across species. These techniques are crucial in genomics, evolutionary biology,
drug discovery, and diagnostics. With growing data, efficient search methods will remain at the
core of biological research and innovation.
References
https://blast.ncbi.nlm.nih.gov/Blast.cgi
https://www.ncbi.nlm.nih.gov/genbank/
https://www.uniprot.org/