Genome Browser
A genome browser is a web-based graphical tool that lets users explore and visualize entire
genomes and their annotations. For example, the UCSC Genome Browser provides “a rapid
and reliable display of any requested portion of genomes at any scale, together with dozens
of aligned annotation tracks (known genes, predicted genes, ESTs, etc.)”. These tracks are
stacked beneath the genomic coordinates, allowing researchers to zoom in from whole
chromosomes to individual genes and correlate different kinds of data. Genome browsers
often link items to external databases (e.g. PubMed or GenBank) for detailed information.
Key features: Multiple annotation tracks (genes, transcripts, SNPs, repeats, epigenetic
marks, etc.) with color-coded displays; search by gene name or coordinates; zoom/pan
across chromosomes; user controls to show/hide tracks; and custom-track upload for new
data. For instance, users can “look at a whole chromosome to get a feel for gene density,
open a specific cytogenetic band to see a positionally mapped disease gene candidate, or
zoom in to a particular gene to view its spliced ESTs and alternative splicing”.
Examples of genome browsers:
UCSC Genome Browser: A widely used browser with rich annotations (RefSeq genes,
mRNA, EST alignments, variation, epigenomics, conservation, etc.).
Ensembl Genome Browser: Part of the Ensembl project, it provides annotated genomes
and comparative views for many species. Ensembl is “one of three main systems that
annotate and display genome information, the other two being the UCSC genome browser
and the NCBI genome resources”.
NCBI Genome Data Viewer: NCBI’s browser supports thousands of eukaryotic genome
assemblies. It “supports the exploration and analysis of NCBI-annotated … eukaryotic
genome assemblies”.
Others: JBrowse/IGV for local/interactive viewing; specialized browsers (e.g. Gramene for
plant genomes).
Research uses: Genome browsers are essential for locating genes and regulatory elements
in context. Researchers use them to check gene structure (exons, splice variants), find
promoter or enhancer regions, view conservation across species, overlay experimental
data (e.g. RNA-seq, ChIP-seq tracks), and interpret variants. For example, one can click on
a gene to see its exon–intron structure and protein domains, or load a SNP track to find
candidate mutations. By integrating diverse datasets, browsers help form hypotheses (e.g.
candidate disease genes at a locus).
Genomic Database
A genomic database is a centralized repository that stores DNA sequence data and related
annotation for genomes. Major international databases include:
GenBank (USA): The NIH genetic sequence database. It is “an annotated collection of all
publicly available DNA sequences”, maintained by NCBI. GenBank contains raw
sequences (genomes, chromosomes, plasmids, transcripts) with annotations (genes,
features, references).
EMBL-ENA (Europe): The European Nucleotide Archive (originally EMBL-Bank) at EBI, which
“incorporates, organizes and distributes nucleotide sequences from all available public
sources”. EMBL-ENA exchanges data daily with GenBank and DDBJ.
DDBJ (Japan): The DNA Data Bank of Japan, a public nucleotide sequence database. It
“collects annotated nucleotide sequence data” in collaboration with GenBank and ENA as
part of the International Nucleotide Sequence Database Collaboration.
Ensembl (EBI/EMBL): A genome database and browser offering automatic annotation of
many eukaryotic genomes. Ensembl provides gene models, variation, regulatory elements
and comparative maps.
Others: Many countries/projects have genome DBs (e.g. UCSC’s GoldenPath, GigaDB,
etc.).
These databases are critical for research. They store raw sequences (full genomes,
chromosomes, mitochondrial DNA, plasmids, transcripts) along with annotations (gene
models, coding regions, UTRs, regulatory sites) and metadata (organism, references). All
data are openly accessible: for example, GenBank data can be searched via NCBI’s Entrez
and BLAST tools, retrieved programmatically via NCBI’s e-utilities, or downloaded in bulk
via FTP. The INSDC partners (GenBank, EMBL-ENA, DDBJ) ensure daily data exchange,
providing a comprehensive, up-to-date sequence archive. In short, genomic DBs enable
sequence searching, comparative analysis, and data sharing; they underpin genome
annotation and downstream analyses. Researchers routinely access these databases
through web portals, command-line tools, or APIs to retrieve sequences and information
for their studies.
Protein Database
A protein database stores protein sequences, structures, and annotations. Key examples
include:
UniProt (Universal Protein Resource): A comprehensive protein sequence and annotation
database. The UniProt Knowledgebase (UniProtKB) “is a comprehensive, high-quality and
freely accessible set of protein sequences annotated with functional information”. It
comprises Swiss-Prot (manually curated) and TrEMBL (automatically annotated) sections.
UniProt entries include amino acid sequences, functional descriptions, domain and family
annotations, post-translational modifications, and cross-references (e.g. to PDB).
Protein Data Bank (PDB): The global archive of 3D macromolecular structures. PDB (with
sites RCSB, PDBe, PDBj) is “the single worldwide archive of structural data of biological
macromolecules”. It stores experimentally determined 3D coordinates of proteins (and
nucleic acids), along with annotations (ligands, resolution, authors).
Others: There are specialized protein DBs: e.g. NCBI’s RefSeq Protein (curated sequences),
Swiss-Prot, PIR. Databases like InterPro or Pfam compile family/domain profiles, and
PROSITE catalogs sequence motifs.
Data stored: Protein DBs hold primary structures (amino-acid sequences), plus rich
metadata. For UniProt, this includes protein names, gene names, function summaries, GO
terms, active sites, and domain architecture. PDB provides atomic coordinates and
experimental details for each structure. Some databases (like PDB and UniProt) also
integrate cross-links to each other.
Applications: Protein databases are used for annotation and modeling. For example,
BLASTp searches against UniProt identify homologous proteins and infer function.
UniProt’s annotations help predict enzyme activity or pathways. PDB structures enable
homology modeling and structure-based drug design. Protein domains and motifs (see
below) are annotated based on these resources. In summary, researchers use protein DBs
to find sequences of interest, understand protein functions and interactions, and obtain
structural templates for modeling.
Genome and Gene
Genome: The genome is the entire set of genetic material of an organism. As stated by
RCSB, *“a genome is the complete set of genes and/or genetic material of a cell or
organism. It is the blueprint that cells use to make proteins and RNA.”*. In many organisms,
the genome is arranged on multiple linear chromosomes (e.g. humans have 46), whereas
others (bacteria, many mitochondria and chloroplasts) have a single circular DNA
molecule. Eukaryotes have both a large nuclear genome and smaller organellar genomes
(mitochondrial, and chloroplast in plants).
Gene: A gene is a DNA sequence that encodes a functional product (typically a protein or
an RNA). Genes can be protein-coding (encoding an amino-acid sequence) or non-coding
(encoding RNAs such as rRNA, tRNA, microRNAs, etc.). In eukaryotes, genes are often
interrupted by introns (non-coding segments) and interspersed with exons (coding
segments that remain in mature mRNA). Typical gene structure includes a promoter
(upstream regulatory DNA where transcription factors bind), a transcription start site,
exons/introns, untranslated regions (5′ and 3′ UTRs), and a termination site. Bacterial genes
usually lack introns and are organized in operons. Identifying genes involves locating open
reading frames (ORFs) and regulatory signals on the genome.
Structure of DNA: DNA is a double helix of nucleotide bases (A, T, C, G) paired across two
strands. A stretch of DNA contains many genes plus regulatory regions (promoters,
enhancers) and non-genic sequences. Chromosomes also include centromeres and
telomeres. Genes sit at specific loci, and their arrangement and context (neighboring
genes, repetitive elements) can affect gene regulation.
Functional genomics: This field studies how genes and genomes function and interact in
the cell. It involves genome-wide experiments (transcriptomics, proteomics, epigenomics)
to link sequence information with function. As Wikipedia notes, “functional genomics …
attempts to describe gene (and protein) functions and interactions” on a genome-wide
scale. For example, RNA-seq (a functional genomics method) measures expression of all
genes at once. Functional genomics uses techniques like gene knockout libraries,
expression microarrays, and CRISPR screens to reveal gene roles. In short, after identifying
gene sequences, functional genomics seeks to determine what those genes do in the
organism.
Phytozome
Phytozome is a plant genomics portal maintained by the U.S. Department of Energy’s Joint
Genome Institute. It is a *“comparative hub for plant genome and gene family data and
analysis.”*. Phytozome provides access to the sequences and annotations of many
complete plant genomes (currently including all major land plants and selected algae) and
lets users explore them comparatively. Key features include:
Species covered: Dozens of green plants – e.g. model species like Arabidopsis thaliana,
crop genomes (rice, maize, soybean, wheat, etc.), legumes, trees, and some algae. (The
original publication mentioned ~25 genomes, but newer versions include 50+ plant
genomes.)
Interface & tools: Users can search genes or families and view Gene pages with details on
each gene (sequence, exons/introns, functional annotation). There are Gene Family pages
showing phylogenetic trees of related genes across species. Each gene page links to a
genomic GBrowse view (a genome browser) that shows gene context and flanking genes.
The portal also supports BLAST searches, and a synteny viewer displays conserved gene
order (five neighboring genes on each side) among species.
Function: Phytozome makes it easy to compare genes across plants. For any gene, one can
see homologs in other species, conserved gene structures, and family relationships. By
linking model and crop genomes, Phytozome helps researchers infer function: for example,
an uncharacterized gene in maize might be annotated by its well-studied Arabidopsis
ortholog. The portal thus accelerates plant genomics research and breeding.
Phytozome is widely used in plant biology for gene discovery, evolutionary studies, and
pathway analysis, as it ties together gene sequence, structure, and annotation in a
comparative framework.
BLAST
BLAST (Basic Local Alignment Search Tool) is a suite of sequence alignment programs that
finds regions of similarity between a query sequence and those in a database. It “finds
regions of local similarity between protein or nucleotide sequences” by comparing a query
to database sequences and computing the statistical significance of matches.
Types of BLAST: Common variants include BLASTn (nucleotide query vs. nucleotide
database), BLASTp (protein vs. protein), BLASTx (translated nucleotide query vs. protein
db), tBLASTn (protein query vs. translated nucleotide db), and tBLASTx (translated
nucleotide vs. translated nucleotide). As described in the NCBI guide, the original BLAST
was protein-to-protein, then a nucleotide version was developed, and translation
approaches (BLASTx/ tBLASTx) were added to allow cross-comparisons between DNA and
protein.
Algorithm: BLAST uses a fast heuristic: it breaks the query into short “words” (k-mers), finds
high-scoring word matches in the database, then extends these matches to generate
longer alignments. It reports alignments (High-scoring Segment Pairs) with bit scores and
E-values, where the E-value estimates how many alignments of that quality would be
expected by chance. Lower E-values indicate more significant matches.
Input/Output: Users input a sequence (FASTA format) and select a target database (e.g. nr,
RefSeq, specialized sets). The output typically has a summary table listing top hits (with
score, E-value, % identity), followed by detailed alignments. BLAST also provides graphical
views of match distributions along the query.
Applications: BLAST is used for sequence homology searches, gene identification, and
functional annotation. For example, running BLASTn on a new DNA sequence can reveal
known genes or repeats. BLASTp can suggest protein function by matching to annotated
proteins. In research, BLAST helps find orthologs, design PCR primers (by checking
specificity), and place novel sequences in phylogenetic context. In essence, BLAST is a
fundamental tool for comparing biological sequences across all areas of bioinformatics.
Domain / Motif
Protein domain: A domain is a self-stable part of a protein’s structure that often folds
independently. It is a “structural domain [that] is stable and often folds independently of
the rest of the protein chain”. Domains are typically ~50–200 amino acids or larger and
often confer a specific function. For instance, many proteins share a common kinase
domain or zinc-finger domain, indicating they may share functional or evolutionary
relationships. Domains can be recombined by evolution (or engineered) to create proteins
with new combinations of functions.
Protein motif: A motif is a short sequence or structural pattern, usually smaller than a
domain, that is conserved among proteins. As defined above, “Protein motifs are small
regions of protein three-dimensional structure or amino acid sequence shared among
different proteins”. Motifs often correspond to key functional sites (for example, an active-
site loop) but may not fold on their own. An example is the helix-turn-helix DNA-binding
motif found in many transcription factors.
Difference: Domains are larger, autonomously folding units (like Lego blocks of structure),
whereas motifs are short consensus patterns (like a small sequence logo). Domains often
define entire structural/functional modules (e.g. SH3 domain), while motifs identify
conserved residues or substructures (e.g. a phosphorylation site motif).
Biological significance: Both domains and motifs help predict protein function. If a protein
has a kinase domain, it likely has enzymatic activity. Motifs like the “Hox” homeobox
sequence identify DNA-binding proteins. Detecting domains and motifs in a sequence can
suggest its family and role. For example, the presence of a Pfam “PH domain” in a protein
suggests it binds phosphoinositides.
Detection tools: Bioinformatics databases and tools identify domains/motifs in protein
sequences. Pfam is a large database of protein families represented by profile Hidden
Markov Models, used to detect domains. PROSITE is a database of sequence motifs and
patterns. Other resources include SMART, CDD (Conserved Domain Database), and MEME.
Given a protein, these tools scan against libraries of known domains/motifs to annotate its
features.
Examples: Common domains include the kinase domain (involved in phosphorylation),
immunoglobulin domains (in antibodies), and leucine-rich repeats (in receptor proteins).
Typical motifs include the glycosylation motif “N-X-S/T” or the classical Cys2His2 zinc-
finger pattern. Identifying these elements is a key step in protein characterization and
annotation.
ORF (Open Reading Frame)
An Open Reading Frame (ORF) is a stretch of DNA that can potentially encode a protein.
Formally, an ORF is a sequence with a length divisible by three, bounded by stop codons
(and usually containing a start codon like ATG). In practice, it is any run of DNA without in-
frame stop codons that could be translated into a polypeptide. ORFs are used as hints for
gene prediction: *“Long ORFs are often used, along with other evidence, to initially identify
candidate protein-coding regions”*.
Detection: ORFs are found by scanning the sequence in all six reading frames. Tools like
NCBI’s ORF Finder automate this process: given a DNA sequence, it identifies all ORFs
above a specified length and reports the deduced peptide sequences. In microbial
genomes (no introns), a long ORF often corresponds directly to a gene. In eukaryotes, ORFs
in genomic DNA may span several exons separated by introns, so gene prediction also uses
splice signals in addition to ORFs.
Importance: Identifying ORFs is a first step in annotating a new genome or DNA fragment.
The longest ORFs are strong candidates for protein-coding genes. ORFs also help in
annotating metagenomic sequences and checking gene models. For example, if a
predicted gene lacks an ORF in the correct frame, it may be a pseudogene or mis-
annotation.
Full Forms of Common Bioinformatics Databases
GenBank (NIH Genetic Sequence Database): The NCBI repository of all publicly available
DNA sequences. (Note: GenBank is a proper name, not an abbreviation.)
EMBL-ENA (European Molecular Biology Laboratory – European Nucleotide Archive): EBI’s
nucleotide sequence database, part of the INSDC collaboration.
DDBJ (DNA Data Bank of Japan): The Japanese partner in INSDC, a public database of
nucleotide sequences.
PDB (Protein Data Bank): The global archive of 3D structures of proteins and nucleic acids.
(Maintained by RCSB, PDBe, PDBj.)
UniProt (Universal Protein Resource): A comprehensive protein sequence and annotation
database (combining Swiss-Prot and TrEMBL).
KEGG (Kyoto Encyclopedia of Genes and Genomes): A database linking genes to higher-
level functions, pathways and networks (metabolism, regulatory networks). It allows
systems-level analysis by mapping gene sets onto metabolic and signaling pathways.
(Other examples: EMBL (European Nucleotide Seq. DB), PIR (Protein Information
Resource), RefSeq (Reference Sequence DB), etc.)
Operating Systems in Bioinformatics
Bioinformatics relies heavily on Unix-like operating systems (Linux and macOS/UNIX)
because many tools and pipelines are developed for command-line environments. Linux
distributions (Ubuntu, CentOS, Debian, etc.) dominate computational biology, especially
on servers and clusters. Windows is used primarily on desktops; some tools support
Windows or run under the Windows Subsystem for Linux (WSL), but most high-
performance bioinformatics software is native to Linux/Unix.
Command-line usage: Proficiency with the shell is essential. Common tasks (file
manipulation, text processing, launching analyses) use commands like ls, grep, awk, sed,
and sort. Scripting languages (bash, Python, Perl) automate workflows. For example, a
bioinformatician might use a Bash script to loop over hundreds of FASTA files, running
BLAST or alignment tools on each. Command-line environments also allow job schedulers
(SLURM, PBS) to manage large computing tasks on clusters.
Advantages of open-source/Linux: Open-source OS and tools are free, transparent, and
widely community-supported. Many bioinformatics packages (Bioconductor, SAMtools,
Bowtie, etc.) are open-source and run on Linux. Linux allows easy installation of software
and dependencies via package managers. Open-source environments facilitate
reproducibility: one can share a pipeline (Docker or Conda environment) that others can
run identically. The flexibility of Linux (e.g. running processes in the background, piping
output between programs, editing files with vi/nano) makes it well suited for data-intensive
bioinformatics work. In summary, the Linux/UNIX command line is the standard platform
for bioinformaticians due to its power, flexibility, and open-source ecosystem.
Sources: Authoritative descriptions and tutorials on the above topics have been used.
Further reading includes primary manuals and database documentation (e.g. NCBI Book
on BLAST, Genome Browser FAQs).