**Protein Sequence Databases: Primary and Secondary**
**1. Primary Databases**
Primary databases store experimentally determined protein
sequences and associated metadata. They serve as
repositories for raw data submitted by researchers, often
including annotations like source organism, function, and
references. Key examples include:
- **UniProt** (Universal Protein Resource):
- **Swiss-Prot**: Manually curated entries with detailed
annotations, including function, structure, and post-
translational modifications.
- **TrEMBL**: Automatically annotated entries awaiting
curation, derived from EMBL-Bank/GenBank/DDBJ
translations.
- **UniProtKB**: Combines Swiss-Prot and TrEMBL,
offering comprehensive coverage.
- **NCBI Protein**: Part of the Entrez system, aggregating
data from GenBank, RefSeq, and PDB. RefSeq provides non-
redundant, curated sequences.
- **DDBJ** (DNA Data Bank of Japan): Collaborates with
GenBank and ENA to archive nucleotide sequences, with
protein translations available.
- **PIR** (Protein Information Resource): Now part of
UniProt, historically focused on protein classification.
**2. Secondary Databases**
Secondary databases analyze, classify, or predict features
from primary data, adding value through computational or
manual curation. They focus on domains, families,
structures, or functional annotations. Examples include:
- **Pfam**: Protein family database using hidden Markov
models (HMMs) to identify domains and families.
- **PROSITE**: Catalogs protein domains, families, and
functional sites using patterns and profiles.
- **InterPro**: Integrates multiple databases (Pfam,
PROSITE, PRINTS, etc.) to provide comprehensive protein
signature analysis.
- **PRINTS**: Fingerprint database for protein motif
identification.
- **SMART**: Focuses on domain architectures, particularly
in signaling and extracellular proteins.
- **CDD** (Conserved Domain Database): Annotates
conserved domains using tools like RPS-BLAST.
**Structural and Functional Secondary Databases**:
- **SCOP** (Structural Classification of Proteins) &
**CATH**: Classify protein structures into hierarchies (e.g.,
folds, superfamilies).
- **KEGG**: Maps proteins to metabolic pathways and
functional networks.
- **STRING**: Predicts protein-protein interactions based on
genomic context and experimental data.
**Key Differences**:
- **Primary**: Store raw sequences (e.g., UniProt).
- **Secondary**: Provide derived information (e.g., Pfam for
families, SCOP for structural classification).
**Applications**:
- **Primary**: Direct access to sequence data for research
like cloning or phylogenetics.
- **Secondary**: Facilitate functional annotation,
evolutionary studies, and structural predictions.
**Integration**: Tools like BLAST use primary databases for
sequence alignment, while secondary databases enhance
interpretation (e.g., identifying domains in BLAST results via
InterPro).
This structured approach ensures researchers can access
both raw data and enriched insights, driving advancements
in genomics and proteomics.