0% found this document useful (0 votes)
14 views4 pages

? Bioinformatics Study Note

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

? Bioinformatics Study Note

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Bioinformatics Study Note – Data Repositories & Biological Databases

PART 1: DATA REPOSITORIES

What is a Data Repository?

A data repository is a digital storage system where data is collected, maintained, and made available for users or
systems. In bioinformatics and scientific research, a data repository acts like a library of data, often domain-specific.
Key Characteristics:
• Stores data sets or collections from various sources.
• Organized logically or by subject area.
• Accessible by users for sharing, reuse, or research.
• May have submission restrictions (based on academic level, funding, data type).
• Some are open-access, while others are restricted.
Purpose:
• To preserve scientific data.
• To enable collaboration among researchers.
• To support reproducibility in research.
• To provide a permanent digital archive.

Examples of Data Repositories

1. Data Warehouse

• A large, centralized repository that integrates data from different departments or systems.
• Used mainly for reporting, data mining, and decision-making.
• Example: A company’s data warehouse that combines finance, sales, and HR data.

2. Data Mart

• A department-specific segment of a data warehouse.


• Focused on a single domain like marketing or customer service.
• Provides faster access to relevant data.
• Example: A marketing data mart focused only on campaign performance.

3. Data Lake

• A flexible storage system for storing raw data in any format: structured (tables), semi-structured (XML,
JSON), or unstructured (text, images).
• Used in big data and machine learning environments.
• Allows different analytics use cases, including real-time dashboards and predictive models.

4. Data Cubes

• A multidimensional representation of data, often used in OLAP (Online Analytical Processing).


• Enables quick analysis by slicing and dicing across dimensions like time, location, product.
• Example: A sales cube analyzing revenue across year, product, and region.
5. Metadata Repository

• Stores metadata, or "data about data".


• Explains where the data came from, how it was collected, and what it represents.
• Helps in data governance, data cataloging, and regulatory compliance.

Database vs. Data Repository

Feature Database Data Repository

Primary Use Storing and managing operational data Archiving, sharing, and reusing data

Data Type Structured (SQL) Any type (structured, semi, unstructured)

Access Often private or user-specific Can be public or open-access


Example MySQL for an app’s backend GenBank for public DNA data

PART 2: BIOLOGICAL DATABASES

Biological databases are essential tools in bioinformatics, providing organized collections of data related to
nucleotides, proteins, metabolites, and macromolecular structures.

Types of Biological Data Stored

• Nucleotide Sequences (DNA/RNA)


• Protein Sequences
• Protein Motifs/Patterns
• 3D Structures of Biomolecules
• Gene Expression Profiles
• Metabolic Pathways

Classification of Biological Databases

1. Primary Databases

• Store raw experimental data submitted directly by researchers.


• Data is archival and usually unprocessed.
• Once assigned an accession number, data is never modified.
Examples:
• GenBank (NCBI): DNA/RNA sequences
• EMBL (EBI): European nucleotide archive
• DDBJ (Japan): Japanese DNA database
• SWISS-PROT: High-quality protein sequences
• PDB (Protein Data Bank): 3D biomolecular structures

2. Secondary Databases

• Contain analyzed or derived data from primary databases.


• Often curated or generated using algorithms.
• More informative, with annotations, predictions, or interpretations.
Examples:
• Pfam: Protein families
• PROSITE: Protein domains/motifs
• PRINTS: Protein fingerprints
• BLOCKS: Aligned protein blocks

Composite Databases

• Integrate entries from multiple primary databases.


• Allow users to search across various datasets quickly.
• Example: NRDB, OWL, MIPSX, TrEMBL (extension of SWISS-PROT)

Examples of Major Biological Databases

Database Content Description

GenBank DNA/RNA sequences Managed by NCBI, accepts public submissions

EMBL Nucleotide sequences Managed by EBI, European equivalent to GenBank


DDBJ DNA sequences Japanese repository, works with EMBL and GenBank

SWISS-PROT Annotated protein sequences High-quality, manually curated protein database


PIR Protein Information Resource Sequence data and functional information

TrEMBL Translated EMBL nucleotide data Supplement to SWISS-PROT for automatic entries

PDB 3D macromolecular structures Critical for drug design and structural biology
Pfam Protein families and domains Categorized proteins by their function or evolutionary lineage

PROSITE Protein domains/motifs Identifies functional parts of proteins

Summary: Primary vs. Secondary Biological Databases

Feature Primary Database Secondary Database

Data Source Direct from experiments Derived by analysis of primary data


Content Raw sequences/structures Annotations, motifs, families

Update Never changed after submission Regularly updated


Examples GenBank, EMBL, DDBJ, SWISS-PROT Pfam, PROSITE, PRINTS, BLOCKS

Application in Research

• Track gene mutations, identify protein functions.


• Used in disease gene discovery, personalized medicine, vaccine design.
• Help in structure prediction, drug target identification, and comparative genomics.
Category Definition / Purpose Examples Notes

Data A storage system that GenBank, Data Lake, Metadata May store raw, structured, or
Repository holds, organizes, and Repositories unstructured data.
shares data.

Data Centralized, structured Enterprise Sales Data Warehouse Integrates data from multiple
Warehouse data storage for sources.
reporting & analysis.

Data Mart Department-specific Finance Data Mart, Marketing Mart Faster, domain-focused access.
subset of a data
warehouse.

Data Lake Stores raw data of any AWS S3, Hadoop HDFS Accepts
format for analytics & structured/unstructured/semi-
machine learning. structured data.

Data Cube Multidimensional data Sales cube (by region, product, time) Enables fast slicing/dicing of
storage for analytical data.
processing (OLAP).

Metadata Stores metadata – Data catalog systems Includes source, format, capture
Repository information about other method, etc.
data sources.

Primary Stores original GenBank, EMBL, DDBJ, SWISS- Archival, submitted by


Database experimental biological PROT, PDB researchers, unmodified.
data.

Secondary Contains Pfam, PROSITE, PRINTS, BLOCKS Annotations, protein domains,


Database derived/curated data functional data.
from primary databases.

Composite Combines multiple NRDB, OWL, TrEMBL Supports bulk search and
Database primary sources for integration.
efficient searching.

GenBank Annotated public [Link] Primary source for DNA/RNA


nucleotide sequence sequences.
database maintained by
NCBI.

SWISS- High-quality, manually [Link] Partnered with TrEMBL.


PROT annotated protein
sequence database.

Pfam / Secondary databases for [Link] / Help predict protein function.


PROSITE / protein families and [Link]
BLOCKS functional motifs.

PDB (Protein Global repository of 3D [Link] Essential for structural


Data Bank) macromolecular bioinformatics and drug design.
structures.

You might also like