Biological Databases
Zoya Khalid
[Link]@[Link]
Data Vs. Information
• Information produced by processing data
• Information used to reveal meaning in data
• Accurate, timely and relevant information is the key to good
decision making
• Good decision making is the key to organizational survival
What is a database
• Structured collection of information.
• Consists of basic units called records or entries.
• Each record consists of fields, which hold pre-defined data
related to the record.
• For example, a protein database would have protein entries as
records and protein properties as fields (e.g., name of protein,
length, amino-acid sequence)
Types of databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, Trace, SRA, SNP, GEO
• Derivative Databases
– Derived from primary data
– Content controlled by third party (NCBI) Algorithms
• Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene,
Structure, Conserved Domain
A flat-file database
Why Flat Files ?
• Flat files are the universal mechanism for moving data from one
database or system to another.
• There are two common types of flat files: CSV (comma separated
values) and delimited files.
Relational databases
Relational database
• A relational database consists of a relations (tables) containing attributes
(fields or columns). Each row in a table is known as a record or tuple.
• Information should be ‘normalized’ so that it is non-redundant this means
that every row should be unique, although this ideal is not always observed.
First Name Last Name Institution Department Address
Omar|Farooq|Computer Science|NUCES|Islamabad
Omar Farooq NUCES Computer Science Islamabad Hadiya|Ali|Electrical Engineering|FAST|Islamabad
Ahmed|Khan|Dept of Computer Science|NUCES|Isb
Hadiya Ali FAST Electrical Engineering Islamabad
Ahmed|Khan|Dept of Management|NUST|Islamabad
Ahmed Khan NUCES Dept of Computer Science Isb
Ahmed Khan NUST Dept of Management Islamabad
Foreign Primary
Key Key
Table Professor Table Contacts
Primary Prof_id First_name Last_name Contact_id Contact_id Institution Department Address
Key 1 Omar Farooq 1 1 NUCES Computer Science Islamabad
2 Hadiya Ali 2 2 FAST Electrical Engineering Islamabad
3 Ahmed Khan 1 3 NUST Management Islamabad
4 Ahmed Khan 3
Types of databases
Database providers
• The National Center for
Biotechnology Information (NCBI)
offers data banks, databases and
tools (USA)
• The European Bioinformatics
Institute (EBI) does a similar
function in Europe
• GenomeNet gathers several
databases from Japan
Data quality
• How are things entered
– Step by step protocol
• What are the evidence?
– Automatic validation
– Manual curation
• How new is the data?
• Can the data be secret?
• Redundant or non-redundant?
summary
NCBI
European Bioinformatics Institute
GenomeNet
NAR database issue
Nucleotide databases
• International nucleotide sequence database collaborations
– Genbank
– EMBL
– DDBJ
• The nucleotide sequence databases are data repositories,
accepting nucleic acid sequence data from the scientific
community and making it freely available.
– The databases strive for completeness, with the aim of recording
every publicly known nucleic acid sequence.
– These data are heterogeneous, they vary with respect to
• the source of the material (e.g. genomic versus cDNA), the intended quality
(e.g. finished versus single pass sequences), the extent of sequence
annotation
• the intended completeness of the sequence relative to its biological target
(e.g. complete versus partial coverage of a gene or a genome).
GenBank entry
Genome specific databases
Protein Databases
• Sequences are in Uniprot
• Structures are in PDB
• Enzyme classifications EC
• Protein families: Pfam,
Interpro etc
Uniprot
• UniProtKB: Protein knowledgebase, consists of two sections:
– Swiss-Prot, which is manually annotated and reviewed.
– TrEMBL, which is automatically annotated and is not reviewed.
• Includes complete and reference proteome sets.
• UniRef: Sequence clusters, used to speed up sequence
similarity searches.
• UniParc: Sequence archive, used to keep track of sequences
and their identifiers.
• Supporting data
– Literature citations, keywords, subcellular locations, cross-referenced
databases and more.
Uniprot
PDB
PDB
Pfam
Multiple Sequence Alignment and HMMs
KEGG
[Link]