0% found this document useful (0 votes)
24 views49 pages

Protein Databases

Uploaded by

Ashutosh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views49 pages

Protein Databases

Uploaded by

Ashutosh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Protein database

Unit 2, Part -3
UniProt - Universal protein resource



Source of protein sequence data
• Protein sequencing is rare
• Most protein sequence derived from
nucleotide data

Nucleotide Protein
sequence sequence
database database
Protein sequence is mainly derived data
Tools/services at UniProt
1. BLAST
2. Align
3. Retrieve/ID mapping
4. Peptide search
UniProt
● The Universal Protein Resource (UniProt) is a comprehensive resource for
protein sequence and annotation data.
● The UniProt databases are:
○ UniProt Knowledgebase (UniProtKB),
○ UniProt Reference Clusters (UniRef),
○ UniProt Archive (UniParc).
UniProtKB
UniProtKB
● Reviewed (Swiss-Prot)- ● Unreviewed (TrEMBL) –
Manually annotated- Records Computationally analyzed
with information extracted - Records that await full
from literature and
curator-evaluated manual annotation.
computational analysis and ○ Contains protein sequences
and scientific conclusions. associated with
○ Is a high quality non-redundant
computationally generated
protein sequence database annotation and large-scale
functional characterization
UniProtKB and its functions
● The UniProt Knowledgebase is the central hub for the
collection of functional information on proteins, with accurate,
consistent and rich annotation.
● It captures the core data mandatory for each UniProtKB entry
mainly,
○ the amino acid sequence,
○ protein name or description
○ taxonomic data
○ citation information
● It also adds annotation information as much as possible.
UniProtKB and its functions
● Annotation includes:
● widely accepted biological ontologies, classifications
● cross-references
● It also gives clear indications of the quality of annotation in the form of
evidence attribution of experimental and computational data.
UniProt automatic annotation
● UniProt has developed two complementary approaches to automatically
annotate protein sequences with a high degree of accuracy.
● UniRule is a collection of manually curated annotation rules which define
annotations that can be propagated based on specific conditions
● Statistical Automatic Annotation System (SAAS) is an automatic
decision-tree based rule-generating system.
● The central components of these approaches are rules based
on InterPro classification and the manually curated data in
UniProtKB/Swiss-Prot.
● Predictions of sequence features such as Signal, Transmembrane and Coil
regions are generated using software from external providers
● UniProt uses InterPro to classify sequences at superfamily, family and
subfamily levels and to predict the occurrence of functional domains and
important sites. InterPro integrates predictive models of protein function,
so-called ‘signatures’, from a number of member databases.
● InterPro matches are automatically annotated to UniProtKB entries as
database cross-references with every InterPro release.
● In UniProtKB/TrEMBL entries, domains from the InterPro member
databases PROSITE, SMART or Pfam are predicted and annotated
automatically, and their evidence/source labels indicate “InterPro
annotation”.

Automatic annotation

Swiss-Prot InterPro
Automatic annotation - InterPro
UniProt Manual curation
● UniProt provides both manual curation and automatic annotation
● The UniProt manual curation process comprises manual review of
results from a range of sequence analysis programs and literature
curation of experimental data as well as attribution of all information
to its original source.
● Curators also assign GO terms to all manually curated entries.
Manual curation process
● This process consists of 6 major mandatory steps:
○ Sequence curation
○ Sequence analysis
○ Literature curation
○ Family-based curation
○ Evidence attribution
○ Quality assurance and integration of completed entries.
● Curation is performed by expert biologists using a range of
tools that have been iteratively developed in close
collaboration with curators.
The UniProt Reference Clusters (UniRef)
● It provide clustered sets of sequences from the UniProt Knowledgebase
(including isoforms) and selected UniParc records. This hides redundant
sequences and obtains complete coverage of the sequence space at three
resolutions:
UniRef100 combines identical sequences and sub-fragments with 11 or more
residues from any organism into a single UniRef entry.
UniRef90 is built by clustering UniRef100 sequences such that each cluster is
composed of sequences that have at least 90% sequence identity to, and 80%
overlap with, the longest sequence (a.k.a. seed sequence).
UniRef50 is built by clustering UniRef90 seed sequences that have at least
50% sequence identity to, and 80% overlap with, the longest sequence in the
cluster.
UniParc
● A comprehensive & non-redundant database
● Contains most of the publicly available protein sequences in the
world.
● Proteins may exist in different source databases and in multiple
copies in the same database. UniParc avoids such redundancy by
storing each unique sequence only once and giving it a stable and
unique identifier (UPI).
● A UPI is never removed, changed or reassigned.
● UniParc contains only protein sequences. All other information
about the protein must be retrieved from the source databases using
the database cross-references.
● Predictions of sequence features such
as Signal, Transmembrane and Coil regions are generated
using the following software from external providers:
● TMHMM
● SignalP
● Phobius
● Coils
● TMHMM and Phobius predictors are used to infer
transmembrane regions.
PDB
wwPDB
PDB

● An Information Portal to 132661 143581 168358 174014 Biological


Macromolecular Structures
● RCSB PDB is a member of wwPDB
● Provides a structural view of biology
PDB Archive
● The Protein Data Bank (PDB) archive
● It is the single worldwide repository of information about the 3D
structures of large biological molecules, including proteins, nucleic
acids & carbohydrates from all organisms including bacteria, yeast,
plants, flies, other animals, and humans.
● Understanding the shape of a molecule deduce a structure's role in
human health and disease, and in drug development. The structures
in the archive range from tiny proteins and bits of DNA to complex
molecular machines like the ribosome.
● Freely available.
● updated weekly.
History of PDB
● PDB was established in 1971 at Brookhaven National Laboratory by
Walter Hamilton and originally contained 7 structures.
● In 1998 , Research Collaboratory for Structural Bioinformatics
(RCSB) started managing PDB.
● In 2003, the wwPDB was formed to maintain a single PDB archive of
macromolecular structural data that is freely and publicly available to
the global community. It consists of organizations that act as
deposition, data processing and distribution centres for PDB data.
● In addition, the RCSB PDB supports a website where visitors can
perform simple and complex queries on the data, analyze, and
visualize the results.
Homepage (https://www.rcsb.org/pdb/home/home.do)
wwPDB


Function of PDB
● It provides the Protein Data Bank archive-
○ information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps
students and researchers understand all aspects of biomedicine and agriculture, from protein
synthesis to health and disease.

● RCSB PDB curates and annotates PDB data.


● The RCSB PDB builds upon the data by creating tools and resources
for research and education in molecular biology, structural biology,
computational biology, and beyond.
● DEPOSIT (prepare, validate, deposit data)
● ANALYZE (Sequence & Structure Alignment, Protein Symmetry,
Structure Quality, Map Genomic Position to Protein)
● VISUALIZE (Viewers- 3D Structure Viewers, Pathway View and
Sequence -Protein Feature View, Human Gene View)s
● Search, Download and LEARN,
RCSB PDB data deposition

● How to deposit data quickly, easily, accurately, and efficiently


● RCSB PDB tools like pdb_extract, Validation Server, LigandDepot, ADIT
● How the RCSB PDB annotates structures?
Why??

● structural data is deposited to the PDB


○ Primary citation journal policies requires it
○ Funding agency requires it
○ For safe-keeping of structural data
○ For the benefit of the entire scientific community
Information to be deposited

● The coordinates
● The structure factor file(s)
○ and more...
○ Information that only you can provide
○ Information that you should complete and verify
about the molecule(s) or complex
about the crystallization and data collection
○ Information that can be extracted from log files of
crystallographic applications.
Data deposition at the RCSB PDB
Steps for Fast, Accurate, and Complete Data Deposition at the RCSB PDB
1.Use pdb_extract
2.Validate your entry
3.Verify sequence
4.Use Ligand Depot
5.Deposit with ADIT (not available now)/One-Dep sye
Steps in data deposition
● Prepare data
■ pdb_extract
■ SF-Tool
■ Ligand Expo
■ MAXIT
■ ADIT (Auto Dep Input Tool) – no longer available
● Validate data
■ Validation reports contain an assessment of the quality of a
structure and highlight specific concerns by considering the
coordinates of the model, the experimental data and the fit between
the two (validation servers and task forces)
● Deposit data
■ wwPDB OneDep System
pdb_extract
● A resource which assembles specific details about your experiment and experimental model
from your coordinate and structure determination output files in preparation for PDB
deposition.
● online tool or standalone program
This tool will:
● provide an author information form for Xray, NMR, EM , which can be saved/updated for
multiple related entry depositions.
● assemble coordinate and log files pertaining to your specific experimental methods.
● allow you to fix the primary sequence of your protein/nucleotide chains to account for
unresolved residues.
● It generates a complete data file ready for validation & deposition
SF-TOOL can be used
➢ to convert various structure factor format
➢ to check the model coordinates against the structure factor data.
Ligand Expo (formerly Ligand Depot)
- provides chemical and structural information about small molecules within
the structure entries of the Protein Data Bank.
RCSB PDB LigandDepot
–Use to find code for existing ligands
–Searching by many attributes
–New ligands
MAXIT
● MAXIT assists in the processing and curation of macromolecular structure data. It
can do the following things:

○ Read and write PDB and mmCIF format files, and translate between file formats.
○ Perform consistency checks on coordinates, sequence, and crystal data.
○ Automatically construct, transform, and merge information between formats
○ Align residue numbering in the coordinates with the sequence
○ Reorder and rename atoms in standard and nonstandard residues and ligands according to the
Chemical Component Dictionary
○ Assign ligands the same chain IDs as the adjacent polymers
○ Detect missing or additional atoms
PDB-101
● Educational tool
● Online portal for teachers, students, and the general public to
promote exploration in the world of proteins and nucleic acids.
● Videos, interactive animations, paper models etc.
● Helps in learning about the diverse shapes and functions of the
biological macromolecules making it convenient to understand all
aspects of biomedicine and agriculture, from protein synthesis to
health and disease to biological energy
Annotation
● Check entry for self-consistency
● Check title
● Check citation references with
PubMed(http://pubmed.gov/)–Correct format errors in data and
coordinates–Check sequence –Add sequence database
reference–Add protein name and synonyms–Check source–Check
ligand nomenclature–Add biological unit information–Visually
check entry –Generate validation reports
Accession numbers

● The format for GenBank Accession numbers are:


Nucleotide:1 letter + 5 numerals
OR 2 letters + 6 numerals
Protein:3 letters + 5 numerals
WGS:4 letters + 2 numerals for WGS assembly version + 6-8 numerals
MGA:5 letters + 7 numerals
MGA (terminated now)
● In order to accept a large scale of sequence data that provide useful
information for annotation of genome assemblies/sequences, INSDC have
created a new category - Mass sequence for Genome Annotation (MGA).
● MGA is defined as those sequences which are produced in large quantity in
view of genome annotation.
● The data which can be acceptable to the MGA category of INSDC are those
which include useful biological features for genome annotation ( e.g. start or
end terminus of a transcript).
The large of quantity here means that the number of sequences in one
resource is 10,000 or more.
Accession Number prefixes
● Where are the sequences from??
● D,AB,LC for DDBJ Direct submissions
● V,X,Y,Z,AJ,AM, FM,FN,HE,HF, HG,FO,LK,LL, LM,LN,LO,LR, LS,LT for
EMBL Direct submissions
● U,AF,AY,DQ,EF, EU,FJ,GQ,GU, HM,HQ,JF,JN, JQ,JX,KC,KF,
KJ,KM,KP,KR, KT,KU,KX,KY, MF for GenBank Direct submissions
● In the same way Genome projects, WGS, TPA, EST proteins have been
assigned different accession number prefixes.
RefSeq Accession format

● The RefSeq projects are NCBI sequence annotation projects and are not part
of DDBJ/EMBL/GenBank.

● RefSeq accession numbers can be distinguished from GenBank accessions by


their distinct format of an underbar in the third position.
UniProtKB Accession
● This subsection of the ‘Entry information’ section provides one or more
accession number(s). These are stable identifiers and should be used to cite
UniProtKB entries.
● Upon integration into UniProtKB, each entry is assigned a unique accession
number, which is called ‘Primary (citable) accession number’.
● UniProtKB accession numbers consist of 6 or 10 alphanumerical characters in
the format:

The three patterns can be combined into the following regular expression:

[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}
Examples: A2BC19, P12345, A0A023GPI8
Accession number
● Entries can have more than one accession number. This can be due to
two distinct mechanisms:
a) When two or more entries are merged, the accession numbers from all
entries are kept. The first accession number is referred to as the ‘Primary
(citable) accession number’, while the others are referred to as
‘Secondary accession numbers’. These are listed in alphanumerical order.
b) If an existing entry is split into two or more entries (‘demerged’), new
‘primary’ accession numbers are attributed to all the split entries while
all original accession numbers are retained as ‘secondary’ accession
numbers.

You might also like