An embarras de richesses
Th proliferation of primary sequence dbs gives rise to a
number of questions:
Do they all have same format?
Which is the most accurate?
Which is the most up-to-date?
Which is the most comrehensive?
Given the choice, which should we use?
Of the protein sequence dbs, NRL-3D is the least
comprehensive because it reflects only the contents of PDB,
yet it has the advantage of relating directly to structural
information.
PIR (1-4) is the most coprehensive resource, but the quality of
its annotations is still relatively poor.
SWISS-PROT, on the other hand, is a highly structured db that
provides excellent annotations, but its sequence coverage is
poor compared to PIR.
Choosing the right db to search can seem an impossible
choice; so is it, perhaps, better to search them all?
Composite Protein Sequence Dbs
One solution to the problem of proliferation primary dbs is to
compile a composite, i.e. a db that amalgamates a variety of
different primary sources.
Composite dbs: These dbs render sequence searching much
more efficient, because they obviate the need to interrogate
multiple resources.
The interrogation process is streamlined still further if the
composite has been designed to be non-redundant, as this
means that the same sequence need not be searched more
than once.
Different strategies can be used to create composite
resources.
The final product depends on the chosen data sources and the
criteria used to merge them; e.g.
A composite resource will be non-identical if it eliminates only
identical sequence copies during the amalgamation process.
But if both identical and highly similar sequences are ejected
(e.g. those entries that differ by only one residue), then the
resulting db will be more truly non-redundant.
The choice of different sources and the application of different
redundancy criteria have led to the emergence of different
composites, each of which has its own particular format.
The main dbs are outlined below.
NRDB: Non-Redundant Db is built at the NCBI.
The db is a composite of GenPept (derived from automatic
GenBank CDS translations), PDB sequences, SWISS-PROT,
SPupdate (the weekly updates of SWISS-PROT), PIR and
GenPeptupdate (the daily updates of GenPept).
This db is thus comprehensive and contains up-to-date
information.
However, strictly speaking, it is not non-redundant but non-
identical i.e. only identical sequence copies are removed from
the resource.
OWL: It is non-Redundant protein sequene db built at the
University of Leeds in collaboration with the Daresbury
Laboratory in Warrington.
The db is a composite of four major primary sources: SWISS-
PROT, PIR 1-4, GenBank (CDS tranlations) and NRL-3D.
MIPSX: It is merged db produced at the Max-Planck Institut in
Martinsried.
The db contains information from the following resources: PIR
1-4, MIPS preliminary entries, MIPSOwn; MIPS/PIR
preliminary entries, PIRMOD; MIPS preliminary translations,
MIPSTrn; MIPS yeast entries, MIPSH, NRL-3D, SWISS-PROT,
EMTrans, GBTrans, Kabat and PSeqIP.
SWISS-PROT + TrEMBL: At the EBI, the combination of SWISS-
PROT and TrEMBL provides a resource that is both
comprehensive and minimally redundant.
This db has the advantage of containing fewer errors than do
those mentioned above.