0% found this document useful (0 votes)
43 views2 pages

Database

We need to organize biological data into databases because scientists produce large amounts of it. Sharing data in databases allows it to help other researchers, even if a particular piece of data was not useful for the original scientist's paper. Biological databases organize data into standardized records with fields for items like identifiers, sequences, descriptions, and references. This allows the data to be stored efficiently and shared in easy-to-access ways.

Uploaded by

filymascolo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views2 pages

Database

We need to organize biological data into databases because scientists produce large amounts of it. Sharing data in databases allows it to help other researchers, even if a particular piece of data was not useful for the original scientist's paper. Biological databases organize data into standardized records with fields for items like identifiers, sequences, descriptions, and references. This allows the data to be stored efficiently and shared in easy-to-access ways.

Uploaded by

filymascolo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

BIOLOGICAL DATABASE

We need to organize our data because we produce a lot of it. Scientific literature is not somehow to
share data, but is used to share stories. There is the personal interpretation of our data. Data are not
the main part. We need to share data in easy ways. There may be data not useful in my paper, so I
won’t use it and it will be lost, but in database it can help other scientist with their research.
DATABASE ORGANIZATION
Flat collection. Items that are somehow identical  same type of elements, same features in
common. For protein I want for sure to store the sequence but is not enough: when was sequenced,
how, where etc… EMBL Databank: Idea to store items like this in a typical file, called tabbed
and in order to store data in optimized form. How can a human record a gene of 20.000 characters?
A database is a collection of file, which one is a record of a protein. Flat collection of elements
identical. Each file is organized in smart and economical way. Every info is on a line: what it and
plenty of option readable from AI also. A field is typical a line (ID ex). I could have different
update for a single line, that wasn’t possible before. If I have no space left for description of line 1
data, I can keep writing and start writing in the next line and so on, and that next line is not another
description of the same data, but is the remaining of the previous that wasn’t possible to write in
due to no space left.
Ex of EMBL record: ID, Acc number, dates, description, keywords, taxonomy, ref block 1, ref bloc
2, ref block 3 (grouped in references), comment.
At least we want the sequence, so it’s put in feature field the sequence itself. It’s written in way that
not only machines can read it but also humans, ex the seq is split in group of 10 nucleotides with
spaces and at the end of the line the number of index nucleotide (60, 120, 180, 240…). Additional
field may be CDS (coding sequence) that tells where the sequence starts coding, and so a program
that does translation can use this information (20-1729). A file written in the 80’s is still readable.
DDBJ is national storage for sequences made in java. At a certain point database stopped competing
vs each other and started collaboration, stored in INSDC International Nucleotide Sequence
Database Collaboration (DDBJ, EML, NCI…)
ENA European Nucleotide Archive
EMBL DB + SRA = ENA {vedi questa storia che la chiede sicuro l’ha detto Luca}
Today we usually take the information by a web interface. Be able to distinguish the web interface
to the database, that is only a program that answers a query. Before w/o internet no nightly updates,
that were monthly or 3-monthly, and data didn’t move via internet but in suitcases in trains xD lol
lmao so funny kill me pls I hate this life.
PROTEIN SEQUENCE DATABASE – DATABANKS
Electronic version of ATLAS of protein sequence and structure (1965?).
Swissprot (1986) but usable in sequences rich in annotation (descry of function, domain
structure…)
TrEmble (1996) useful for protein not in swissprot with no annotation… ex protein that could exist
but are not found and no proved they exist.
SECONDARY DATABASE
Ex tremble comes not from experimental work but from translation of already existing.
If I want to sequence 20 nt, then another seq a gene, another mitochondrial genome, an other trnas,
there’s a lot of chaos, and we need to put order in this disorder, trying to put together the pieces. Ex
if I put all sequencing in a database, I may see that in a species there may be billions genes, but it’s
not possible, how many genes have a human or a chimpanzee? Not billions for sure, there’s a ort of
redundancy. I put all this stuff in a program and I somehow it organizes.
HOW TO ORGANIZE DATA
It’s not an informatic issue but most a logic problem. We have to make data easily understandable;
today we produce a huge amount of data, in an afternoon we can produce a larger amount of data
done in a year, thanks to an experiment. We can study the expression of several genes in several cell
lines in several condition.
New model is to organize ex human names in index, because some humans may share the same
number (ok facciamo finta di sì) ad the same address. By doing this, if I ask database “who lives in
Via Roma 21?” the query will compare only one time Via Roma 21 for each line, in the sense that
will appear in database only one time and not 4 or 5, and is associated to 4 or 5 indexes associated
with 4 or 5 people. In this way I can reduce computational stress and fasten the research.

You might also like