COVID2–19 DNA sequence data using python.
Major Modules Used:
Bio Python
Squiggle
Pandas
Importing Modules:
from __future__ import division
from [Link] import ProtParam
import warnings
import pandas as pd
from Bio import SeqIO
from [Link] import CodonTable
We will use [Link] from Biopython for parsing
DNA sequence data(fasta). It provides a simple
uniform interface to input and output assorted
sequence file formats.
for sequence in [Link](r'[Link]', "fasta"):
print([Link])
print(len(sequence), 'nucliotides')
DNAsequence = [Link](r'[Link]', "fasta")
print(DNAsequence)
Since input sequence is FASTA (DNA), and
Coronavirus is RNA type of virus, we need to:
Transcribe DNA to RNA (ATTAAAGGTT… =>
AUUAAAGGUU…)
Translate RNA to Amino acid sequence
(AUUAAAGGUU… => IKGLYLPR*Q…)
In the current scenario, the .fna file starts with
ATTAAAGGTT, then we call transcribe() so T
(thymine) is replaced with U (uracil), so we get the
RNA sequence which starts with AUUAAAGGUU
The transcribe() method will convert the DNA to
mRNA.
DNA = [Link]
mRNA = [Link]()
print(mRNA)
print('Size : ', len(mRNA))
The difference between the DNA and the mRNA is
just that the bases T (for Thymine) are replaced
with U (for Uracil).
Next, we are going to translate the mRNA sequence
to amino-acid sequence using translate() method,
we get something like IKGLYLPR*Q ( is so-called
STOP codon, effectively is a separator for proteins).
Amino_Acid = [Link](table=1, cds=False)
print('Amino Acid', Amino_Acid)
print("Length of Protein:", len(Amino_Acid))
print("Length of Original mRNA:", len(mRNA))
The standard genetic code is traditionally
represented as an RNA codon table because, when
proteins are made in a cell by ribosomes, it is
mRNA that directs protein synthesis. The mRNA
sequence is determined by the sequence of
genomic DNA. Here are some features of codons:
Most codons specify an amino acid
Three “stop” codons mark the end of a protein
One “start” codon, AUG, marks the beginning of a
protein and also encodes the amino acid
methionine.
A series of codons in part of a messenger RNA
(mRNA) molecule. Each codon consists of three
nucleotides, usually corresponding to a single
amino acid. The nucleotides are abbreviated with
the letters A, U, G, and C. This is mRNA, which
uses U (uracil). DNA uses T (thymine) instead. This
mRNA molecule will instruct a ribosome to
synthesize a protein according to this code. Source
print(CodonTable.unambiguous_rna_by_name['Sta
ndard'])
Now we are extracting the Proteins (chains of
amino acids), basically separating at the stop
codon, marked by * (ASTERISK). Then let’s remove
any sequence less than 20 amino acids long, as
this is the smallest known functional protein
Proteins = Amino_Acid.split('*')
df = [Link](Proteins)
[Link]()
print('Total proteins:', len(df))
def conv(item):
return len(item)
def to_str(item):
return str(item)
df['sequence_str'] = df[0].apply(to_str)
df['length'] = df[0].apply(conv)
[Link](columns={0: "sequence"}, inplace=True)
[Link]()
functional_proteins = [Link][df['length'] >= 20]
print('Total functional proteins:',
len(functional_proteins))
print(functional_proteins.describe())
Protein Analysis With The Protparam Module In
Biopython using ProtParam.
poi_list = []
MW_list = []
for record in Proteins[:]:
print("\n")
X = [Link](str(record))
POI = X.count_amino_acids()
poi_list.append(POI)
MW = X.molecular_weight()
MW_list.append(MW)
print("Protein of Interest = ", POI)
try:
print("Amino acids percent = ",
str(X.get_amino_acids_percent()))
except ZeroDivisionError:
pass
print("Molecular weight = ", MW)
try:
print("Aromaticity = ", [Link]())
except ZeroDivisionError:
pass
print("Flexibility = ", [Link]())
try:
print("Secondary structure fraction = ",
X.secondary_structure_fraction())
except ZeroDivisionError:
pass
As The Above Code Produces The OutPut For All
The 775 proteins, we have attached only one of the
output screen.