Perl
tutorial
Working
with
DNA
Sequences
#!/usr/bin/perl -w
# Storing DNA in a variable, and printing it out
# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA;
# Finally, we'll specifically tell the program to exit.
exit;
Concatenating
the
DNA
sequences
#!/usr/bin/perl -w
# Concatenating DNA
# Store two DNA fragments into variables called $DNA1
#and $DNA2
$DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
$DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';
# Print the DNA onto the screen
print "Here are the original two DNA fragments:\n\n";
print $DNA1, "\n";
print $DNA2, "\n\n";
# Concatenate the DNA fragments into a third variable and
#print them Using "string interpolation"
$DNA3 = "$DNA1$DNA2";
print "Here is the new DNA of the two fragments
version 1):\n\n";
print "$DNA3\n\n";
# An alternative way using the "dot operator":
# Concatenate the DNA fragments into a third variable and
# print them
$DNA3 = $DNA1 . $DNA2;
print "Here is the concatenation of the first two fragments
(version 2):\n\n";
print "$DNA3\n\n";
# Print the same thing without using the variable $DNA3
print "Here is the concatenation of the first two fragments
(version 3):\n\n";
print $DNA1, $DNA2, "\n";
exit;
TRANSCRIPTION:
DNA
->
RNA
#!/usr/bin/perl -w
# Transcribing DNA into RNA
# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with U's.
$RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the DNA to
RNA:\n\n";
print "$RNA\n";
# Exit the program.
exit;
Reverse
Complement
#!/usr/bin/perl -w
# Calculating the reverse complement of a strand of DNA
# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";
# Calculate the reverse complement
# First, copy the DNA into new variable $revcom
# (short for REVerse COMplement)
#
# It doesn't matter if we first reverse the string and then
# do the complementation; or if we first do the
complementation
# and then reverse the string. Same result each time.
# So when we make the copy we'll do the reverse in the same
statement.
$revcom = reverse $DNA;
-----
The DNA is now reversed.. we neeed to complement the bases
in revcom - substitute all bases by their complements.
# A->T, T->A, G->C, C->G
####Attempt 1:
$revcom =~ s/A/T/g;
$revcom =~ s/T/A/g;
$revcom =~ s/G/C/g;
$revcom =~ s/C/G/g;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
#################
Does
this
work??
Why?
# See the text for a discussion of tr///
$revcom =~ tr/ACGTacgt/TGCAtgca/;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
print "\nThis time it worked!\n\n";
exit;
Reading
Proteins
in
files
#!/usr/bin/perl -w
# Reading protein sequence data from a file
# The filename of the file containing the protein sequence
data
$proteinfilename = 'Name_Of_your_sequence_file.txt';
# First we have to "open" the file, and associate
# a "filehandle" with it. We choose the filehandle
# PROTEINFILE for readability.
open(PROTEINFILE, $proteinfilename) || Die ("cannot open
file");
# Now we do the actual reading of the protein sequence data
from the file, by using the angle brackets < and > to get
the input from the filehandle. We store the data into our
variable $protein.
@protein = <PROTEINFILE>;
# Now that we've got our data, we can close the file.
close PROTEINFILE;
# Print the protein onto the screen
print "Here is the protein:\n\n";
print @protein;
exit;
Pattern
matching:
Motifs
and
Loops
Proceed
ONLY
if
condition
is
true...
code layout..
if (condition)
do something
Finding Motifs
#!/usr/bin/perl -w
# if-elsif-else
$word = 'MNIDDKL';
# if-elsif-else conditionals
if($word eq 'QSTVSGE') {
print "QSTVSGE\n";
} elsif($word eq 'MRQQDMISHDEL') {
print "MRQQDMISHDEL\n";
}
GC
CONTENT
In
PCR
experiments,
the
GC-content
of
primers
are
used
to
predict
their
annealing
temperature
to
the
template
DNA.
A
higher
GC-content
level
indicates
a
higher
melting
temperature.
GC
%
=
G
+
C
x100
A+G+C+T
Logical:
for each base in the DNA
if base is A
count_of_A = count_of_A + 1
if base is C
count_of_C = count_of_C + 1
if base is G
count_of_G = count_of_G + 1
if base is T
count_of_T = count_of_T + 1
done
print count_of_A, count_of_C, count_of_G, count_of_T
the
script
#!/usr/bin/perl -w
# Determining frequency of nucleotides
# Get the name of the file with the DNA sequence data
$dna_filename = File_name.txt;
# Remove the newline from the DNA filename
chomp $dna_filename;
# open the file, or exit
open(DNAFILE, $dna_filename) || die ("Cannot open file
\"$dna_filename\");
exit;
}
# Read the DNA sequence data from the file, and store it
# into the array variable @DNA
@DNA = <DNAFILE>;
# Close the file
close DNAFILE;
# From the lines of the DNA file,
# put the DNA sequence data into a single string.
$DNA = join( '', @DNA);
# Remove whitespace
$DNA =~ s/\s//g;
# Now explode the DNA into an array where each letter of
# the original string is now an element in the array.
# This will make it easy to look at each position.
# Notice that we're reusing the variable @DNA for this
purpose.
@DNA = split( '', $DNA );
# Initialize the counts.
# Notice that we can use scalar variables to hold numbers.
$count_of_A = 0;
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$errors = 0;
# In a loop, look at each base in turn, determine which of
# the four types of nucleotides it is, and increment the
# appropriate count.
foreach $base (@DNA)
{
if ( $base eq 'A' ) {
++$count_of_A;
}
elsif ( $base eq 'C' ) {
++$count_of_C;
}
elsif ( $base eq 'G' ) {
++$count_of_G;
}
elsif ( $base eq 'T' ) {
++$count_of_T;
}
else {
print "!!!!!!!! Error - I don\'t recognize this
base: $base\n";
++$errors;
}
}
# print the results
print "A = $count_of_A\n";
print "C = $count_of_C\n";
print "G = $count_of_G\n";
print "T = $count_of_T\n";
print "errors = $errors\n";
# exit the program
exit;
---using
regex
---
while($DNA =~ /a/ig){$a++}
while($DNA =~ /c/ig){$c++}
while($DNA =~ /g/ig){$g++}
while($DNA =~ /t/ig){$t++}
while($DNA =~ /[^acgt]/ig){$e++}
print "A=$a C=$c G=$g T=$t errors=$e\n";
----
Next
is
a
new
kind
of
loop,
the
foreach loop.
This
loop
works
over
the
elements
of
an
array.
The
line:
foreach $base (@DNA)
Wrtiting
to
files
# Also write the results to a file called "countbase"
$outputfile = "countbase";
(
unless
open(COUNTBASE, ">$outputfile") || die ("Cannot open file
\"$outputfile\" to write to!!\n\n");
print COUNTBASE "A=$a C=$c G=$g T=$t errors=$e\n";
close(COUNTBASE);