DATA
STRUCTURES
PYTHON
FOR
GENOMIC
DATA
SCIENCE
Lists
A
list
is
an
ordered
set
of
values:
['gene', 5.16e-08, 0.000138511, 7.33e-08]
You
can
create
a
variable
to
hold
this
list:
>>> gene_expression=['gene',5.16e-08, 0.000138511, 7.33e-08]
'gene'
5.16e-08
0.000138511
7.33e-08
Lists
A
list
is
an
ordered
set
of
values:
['gene', 5.16e-08, 0.000138511, 7.33e-08]
You
can
create
a
variable
to
hold
this
list:
>>> gene_expression=['gene',5.16e-08, 0.000138511, 7.33e-08]
gene_expression
gene
5.16e-08
0.000138511
7.33e-08
You
can
access
individual
list
elements:
>>> print(gene_expression[2])
0.000138511
>>> print(gene_expression[-1])
7.33e-08
4
Modifying
Lists
0
gene_expression
Lif
gene
5.16e-08
0.000138511
7.33e-08
You
can
change
an
individual
list
element:
>>> gene_expression[0]='Lif'
>>> print(gene_expression)
['Lif', 5.16e-08, 0.000138511, 7.33e-08]
Modifying
Lists
0
gene_expression
Lif
gene
5.16e-08
0.000138511
7.33e-08
You
can
change
an
individual
list
element:
>>> gene_expression[0]='Lif'
>>> print(gene_expression)
['Lif', 5.16e-08, 0.000138511, 7.33e-08]
Dont
change
an
element
in
a
string!
Unlike
strings,
which
are
immutable,
lists
are
a
mutable
type!
>>> motif ='nacggggtc'
>>> motif[0]='a'
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
motif[0]='a'
TypeError: 'str' object does not support item assignment
Slicing
Lists
gene_expression
Lif
gene
5.16e-08
0.000138511
7.33e-08
You
can
slice
a
list
(it
will
create
a
new
list):
>>> gene_expression[-3:]
[5.16e-08, 0.000138511, 7.33e-08]
The
following
special
slice
returns
a
new
copy
of
the
list:
>>> gene_expression[:]
['Lif', 5.16e-08, 0.000138511, 7.33e-08]
Slicing
Lists
gene_expression
Lif
gene
5.16e-08
0.000138511
7.33e-08
Assignment
to
slices
is
also
possible,
and
this
can
change
the
list:
>>> gene_expression[1:3]=[6.09e-07]
0
gene_expression
Lif
gene
1
6.09e-07
5.16e-08
2
7.33e-08
0.000138511
>>> gene_expression[:]=[] # this clears the list
8
Common
List
Operations
0
1
gene_expression
Lif
gene
6.09e-07
5.16e-08
7.33e-08
0.000138511
Like
strings,
lists
also
support
concatenation:
>>> gene_expression+[5.16e-08, 0.000138511]
['Lif', 6.09e-07, 7.33e-08, 5.16e-08, 0.000138511]
The
built-in
function
len()
also
applies
to
lists:
>>> len(gene_expression)
3
Common
List
Operations
0
1
gene_expression
Lif
gene
6.09e-07
5.16e-08
7.33e-08
0.000138511
Like
strings,
lists
also
support
concatenation:
>>> gene_expression+[5.16e-08, 0.000138511]
['Lif', 6.09e-07, 7.33e-08, 5.16e-08, 0.000138511]
The
built-in
function
len()
also
applies
to
lists:
>>> len(gene_expression)
3
The
del
statement
can
be
used
to
remove
elements
and
slices
from
a
list
destructively:
>>> del gene_expression[1]
>>> gene_expression
['Lif', 7.33e-08]
10
Lists
As
Objects
The
list
data
type
has
several
methods.
Among
them:
a
method
to
extend
a
list
by
appending
all
the
items
in
a
given
list:
>>> gene_expression.extend([5.16e-08, 0.000138511])
>>> gene_expression
['Lif', 7.33e-08, 5.16e-08, 0.000138511]
11
Lists
As
Objects
The
list
data
type
has
several
methods.
Among
them:
a
method
to
extend
a
list
by
appending
all
the
items
in
a
given
list:
>>> gene_expression.extend([5.16e-08, 0.000138511])
>>> gene_expression
['Lif', 7.33e-08, 5.16e-08, 0.000138511]
a
method
to
count
the
number
of
times
an
element
appears
in
a
list:
>>>
print(gene_expression.count('Lif'),gene_expression.count('gene'))
1 0
a
method
to
reverse
all
elements
in
a
list:
>>> gene_expression.reverse()
>>> gene_expression
[0.000138511, 5.16e-08, 7.33e-08, 'Lif']
You
can
Tind
all
the
methods
of
the
list
object
using
the
help()
function:
>>> help(list)
12
Lists
As
Stacks
The
list
methods
append
and
pop
make
it
very
easy
to
use
a
list
as
a
stack,
where
the
last
element
added
is
the
Tirst
element
retrieved
(last-in,
Tirst-out).
>>> stack=['a','b','c','d]
4
3
2
1
0
d
c
b
a
stack
elem
13
Lists
As
Stacks
The
list
methods
append
and
pop
make
it
very
easy
to
use
a
list
as
a
stack,
where
the
last
element
added
is
the
Tirst
element
retrieved
(last-in,
Tirst-out).
>>> stack=['a','b','c','d']
To
add
an
item
to
the
top
of
the
stack,
use
append():
>>> [Link]('e')
To
retrieve
an
item
from
the
top
of
the
stack,
4
3
use
pop():
>>> elem=[Link]()
>>> elem
'e'
2
1
0
d
c
b
a
stack
elem
14
Sorting
Lists
There
are
two
ways
to
sort
lists:
one
way
uses
the
sorted()
built-in
function:
>>>
>>>
[1,
>>>
[3,
mylist=[3,31,123,1,5]
sorted(mylist)
3, 5, 31, 123]
mylist
31, 123, 1, 5]
another
way
is
to
use
the
list
sort()
method:
>>>
[Link]()
15
Sorting
Lists
There
are
two
ways
to
sort
lists:
one
way
uses
the
sorted()
built-in
function:
>>>
>>>
[1,
>>>
[3,
mylist=[3,31,123,1,5]
sorted(mylist)
3, 5, 31, 123]
mylist
31, 123, 1, 5]
another
way
is
to
use
the
list
sort()
method:
>>>
[Link]()
>>> mylist
[1, 3, 5, 31, 123]
the
sort()
method
modiTies
the
list!
The
elements
of
the
list
dont
need
to
be
numbers:
>>> mylist=['c','g','T','a','A']
>>> print(sorted(mylist))
['A', 'T', 'a', 'c', 'g']
16
Tuples
A
tuple
consists
of
a
number
of
values
separated
by
commas,
and
is
another
standard
sequence
data
type,
like
strings
and
lists.
>>> t=1,2,3
>>> t
We
may
input
tuples
may
with
or
(1, 2, 3)
without
surrounding
parentheses.
>>> t=(1,2,3)
>>> t
(1, 2, 3)
17
Tuples
A
tuple
consists
of
a
number
of
values
separated
by
commas,
and
is
another
standard
sequence
data
type,
like
strings
and
lists.
>>> t=1,2,3
>>> t
We
may
input
tuples
may
with
or
(1, 2, 3)
without
surrounding
parentheses.
>>> t=(1,2,3)
>>> t
(1, 2, 3)
Tuples
have
many
common
properties
with
lists,
such
as
indexing
and
slicing
operations,
but
while
lists
are
mutable,
tuples
are
immutable,
and
usually
contain
an
heterogeneous
sequence
of
elements.
18
Sets
A
set
is
an
unordered
collection
with
no
duplicate
elements.
Set
objects
support
mathematical
operations
like
union,
intersection,
and
difference.
>>> brca1={'DNA repair','zinc ion binding','DNA
binding','ubiquitin-protein transferase activity', 'DNA
repair','protein ubiquitination'}
>>> brca1
{'DNA repair','zinc ion binding','DNA binding','ubiquitin-protein
transferase activity', 'DNA repair','protein ubiquitination'}
19
Sets
A
set
is
an
unordered
collection
with
no
duplicate
elements.
Set
objects
support
mathematical
operations
like
union,
intersection,
and
difference.
>>> brca1={'DNA repair','zinc ion binding','DNA
binding','ubiquitin-protein transferase activity', 'DNA
repair','protein ubiquitination'}
>>> brca1
{'DNA repair','zinc ion binding','DNA binding','ubiquitin-protein
transferase activity,'protein ubiquitination'}
>>> brca2={'protein binding','H4 histone acetyltransferase
activity','nucleoplasm', 'DNA repair','double-strand break
repair', 'double-strand break repair via homologous
recombination'}
20
Operation
with
Sets
>>> brca1 | brca2
{'DNA repair','zinc ion binding','DNA
binding','ubiquitin-protein transferase
activity,'protein ubiquitination', 'protein
binding','H4 histone acetyltransferase
activity','nucleoplasm','double-strand break repair',
'double-strand break repair via homologous
recombination'}
>>> brca1 & brca2
{'DNA repair'}
>>> brca1 brca2
{'zinc ion binding','DNA binding','ubiquitin-protein
transferase activity,'protein ubiquitination'}
union
intersection
difference
21
Dictionaries
A
dictionary
is
an
unordered
set
of
key
and
value
pairs,
with
the
requirement
that
the
keys
are
unique
(within
one
dictionary).
TF_motif
"SP1"
'gggcgg'
"C/EBP"
'attgcgcaat'
"ATF"
'tgacgtca'
"c-Myc"
'cacgtg'
"Oct-1"
'atgcaaat'
keys: can be
values: can be
any immutable type:
e.g. strings, numbers.
any type.
>>> TF_motif =
{'SP1' :'gggcgg',
'C/EBP':'attgcgcaat',
'ATF':'tgacgtca',
'c-Myc':'cacgtg',
'Oct-1':'atgcaaat'}
Each
key
is
separated
from
its
value
by
a
colon.
22
Accessing
Values
From
A
Dictionary
Use
a
dictionary
key
within
square
brackets
to
obtain
its
value:
>>> TF_motif={'SP1' : 'gggcgg', 'C/EBP':'attgcgcaat',
'ATF':'tgacgtca','c-Myc':'cacgtg','Oct-1':'atgcaaat'}
>>> print("The recognition sequence for the ATF transcription
is %s. % TF_motif['ATF'])
The recognition sequence for the ATF transcription is
tgacgtca.
Attempting
to
access
a
key
that
is
not
part
of
the
dictionary
produces
an
error:
>>> print("The recognition sequence for the NF-1 transcription
is %s. % TF_motif['NF-1'])
Traceback (most recent call last):
File "<pyshell#291>", line 1, in <module>
print("The recognition sequence for the ATF transcription
is %s"%TF_motif['NF-1'])
KeyError: 'NF-1'
Check
Tirst
if
a
key
is
present!
>>> 'NF-1' in TF_motif
False
23
Updating
A
Dictionary
>>> TF_motif={'SP1' : 'gggcgg', 'C/EBP':'attgcgcaat',
'ATF':'tgacgtca','c-Myc':'cacgtg'}
Add
a
new
key:value
pair
to
the
dictionary:
>>> TF_motif['AP-1']='tgagtca'
>>> TF_motif
{'ATF': 'tgacgtca', 'c-Myc': 'cacgtg', 'SP1': 'gggcgg',
'C/EBP': 'attgcgcaat', 'AP-1': 'tgagtca'}
Modify
an
existing
entry:
>>> TF_motif['AP-1']='tga(g/c)tca'
>>> TF_motif
{'ATF': 'tgacgtca', 'c-Myc': 'cacgtg', 'SP1': 'gggcgg',
'C/EBP': 'attgcgcaat', 'AP-1': 'tga(g/c)tca}
24
Updating
A
Dictionary
(contd)
>>> TF_motif
{'ATF': 'tgacgtca', 'c-Myc': 'cacgtg', 'SP1': 'gggcgg', 'C/
EBP': 'attgcgcaat', 'AP-1': 'tga(g/c)tca}
Delete
a
key
from
the
dictionary:
>>> del TF_motif['SP1']
>>> TF_motif
{'ATF': 'tgacgtca', 'c-Myc': 'cacgtg', 'C/EBP': 'attgcgcaat',
'AP-1': 'tga(g/c)tca'}
Add
another
dictionary
(multiple
key:value
pairs)
to
the
current
one:
Note
the
overlap
with
the
current
dictionary.
>>> TF_motif.update({'SP1': 'gggcgg', 'C/EBP': 'attgcgcaat',
'Oct-1': 'atgcaaa'})
>>> TF_motif
{'ATF': 'tgacgtca', 'c-Myc': 'cacgtg', 'SP1': 'gggcgg', 'C/
EBP': 'attgcgcaat', 'Oct-1': 'atgcaaa', 'AP-1': 'tga(g/c)tca'}
25
Listing
All
Elements
In
A
Dictionary
The
size
of
a
dictionary
can
be
easily
obtained
by
using
the
built-
in
function
len():
>>> len(TF_motif)
6
It
is
possible
to
get
a
list
of
all
the
keys
in
the
dictionary:
>>> list(TF_motif.keys())
['ATF', 'c-Myc', 'SP1', 'C/EBP', 'Oct-1', 'AP-1']
Similarly
you
can
get
a
list
of
all
the
values:
>>> list(TF_motif.values())
['tgacgtca', 'cacgtg', 'gggcgg', 'attgcgcaat', 'atgcaaa',
'tga(g/c)tca']
The
lists
found
as
above
are
in
arbitrary
order,
but
if
you
want
them
sorted
you
can
use
the
sorted()
function:
>>> sorted(TF_motif.keys())
['AP-1', 'ATF', 'C/EBP', 'Oct-1', 'SP1', 'c-Myc']
>>> sorted(TF_motif.values())
['atgcaaa', 'attgcgcaat', 'cacgtg', 'gggcgg', 'tga(g/c)tca',
'tgacgtca']
26
Sequence
Data
Types
Comparison
Action
Strings
Lists
Dictionaries
Creation
"...",
...,
"..."
[a,
b,
...,
n]
{keya:
a,
keyb:
b,
...,
keyn:n
}
Access
to
an
element
s[i]
L[i]
D[key]
Membership
c
in
s
e
in
L
key
in
D
Remove
en
element
Not
Possible
s
=
s[:i1]+s[i+1:]
del
L[i]
del
D[key]
Change
an
element
Not
Possible
s=s[:i1]+new+s[i+1:]
L[i]=new
D[key]=new
Add
an
element
Not
Possible
s=s
+
new
[Link](e)
D[newkey]=val
Remove
consecutive
elements
Not
Possible
s=s[:i]+s[k:]
del
L[i:k]
Not
Possible,
not
ordered
but,
remove
all
[Link]()
Change
consecutive
elements
Not
Possible
s=s[:i]+news+s[k:]
L[i:k]=Lnew
Not
Possible
Add
more
than
one
element
Not
Possible
s=s+news
[Link](newL)
or
L
=
L
+
Lnew
[Link](newD)
Adapted from [Link]
27