Python Module 3
Python Module 3
MODULE III
3.1 LISTS
A list is a sequence, Lists are mutable, Traversing a list, List operations, List slices, List Methods,
Deleting elements, Lists and functions, Lists and strings, Parsing lines, Objects and values , Aliasing,
List arguments, Debugging
3.2 DICTIONARIES
Introduction, Dictionary as a set of counters, Dictionaries and files, Looping and Advanced text
parsing, Debugging
3.3 TUPLES
Tuples are immutable, Comparing tuples, Tuple assignment Dictionaries and tuples, Multiple
assignment with dictionaries, The most common words, Using tuples as keys in dictionaries, Sequences:
strings, lists, and tuples, Debugging
Character matching in regular expressions, Extracting data using regular expressions, Combining searching
and extracting Escape character, Summary, Bonus section for Unix / Linux users
Python Application Programming (17CS664) EPCET
Python Application Programming (17CS664) EPCET
MODULE III
3.1 LISTS
A list is an ordered sequence of values.
It is a data structure in Python. The values inside the lists can be of any type (like integer, float,
strings, lists, tuples, dictionaries etc) and are called as elements or items.
For example,
Here, ls1 is a list containing four integers, and ls2 is a list containing three strings.
A list need not contain data of same type.
We can have mixed type of elements in list.
For example,
> ls =[]
> type(ls)
<class 'list'>
or
> ls =list()
> type(ls)
<class 'list'>
In fact, list() is the name of a method (special type of method called as constructor – which will be
discussed in Module 4) of the class list.
Hence, a new list can be created using this function by passing arguments to it as shown below –
> ls2=list([3,4,1])
> print(ls2)
[3, 4, 1]
> print(ls[1])
hi
> print(ls[2])
[2, 3]
Python Application Programming (17CS664) EPCET
Observe here that, the inner list is treated as a single element by outer list. If we would like to
access the elements within inner list, we need to use double-indexing as shown below –
> print(ls[2][0]) 2
> print(ls[2][1]) 3
Thus, when we are using double- indexing, the first index indicates position of inner list inside
outer list, and the second index means the position particular value within inner list.
Unlike strings, lists are mutable. That is, using indexing, we can modify any value within list.
rd
In the following example, the 3 element (i.e. index is 2) is being modified –
> ls[2]='Hello'
> print(ls)
The list can be thought of as a relationship between indices and elements. This relationship is
called as a mapping. That is, each index maps to one of the elements in a list.
The index for extracting list elements has following properties –
[2,3]
Attempt to access a non-existing index will throw and IndexError.
> print(ls[4])
> print(ls[-1])
-5
Python Application Programming (17CS664) EPCET
> print(ls[-3])
hi
> 34 in ls
> -2 in ls
False
Traversing a List
A list can be traversed using for loop.
If we need to use each element in the list, we can use the for loop and in operator as below
34
hi
[2,3]
-5
List elements can be accessed with the combination of range() and len() functions as well –
ls=[1,2,3,4]
for i in range(len(ls)):
ls[i]=ls[i]**2
print(ls)
#output is
[1, 4, 9, 16]
Here, we wanted to do modification in the elements of list. Hence, referring indices is suitable
than referring elements directly.
The len() returns total number of elements in the list (here it is 4).
Then range() function makes the loop to range from 0 to 3 (i.e. 4-1).
Then, for every index, we are updating the list elements (replacing original value by its square).
Python Application Programming (17CS664) EPCET
List Operations
Python allows to use operators + and * on lists.
The operator + uses two list objects and returns concatenation of those two lists.
Whereas * operator take one list object and one integer value, say n, and returns a list by repeating
itself for n times.
> ls1=[1,2,3]
> ls2=[5,6,7]
[1, 2, 3, 5, 6, 7]
>>> ls1=[1,2,3]
#repetition using *
>>> print(ls1*3)
[1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> [0]*4
[0, 0, 0, 0] #repetition using *
List Slices
Similar to strings, the slicing can be applied on lists as well. Consider a list t given below, and a
series of examples following based on this object.
t=['a','b','c','d','e']
Extracting full list without using any index, but only a slicing operator –
>>> print(t[:])
['c', 'd']
> print(t)
Python Application Programming (17CS664) EPCET
List Methods
There are several built-in methods in list class for various purposes. Here, we will discuss some of
them.
append(): This method is used to add a new element at the end of a list.
> ls=[1,2,3]
> ls.append(„hi‟)
> ls.append(10)
> print(ls)
extend(): This method takes a list as an argument and all the elements in this list are added at the end of
invoking list.
> ls1=[1,2,3]
> ls2=[5,6]
> ls2.extend(ls1)
> print(ls2)
[5, 6, 1, 2, 3]
sort(): This method is used to sort the contents of the list. By default, the function will sort the items in
ascending order.
> ls.sort()
Python Application Programming (17CS664) EPCET
> print(ls)
When we want a list to be sorted in descending order, we need to set the argument as shown
> ls.sort(reverse=True)
> print(ls)
> ls.reverse()
> print(ls)
[6, 1, 3, 4]
count(): This method is used to count number of occurrences of a particular value within list.
> ls=[1,2,5,2,1,3,2,10]
> ls.count(2)
clear(): This method removes all the elements in the list and makes the list empty.
> ls=[1,2,3]
> ls.clear()
> print(ls)
[]
insert(): Used to insert a value before a specified index of the list.
> ls=[3,5,10]
> ls.insert(1,"hi")
> print(ls)
Python Application Programming (17CS664) EPCET
[3, 'hi', 5, 10]
index(): This method is used to get the index position of a particular value in the list.
> ls=[4, 2, 10, 5, 3, 2, 6]
> ls.index(2)
Here, the number 2 is found at the index position 1. Note that, this function will give index of only the
first occurrence of a specified value. The same function can be used with two more arguments start and
end to specify a range within which the search should take place.
> ls.index(2)
2
>>> ls.index(2,3,7) 6
> ls.index(53)
1. There is a difference between append() and extend() methods. The former adds the argument as it
is, whereas the latter enhances the existing list. To understand this, observe the following example
–
> ls1=[1,2,3]
> ls2=[5,6]
> ls2.append(ls1)
> print(ls2)
Python Application Programming (17CS664) EPCET
[5, 6, [1, 2, 3]]
Here, the argument ls1 for the append() function is treated as one item, and made as an inner list to
ls2. On the other hand, if we replace append() by extend() then the result would be –
> ls1=[1,2,3]
> ls2=[5,6]
> ls2.extend(ls1)
> print(ls2)
[5, 6, 1, 2, 3]
2. The sort() function can be applied only when the list contains elements of compatible types. But, if
a list is a mix non-compatible types like integers and string, the comparison cannot be done. Hence,
Python will throw TypeError.
For example,
> ls.sort()
TypeError: '<' not supported between instances of 'str' and 'int'
> ls=[34,[2,3],5]
> ls.sort()
Integers and floats are compatible and relational operations can be performed on them. Hence, we can
sort a list containing such items.
> print(ls)
[2, 3, 4.5]
3. The sort() function uses one important argument keys. When a list is containing tuples, it will be
useful. We will discuss tuples later in this Module.
4. Most of the list methods like append(), extend(), sort(), reverse() etc. modify the list object
internally and return None.
> ls=[2,3]
> ls1=ls.append(5)
> print(ls)
[2,3,5]
> print(ls1)
None
Deleting Elements
Elements can be deleted from a list in different ways. Python provides few built-in methods for
removing elements as given below –
pop(): This method deletes the last element in the list, by default.
>>> ls=[3,6,-2,8,10]
[3, 6, -2, 8]
>>> print(x)
10
When an element at a particular index position has to be deleted, then we can give that position as
argument to pop() function.
>>> print(t)
['a', 'c']
> print(x) b
remove(): When we don‟t know the index, but know the value to be removed, then this function can be
used.
Python Application Programming (17CS664) EPCET
> ls.remove(34)
> print(ls)
[5, 8, -12, 2]
Note that, this function will remove only the first occurrence of the specified value, but not
all occurrences.
> ls.remove(34)
> print(ls)
[5, 8, -12, 2, 6, 34]
Unlike pop() function, the remove() function will not return the value that has been deleted.
del: This is an operator to be used when more than one item to be deleted at a time. Here also, we will not
get the items deleted.
>>> ls=[3,6,-2,8,1]
[3, 6, 8, 1]
>>> ls=[3,6,-2,8,1]
>>> print(ls)
[3, 1]
> print(t)
['a', 'c', 'e']
When we need to read the data from the user and to compute sum and average of those numbers,
we can write the code as below –
ls= list()
while (True):
if x== 'done':
break
x= float(x)
ls.append(x)
As every input from the keyboard will be in the form of a string, we need to convert x into float
type and then append it to a list.
When the keyboard input is a string „done‟, then the loop is going to get terminated.
Python Application Programming (17CS664) EPCET
After the loop, we will find the average of those numbers with the help of built-in functions sum()
and len().
> s="hello"
> ls=list(s)
> print(ls)
The method list() breaks a string into individual letters and constructs a list.
If we want a list of words from a sentence, we can use the following code –
Note that, when no argument is provided, the split() function takes the delimiter as white space.
If we need a specific delimiter for splitting the lines, we can use as shown in following example –
> dt="20/03/2018"
> ls=dt.split('/')
> print(ls)
['20', '03', '2018']
It takes a list of strings as argument, and joins all the strings into a single string based on the
delimiter provided.
For example –
Here, we have taken delimiter d as white space. Apart from space, anything can be taken as
delimiter. When we don‟t need any delimiter, use empty string as delimiter.
Parsing Lines
In many situations, we would like to read a file and extract only the lines containing required
pattern. This is known as parsing.
As an illustration, let us assume that there is a log file containing details of email communication
between employees of an organization.
………………
Apart from such lines, the log file also contains mail-contents, to-whom the mail has been sent etc.
Now, if we are interested in extracting only the days of incoming mails, then we can go for parsing.
That is, we are interested in knowing on which of the days, the mails have been received. The code
would be –
fhand = open(„logFile.txt‟)
for line in fhand:
line = line.rstrip()
if not line.startswith('From '):
continue
words = line.split()
print(words[2])
Obviously, all received mails starts from the word From. Hence, we search for only such lines
and then split them into words.
Observe that, the first word in the line would be From, second word would be email-ID and the
rd rd
3 word would be day of a week. Hence, we will extract words[2]which is 3 word.
a= “hi”
b= “hi”
Now, the question is whether both a and b refer to the same string.
Python Application Programming (17CS664) EPCET
There are two possible states –
h
a i a
h
i
h
b i b
In the first situation, a and b are two different objects, but containing same value. The
modification in one object is nothing to do with the other.
Whereas, in the second case, both a and b are referring to the same object.
That is, a is an alias name for b and vice- versa. In other words, these two are referring to same
memory location.
To check whether two variables are referring to same object or not, we can use is operator.
> a= “hi”
> b= “hi”
When two variables are referring to same object, they are called as identical objects.
When two variables are referring to different objects, but contain a same value, they are known as
equivalent objects.
For example,
>>> s1 is s2
#check s1 and s2 are identical False
>>> s1 == s2
#check s1 and s2 are equivalent True
If two objects are identical, they are also equivalent, but if they are equivalent, they are not
necessarily identical.
String literals are interned by default. That is, when two string literals are created in the program
with a same value, they are going to refer same object. But, string variables read from the key-
board will not have this behavior, because their values are depending on the user‟s choice.
Python Application Programming (17CS664) EPCET
Lists are not interned. Hence, we can see following result –
> ls1=[1,2,3]
> ls2=[1,2,3]
>>> ls1 is ls2
#output is False
>>> ls1 == ls2 #output is True
Aliasing
When an object is assigned to other using assignment operator, both of them will refer to
same object in the memory.
> ls1=[1,2,3]
> ls2= ls1
>>> ls1 is ls2 #output is True
Now, ls2 is said to be reference of ls1. In other words, there are two references to the same object
in the memory.
An object with more than one reference has more than one name, hence we say that object is
aliased. If the aliased object is mutable, changes made in one alias will reflect the other.
>>> ls2[1]= 34
>>> print(ls1)
#output is [1, 34, 3]
Python Application Programming (17CS664) EPCET
List Arguments
When a list is passed to a function as an argument, then function receives reference to this list.
Hence, if the list is modified within a function, the caller will get the modified version.
Consider an example –
def del_front(t):
del t[0]
# output is
['b', 'c']
Here, the argument ls and the parameter t both are aliases to same object.
One should understand the operations that will modify the list and the operations that create a new
list.
For example, the append() function modifies the list, whereas the + operator creates a new list.
> t1 = [1, 2]
> t2 = t1.append(3)
#output is 2
>>> print(t1) [1 3]
>>> print(t2) #prints None
>>> t3 = t1 + [5]
#output is 23
>>> print(t3) [1 5]
>>> t2 is t3 #output is False
Here, after applying append() on t1 object, the t1 itself has been modified and t2 is not going to
get anything.
But, when + operator is applied, t1 remains same but t3 will get the updated result.
The programmer should understand such differences when he/she creates a function intending to
modify a list.
For example, the following function has no effect on the original list –
def test(t):
t=t[1:]
Python Application Programming (17CS664) EPCET
ls=[1,2,3]
test(ls)
print(ls) #prints [1, 2, 3]
ls=[1,2,3]
ls1=test(ls)
#prints
print(ls1) [2, 3]
#prints 2,
print(ls) [1, 3]
In the above example also, the original list is not modified, because a return statement always
creates a new object and is assigned to LHS variable at the position of function call.
3.2 DICTIONARIES
A dictionary is a collection of unordered set of key:value pairs, with the requirement that keys
are unique in one dictionary.
Unlike lists and strings where elements are accessed using index values (which are integers),
the values in dictionary are accessed using keys.
A key in dictionary can be any immutable type like strings, numbers and tuples. (The tuple can
be made as a key for dictionary, only if that tuple consist of string/number/ sub-tuples).
As lists are mutable – that is, can be modified using index assignments, slicing, or using
methods like append(), extend() etc, they cannot be a key for dictionary.
One can think of a dictionary as a mapping between set of indices (which are actually keys)
and a set of values.
d= {}
OR
d=dict()
> d={}
> d["Mango"]="Fruit"
> d["Banana"]="Fruit"
Python Application Programming (17CS664) EPCET
> d["Cucumber"]="Veg"
> print(d)
,,To initialize a dictionary at the time of creation itself, one can use the code like –
> print(tel_dir)
{'Tom': 3491, 'Jerry': 8135}
>>> tel_dir['Donald']=4793
Python Application Programming (17CS664) EPCET
Python Application Programming (17CS664) EPCET
>>> print(tel_dir)
{'Tom': 3491, 'Jerry': 8135, 'Donald': 4793}
NOTE that the order of elements in dictionary is unpredictable. That is, in the above example, don‟t
assume that 'Tom': 3491 is first item, 'Jerry': 8135 is second item etc. As dictionary members are not
indexed over integers, the order of elements inside it may vary. However, using a key, we can extract
its associated value as shown below –
Python Application Programming (17CS664) EPCET
> print(tel_dir['Jerry']) 8135
Here, the key 'Jerry' maps with the value 8135, hence it doesn‟t matter where exactly it is inside
the dictionary.
If a particular key is not there in the dictionary and if we try to access such key, then the
KeyError is generated.
'Mickey'
The len() function on dictionary object gives the number of key-value pairs in that object.
>>> print(tel_dir)
{'Tom': 3491, 'Jerry': 8135, 'Donald': 4793}
> len(tel_
dir) 3
The in operator can be used to check whether any key (not value) appears in the dictionary object.
We observe from above example that the value 3491 is associated with the key 'Tom' in tel_dir.
But, the in operator returns False.
The dictionary object has a method values() which will return a list of all the values associated
with keys within a dictionary.
If we would like to check whether a particular value exist in a dictionary, we can make use of it as
shown below –
>>> 3491 in tel_dir.values() #output is True
The in operator behaves differently in case of lists and dictionaries as explained hereunder:
When in operator is used to search a value in a list, then linear search algorithm is used
internally. That is, each element in the list is checked one by one sequentially. This is
considered to be expensive in the view of total time taken to process.
Because, if there are 1000 items in the list, and if the element in the list which we are search for
is in the last position (or if it does not exists), then before yielding result of search (True or
False), we would have done 1000 comparisons.
In other words, linear search requires n number of comparisons for the input size of n elements.
Normally hashing techniques have the time complexity as O(log n) for basic operations like
insertion, deletion and searching.
Python Application Programming (17CS664) EPCET
Hence, the in operator applied on keys of dictionaries works better compared to that on lists.
Create 26 variables to represent each alphabet. Traverse the given string and increment the
corresponding counter when an alphabet is found.
Create a list with 26 elements (all are zero in the beginning) representing alphabets. Traverse
the given string and increment corresponding indexed position in the list when an alphabet is
found.
Create a dictionary with characters as keys and counters as values. When we find a character
for the first time, we add the item to dictionary. Next time onwards, we increment the value
of existing item.
Each of the above methods will perform same task, but the logic of implementation will be
different. Here, we will see the implementation using dictionary.
s=input("Enter a string:")
#read a string
d=dict()
#create empty dictionary
for ch in s:
if ch not in d: #traverse through string
#if new character found
d[ch]=1 #initialize counter to 1
else:
#otherwise, increment counter
d[ch]+=1
print(d)
#display the dictionary
Python Application Programming (17CS664) EPCET
Enter a string:
Hello World
It can be observed from the output that, a dictionary is created here with characters as keys and
frequencies as values. Note that, here we have computed histogram of counters.
Dictionary in Python has a method called as get(), which takes key and a default value as two
arguments. If key is found in the dictionary, then the get() function returns corresponding value,
otherwise it returns default value.
For example,
> tel_dir={'Tom': 3491, 'Jerry':8135, 'Mickey':1253}
> print(tel_dir.get('Jerry',0))
8135
> print(tel_dir.get('Donald',0))
0
In the above example, when the get() function is taking 'Jerry' as argument, it returned
corresponding value, as 'Jerry'is found in tel_dir.
Whereas, when get() is used with 'Donald' as key, the default value 0 (which is provided by us) is
returned.
The function get() can be used effectively for calculating frequency of alphabets in a string.
s=input("Enter a string:")
d=dict()
for ch in s:
d[ch]=d.get(ch,0)+1
print(d)
In the above program, for every character ch in a given string, we will try to retrieve a value.
When the ch is found in d, its value is retrieved, 1 is added to it, and restored.
If ch is not found, 0 is taken as default and then 1 is added to it.
If we want to print key and values separately, we need to use the statements as shown
print(k, tel_dir[k])
Output would be –
Tom 3491
Jerry 8135
Mickey 1253
Note that, while accessing items from dictionary, the keys may not be in order. If we want to print
the keys in alphabetical order, then we need to make a list of the keys, and then sort that list.
ls.sort()
print("Dictionary elements in alphabetical order:")
for k in ls:
print(k, tel_dir[k])
Jerry 8135
Mickey 1253
Tom 3491
Note: The key-value pair from dictionary can be together accessed with the help of a method items()
as shown
Tom 3412
Jerry 6781
Mickey 1294
The usage of comma-separated list k,v here is internally a tuple (another data structure in Python,
which will be discussed later).
Now, we need to count the frequency of each of the word in this file. So, we need to take an
outer loop for iterating over entire file, and an inner loop for traversing each line in a file.
Then in every line, we count the occurrence of a word, as we did before for a character.
The program is given as below –
fhand=open(fname)
except:
exit()
d=dict()
for line in fhand:
for word in line.split():
d[word]=d.get(word,0)+1
print(d)
The output of this program when the input file is myfile.txt would be –
The punctuation marks like comma, full point, question mark etc. are also considered as a
part of word and stored in the dictionary. This means, when a particular word appears in a
file with and without punctuation mark, then there will be multiple entries of that word.
The word „how‟ and „How‟ are treated as separate words in the above example because of
uppercase and lowercase letters.
Python Application Programming (17CS664) EPCET
While solving problems on text analysis, machine learning, data analysis etc. such kinds of
treatment of words lead to unexpected results. So, we need to be careful in parsing the text and we
should try to eliminate punctuation marks, ignoring the case etc. The procedure is discussed in the
next section.
The string module of Python provides a list of all punctuation marks as shown:
> import string
> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
The str class has a method maketrans() which returns a translation table usable for another
method translate().
Consider the following syntax to understand it more clearly:
line.translate(str.maketrans(fromstr, tostr, deletestr))
The above statement replaces the characters in fromstr with the character in the same position in
tostr and delete all characters that are in deletestr.
The fromstr and tostr can be empty strings and the deletestrparameter can be omitted.
Using these functions, we will re-write the program for finding frequency of words in a
file. import string
fname=input("Enter file
name:") try:
fhand=open(fname)
except:
print("File cannot be
opened") exit()
d=dict()
line=line.rstrip()
line=line.translate(line.maketrans('','',string.punctuation))
line=line.lower()
d[word]=d.get(word,0)+1
Python Application Programming (17CS664) EPCET
print(d)
Debugging
When we are working with big datasets (like file containing thousands of pages), it is difficult to
debug by printing and checking the data by hand. So, we can follow any of the following procedures
for easy debugging of the large datasets –
Scale down the input: If possible, reduce the size of the dataset. For example if the program reads a
text file, start with just first 10 lines or with the smallest example you can find. You can either edit
the files themselves, or modify the program so it reads only the first n lines. If there is an error, you
can reduce n to the smallest value that manifests the error, and then increase it gradually as you
correct the errors.
Check summaries and types: Instead of printing and checking the entire dataset, consider printing
summaries of the data: for example, the number of items in a dictionary or the total of a list of
numbers. A common cause of runtime errors is a value that is not the right type. For debugging this
kind of error, it is often enough to print the type of a value.
Write self-checks: Sometimes you can write code to check for errors automatically. For example, if
you are computing the average of a list of numbers, you could check that the result is not greater than the
largest element in the list or less than the smallest. This is called a sanity check because it detects results that
are “completely illogical”. Another kind of check compares the results of two different computations to see if
they are consistent. This is called a consistency check.
Pretty print the output: Formatting debugging output can make it easier to spot an error.
3.3 TUPLES
A tuple is a sequence of items, similar to lists.
The values stored in the tuple can be of any type and they are indexed using integers.
Unlike lists, tuples are immutable. That is, values within tuples cannot be modified/reassigned.
Tuples are comparable and hashable objects.
A tuple can be created in Python as a comma separated list of items – may or may not be enclosed
within parentheses.
>>> print(t)
>>> print(t1)
Python Application Programming (17CS664) EPCET
For example,
>>> x=(3) #trying to have a tuple with single item
>>> print(x)
3 #observe, no parenthesis found
>>> type(x)
<class 'int'> #not a tuple, it is integer!!
Python Application Programming (17CS664) EPCET
Thus, to have a tuple with single item, we must include a comma after the item. That is,
>>> t=3, #or use the statement t=(3,) >>> type(t) #now
this is a tuple <class 'tuple'>
Python Application Programming (17CS664) EPCET
An empty tuple can be created either using a pair of parenthesis or using a function tuple() as
below
> t1=()
> type(t1)
<class 'tuple'>
> t2=tuple()
> type(t2)
<class 'tuple'>
If we provide an argument of type sequence (a list, a string or tuple) to the method tuple(),
then a tuple with the elements in a given sequence will be created:
> t=tuple('Hello')
> print(t)
('H', 'e', 'l', 'l', 'o')
> t=tuple([3,[12,5],'Hi'])
> print(t)
(3, [12, 5], 'Hi')
> t1=tuple(t)
> print(t1)
('Mango', 34, 'hi')
> t is
t1
Tr
ue
Note that, in the above example, both t and t1 objects are referring to same memory location. That
is, t1 is a reference to t.
Python Application Programming (17CS664) EPCET
Elements in the tuple can be extracted using square-brackets with the help of indices.
Similarly, slicing also can be applied to extract required number of items from tuple.
> print(t[1:])
('Banana',
'Apple')
> print(t[-1])
Apple
Modifying the value in a tuple generates error, because tuples are immutable –
> t[0]='Kiwi'
We wanted to replace „Mango‟ by „Kiwi‟, which did not work using assignment.
But, a tuple can be replaced with another tuple involving required modifications –
> t=('Kiwi',)+t[1:]
> print(t)
Comparing Tuples
Tuples can be compared using operators like >, <, >=, == etc.
For example, when we need to check equality among two tuple objects, the first item in first tuple
is compared with first item in second tuple.
nd
If they are same, 2 items are compared.
The check continues till either a mismatch is found or items get over.
> (3,4)==(3,4)
True
Python Application Programming (17CS664) EPCET
The meaning of < and > in tuples is not exactly less than and greater than, instead, it means
comes before and comes after.
Hence in such cases, we will get results different from checking equality (==).
> (1,2,3)<(1,2,5)
> (3,4)<(5,2)
True
When we use relational operator on tuples containing non-comparable types, then TypeError will
be thrown.
> (1,'hi')<('hello','world')
The sort() function internally works on similar pattern – it sorts primarily by first element, in case
of tie, it sorts on second element and so on. This pattern is known as DSU –
Decorate a sequence by building a list of tuples with one or more sort keys preceding the elements
from the sequence,
Sort the list of tuples using the Python built-in sort(), and
Consider a program of sorting words in a sentence from longest to shortest, which illustrates DSU
property.
t = list()
for word in words:
t.append((len(word), word))
res = list()
The
list: 'forest', 'Seeta', 'went', 'with',
sorted ['Lakshman',
'and', 'Ram', 'to']
In the above program, we have split the sentence into a list of words.
Then, a tuple containing length of the word and the word itself are created and are appended to a
list.
Observe the output of this list – it is a list of tuples. Then we are sorting this list in descending
order.
Now for sorting, length of the word is considered, because it is a first element in the tuple.
At the end, we extract length and word in the list, and create another list containing only the
words and print it.
Tuple Assignment
Tuple has a unique feature of having it at LHS of assignment operator.
>>> x,y=10,20
When we have list of items, they can be extracted and stored into multiple variables as below –
>>> x,y=ls
>>> print(x) #prints hello
>>> print(y) #prints world
y= ls[1]
The best known example of assignment of tuples is swapping two values as below –
> a=10
> b=20
> a, b = b, a
In the above example, the statement a, b = b, a is treated by Python as – LHS is a set of variables,
and RHS is set of expressions.
The expressions in RHS are evaluated and assigned to respective variables at LHS.
>>> a, b=10,20,5
ValueError: too many values to unpack (expected 2)
While doing assignment of multiple variables, the RHS can be any type of sequence like list,
string or tuple. Following example extracts user name and domain from an email ID.
> email='[email protected]'
> usrName, domain = email.split('@')
As dictionary may not display the contents in an order, we can use sort() on lists and then print in
required order as below –
print(val,key)
1292 Tom
3501 Jerry
8913 Donald
This loop has two iteration variables because items() returns a list of tuples.
And key, val is a tuple assignment that successively iterates through each of the key-value pairs in
the dictionary.
For each iteration through the loop, both key and value are advanced to the next key-value pair in
the dictionary in hash order.
Once we get a key-value pair, we can create a list of tuples and sort them:
ls=list()
for key, val in d.items():
#observe inner parentheses
ls.append((val,key))
print("List of tuples:",ls)
ls.sort(reverse=True)
In the above program, we are extracting key, val pair from the dictionary and appending it to the
list ls.
While appending, we are putting inner parentheses to make sure that each pair is treated as a tuple.
The sorting would happen based on the telephone number (val), but not on name (key), as first
element in tuple is telephone number (val).
If the word is not there in dictionary, treat that word as a key, and initialize its value as 1. If that word
already there in dictionary, increment the value.
Once all the lines in a file are iterated, you will have a dictionary containing distinct words and their
frequency. Now, take a list and append each key-value (word- frequency) pair into it.
Sort the list in descending order and display only 10 (or any number of) elements from the list to
get most frequent words.
import string
fhand = open('test.txt')
counts = dict()
counts[word] = 1
else:
counts[word] += 1
lst = list()
for key, val in list(counts.items()):
lst.append((val, key))
lst.sort(reverse=True)
print(key, val)
Python Application Programming (17CS664) EPCET
Run the above program on any text file of your choice and observe the output.
For Example, we may need to create a telephone directory where name of a person is Firstname-
last name pair and value is the telephone number.
telDir={}
for i in range(len(number)):
telDir[names[i]]=number[i]
Still, due their difference in behavior and ability, we may need to understand pros and cons of each
of them and then to decide which one to use in a program.
1. Strings are more limited compared to other sequences like lists and Tuples. Because, the
elements in strings must be characters only. Moreover, strings are immutable. Hence, if we need
to modify the characters in a sequence, it is better to go for a list of characters than a string.
Python Application Programming (17CS664) EPCET
2. As lists are mutable, they are most common compared to tuples. But, in some situations as given
below, tuples are preferable.
a. When we have a return statement from a function, it is better to use tuples rather than
lists.
b. When a dictionary key must be a sequence of elements, then we must use immutable
type like strings and tuples
3. As tuples are immutable, the methods like sort() and reverse() cannot be applied on them. But,
Python provides built-in functions sorted() and reversed() which will take a sequence as an
argument and return a new sequence with modified results.
Debugging
Lists, Dictionaries and Tuples are basically data structures.
In real-time programming, we may require compound data structures like lists of tuples, dictionaries
containing tuples and lists etc.
But, these compound data structures are prone to shape errors – that is, errors caused when a data
structure has the wrong type, size, composition etc.
For example, when your code is expecting a list containing single integer, but you are giving a plain
integer, then there will be an error.
When debugging a program to fix the bugs, following are the few things a programmer can try –
Reading: Examine your code, read it again and check that it says what you meant to say.
Running: Experiment by making changes and running different versions. Often if you display
the right thing at the right place in the program, the problem becomes obvious, but sometimes
you have to spend some time to build scaffolding.
Ruminating: Take some time to think! What kind of error is it: syntax, runtime, semantic?
What information can you get from the error messages, or from the output of the program?
What kind of error could cause the problem you‟re seeing? What did you change last,
before the problem appeared?
Retreating: At some point, the best thing to do is back off, undoing recent changes, until you get
back you can start rebuilding.
Python Application Programming (17CS664) EPCET
We have done such tasks earlier using string slicing and string methods like split(), find() etc.
As the task of searching and extracting is very common, Python provides a powerful library called
regular expressions to handle these tasks elegantly.
Though they have quite complicated syntax, they provide efficient way of searching the patterns.
The regular expressions are themselves little programs to search and parse strings.
There is a search() function in this module, which is used to find particular substring within a
string.
import re
fhand = open('myfile.txt')
print(line)
By referring to file myfile.txt that has been discussed in previous Chapters, the output would be
In the above program, the search() function is used to search the lines containing a word how.
One can observe that the above program is not much different from a program that uses find()
function of strings. But, regular expressions make use of special characters with specific meaning.
In the following example, we make use of caret (^) symbol, which indicates beginning of the line.
import re
hand = open('myfile.txt')
line = line.rstrip()
Python Application Programming (17CS664) EPCET
if re.search('^how', line):
print(line)
Here, we have searched for a line which starts with a string how.
Again, this program will not makes use of regular expression fully.
Because, the above program would have been written using a string function startswith(). Hence,
in the next section, we will understand the true usage of regular expressions.
Some of the examples for quick and easy understanding of regular expressions are given in next
Table.
Consider the following example, where the regular expression is for searching lines which starts
with I and has any two characters (any character represented by two dots) and then has a
character m.
import re
fhand = open('myfile.txt')
if re.search('^I..m', line):
print(line)
I am doing fine.
Note that, the regular expression ^I..m not only matches „I am‟, but it can match „Isdm‟,
„I*3m‟ and so on.
In the previous program, we knew that there are exactly two characters between I and m. Hence,
we could able to give two dots.
But, when we don‟t know the exact number of characters between two characters (or strings),
we can make use of dot and + symbols together.
import re
hand = open('myfile.txt')
line = line.rstrip()
Python Application Programming (17CS664) EPCET
if re.search('^h.+u', line):
print(line)
It indicates that, the string should be starting with h and ending with u and there may by any
number of (dot and +) characters in- between.
Few examples:
To understand the behavior of few basic meta characters, we will see some examples.
The file used for these examples is mbox-short.txt which can be downloaded from –
https://www.py4e.com/code3/mbox-short.txt
Pattern to extract lines starting with the word From (or from) and ending with edu:
import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
pattern = „^[Ff]rom.*edu$‟
if re.search(pattern, line):
print(line)
Here the pattern given for regular expression indicates that the line should start with either From
or from. Then there may be 0 or more characters, and later the line should end with edu.
Replace the pattern by following string, rest of the program will remain the same.
pattern = „[0-9]$‟
Using Not :
pattern = „^[^a-z0-9]+‟
Python Application Programming (17CS664) EPCET
Here, the first ^ indicates we want something to match in the beginning of a line. Then, the ^
inside square-brackets indicate do not match any single character within bracket. Hence, the
whole meaning would be – line must be started with anything other than a lower-case alphabets
and digits. In other words, the line should not be started with lowercase alphabet and digits.
Here, the line should start with capital letters, followed by 0 or more characters, but must end
with any digit.
Consider an example of extracting anything that looks like an email address from any line.
import re
lst = re.findall('\S+@\S+', s)
print(lst)
['[email protected]', '[email protected]']
Here, the pattern indicates at least one non-white space characters (\S) before @ and at least one
non-white space after @.
Hence, it will not match with @2pm, because of a white- space before @.
Now, we can write a complete program to extract all email-ids from the file.
import re
fhand = open('mbox-short.txt')
line = line.rstrip()
x = re.findall('\S+@\S+', line)
Python Application Programming (17CS664) EPCET
if len(x) > 0:
print(x)
Here, the condition len(x) > 0 is checked because, we want to print only the line which contain an
email-ID. If any line do not find the match for a pattern given, the findall() function will return an
empty list. The length of empty list will be zero, and hence we would like to print the lines only
with length greater than 0.
['[email protected]'] ['<[email protected]>']
['<[email protected]>']
['<[email protected]>;'] ['<[email protected]>;']
['<[email protected]>;'] ['apache@localhost)']
……………………………….
………………………………..
Note that, apart from just email-ID‟s, the output contains additional characters (<, >, ; etc)
attached to the extracted pattern. To remove all that, refine the pattern. That is, we want email-ID
to be started with any alphabets or digits, and ending with only alphabets. Hence, the statement
would be –
x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
The line should start with X-, followed by 0 or more characters. Then, we need a colon and white-
space. They are written as it is.
Then there must be a number containing one or more digits with or without a decimal point. Note
that, we want dot as a part of our pattern string, but not as meta character here. The pattern for
regular expression would be –
^X-.*: [0-9.]+
import re
hand = open('mbox-short.txt')
line = line.rstrip()
print(line)
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
……………………………………………………
……………………………………………………
Assume that, we want only the numbers (representing confidence, probability etc) in the above
output.
We can use split() function on extracted string. But, it is better to refine regular expression. To do
so, we need the help of parentheses.
When we add parentheses to a regular expression, they are ignored when matching the string. But
when we are using findall(), parentheses indicate that while we want the whole expression to
match, we only are interested in extracting a portion of the substring that matches the regular
expression.
import re
hand = open('mbox-short.txt')
line = line.rstrip()
x = re.findall('^X-\S*: ([0-9.]+)',
line) if len(x) > 0:
print(x)
Because of the parentheses enclosing the pattern above, it will match the pattern starting with X-
and extracts only digit portion. Now, the output would be –
['0.8475']
Python Application Programming (17CS664) EPCET
['0.0000']
['0.6178']
['0.0000']
['0.6961']
…………………
………………..
Another example of similar form: The file mbox-short.txt contains lines like –
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
We may be interested in extracting only the revision numbers mentioned at the end of these
lines. Then, we can write the statement –
x = re.findall('^Details:.*rev=([0-9.]+)', line)
The regex here indicates that the line must start with Details:, and has something with rev= and
then digits.
As we want only those digits, we will put parenthesis for that portion of expression.
Note that, the expression [0-9] is greedy, because, it can display very large number. It keeps
grabbing digits until it finds any other character than the digit.
The output of above regular expression is a set of revision numbers as given below –
['39772']
['39771']
['39770']
['39769']
………………………
………………………
Consider another example – we may be interested in knowing time of a day of each email. The
file mbox-short.txt has lines like –
Here, we would like to extract only the hour 09. That is, we would like only two digits
representing hour. Hence, we need to modify our expression as –
Here, [0-9][0-9] indicates that a digit should appear only two times.
The alternative way of writing this would be -
Python Application Programming (17CS664) EPCET
x = re.findall('^From .* ([0-9]{2}):', line)
The number 2 within flower-brackets indicates that the preceding match should appear exactly
two times.
['18']
['16']
['15']
…………………
…………………
Escape Character
As we have discussed till now, the character like dot, plus, question mark, asterisk, dollar etc.
are meta characters in regular expressions.
For example,
import re
y = re.findall('\$[0-9.]+',x)
Output:
['$10.00']
∙ Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we need to use
\ before it.
There is a command-line program built into Unix called grep (Generalized Regular Expression
Parser) that behaves similar to search() function.
[email protected] From:
From: [email protected]
Note that, grep command does not support the non-blank character \S, hence we need to
use [^ ]indicating not a white-space.