ANL252 Python for Data Analytics Guide
ANL252 Python for Data Analytics Guide
No part of this material may be reproduced in any form or by any means without
permission in writing from the Educational Media & Resources, Singapore University of
Social Sciences.
ISBN 9789814873925
Social Sciences.
Release V1.2.0
Table of Contents
Course Guide
1. Welcome.................................................................................................................. CG-2
Overview................................................................................................................... SU1-3
Summary................................................................................................................. SU1-51
........................................................................................................................... SU1-
i
Table of Contents
Overview................................................................................................................... SU2-3
Summary................................................................................................................. SU2-61
........................................................................................................................... SU2-
Overview................................................................................................................... SU3-3
Summary................................................................................................................. SU3-28
References............................................................................................................... SU3-39
Overview................................................................................................................... SU4-3
ii
Table of Contents
Summary................................................................................................................. SU4-55
References............................................................................................................... SU4-65
Overview................................................................................................................... SU5-3
Summary................................................................................................................. SU5-49
References............................................................................................................... SU5-55
Overview................................................................................................................... SU6-3
iii
Table of Contents
Summary................................................................................................................. SU6-38
References............................................................................................................... SU6-44
iv
List of Tables
List of Tables
Table 2.3 Some common packages for data analytics in Python......................... SU2-59
Table 4.1 Most common data file formats and the corresponding reader in
pandas............................................................................................................................. SU4-5
v
List of Tables
Table 6.1 List of some available modes of the open() function........................... SU6-6
vi
List of Figures
List of Figures
Figure 1.3 JupyterLab input cell for Python code entry......................................... SU1-6
Figure 1.4 JupyterLab shows the output after the code is run.............................. SU1-6
Figure 1.14 Syntax error caused by quotation marks within a string................. SU1-21
vii
List of Figures
Figure 1.25 Repeating the same code for two individuals to enter..................... SU1-38
Figure 1.30 Using while-loop for user input if error occurs.............................. SU1-48
Figure 1.31 The complete example programme of Study Unit 1......................... SU1-50
viii
List of Figures
Figure 2.16 Replacing more values than the specified indices............................ SU2-18
Figure 2.17 Replacing less values than the specified indices............................... SU2-18
Figure 2.20 Merging two lists by keeping them as lists in the new one............. SU2-21
Figure 2.22 Printing list elements line by line to the screen................................. SU2-23
Figure 2.23 Printing elements of multiple lists using formatted string.............. SU2-24
ix
List of Figures
Figure 2.30 Store keys and values from a dictionary in lists................................ SU2-33
Figure 2.31 Print all the keys from a dictionary line by line................................ SU2-34
Figure 2.32 Print all the keys and values from a dictionary line by line............ SU2-34
Figure 2.34 Print items from a dictionary after converting it to a list................. SU2-36
Figure 2.36 Add items to every dictionary key and convert values to lists....... SU2-37
Figure 2.37 Print items from a dictionary after converting to lists using for-
loops.............................................................................................................................. SU2-37
Figure 2.40 Merging two dictionaries using the v3.5+ option............................. SU2-40
Figure 2.41 Merging two dictionaries using the v3.9+ option............................. SU2-40
x
List of Figures
Figure 4.3 Printing the head and tail of the imported dataset by
display().................................................................................................................. SU4-10
Figure 4.5 Select a single column from a DataFrame and save it as NumPy
array.............................................................................................................................. SU4-13
Figure 4.6 Select rows from a DataFrame by random row indices..................... SU4-14
xi
List of Figures
Figure 4.13 Selecting cells using row and column labels..................................... SU4-21
Figure 4.14 Selecting cells by row indices and column labels............................. SU4-22
Figure 4.15 Selecting cells by row labels and column indices............................. SU4-22
Figure 4.19 Concatenating two datasets with different rows but identical
variables........................................................................................................................ SU4-27
Figure 4.25 Counting the number of missing values in each variable................ SU4-37
xii
List of Figures
Figure 4.36 Computing the mean of all numeric columns for grouped
data................................................................................................................................ SU4-53
Figure 4.37 Computing the mean of selected columns for grouped data........... SU4-53
Figure 5.2 Reducing the number of categories for a categorical variable........... SU5-10
xiii
List of Figures
Figure 5.16 Load packages and prepare dataset for tree plot.............................. SU5-45
Figure 6.4 Print the records in a .csv file to the screen line by line..................... SU6-11
xiv
List of Figures
Figure 6.15 Convert the result of an SQL query to a pandas DataFrame........... SU6-21
Figure 6.17 Select only analytics students from the table students................. SU6-26
Figure 6.18 Select only analytics students with ID number between 5 and
10.................................................................................................................................... SU6-27
Figure 6.19 Select only analytics students with ID number not between 5 and
10.................................................................................................................................... SU6-27
Figure 6.20 Select analytics students not from Singapore or China.................... SU6-28
Figure 6.21 Select analytics students with first names that start with “M”........ SU6-28
Figure 6.22 Select analytics students with first names that contain “ar”............ SU6-28
Figure 6.23 Select analytics students with first names that end with “s”........... SU6-29
Figure 6.24 Select analytics students with first names that start with “M” and
end with “l”................................................................................................................. SU6-29
Figure 6.25 Select analytics students with last names that have the pattern
“Ta_”.............................................................................................................................. SU6-29
xv
List of Figures
Figure 6.30 Select certain columns of a table by a variable list string................. SU6-32
xvi
List of Lesson Recordings
xvii
List of Lesson Recordings
xviii
Course
Guide
1. Welcome
Welcome to the course ANL252 Python for Data Analytics, a 5 credit unit (CU) course.
This StudyGuide will be your personal learning resource to take you through the course
learning journey. The guide is divided into two main sections – the Course Guide and
Study Units.
The Course Guide describes the structure for the entire course and provides you with an
overview of the Study Units. It serves as a roadmap of the different learning components
within the course. This Course Guide contains important information regarding the
course learning outcomes, learning materials and resources, assessment breakdown and
additional course information.
i
https://d2jifwt31jjehd.cloudfront.net/ANL252/IntroVideo/ANL252_Intro_Video.mp4
CG-2
ANL252 Course Guide
The course provides foundational knowledge and skills of Python programming, which
enables students to develop programs for data preparation, data management, and data
visualisation to carry out data analytics tasks such as clustering, decision tree, etc. Students
also acquire skills to explore and find patterns in datasets using ready-to-use Python codes
that can be modified to suit individual needs. Furthermore, this course introduces the
application of SQL for querying data from database within any Python code for data
analytics purposes. Since this course is designed to help students with little prior exposure
to programming, it will focus on breadth rather than depth.
Course Structure
This course is a 5-credit unit course.
There are six Study Units in this course. The following provides an overview of each Study
Unit.
This unit takes the first steps in Python programming, including variables, data types,
operators, formatted printing, and user input. It also introduces the conditional statement
and loops, two types of control flow that change the behaviour of a program dynamically.
This unit establishes three compound built-in data types in Python: tuples, lists, and
dictionaries. Compound data structures organise and store data in a way that they
can be accessed and worked with efficiently. The type of compound data also defines
the relationship between the data and the operations that can be performed on them.
Furthermore, this unit covers the application of functions, methods, packages, and
modules and how they can be integrated in the program.
CG-3
ANL252 Course Guide
This study unit introduces two Python packages: NumPy and matplotlib. NumPy is
the fundamental package for efficient scientific computing with Python. The students
first learn to create and subset NumPy arrays, before getting on to generate statistics
on the data stored in an array by the integrated NumPy functions. Furthermore, this
unit illustrates the use of the matplotlib package for data visualisation, including the
functionality for plotting and customising basic charts of data analytics.
This unit establishes the pandas DataFrame as the key data structure for analytics in
Python. It demonstrates the process of creating Python programs for importing data
from external sources, indexing, and querying data from a DataFrame. It also focuses on
merging multiple DataFrames efficiently, identifying and dealing with missing data and
outliers, sorting, and grouping data.
This unit covers the application of k-means clustering and decision trees in Python
programming using the scikit-learn library. The specific preparation of DataFrames due
to the different requirements of the analytics algorithms is discussed in the first step,
followed by the implementation of the two techniques in Python programs, including
parameter settings, presentation, and visualisation of the results.
This unit describes the usage of SQL to query data from databases in Python programs.
The students learn basic information about SQL as a programming language and the
related SQLite3 package. Students will also learn about how to create flexible SQL
statements for data queries such as selecting data and merging tables using Python
programming skills.
CG-4
ANL252 Course Guide
3. Learning Outcomes
CG-5
ANL252 Course Guide
4. Learning Material
To complete the course, you will need the following learning material(s):
Required Textbook(s)
Shaw, Z. A. (2017). Learn python 3 the hard way. Addison-Wesley Professional.
If you are enrolled into this course, you will be able to access the eTextbooks here:
To launch eTextbook, you need a VitalSource account which can be created via
Canvas (iBookStore), using your SUSS email address. Access to adopted eTextbook is
restricted by enrolment to this course.
Website(s):
The Python documentation. https://docs.python.org/
CG-6
ANL252 Course Guide
CG-7
ANL252 Course Guide
5. Assessment Overview
The overall assessment weighting for Day & Evening course is as follows:
Pre-Class Quiz 02 2%
Participation 6%
TOTAL 100%
The overall assessment weighting for the full online classes for this course is as follows:
Pre-Class Quiz 2 2%
CG-8
ANL252 Course Guide
Discussion 10%
Participation 6%
TOTAL 100%
To be sure of a pass result you need to achieve scores of 40% in each component. Your
overall rank score is the weighted average of both components.
For detailed information on the Course grading policy, please refer to The Student
Handbook (‘Award of Grades’ section under Assessment and Examination Regulations).
The Student Handbook is available from the Student Portal.
Activities for the purpose of self-learning are present in each study unit. These learning
activities are meant to enable you to assess your understanding and achievement of the
learning outcomes. The type of activities can be in the form of Formative Assessment,
CG-9
ANL252 Course Guide
CG-10
ANL252 Course Guide
6. Course Schedule
To pace yourself and monitor your study progress, pay special attention to your
Course Schedule. It contains study-unit-related activities including Assignments, Self-
Assessments, and Examinations. Please refer to the Course Timetable on the Student
Portal for the most current Course Schedule.
Note: Always make it a point to check the Student Portal for announcements and
updates.
CG-11
ANL252 Course Guide
7. Learning Mode
The learning approach for this course is structured along the following lines:
a. Self-study guided by the study guide units. Independent study will require at
least 3 hours per week.
b. Working on assignments, either individually or in groups.
c. Classroom Seminars.
StudyGuide
You may be viewing the interactive StudyGuide, which is the mobile-friendly version of
the StudyGuide. The StudyGuide is developed to enhance your learning experience with
interactive learning activities and engaging multimedia. You will be able to personalise
your learning with digital bookmarking, note-taking, and highlighting of texts if your
reader supports these features.
Flexible learning—learning at your own pace, space, and time—is a hallmark at SUSS,
and we strongly encourage you to engage your instructor and fellow students in online
discussion forums. Sharing of ideas through meaningful debates will help broaden your
perspective and crystallise your thinking.
Academic Integrity
As a student of SUSS, you are expected to adhere to the academic standards stipulated
in the Student Handbook, which contains important information regarding academic
policies, academic integrity, and course administration. It is your responsibility to read
and understand the information outlined in the Student Handbook prior to embarking on
the course.
CG-12
1
Study
Unit
Introduction to Python
Programming
ANL252 Introduction to Python Programming
Learning Outcomes
SU1-2
ANL252 Introduction to Python Programming
Overview
This study unit introduces the Python programming environment and the writing of
Python programmes with some foundation elements. We will also learn how to create
different types of variables and how to assign values to them for further operations. Since
input and output belong to the core of any computer programme, we will learn how
to create user input and construct formatted strings for printing as well. Also, we will
cover the construction of Boolean expressions as conditional statements to control the
behaviour of the program. Finally, we will find out how to create finite loops to repeat
routine instructions iteratively.
SU1-3
ANL252 Introduction to Python Programming
Lesson Recording
Introduction to JupyterLab
SU1-4
ANL252 Introduction to Python Programming
If you encounter installation issues, you may also start by using the JupyterLab at: https://
jupyter.org/try-jupyter/lab/. Visit this page and the following screen will appear. You
can start a new notebook by clicking on the Pyolite icon (more instructions will follow
later).
SU1-5
ANL252 Introduction to Python Programming
Figure 1.4 JupyterLab shows the output after the code is run
SU1-6
ANL252 Introduction to Python Programming
Once we see Figure 1.3, we can type in our Python code in the cell and let the Python
interpreter execute it by pressing the play button (c.f. Figure 1.4), once the syntax is
completed. The output of the programme will then be printed below the input cell.
SU1-7
ANL252 Introduction to Python Programming
Lesson Recording
The following Python arithmetic operators are available for mathematical calculations:
SU1-8
ANL252 Introduction to Python Programming
SU1-9
ANL252 Introduction to Python Programming
As with normal mathematics, the exponent operator has higher priority than the operators
of multiplication or division which, in turn, will be calculated prior to the addition or
subtraction operators. Furthermore, we can add parentheses to our equation to indicate
that those terms within a parenthesis should have the highest priority in the calculation.
Note that mathematical functions such as the square root, the logarithm, the exponential,
or the trigonometrical functions are not included in basic Python. If we want to include
these functions in our calculation, we will need to import packages such as “math” or
“NumPy” in our code. We will discuss how to import and call functions from external
libraries or packages in Study Unit 2.
There are other operators in Basic Python such as relational operators, logical operators,
etc. We will also discuss these at a later stage of this study unit.
SU1-10
ANL252 Introduction to Python Programming
Read
python/python_basic_operators.htm
2.2 Variables
In most situations, we wish to write programmes that help us automate routine operations
without adjusting our programmes according to the actual needs. For instance, we may
not always want to add 2 and 7 together. Instead, we would prefer to let the computer add
up any pair of arbitrary numbers for us, and we can choose these numbers depending on
the situation. As a result, we would like to keep our programme as general as possible by
using variables instead.
variable = value
Remember that it is important to put the variable left of the equal sign (=) and the value
right of it. If we switched their positions, it would be equivalent to assigning a name to
a number. This would result in a syntax error, and Python will stop executing the rest of
the programme at once.
SU1-11
ANL252 Introduction to Python Programming
In Python, the name of a variable can be short (e.g., x, y, z) or more descriptive (e.g., age,
carname, total_volume). But there are certain rules which we must follow when we
create our variable names, namely:
myvar = 10
my_var = 10
my_var = 10
myVar = 10
MYVAR = 10
_myvar2 = 10
2myvar = 10
my-var = 10
my var = 10
SU1-12
ANL252 Introduction to Python Programming
Once values are assigned to variables, we can use them for any arithmetic operations for
numeric values (refer to Chapter 2.1).
Example (Students’ score): Suppose we have the exam scores of two students, 30 and
65, and we would like to store them in two variables, score1 and score2, for some
mathematical operations. Subsequently, we can conduct arithmetic operations with
these variables.
Read
Read the following section of the textbook on examples of creating and using variables
in Python:
SU1-13
ANL252 Introduction to Python Programming
SU1-14
ANL252 Introduction to Python Programming
Boolean (Bool) The value of a Boolean variable can be either True e = True
or False.
In the following, we will use the general term expression for variables or when they are
linked with operators. For instance, a + b is an expression and not a variable, unless we
define c = a + b in our programme where c is then a new variable. However, we would
rather refer to expressions in our programme directly since we do not always define new
variables for calculation steps in between.
To check the type of a variable, we can use the type() function on any variable in our
program.
type(variable_name)
Python will then print the variable type such as “int” (for integer), “float” (for float), or
“str” (for character string) to the screen.
Example (Cont’d): In Figure 1.7, we assign 30 and 65 to the two variables, score1
and score2, respectively. We can use the type() function to check their variable
type.
SU1-15
ANL252 Introduction to Python Programming
We can see that Python returns <class 'int'> as the screen output. The
information we are enquiring is put within the single quotation marks like int in this
case. As a result, we can see that both score1 and score2 are integer variables.
Read
For more information about the different operations on numeric data types, read:
SU1-16
ANL252 Introduction to Python Programming
Lesson Recording
3.1 Printing
Writing programmes not only automates routine operations by the computer but also
shows the use the results, information, or messages while the Python programme is
running. We can use the print() function to generate screen output for the user to read.
print("My String")
All we need to do here is to put the text within a pair of quotation marks and pack
everything inside the print() function. While the programme is being executed, Python
will extract the content within the quotation marks and print it onto the screen.
The print function is not only limited to printing pre-defined strings. We often wish to
print out the result of a calculation, as shown in Figure 1.9, or the value of a variable, or
the result of a calculation based on variables as well.
SU1-17
ANL252 Introduction to Python Programming
Example (Cont’d): Suppose we would like Python to print the exam score of student1,
score1, and the sum of the two scores, score1 + score2, onto the screen.
The print-command is almost identical to the one for normal printing. The only
difference here is to put an “f” before the open quotation mark. Subsequently, we can
place the variables or expressions that we would like to print anywhere within the text
and wrap it within a pair of curly brackets {}.
SU1-18
ANL252 Introduction to Python Programming
Example (Cont’d): Suppose we would like to print the scores of the two students as
well as their sum in a sentence such as “Our scores are 30 and 65. The total score is
95”. We can use the following code to create this screen output.
Unlike the entire argument within the quotation marks, every expression written inside
the curly brackets of a formatted string will be evaluated before being printed onto the
screen. In Figure 1.10, the expressions score1, score2, and score1 + score2 will be
evaluated first. That is, Python will execute the print() function with the value assigned
to these expressions and not with the expressions as part of the string.
Example (Cont’d): If the variables or expressions were not put inside some curly
brackets as a formatted string, the expressions score1, score2, and score1 +
score2 would be treated as ordinary strings and printed just as how they were
written. And if the “f” were forgotten at the beginning of the argument in the
print() function, Python would interpret the missing “f” as an instruction to print
the entire text within the quotation mark without evaluating the expressions in the
curly brackets first. Since the curly brackets are part of the string in the print()
function, they will be printed as well.
SU1-19
ANL252 Introduction to Python Programming
We can also use the .format() method for string formatting. Note that the .format()
method only takes one expression in its argument.
For the .format() method, we need to place the curly brackets at the position within
the string where we would like to print our expression.
Example (Cont’d): We can print the total score of the two students at the end of a
statement such as “The total score is 95.” by using the .format() method.
Figure 1.13 shows us how the printing string can be extended to obtain the same
output as in Figure 1.10. But the syntax is much longer here.
Nevertheless, the .format() method can also be useful if we have one variable to
be printed at the end of our statement.
SU1-20
ANL252 Introduction to Python Programming
Read
Read the following two sections of the textbook on printing formatted strings using
the .format() method:
In Figure 1.14, the string in the print() function ends with the second quotation mark.
Everything after it will be interpreted as part of the code. Since the word “alarming” is
neither a Python command nor a variable, Python simply interprets it as an erroneous
syntax. One way to avoid this error is to use single quotation marks for either the citation
quote or the string definition.
SU1-21
ANL252 Introduction to Python Programming
Another way is to use the escape sequence \" within the string instead of switching
between single and double quotation marks.
Escape sequences are also useful when line breaks should be inserted within a string. By
adding the escape sequence “\n” at the position within the string, the subsequent part of
the string will be printed in the next line of the output screen.
Example (Cont’d): Now, we would like to print the first and the second sentences in
Figure 1.10 in two separate lines. However, we would create a syntax error if we just
placed a line break in our Python script.
Figure 1.17 illustrates that Python treats such a line break within a string as a syntax
error. The reason is that the string in the print() function must be closed by a
SU1-22
ANL252 Introduction to Python Programming
quotation mark in the same line. Instead of closing the first line directly and starting a
new print() function in the second line to solve this problem clumsily, we can add
an escape sequence \n into the string:
The following list contains some useful escape sequences available in Python:
SU1-23
ANL252 Introduction to Python Programming
line2
(Source: https://www.quackit.com/python/reference/python_3_escape_sequences.cfm)
Read
Read the following two sections of the textbook on using escape sequence in printing
formatted strings as in Figure 1.17:
3.4 Input
While a programme script is being executed, it requires values to be assigned to the
variables to proceed in its instructions. So far, we have discussed the possibility to assign
values to the variables in the script. That means, the values are fixed when the programme
SU1-24
ANL252 Introduction to Python Programming
starts to run. However, in most of the cases, those values are unknown and can only
be assigned while the programme is running, mostly based on the input of the user. In
Python, we can use the input() function to ask the user to enter the value for a variable.
The whole syntax will be put on the right-hand side of an equal sign so that Python can
assign the user input to the variable that is defined on the left-hand side of the same equal
sign.
Unlike the print() function, Python requires the user to type in something and then
press ENTER to complete the execution of the input() function. Same as the print()
function, we can instruct Python to print a string to the screen within the input()
function. Usually, this string should be a question and/or some instructions to inform the
user what they should input here. Furthermore, we are also allowed to mix the assigned
values of some variables with the instruction text to become a formatted string that will
be printed on the screen for the subsequent input.
Example (Cont’d): Instead of pre-assigning values to the variables, we will ask the
user to enter his/her name and his/her score. Subsequently, we will print out his/
her score by addressing his/her name and embed his/her score in a sentence such as
“Your score is …”.
SU1-25
ANL252 Introduction to Python Programming
In Python, the value assigned by the user within an input() function will be stored as
string. If the input should be an integer or a number with a floating point, we can convert
the input using:
The functions int() and float() are used to convert a string variable into an integer
or a float variable. (Conversely, the str() function may be used to convert an integer or
a float variable into a string variable.)
Example (Cont’d): Since the score of a student must be an integer within 0 and 100,
we can convert it to an integer by embedding the input() function within an int()
function. At the same time, we add a new question to ask the student for his/her
CGPA and convert it to a float variable.
The syntax introduced above is to put the input() function within the int() and
float() functions and construct the instruction within a single line. Nevertheless, we
SU1-26
ANL252 Introduction to Python Programming
can also separate these commands into two lines without changing the behaviour of the
programme:
These lines are certainly applicable to the float() function as well. It is noteworthy
that if the user enters a value that is not a number, the int() or float() functions will
interpret it as a syntax error and stop executing the code immediately. It is therefore a good
programming habit to build in certain control mechanism for any user input command in
our code. We will discuss the construction of such a control mechanism in Chapter 5 of
this study unit.
Read
Read the following two sections of the textbook on getting the user input:
SU1-27
ANL252 Introduction to Python Programming
Chapter 4: If-elif-else-Conditions
Lesson Recording
If-elif-else-Conditions in Python
1 == 1 True
3 > 2 True
SU1-28
ANL252 Introduction to Python Programming
0 <= -5 False
In the above examples, the first two operations are obviously True since they correspond
to the mathematical relationship between the left-hand and the right-hand sides of the
equations. Note that if we want to check whether two expressions are identical, we will
have to use the double equal sign (==) instead of the ordinary equal sign (=) since the
single equal sign is used to assign a value to a variable. So, if we wrote 1 = 1 instead of
1 == 1, a syntax error would be returned since Python would interpret our intention to
assign a value to a number, which we know is not allowed from Chapter 2.2.
<> True if values of two operands are not (10 <> 20) is True.
equal (Similar to the != operator).
> True if the value of left operand is greater (10 > 20) is False.
than the value of right operand.
SU1-29
ANL252 Introduction to Python Programming
< True if the value of left operand is less (10 < 20) is True.
than the value of right operand.
>= True if the value of left operand is (10 >= 20) is False.
greater than or equal to the value of right
operand.
<= True if the value of left operand is (10 <= 20) is True.
less than or equal to the value of right
operand.
A Boolean expression can also be a combination of multiple relational operations that are
connected by the logical operators. Below is a list of logical operators in Python.
and If both the operands are True, then the (10 > 0 and 20 >
condition becomes True. 0) is True.
not Used to reverse the logical state of its not(10 > 0 or 20 >
operand. 0) is False.
SU1-30
ANL252 Introduction to Python Programming
The use of a single logical operation is usually quite straightforward as the operators are
designed in a way that it simply matches our spoken language. However, it could become
quite confusing if we combine these operators in a Boolean expression. For instance, the
following Boolean expressions are equivalent:
These are just a few examples and can be extended endlessly. It is of utmost importance to
be familiar with how to create Boolean expressions using relational and logical operators.
Any failure in combining these operators could lead to unexpected behaviour of our
programme. The only and most effective way here is to practise them with Python since
we can check on the result directly.
Read
SU1-31
ANL252 Introduction to Python Programming
if condition:
instructions
In the if-condition, Python will execute the syntaxes in the instructions if the condition is
True. However, if the condition is False, Python will simply skip these lines and proceed
with the subsequent code lines. Note that it is mandatory to put the colon directly behind
the condition, and the instructions must be indented so that Python can interpret them as
part of the if-block.
SU1-32
ANL252 Introduction to Python Programming
Example (Cont’d): If the score of a student is below 40, we will show a message on
the screen to tell him/her that he/she failed in the exam.
Figure 1.21 illustrates what Python does if the condition is True. On the other hand,
if a student scores more than 40, nothing will be printed based on the Python script.
Figure 1.22 shows how Python skips all the instructions in the if-block since the
condition is False.
If we intend to let Python execute another set of instructions if the condition is False,
and not just skip the if-block, we can add an else-statement to the if-block:
SU1-33
ANL252 Introduction to Python Programming
if condition:
instructions 1
else:
instructions 2
Same as the if-condition, we must add a colon to the else-statement and the instructions
following it must be indented as well.
Example (Cont’d): If the score of a student is below 40, we will show a message on the
screen to tell him/her that he/she had failed in the exam. Otherwise, we will show a
message to tell him/her that he/she passed.
Figure 1.23 shows that if the condition is False, Python will execute those
instructions following the else-statement.
If the construction of the condition allows more than two outcomes, we may need a third
or fourth if-blocks, etc. In this case, we can use the if-elif-else-block:
if condition 1:
instructions 1
elif condition 2:
instructions 2
else:
SU1-34
ANL252 Introduction to Python Programming
instructions 3
In the example in Figure 1.23, the programme is only constructed to separate students
into two categories: pass and fail. It will then print the statement to the user accordingly.
Suppose we also give grades to evaluate the performance of the students, we can
categorise the scores using if-conditions.
In the example shown in Figure 1.24, the else-statement has also been omitted since all
possible outcomes of the variable score have been covered by the if-block. Nevertheless,
we can also use the else-statement instead of the whole elif-condition for grade A if we
are confident to do so. Just be cautious that in this case, if certain possibilities were not
covered, no instructions would be carried out from the if-block, and the behaviour of the
subsequent part of the programme may be affected.
Example (Cont’d): We will print the grade to the student according to his/her exam
score. If a student scores between 80 and 100, his/her grade will be A; and if his/her
score is between 80 and 60, s/he will get a B; a score between 50 and 60 is equivalent
SU1-35
ANL252 Introduction to Python Programming
to grade C; grade D will be given if a student scores between 40 and 50 and any score
below 40 belongs to grade F.
Lastly, the print() function is not indented here. As a result, Python interprets it as
the part of the code that should be executed after the entire if-block and not as part
of the instructions following the last condition (elif score1 >= 80 and score1
<= 100:).
Read
Read the following three sections of the textbook on conditional statements for control
flow:
Exercise 29 What If
SU1-36
ANL252 Introduction to Python Programming
Chapter 5: Loops
Lesson Recording
Loops in Python
5.1 While-Loops
In the student score example, we constructed a programme in which the name and the
mark of one student can be entered. In the early stage of our example, we had scores of
two students. If we had to enter their names and then assign a grade to each of them, we
would have to repeat the codes in Chapters 2, 3, and 4 twice.
SU1-37
ANL252 Introduction to Python Programming
Figure 1.25 Repeating the same code for two individuals to enter
Usually, we have more than two students in one class. In order not to expand our codes
endlessly and make it clumsy and unreadable, we can construct a loop in our programme
to repeat the instructions that will be applied many times. The first type of loop that we
will introduce here is the while-loop.
while conditions:
instructions
The condition is a Boolean expression which controls whether the loop will continue to
run for a new iteration or not. If the condition is True, Python will go on to execute the
SU1-38
ANL252 Introduction to Python Programming
instructions that are written with indentation and after the colon behind the conditions.
The number of loops can be infinite and will be repeated as long as the condition is True.
As a result, it is extremely important to ensure that a while-loop will be terminated at
some stage by fulfilling the exit condition.
Before the loop starts, we initiate a counter variable called i with the value 0. This
counter will increase by 1 in each iteration. The while-loop is set to continue as long
as i has not reached 3 yet. At the beginning of the last iteration, i should be 2 and
will become 3 when Python executes the instructions within the while-loop. At the
end of this iteration, Python will go back to the first line of the while-loop, and since
SU1-39
ANL252 Introduction to Python Programming
the condition i < 3 is no longer True, the programme will exit the loop as designed.
Therefore, the most crucial command in the whole loop is i = i + 1. Without this
line, i can never reach 3, which is the exit condition, after three iterations. Instead, it
will always stay at 0, the initial value defined before the loop has started.
Read
5.2 For-Loops
Another type of loop is the for-loop, which are constructed differently. While we need a
True-condition for a while-loop to continue to iterate, we need a list for the running of
the for-loops. We will discuss the construction and properties of a list in detail in Study
Unit 2. Here, we will first learn to generate a simple list of consecutive integers by the
range() function.
The start value can be any integer as long as it is smaller than the end value. Note that
the end value is not included in the list. In other words, the list will end at end – 1. The
generated list of numbers will be assigned to the variable defined on the left-hand side of
the syntax. For the for-loops, we do not need to store the generated list in a variable first.
Instead, we can use the range() function in the for-statement directly.
SU1-40
ANL252 Introduction to Python Programming
The for-command must end with a colon, followed by the instructions that should be
carried out in each iteration. These instructions must be written with indentation. The
counter variable will do the counting for the iterations, starting from the start value in
the first iteration and running through the entire integer list. Once the counter reaches end
– 1, Python will execute the instructions for the last time and then exit the loop.
SU1-41
ANL252 Introduction to Python Programming
Example (Cont’d): We will carry out the previous task using the for-loop.
The range() function generates a list containing the values 0, 1, and 2, since 3 will not
be included in the list by definition. The code in Figure 1.27 also shows that the counter
variable i is already integrated in the for-statement, and an explicit instruction to
increase it by one in each iteration is not required at all.
SU1-42
ANL252 Introduction to Python Programming
Read
In our previous example, suppose we would like to quit the entire programme after
entering the first student’s data due to some reasons, although the programme allows us to
enter the data for up to three students. It would be reasonable to have a syntax that allows
us to break from the loop in a “clean” manner instead of shutting down the computer
entirely. The command for such situation is break.
while conditions:
…
if conditions:
break
SU1-43
ANL252 Introduction to Python Programming
Usually, break is used together with an if-condition, since we would only want to break
from a loop under some circumstances and not in general.
Example (Cont’d): Suppose we let the user quit the programme by entering -1 for
his/her score now. All we need to do is to add an if-condition after the input()
statement where s/he can enter the score.
The break command is built after the user is asked to enter his/her exam score but
before the grade is assigned. Basically, the programme will still work normally if the
break command is put after the whole if-elif-block. By breaking from the loop
SU1-44
ANL252 Introduction to Python Programming
before a chunk of codes that will have no further influence on the execution of the
programme can shorten the running time and make it faster.
Note that we can also apply the break command on our example in Figure 1.26, where
we use the while-loops for the same task instead.
Before we start to construct while-loops for user inputs, we must learn how Python
handles errors. In Chapter 3, we mention that if we apply the int() function to convert
a string variable that contains a non-numeric value to an integer, the programme will
be interrupted due to value error, and the user will have to restart the programme in
PowerShell. This can be very annoying if the user only makes a small mistake in one of the
input fields but needs to re-type all the inputs because of the programme interruption.
To avoid Python from stopping to execute the program by force, we can use the try:
syntax.
SU1-45
ANL252 Introduction to Python Programming
In Python, there are many built-in exceptions. Table 1.7 provides some common ones.
Exception Description
NameError Raised when a local or global name is not found. This applies only
to unqualified names. The associated value is an error message that
includes the name that could not be found.
If the user is asked to enter a numeric value such as an exam score, but enters a string
instead, we can use ValueError as our exception.
SU1-46
ANL252 Introduction to Python Programming
In the first run, we type in a string “test” where we should enter a numeric value for
the score. Python detects an error since the syntax tries to use the int() function to
convert character strings to integers which would usually cause the programme to
stop running abruptly. From the output, we can see that Python prints the warning
message we put in the print() function to the screen and ends the programme
“properly” as if no error has occurred. In the second run, we enter a numeric value as
required. Python detects no error and simply skips the except-block.
After defining the try-block to instruct Python on how to handle errors, we can construct
a while-loop around it. As condition for the loop to continue iterating is when a Boolean
variable that indicates a valid input does not change from False to True. Hence, if this
Boolean variable is True, the programme will break from the loop.
SU1-47
ANL252 Introduction to Python Programming
Example (Cont’d): Now we put the entire try-block within a while-loop. The
while-loop will stop iterating once the input for score is numeric. If it is non-
numeric, the user will see a warning message and he will also be asked to re-enter his
exam score. The whole procedure will last until the input is numeric.
We initiate a Boolean variable called valid_input before the while-loop starts. The
initial value of this variable is False, which we also use as the condition for the while-
loop to continue to iterate. In the try-block, we add an else-statement for the case
that the input is correct, and the follow-up instruction here is to change the value
of valid_input to True so that the while-loop stops. Note that we can use the
break command here as well. In the output screen, we can see that if the input is non-
numeric, the programme will print a warning message in a new line and then ask the
user to re-enter the score until the input is valid.
SU1-48
ANL252 Introduction to Python Programming
It is possible or even desirable to set a maximum number of input trials in order not to
have the programme running endlessly. A counter variable can be added to the loop, and
the programme will exit the loop after the maximum allowed number of iterations has
been reached.
SU1-49
ANL252 Introduction to Python Programming
Example (Cont’d): The complete programme of this study unit containing all the
techniques we have learned is given in the following figure.
SU1-50
ANL252 Introduction to Python Programming
Summary
We have learned the basics of writing and executing Python programmes. We have also
been introduced to the various variable types and some operators that can be applied to
them. We have then discussed how to generate screen output and how to let the user enter
answers and assign them as values to the variables. Furthermore, we have covered the
construction of conditional statements to dynamically change the programme behaviour,
if necessary. Finally, we have also come to know the use of loops to repeat routine tasks
for an endless number of times within a programme.
SU1-51
ANL252 Introduction to Python Programming
Formative Assessment
1. From which directory (folder) can you run your Python script that is saved with
the .py extension?
a. From the folder where Python is installed
b. From the folder where the operating system is installed
c. From the folder where I have saved my Python script
d. From any arbitrary folder
SU1-52
ANL252 Introduction to Python Programming
9. Which values will be printed on the screen given the following Python code?
counter = 0
while counter <= 3:
print(counter)
counter = counter + 1
SU1-53
ANL252 Introduction to Python Programming
a. 0, 1, 2, 3
b. 1, 2, 3, 4
c. 0, 1, 2, 3, 4
d. 0, 2, 3
SU1-54
ANL252 Introduction to Python Programming
Formative Assessment
1. From which directory (folder) can you run your Python script that is saved with
the .py extension?
Incorrect. Python would not be able to find your Python script there unless
you have saved it in the Python programme folder, which is rather unlikely.
Incorrect. Python would not be able to find your Python script there unless
you have saved it in the system folder, which is also rather unlikely.
Correct. You must change to the folder where you have saved your Python
script so that Python can find your file.
Incorrect. Python would not be able to find your Python script unless you
are accidentally in the folder where you had saved your Python script.
Incorrect. The syntax is correct since the curly bracket and its content will be
treated as part of the printing string.
b. 0 = 0
SU1-55
ANL252 Introduction to Python Programming
c. 0 == 1
d. y = a / int(b)
Incorrect. This is a valid syntax since we can carry out a division in which the
denominator is being converted to an integer.
a. iamavariablelol
Correct. This variable name is valid since it starts with a character and
contains no invalid character.
b. :)iamavariable
Incorrect. This variable name is invalid since it starts with a colon and
contains invalid characters such as closing round bracket.
c. 007imavariable
d. i-am-a-variable-lol
a. -1
Incorrect. The int() function does not round down a negative number.
SU1-56
ANL252 Introduction to Python Programming
b. 0.5
c. 0
Correct. The int() function will take away all the decimal places.
d. 1
Incorrect. The int() function does not round down a negative number and
then convert it to absolute number.
Incorrect. The .format() method does not format the font of a string.
Incorrect. The .format() method does not replace the curly bracket by a
round bracket in a string unless the code explicitly requires Python to do so.
d. It replaces the curly bracket within a string by the value of the expression in
the .format() method.
a. -5 <= 0
SU1-57
ANL252 Introduction to Python Programming
b. 10 ** 2 == 100
Incorrect. Since 10 ** 2 = 100, the left-hand side and the right-hand side
are equal.
d. 18 / 3 > 1 + 4
b. not a or not b
c. not a and b
d. a or b
SU1-58
ANL252 Introduction to Python Programming
d. Behind each if-, elif-, and else-statement must be a colon before the
instruction block starts.
9. Which values will be printed on the screen given the following Python code?
counter = 0
while counter <= 3:
print(counter)
counter = counter + 1
a. 0, 1, 2, 3
Correct. Since the print() function comes before the increment of the
counter, the counter will be printed starting from its initial value 0 and
goes until 3 with an increase of 1 in each iteration.
b. 1, 2, 3, 4
SU1-59
ANL252 Introduction to Python Programming
Incorrect. Since the print() function comes before the increment of the
counter, the counter will be printed starting from its initial value 0 and not
1.
c. 0, 1, 2, 3, 4
Incorrect. Since the loop will only continue to run if the value in counter is
smaller or equal to 3, 4 cannot be printed based on this code.
d. 0, 2, 3
Incorrect. Since the increment of the counter can only be 1 for each iteration
of the loop, a jump from 0 to 2 is impossible based on this code.
b. We must write a line to increase the counter by one within the loop.
Incorrect. We only need to write a line to increase the counter by one within
a while-loop.
Incorrect. A for-loop does not need any exit condition since it runs through
a finite list. Once the list comes to an end, Python will exit the loop.
d. We can use the range() function to generate a list of numbers as the index
of the for-loop iterations.
Correct. Every for-loop needs a list for it to run through. One type of list
is a list of integers that can serve as the index for the for-loop iterations,
and we can generate such a list by the range() function.
SU1-60
ANL252 Introduction to Python Programming
library/exceptions.html
library/stdtypes.html#numeric-types-int-float-complex
downloads/
www.quackit.com/python/reference/python_3_escape_sequences.cfm
python/python_basic_operators.htm
python/python_variables.asp
SU1-61
ANL252 Introduction to Python Programming
SU1-62
2
Study
Unit
Learning Outcomes
SU2-2
ANL252 Data Types and Functions
Overview
Python provides numerous compound data types to group a collection of values together.
Compound data structures organise and store data in a way that they can be accessed
and worked with efficiently. These structures also define the relationship between the
data and the operations that can be performed on them. We will learn in this study unit
how to create, when to use, and what operations to perform on the most common Python
compound data structures: tuples, lists, and dictionaries. Furthermore, we will also learn
about what functions and methods are and how they can be integrated in our programme.
Lastly, we will also deal with Python packages and modules which are, together with
functions and methods, very useful for re-using and extending our tools for performing
specific tasks in Python programming.
SU2-3
ANL252 Data Types and Functions
1.1 Tuples
Lesson Recording
Python Tuples
Python has several types of compound data for our use. One of these data types is the
tuple. A tuple is a collection of values written as comma-separated items between a
pair of round brackets, similar to vectors in mathematics. Unlike vectors, tuples are not
specifically designed for mathematical operations. The items, or elements in a tuple can
be numeric, string, or a mixture of both.
Note that tuples are immutable – that is, we are not allowed to modify the values stored in
them once they are defined. Nevertheless, they are more efficient in terms of performance
SU2-4
ANL252 Data Types and Functions
and memory use. Tuples are useful in situations where we want to share the data with
other users without granting them permission to edit the content.
Example (Students’ score, cont’d): We assign the names of two students to a tuple
called names, the scores of the same students to a tuple scores, and a mixture of the
names and scores to all_data.
We can assign elements to a tuple with or without writing them in parentheses in our
code as long as they are separated by commas. Nevertheless, when printing them on
the screen, they will be wrapped by a pair of round brackets. Furthermore, the tuple
all_data, which consists of two strings and two integers, shows us that tuples can
be indeed a mixture of strings, integers, and floats.
SU2-5
ANL252 Data Types and Functions
tuple_subset = tuple_name[start:end]
It is important to recall that the index end will not be included in the sub-setting procedure.
Example (Cont’d): We would now like to extract the first element from the tuple
names and store it in a variable called name1, the second element in name2, as well
as the entire data of the first student including his/her name and score from the tuple
all_data in a new tuple called student1_data.
If we subset a single element from a tuple, it will become a variable of the type that
corresponds to the type of the extracted data. In this case, the variables name1 and
name2 are strings, since “Peter” and “Mary” are stored as string in the original tuple
names. And if we subset multiple elements from a tuple, the result will be a tuple
as well, and the data type of each element will also be taken over from the original
tuple.
In the above example, the tuple student1_data should be a subset of the tuple
all_data with the element indexed 0:2, where index 2 is excluded. In other words,
SU2-6
ANL252 Data Types and Functions
it only contains the elements 0 and 1, which are the first and second elements: Peter
and 72.
We can also use negative indices to access the elements of a tuple starting from the last
element. That is, the index -1 indicates the last element, -2 the second last, etc.
Example (Cont’d): We would now like to extract the last element from the tuple names
and store it in a variable called name1, the second last element in name2, as well as
the entire data of the second student including his name and score from the tuple
all_data in a new tuple called student2_data.
Everything seems alright in the above output except that the last element of the tuple
student2_data, the value 86, is missing. The reason is that the last index is never
included. In our example, the sub-setting indices we wish to have are -2 and -1. But
since Python does not include -1, we only receive student2_data[-2], which is
“Mary” in this case. To overcome this dilemma, we need to leave the end index blank
after the colon. Python will interpret it as “take all indices until the end”.
SU2-7
ANL252 Data Types and Functions
This is the correct output that we wanted to obtain originally. Note that the “open
end” indexing also works for positive indices.
Example (Cont’d): Suppose we would like to change the first element in the tuple
names from “Peter” to “John”. We will receive an error message as result.
SU2-8
ANL252 Data Types and Functions
We will now try to add the name “John” to the tuple names. Since names has 2
elements, the index of the last element must be 1. Hence, we add the new element to
names by referring to a new index, namely 2.
Nevertheless, we are allowed to concatenate two tuples into a single tuple by connecting
them with a “+” sign.
Example (Cont’d): Suppose we would like to concatenate the two tuples (names and
Scores) and name the new one newtuple.
The concatenation of tuples works in the same way as it works for strings. As a result,
we can re-attempt to add a new name "John" to the tuple names. The only thing we
need to change is to put "John" in a tuple first.
SU2-9
ANL252 Data Types and Functions
To put the new name “John” into a tuple is indeed tricky since Python would not
recognise syntaxes such as ("John") or "John" as a tuple. The reason here is that
all elements in a tuple must be separated by commas. As a result, we put a comma
behind "John" and leave the next element blank to tell Python that this is a tuple of
length 1. We can see that after concatenating newname with names, Python will ignore
the blank element in newname and append "John" as the only element in newname
to "Peter" and "Mary", the original elements in names.
tuple_length = len(tuple_name)
The length is also often used as an index to subset a tuple or control the indexing in our
code so that we cannot refer to indices that go beyond the largest index of a tuple.
SU2-10
ANL252 Data Types and Functions
Example (Cont’d): We use the len() function to determine the length of the tuple
all_data. We will then use it as an index to subset the last element of a tuple.
SU2-11
ANL252 Data Types and Functions
In Study Unit 1, we learned how to use the range() function to generate a list for the
for-loop to run over it. Generally, if there are tuples, lists, or dictionaries already created
and existing while the programme is running, we can use the for-loop directly by putting
the name of the tuple, list, or dictionary in the for-statement.
SU2-12
ANL252 Data Types and Functions
Example (Cont’d): Now we would like to print out each element of the tuple
all_data onto the screen. The name of the counter here is records.
1.2 Lists
Lesson Recording
Python Lists
SU2-13
ANL252 Data Types and Functions
As mentioned before, we must wrap the data in a list by square brackets. However, unlike
a tuple which we may omit the round brackets when defining it in our code as long as
the data are separated by commas, we must define a list with the square brackets in the
program. If we omit the brackets, Python will interpret the data as a tuple.
Same as tuples, lists may contain any type of values: floats, integer, Booleans, strings, or
more advanced Python types like lists. The last one is indeed a very interesting property
of the Python compound data type since we can namely put a list within a list, or a tuple
in a dictionary, and so on.
Example (Cont’d): We define two lists, names and scores, to store the data of the
students in the class.
Example (Cont’d): Here we will subset the lists that we have defined above into four
new lists and variables respectively.
SU2-14
ANL252 Data Types and Functions
In the first line (line 1 and 2 are not counted here), name1 = names[0], we simply
extract the first element of names and store it in name1. Since we have only extracted
one element, it will not be stored as a new list. Instead, it will become a variable, which
we can see from the screen output that the value "Peter" is not wrapped by a pair of
square brackets. And the type of new variable will be the same as the type of extracted
element, which is a string variable in this case.
In the second line, name2 = names[1:], we extract multiple elements from the
original list and save them as a new list called name2. Recall that if we use open-end
indexing – that is, we leave the value behind the colon blank – Python will take all
elements from the original list starting from the starting value until the end. Here,
it indicates that Python should extract the elements with the indices 1 and 2 from
names.
In the third line, score1 = scores[0:2], the first and second elements from
scores will be extracted and saved into a new list score1. We should always be
aware that the last index is never included in the indexing. As a result, only the
elements 72 and 86, and not 35 are extracted and saved into score1.
SU2-15
ANL252 Data Types and Functions
In the fifth line, score3 = scores[-2:-1], we extract the element with the indices
-2:-1 from scores. Since the last index is not included in the sub-setting, the only
relevant index here is -2. As a result, we should obtain the same result as the fourth
line. However, there is one significant difference between this line and the fourth line:
the new object here is still a list instead of a variable. In other words, if we want to
create a new list with a single element from another list, we will have to use multiple
indexing.
If we only intend to edit the items in the list, then the length of the indices on the left-hand
side (which is the number of items to be modified) must be the same as the length of the
list with the new values. If the list of new values is longer than the indices, items will be
added to the list on the left-hand side. On the other hand, if the list of indices is shorter,
items will be removed from the original list.
SU2-16
ANL252 Data Types and Functions
Example (Cont’d): Suppose the name of the third student is not John, but Jon, and we
would like to change it now. Once again, the third item has index 2.
Figure 2.14 shows how the third element has been replaced by a new value. Now,
suppose that the lecturer must deduct two marks from the students who scored 72
and 86 after rechecking their exam papers. In other words, we need to modify the first
two elements of the list scores.
In the second line of the output, the first value has been replaced by 70 and the second
one by 84. However, if our list [70, 84], does not only contain two values but three,
the output will become:
SU2-17
ANL252 Data Types and Functions
Python will not treat it as a syntax error. Instead, it will replace the values of the indices
0 and 1 by the first two new values, 70 and 84, add a new value to it (namely 60),
before returning to the rest of the original list. As a result, the list scores now has
four instead of three elements.
Let us turn to the opposite situation and assume that the indices for modification are
longer than the list of new values.
Same as before, Python will not treat the code here as a syntax error as well. Instead,
it will replace the values of the indices 0 and 1 by the only new value, 70, and then
return to the rest of the original list, which is 35, the last value. As a result, the list
scores now has two instead of three elements.
In the above examples, we can see that lists can be extended or shrunk with the
replacement of certain items in the original list by a longer or shorter list of new values.
SU2-18
ANL252 Data Types and Functions
Basically, Python uses the addition operation “+” to concatenate objects such as strings,
tuples, or lists, rather than carrying out arithmetic addition, except for numeric values.
Concatenating lists is a recommended step in Python programming if the nature, i.e.,
content and type, of the lists is identical.
Example (Cont’d): Suppose we have two classes for the same course. The student
names of these classes are stored in the lists class1 and class2 respectively.
Equivalently, the exam scores are stored in the lists scores1 and scores2. Now, we
would like to concatenate class1 and class2 into a new list called names and
scores1 and scores2 into a new list called scores.
Even if the nature of the two lists is not identical, it is sometimes still quite convenient to
store their data in one list for later use.
SU2-19
ANL252 Data Types and Functions
Example (Cont’d): Suppose we would like to concatenate the two lists – names and
scores – into a new list called combined_list in order to store the data of the entire
class in a single compound data source.
So, the new list consists of ten elements combined from the two lists names and
scores. Apparently, concatenating lists with different types of data, such as strings
and integers in this example, is allowed in Python.
Figure 2.19 shows a straightforward concatenation of two lists in Python. However, such
a combination of the two lists will lead to difficulty in distinguishing the original nature
such as meanings and types of the elements. For instance, in the above example, suppose
the first item in the list scores is the exam score of the first student in names – that
is, Peter’s exam score is 72. We will not be able to assign the score to the corresponding
name unless we know that each score always belongs to the name that is five positions
before. If we keep in mind that the length and contents of the concatenated list may change
every now and then, the meaning and source of each element will become more and more
untraceable with time.
Another way to solve this problem is to merge two lists into a new list without combining
the elements together. Instead, each element of the new list is a list and not a single value.
SU2-20
ANL252 Data Types and Functions
To merge two lists into a new list while keeping them as “list elements”, we cannot use
the addition operator as introduced before. Instead, we define the new list by putting the
list names instead of some values as the elements.
The advantage of this merging technique is that we will always know that index 0 of
the new list refers to the list names and index 1 to the list scores, if the new list is a
combination of two known lists in a fixed sequence. As a result, we will be able to trace
the origin and meaning of the data in the new list.
Example (Cont’d): Suppose we would like to put the two lists’ names and scores
into a new list called class_data and keep them as lists in the new list.
Figure 2.20 Merging two lists by keeping them as lists in the new one
In contrast to the concatenation technique where all elements of the new list are just
combined straightforwardly, we can see from Figure 2.20 that the elements of the list
names are wrapped by each one pair of outer and inner square brackets to indicate
that they belong to a list that in turn is an element of the new list class_data. The
same can be observed for the elements of scores too.
Note that this merging technique is not limited to merging lists. We can also merge two
tuples into a list and keep their types as tuple in the new list as well.
SU2-21
ANL252 Data Types and Functions
To access a single element in the merged list, we need to use the double index operator []
since the single index operator would return one of the original lists to us.
Example (Cont’d): Now, we extract the first element as well as the first name and
score from the merged list class_data and print them to the screen.
Here, we put in the print() function for class_data[0], which is the first element
of class_data. This element also happens to be the original list names. Then we try
to extract the first element from the first list within class_data by using the double
index operator class_data[0][0]. As a result, Python extracts the first element
in class_data first, which corresponds to the list with the names elements. From
there, Python extracts the element with the index 0, which is “Peter” in this case. The
same has also been carried out with class_data[1][0], which is the first element
of the second list. The resulting element is 72 here.
It is noteworthy that the indexing technique introduced in the above example also works
for multiple indexing.
SU2-22
ANL252 Data Types and Functions
of a list can be accessed by their indices, we can easily use loops to subset, print, and/or
modify them.
Furthermore, we learned how to generate a list of integers to serve as sort of a counter for
the iterations of the for-loops. Now, after being familiarised with the concept of lists, we
do not always need the range() function to create these integers for us. Instead, we can
simply use any available list as our counter. Nevertheless, the range() function can still
be very useful in some situations.
Example (Cont’d): Suppose we would like to print all the student names of the two
classes on the screen.
In this example, we print the elements of the list names line by line to the screen. In
each iteration, Python will assign the value of the current list element to the counter
variable i, and the print() function will print the value stored in i to the screen.
Suppose we would like to modify the screen output and use string formatting to print
the score and the student’s name in the same line by linking them up using more
natural language.
SU2-23
ANL252 Data Types and Functions
Since we want to run through two lists, names and scores, and print out their
corresponding elements, we cannot use one of the lists as our counter in the for-
loop. Thus, we need to use the range() function to generate a list of integers as
our counter. Most conveniently, the integers should be exactly corresponding to the
indices so that we can subset our lists within the for-loop directly. The obvious start
index is 0, and the end index would be the number of items in our lists minus one.
Since the range() function does not include the end value of the range in the result,
we can therefore simply take the length of our list for this purpose. Here, we can apply
the len() function to determine the length of names, and then store the result in the
variable listlen, which will in turn be taken as the end of our integer list for the
range() function.
Within the for-loop, we will have to instruct Python to print the element of names
and scores with the index i for each iteration. And we can subset them by
names[i] and scores[i], respectively.
In the above example, we need to extract the information from two separate lists. In
Chapter 1.2.4, we learned how to merge two lists into one single list where the original
lists are kept as list elements in the new one, which makes the coding easier since we only
need to work with one list. Instead of using the syntax we learned there, we can also use
loops to carry out simple or complicated merging of lists.
SU2-24
ANL252 Data Types and Functions
Example (Cont’d): Suppose we would like to extract the first student’s name and score
from the two lists and put these data into a small list. We then do the same for the
second student, etc. Eventually, all these small lists will be merged into a single list.
On line 12, we create a temporary list to store the pair of values that we have extracted
from names and scores. This temporary list templist will then be appended to
finallist, our target list, in line 13. This line is essential for the whole process as
we would like to eventually merge all the information into a single list.
However, it is important that we define finallist as an empty list before the for-
loop starts. To define an empty list, we simply assign a pair of square brackets [] with
no content to our target list. If we did not define our target list first, line 13 would
produce a syntax error as Python would not know what finallist is.
Furthermore, the instructions inside a loop are identical in every iteration unless we
add an if-condition in it. If we defined finallist as an empty list inside the for-
loop instead, the list would be re-initialised and become empty in each iteration.
And if we simply have finallist = templist in line13, the complete content of
finallist would be replaced in each iteration. The final output would then only
consist of the data of the last student. It is therefore important to accumulate data in
every iteration by appending templist to finallist.
SU2-25
ANL252 Data Types and Functions
This arrangement of finallist is perhaps the most natural way to store the data.
In each small list, we have the data of each student. In other words, if we extract an
element of the list finalist, we will have all the data of the one corresponding student,
which can be quite convenient when dealing with these data further.
After creating finallist, we can use a for-loop to run through it. Within the for-
loop, Python will extract one element from finallist in each iteration and store it
in the variable i. Note that i is not a counter variable here. It is more like a temporary
storage for the current element of the list that is being run through. Since each element
of finallist is a list itself with two elements, we can subset them by i[0] and i[1]
respectively. From the construction concept of each of these small lists we know that
the first element is the student’s name, and the second element is the corresponding
exam score. As a result, all we need to do is to put the subsets at the right place of the
formatted string for screen output.
SU2-26
ANL252 Data Types and Functions
Example (Cont’d): The user is now asked to enter the student’s name and score, and
s/he can stop entering by pressing ENTER for either the name or the score. After the
entering process, the list will be printed on the screen for checking.
In this example, we use several elements that we have learned in this and the
previous study units to provide a clear input environment to the user and the suitable
functionality for data storage in a list.
SU2-27
ANL252 Data Types and Functions
• The programme begins with the definition of an empty list called finallist
that will be used to store the data eventually. This is the same step as in line 9
of Figure 2.25.
• We initiate two variables before the while-loop starts: the counter variable i,
which is set to 1 initially, and the Boolean variable proceed which controls
whether the while-loop should continue with the next iteration or not.
• The while-loop will continue to run as long as proceed is True. Its value will
only change from True to False if the user’s input of either the name or the
score is an empty string "" – i.e., ENTER.
• We print an instruction to tell the user what to do before the loop starts.
This instruction can also be placed within the while-loop. In that case, the
instruction will be printed on the screen in every iteration.
• We start the printing string in line 4 with an escape sequence “\n” to create a
blank line between the system prompt and our instruction. It is visually more
comfortable when the texts are not put too close together. We have built in
several escape sequences for line break in the programme to create blank lines
for the same purpose.
• In line 6, we include the value of the counter variable in the formatted string to
show the number of the student whose data the user is about to enter.
• The if-condition in line 7 checks whether the user’s input for the student’s
name is an ENTER key or not. If it is the case, the programme will switch
proceed to False and break from the while-loop by the command break. For
the mechanism of breaking from the while-loop, the line proceed = False is
insufficient since the remaining instructions within the while-loop will still be
carried out, although the user intends to stop the entering process immediately.
As a result, we need to add the break-command to it. Nevertheless, for the
sake of programming “cleanliness”, we also add the line to change the value of
proceed to False.
• Starting from line 10, we implement a while-loop controlled by a Boolean
variable called valid_input to check whether the user’s input of the score is
SU2-28
ANL252 Data Types and Functions
numeric or not. This part of the code is mostly taken over from Chapter 5.4 in
Study Unit 1. Here the user is allowed to press ENTER – i.e., an empty string
to quit the entering process. As a result, we cannot convert the user’s input to
float or integer in the same line as the input() function, or else the if-condition
in line 13 could become invalid.
• For the exit mechanism, both syntaxes proceed = False and break are not
redundant. The break command is used here to break out from the while-
loop that controls the numeric input, and not the outer while-loop for entering
data of multiple observations. The change of value for proceed from True to
False prevents the merging of list in lines 22-25. Hence, the programme will
jump to the end of the while-loop and break from it due to proceed = False.
• The conversion of the variable std_score to a numeric value will be put in
the try-block to prevent Python from stopping the programme due to the
occurrence of an error. If the input of std_score is indeed a number (but
stored as string temporarily), the control variable valid_input will turn
to True and the while-loop will stop iterating. However, if the input is a
character string (with exception of an empty string), Python will print the
warning message that is written in the except-block, and the while-loop will
start a new iteration.
• In the formatted string of the input() function to enquire for the exam score
in line 12, we embed the student’s name that has just been entered by the user
in line 6. This can prevent users from entering the score of another student due
to the visual confusion caused by the mass of information on the screen.
• The same mechanism to store the data in templist and then to append them
to finallist as in Figure 2.25 is employed here in line 23 and line 24. But the
whole procedure will only be carried out if proceed = True. That is, if the
inputs of both the name and the score are not empty strings.
• We increase our counter variable by 1 after all the data for the ith student has
been typed in and confirmed by the user.
SU2-29
ANL252 Data Types and Functions
• After the entire entering process, the programme will print the entire list to
the screen for checking purpose. However, this will only be executed if the
user has at least entered the entire set of data for one student. This control is
implemented in case the user has already quit the entering process during the
first iteration and prevents us from printing an empty list. We have also built
in several escape sequences for line breaks here to enhance the visual comfort
of the output.
Read
For details and examples on lists and on operations applicable to lists, read the
following official Python documentation:
Python Software Foundation. (2023d). Built-in types: Sequence types – list, tuple, range.
https://docs.python.org/3/library/stdtypes.html#sequence-types-list-
tuple-range
Also, read the following section of the textbook on looping over elements of lists:
Read the following section of the textbook on accessing, adding, removing, and joining
elements of lists:
SU2-30
ANL252 Data Types and Functions
Lesson Recording
Python Dictionaries
1.3 Dictionaries
Unlike lists and tuples, which are indexed by a range of numbers, dictionaries are indexed
by keys which are usually strings or numbers. As a result, we use keys to subset a
dictionary instead of indices.
value = dictionary_name["key"]
Hence, we need to use the keys to extract values from the dictionary.
Example (Cont’d): Previously, we store the name and score of a student in a small list
and merge all these “sub-lists” into a big list subsequently. Now, we save our data of
class1 in a dictionary instead. In the following, we will use the student names as
the key and their scores as the value.
SU2-31
ANL252 Data Types and Functions
To extract the score of Peter, we can put the key, "Peter", in the index operator [].
Note that if the key that we would like to refer to is a string, we will have to put it
within a pair of quotation marks.
The advantage of dictionary is that we can choose a much suitable description for the
value and use it as the key. The programmers or users do not need to rely on their memory
or understanding of the code to trace back the nature and type of the values in a tuple or
a list like our examples in Chapter 1.2.4 and Chapter 1.2.5.
dictionary_name.keys()
Equivalently, the .values() method can extract all values from a dictionary.
SU2-32
ANL252 Data Types and Functions
dictionary_name.values()
Method is like a function to carry out certain actions on the object before the dot (.). We
will have detailed discussion on methods and functions in later chapters. Here, the object
is a dictionary, and the methods are .keys() and .values(). Applying these syntaxes
to any dictionary, Python can extract all the keys and values from it.
Example (Cont’d): Suppose we would like to obtain all the keys and values separately
from the dictionary class1 defined in Figure 2.27.
The method .keys() returns an object called dict_keys() with the keys in our
dictionary, and the method .values() returns an object called dict_values()
with the values. However, we cannot work with these keys and values yet since we
cannot extract them from the dict_keys() and dict_values() objects directly.
We need to transform them to list by a function called list() first.
SU2-33
ANL252 Data Types and Functions
The keys and values are now stored in their corresponding lists, recognisable by the
square brackets around them. From the techniques introduced in Chapter 1.2, we can
print the keys line by line now.
Figure 2.31 Print all the keys from a dictionary line by line
We can apply the same technique to print the values line by line too. Nevertheless, we
can also use the index operator [] on the original list class1 to print the keys and the
corresponding values. To extract the keys individually, we need to initiate a for-loop
that iterates through the list containing the dictionary keys, which is class_keys
here. In each iteration within the for-loop, one key of the list will be stored in the
variable i. The value can then be extracted by class1[i].
Figure 2.32 Print all the keys and values from a dictionary line by line
We put two objects separated by comma in the print() function. This is a command
to tell Python to print these objects in the same line to the screen. And the values
assigned in these objects are separated by an empty space between them.
If we want to extract all keys and values from a dictionary in the same step, we can apply
the .items() method on a dictionary.
SU2-34
ANL252 Data Types and Functions
dictionary_name.items()
Same as the .keys() and .values() methods, the result returned by Python from the
.items() method is not an object that can be accessed directly. Nevertheless, it can be
converted to a list of keys and values by the list() function.
Example (Cont’d): We apply the .items() method on class1 to obtain all the keys
and values of the dictionary.
The first print() function prints the dict_items() object returned from the
.items() method to the screen. This object contains every pair of keys and values
from the dictionary stored in a tuple. In the second print() function, we convert the
dict_items() object to a list. The printed object is a list with each pair of keys and
values stored in a tuple as its elements.
The list created from a dict_items() object contains tuples that have two elements in
each of them: the keys and the values. We can easily use a for-loop and the index operator
to print the contents. Here, we introduce an extension of the for-loop so that we can omit
the indices when referring to the keys and values.
SU2-35
ANL252 Data Types and Functions
Example (Cont’d): Now we use the double storage variables in a for-loop to print
out the keys and values of our dict_items() object after being converted to a list.
In line 2, we extract the dict_items() object and convert it directly to a list named
class_items. The two storage variables of the for-loops are key and val here.
And in the print() function, we use a formatted string to print out the two values
in a normal sentence for the user to read.
dictionary_name["key"] = value
SU2-36
ANL252 Data Types and Functions
Example (Cont’d): Suppose the score of Peter was 70 instead of 72. We can simply
change it by assigning a new value to the key “Peter”.
There was then a second exam and each of our three students has a new score in
addition to the scores that are already stored in the dictionary. Below is one rather
trivial approach to carry out this task.
Figure 2.36 Add items to every dictionary key and convert values to lists
The above approach replaces the original values in the dictionary with some lists. Each
list contains two scores for each student. Therefore, we must include the first score of
each student in the list as well, which is not quite elegant in terms of programming
conciseness. Another possibility is to use a for-loop to convert the value of each key
and append the new score to it.
Figure 2.37 Print items from a dictionary after converting to lists using for-loops
SU2-37
ANL252 Data Types and Functions
within the for-loops much easier to handle. The list that is used for the for-loop to
iterate is the key list of the dictionary class1. In each iteration, the variable i will
store one key from the list. Since both dictionaries have identical keys, we can use
the dictionary sub-setting technique on both dictionaries by putting i in the index
operator. Subsequently, we merge the extracted values from both dictionaries to one
list, and then assign it to the corresponding key of class1. Eventually, we print out
the edited dictionary for checking purpose.
del object
Note that the object in this syntax does not only refer to a key in a dictionary. It can be any
Python object such as a variable, a list, a tuple, a dictionary, etc. Once an object is deleted,
it will no longer be available in the programme.
Example (Cont’d): Suppose the key “John” was a typo when entering the data. It
should be “Jon” instead.
SU2-38
ANL252 Data Types and Functions
First, we assign the value of “John” in the dictionary class1 to a new key called
“Jon” of the same dictionary. Subsequently, we delete the key “John” and its value as
an entire object from the running program.
On the other hand, if we want to add a new key to a dictionary, we can simply assign a
value to a new dictionary key.
Basically, the syntax to add a new key is just the same as the syntax to edit an existing key.
The only difference is that the key put in the index operator must be a new one, if we wish
to add a new item.
Example (Cont’d): Suppose we would like to add a new student, Michael, who scores
60 in the exam, to our dictionary.
The first syntax is a new option available from Python version 3.9 onwards. (Recall from
Study Unit 1 that this study guide is written based on Python version 3.9.) It uses the
SU2-39
ANL252 Data Types and Functions
“Bitwise Or” operator, which is a vertical line “|”, to merge two dictionaries. The second
syntax is an option available from Python version 3.5 onwards.
Example (Cont’d): Suppose we have a second dictionary with the exam results of a
second class, and we would like to merge these two dictionaries together.
SU2-40
ANL252 Data Types and Functions
Read
Read the following section of the textbook on creating and manipulating dictionaries:
Read the following official Python documentation for more details and examples on
dictionaries:
SU2-41
ANL252 Data Types and Functions
Lesson Recording
The following table contains all the built-in functions in alphabetical order.
SU2-42
ANL252 Data Types and Functions
__import__()
Some of the listed functions are rather straightforward such as abs(), sum(), round()
, etc. There are also some that are quite unclear in terms of their functionality or area
of use such as frozenset() or staticmethod() just by looking at their names. You
can visit the website https://docs.python.org/3/library/functions.html to get a detailed
explanation on how to integrate and apply all these functions in Python programs.
SU2-43
ANL252 Data Types and Functions
Example (Cont’d): Suppose we would like to summarise the combined exam results
of class1 and class2 by calculating their mean, maximum, and minimum.
In the first three lines, we repeat the syntaxes from Chapter 1.3.2 to merge two
dictionaries into a new one called all_classes. Subsequently, all the exam scores
are extracted by the .values() method from all_classes. In line 5, we use the
sum() function to add up all the numbers stored in scores and divide the result
by the number of elements in scores, determined by the len() function, to obtain
the mean exam scores. The mean will then be rounded to two decimal places by the
round() function when embedding it in a formatted string for printing purpose. In
the last two lines, we determine the highest and lowest exam scores by the max() and
min() functions and include them in the formatted strings for the print() function
to print them onto the screen.
You may wonder why some rather basic functions such as a function for the calculation
of the mean is missing in the above list. Some of those functions may be included in some
common packages that will be introduced in Chapter 4. Some of them could be built-in
methods instead, which will be introduced in the next section.
Read
Read the following website for more details and examples on Python methods for
strings:
SU2-44
ANL252 Data Types and Functions
python_ref_string.asp
SU2-45
ANL252 Data Types and Functions
SU2-46
ANL252 Data Types and Functions
SU2-47
ANL252 Data Types and Functions
SU2-48
ANL252 Data Types and Functions
SU2-49
ANL252 Data Types and Functions
SU2-50
ANL252 Data Types and Functions
Read
Read the following website for more details and examples on Python methods for
strings:
python_ref_string.asp
Read the following website for more details and examples on Python methods for
lists:
python/python_ref_list.asp
Read the following website for more details and examples on Python methods for
dictionaries:
python/python_ref_dictionary.asp
Read the following website for more details and examples on Python methods for
tuples:
python_ref_tuple.asp
SU2-51
ANL252 Data Types and Functions
Lesson Recording
In Chapter 2, we are introduced to some built-in functions and methods that are already
included in the Python programming environment. Functions and methods help us to
carry out routine tasks which would require us to write very lengthy code to achieve
the same functionality if we were to create the programme by ourselves. Nevertheless,
sometimes we can also write our own functions that suit our own needs.
A user-defined function (we will call it function in the following due to simplicity) can
be viewed as a separate part of the code that will not be interpreted by Python until it is
called from the main programme. Usually, a function consists of four parts:
SU2-52
ANL252 Data Types and Functions
Same as for-loops or if-conditions, the def-statement must end with a colon, and
all the follow-up instructions and codes that belong to the function must be indented.
Subsequently, the function can be called in the main programme by integrating the
function_name at an appropriate place.
… (Main program)
y = function_name(argument1 = object1, argument2 = object2,
…)
print(y)
… (Continue with main program)
We use a function to carry out a certain process. The objects object1 and object2 are
the corresponding input for argument1 and argument2 to the function. It is essential
for Python that object1 and object2 are already defined somewhere in the previous
part of the main programme. The output object (or value) from the function will then be
assigned to the variable y.
Caution! It is tempting for beginners to “outsource” some parts of the main programme
and make them separate functions for the sake of “programme cleanliness”. Though
the code may look more structured at first sight, the debugging process can be quite
challenging if the whole programme is jumping between the main programme and the
functions. The rules of thumb for where user-defined functions are appropriate would be:
1. if the same routine, probably with different arguments, appears more than once
in the main programme
2. if several functions should be combined into one which will then be used in the
main programme on multiple occasions
3. if a function can really increase the efficiency of the main programme
SU2-53
ANL252 Data Types and Functions
In these cases, functions can really simplify the programme code and the debugging of it
since there are less chances for syntax or logical errors.
The second function sort_dictkeys(dicts) sorts the dictionary keys directly. The
dictionary object is not a specific dictionary from the main programme, but a variable
called dicts that is used as the argument in the definition of the function and is also
only used within the function. The main programme will then pass on a dictionary
to sort_dictkeys and store the dictionary in dicts once it starts to process the
instructions inside the function. It is noteworthy to mention that we can call another
function within a function like in line 5, where we call list_dictkeys(dicts) to
generate the list of dictionary keys for further process.
The main programme does not become significantly shorter but simpler, in terms of
the number of functions and methods involved.
SU2-54
ANL252 Data Types and Functions
Read
Read the following three exercises of the textbook for more details and examples on
user-defined functions:
SU2-55
ANL252 Data Types and Functions
Lesson Recording
Since the students are using Anaconda for this course, the content below is for the students’
information only.
The packages of the standard library are already installed in the Python environment.
A library is a collection of codes for us to perform specific tasks without writing our
own code. But before we can use the modules in our programme, we need to import the
package or a specific module of the package first.
In the first syntax, we import the whole package into our programme. The alias is a
name that is used to refer to that particular package from thereon in our programme.
It is advantageous to use a package alias if it has a very long name. Note that the “as
SU2-56
ANL252 Data Types and Functions
package_alias” part is optional in the import syntax. If the original package name is
preferred, this part can be omitted.
The second syntax imports a particular module from a package. The alias here is the
referral name of the module that we will use in our programme, and not the package.
Once again, the alias part is optional and can be omitted.
“pip” or “pip3” refers to the installer program that Python uses for installing external
packages. Basically, “pip3” is a newer version of “pip”. In most of the cases, we can use
either one for our installation.
For instance, the package “numpy” will be needed in the next few study units and we
would like to install it for our class preparation:
SU2-57
ANL252 Data Types and Functions
Once ENTER is pressed, Python will download the package installation file and install it.
The message “Successfully installed xxx” will appear on the screen once the installation
has been completed.
Another package that we will need in the next study units is “matplotlib”. This time we
install it with pip3.
SU2-58
ANL252 Data Types and Functions
In the same installation process, some other packages are downloaded and installed as
well. This is because “matplotlib” needs some modules of these packages so that it can
work.
In Python, there are some packages that are used quite commonly for data analytics. A
couple of them will also be covered in this study guide.
SU2-59
ANL252 Data Types and Functions
SU2-60
ANL252 Data Types and Functions
Summary
In this unit, we have learned three types of built-in compound data structures of Python:
tuples, lists, and dictionaries. We discussed the major differences among these objects and
the typical applications of them. Their creation and modification were also explained and
demonstrated in detail. The most important issue here is the technique for sub-setting and
indexing the elements in these objects. We were also introduced to functions, methods,
and packages in Python. While some of them are built-in – i.e., they are already included
in the Python environment – others can be user-defined or installed from external sources.
Based on examples, we have been shown how built-in functions and methods, user-
defined functions, as well as standard or external libraries can be applied to our Python
programmes.
SU2-61
ANL252 Data Types and Functions
Formative Assessment
a = 1, 4, 9, 16
a[2] = 3
print(a)
a. 1, 3, 9, 16
b. 1, 4, 3, 16
c. 1, 4, [3, 9], 16
d. Syntax Error
a = []
for i in range(0, 5):
a = a + [i ** 2]
print(a)
a. [0, 1, 2, 3, 4]
b. [0, 1, 4, 9, 16, 25]
c. [0, 1, 4, 9, 16]
d. [0, 1, 2, 3, 4, 5]
SU2-62
ANL252 Data Types and Functions
b.
d = {“north”: 2, “south”, -2}
dkeys = list(d.keys())
print(d[0])
c.
d = {“north”: 2, “south”, -2}
dkeys = list(d.items())
print(dkeys[0])
d.
d = {“north”: 2, “south”, -2}
dkeys = list(d.values())
print(dkeys[0])
SU2-63
ANL252 Data Types and Functions
c. .replace()
d. .strip()
9. Which of the following statements is correct when using alias for importing package/
module?
a. We can use both the original name and alias to refer to the package/module in
our programme.
b. An alias is optional and can be omitted if we are comfortable to work with the
original package/module name.
c. An alias must be shorter than the original package/module name.
d. If only a single module from a package is imported, the alias refers to the
package and not the module.
SU2-64
ANL252 Data Types and Functions
SU2-65
ANL252 Data Types and Functions
Formative Assessment
1. What is the output of the following programme?
a = 1, 4, 9, 16
a[2] = 3
print(a)
a. 1, 3, 9, 16
b. 1, 4, 3, 16
c. 1, 4, [3, 9], 16
d. Syntax Error
a = []
for i in range(0, 5):
a = a + [i ** 2]
SU2-66
ANL252 Data Types and Functions
print(a)
a. [0, 1, 2, 3, 4]
Incorrect. In each iteration, i square will be put into a list and then appended
to a. The values of i are 0, 1, 2, 3, 4.
Incorrect. Though i square is taken here, the last value is out of range since
the for-loop will stop running at i = 4.
c. [0, 1, 4, 9, 16]
d. [0, 1, 2, 3, 4, 5]
Incorrect. The last value is out of range, and the values in a are i and not i
square.
a.
d = {“north”: 2, “south”, -2}
dkeys = list(d.keys())
print(dkeys[0])
b.
d = {“north”: 2, “south”, -2}
dkeys = list(d.keys())
SU2-67
ANL252 Data Types and Functions
print(d[0])
Incorrect. In the print() function, it refers to the object d and not dkeys.
And since d is dictionary, we can only access its element by the keys and not
the indices.
c.
d = {“north”: 2, “south”, -2}
dkeys = list(d.items())
print(dkeys[0])
Incorrect. The .items() method is used to extract the keys and values of
a dictionary and store each pair of them in a tuple. dkeys[0] will return
(“north”, 2) as its result.
d.
d = {“north”: 2, “south”, -2}
dkeys = list(d.values())
print(dkeys[0])
Incorrect. The value of a dictionary can also be tuples, lists, or other object
types.
SU2-68
ANL252 Data Types and Functions
a. .get()
SU2-69
ANL252 Data Types and Functions
b. .lower()
c. .replace()
d. .strip()
Incorrect. It is a method for string variables too. It removes all empty spaces
at the beginning and at the end of a string.
a. Function name
b. Arguments
Incorrect. Arguments are input from the main programme to the function
and they are optional. A function can still be carried out without arguments
provided.
c. Loops
Incorrect. A function does not need loops for its functionality. Its instructions
can contain loops, but it would still work without them.
d. Return value
Incorrect. A function can also be carried out even if it does not return any
value to the main programme.
SU2-70
ANL252 Data Types and Functions
Incorrect. It is a good habit to implement functions when they can really carry
out recurrent tasks and return some objects/values to the main programme
based on some input arguments originated from the previous parts of the
main programme.
9. Which of the following statements is correct when using alias for importing package/
module?
SU2-71
ANL252 Data Types and Functions
a. We can use both the original name and alias to refer to the package/module
in our programme.
Incorrect. There is no guideline for the length of the alias. It can even be longer
than the original package/module name.
d. If only a single module from a package is imported, the alias refers to the
package and not the module.
Incorrect. The alias refers to the module and not the package in the “from …
import … as …” syntax.
SU2-72
ANL252 Data Types and Functions
library/functions.html
library/stdtypes.html#sequence-types-list-tuple-range
docs.python.org/3/library/stdtypes.html#typesmapping
Python Software Foundation. (2023d). Built-in types: Sequence types – list, tuple,
range.https://docs.python.org/3/library/stdtypes.html#sequence-types-list-
tuple-range
www.w3schools.com/python/python_ref_dictionary.asp
www.w3schools.com/python/python_ref_list.asp
python/python_ref_string.asp
python/python_ref_tuple.asp
SU2-73
ANL252 Data Types and Functions
SU2-74
3
Study
Unit
Learning Outcomes
SU3-2
ANL252 Arrays and Plots
Overview
This study unit introduces two Python packages: NumPy and matplotlib. NumPy is the
fundamental package for efficient scientific computing with Python. We will learn how to
create NumPy arrays and how to use indexing and Boolean masks for sub-setting NumPy
arrays. We will also learn the NumPy functions to generate statistics on the data stored
in an array. Furthermore, we will learn how to use the “matplotlib.pyplot” sub-package
for data visualisation purpose. In particular, the functionalities available for plotting and
customising basic charts of data analytics will also be the focus of this study unit.
SU3-3
ANL252 Arrays and Plots
Lesson Recording
In Study Unit 2, we were introduced to various types of compound data such as tuples,
lists, and dictionaries. We also discussed in detail how to modify a list or store different
types of elements in a list. Despite being able to group lists or tuples in a superordinate list,
these types of compound data are basically still one-dimensional. Recall in one part of our
exam score example in the previous study units, the list of data to be analysed can consist
of either sub-lists with individual student scores from different subjects or the scores of
all students in one subject in each one of them. As a result, the data in this example are in
fact two-dimensional: the student dimension and the subject dimension.
Hence, lists and dictionaries are no longer sufficient to store multi-dimensional data for
analysis, and arrays should be used instead. Nevertheless, we can also replace lists and
dictionaries with arrays when it comes to one-dimensional data. Note that the shape of
a Python array must be rectangular – that is, the number of values in each row and each
column must be identical, and all values must be entirely of the same type (typically
numeric values or strings). Therefore, an array is not equivalent to a dataset in the
conventional sense since missing values and mixed data types are common features of
usual datasets. In Python, we can work with arrays using the “numpy” package.
SU3-4
ANL252 Arrays and Plots
import numpy as np
Recall that we can import a package with an alias, which will then be used as our reference
to the package in the further part of our programme. And for “numpy”, the most common
alias used in literature or online references is “np”.
In the above syntax, the array() function was attached to the np. prefix, which is
required if NumPy is imported using the import-statement. The prefix should be omitted
if we use the from … import … statement to import NumPy instead.
The data assigned to array_name in the array() function are originally stored in
various regular Python lists. Each list corresponds to a row of the array and the total
number of rows is therefore equal to the number of lists included in the array() function.
Additionally, the number of elements in each list must be identical. If the lists have
different lengths, an error message will appear. Note that it is compulsory to wrap all the
lists, separated by commas, in a pair of outer square brackets - again before putting them
into the array() function.
Example (Student score, cont’d): Suppose we have three lists and each of them
contains one student’s exam scores of three different courses. The three lists are [72,
SU3-5
ANL252 Arrays and Plots
73, 53], [86, 83, 90] and [35, 42, 51]. And we would like to store all these
scores in a two-dimensional array.
From the output of the array, we can see that each list in the array() function
corresponds to one row in the array, and not one column. Furthermore, NumPy uses
the outer square bracket to indicate the start and end of the array, and the inner square
brackets are used to wrap the data in each row.
Due to the mechanism of the array() function, it is crucial to make sure what features
of the data are represented by the columns and the rows before creating the arrays.
• Using index: Suppose we would like to get the second value of the first row in an
array. We can extract the element by array_name[0, 1], where array_name can
be any arbitrary name of an array. The index here starts with 0, same as Python List.
Recall that if we intend to do multiple indexing like start:end, the end index will
SU3-6
ANL252 Arrays and Plots
not be included in our array subset. Furthermore, negative indexing and open-end
indexing are also allowed here. Check on Chapter 1 of Study Unit 2 for more details
regarding indexing.
• Using Boolean masking: Suppose we want to get all values larger than 80. The
first step is to use array_name > 80 to produce a Boolean mask. The result is
a NumPy array with Boolean elements: True if the corresponding value is above
80; False if it is below. Subsequently, we can use the Boolean mask inside a pair
of square brackets to do sub-setting. Only those elements above 80, for which the
corresponding Boolean mask is True, are selected. If, for instance, there are two
values above 80, we will end up with a NumPy array with two values.
Since arrays can be multi-dimensional (ndarray), we can certainly subset every dimension
of it by using multi-dimensional indexing. The following syntax is used for subsetting
two-dimensional NumPy arrays:
array_name[row_index, column_index]
Basically, the usage of the index operator here is just like sub-setting a one-dimensional
Python list. The only difference is the two sets of indices in it: one for row indexing, and
the other one for column indexing. The resulting subset of the array can be a single value,
a row, a column, or an array with less rows and/or less columns. For multi-dimensional
NumPy arrays, the order of the indices in the index operator must follow the sequence
of the axes or dimensions, in an array. For instance, axis one of a two-dimensional array
refers to the rows and axis two to the columns.
Example (Cont’d): The array exam_scores created in Figure 3.1 contains data of
individual students in the row and data of each subject in the column. Suppose we
would now like to extract all the exam scores of the second subject.
SU3-7
ANL252 Arrays and Plots
To subset a column from a two-dimensional array, we must indicate indices for both
the rows and the columns, or else an error message would appear. The column index
is clearly 1 here since we intend to extract the second column of the array. The row
index must be multiple indexing since we would like to access the entire column. As
a result, open-end indexing starting from index 0 is the most appropriate way here to
access the elements of row 1 to row 3 of column 2.
In the next step, we would like to extract the exam scores of the first two students in
the last two subjects.
Here, we use negative indexing for sub-setting the column. Since our array has three
columns, -2 is the index of the second last column. We leave the end index here open
to instruct Python to “take every index until the end” which, in this case, is the last
column of the array exam_score.
SU3-8
ANL252 Arrays and Plots
Assuming that the passing mark is 40, we would now like to subset all the failed
exams. That is, we extract all exam marks below 40.
If we ask Python to compare an array with a numeric value, we will obtain a Boolean
mask as mentioned before. Comparing the Boolean values in Figure 3.2 with the array
created in Figure 3.4, the only True value found here is the score of the third student
in subject 1. To subset exam_score using the Boolean mask, we need to put the
condition exam_scores < 40 within a pair of square brackets.
The result is the only value in exam_scores that is smaller than 40, namely 35, which
is the score of the third student in the first subject. And it is also the only True value
in the Boolean mask.
We can also check various properties of an array using the following NumPy functions
and methods:
type(array_name)
array_name.ndim
array_name.shape
array_name.size
array_name.dtype
SU3-9
ANL252 Arrays and Plots
The type() function indicates the type of our array while the .ndim method returns
the array’s number of dimensions, which is usually 2 in our case. The .shape method
provides the number of rows and columns of the given array and the .size method
calculates the total number of elements in an array. The .dtype method shows us the
type of data contained in the array.
The array type returned from the type() function is a Python type output <class
'numpy.ndarray'>, where ndarray stands for n-dimensional array.
Another remarkable output is the values returned from the .size() method. In fact,
the .size() method returns a tuple (row_number, column_number). We subset
the corresponding result in the formatted string by using the index operator [].
SU3-10
ANL252 Arrays and Plots
The type of data returned by the .dtype() method is int32, a specific NumPy
integer type that fixes the length of an integer variable at 32 bytes. The usual integer
variable of Python has no fixed length and its type is simply called int.
Read
Refer to the three links below for more details and examples on the methods “shape”,
“ndim” and “size” of NumPy arrays:
docs.scipy.org/doc/numpy/reference/generated/
numpy.ndarray.shape.html#numpy.ndarray.shape
docs.scipy.org/doc/numpy/reference/generated/
numpy.ndarray.ndim.html#numpy.ndarray.ndim
numpy/reference/generated/numpy.ndarray.size.html#numpy.ndarray.size
Each function deals with specific types of variables. For instance, mathematical functions
such as log() and sqrt() can only be applied on arrays with numeric values, whereas
strip() and upper() are functions specifically designed for arrays with only strings in
them.
SU3-11
ANL252 Arrays and Plots
Function Description
Statistics
SU3-12
ANL252 Arrays and Plots
Function Description
Rounding
Arithmetic
SU3-13
ANL252 Arrays and Plots
Function Description
Trigonometric
Random Sampling
SU3-14
ANL252 Arrays and Plots
Function Description
SU3-15
ANL252 Arrays and Plots
Function Description
Example (Cont’d): Suppose we would like to compute some statistics of each subject’s
exam scores such as the mean, standard deviation, maximum, minimum, etc. We will
also have to round off all these statistics to 2 decimal digits.
SU3-16
ANL252 Arrays and Plots
For the statistical functions in NumPy, we usually need to specify the axis argument
in it. If axis = 0, the column statistics will be calculated. If, however, axis = 1, the
row statistics will be returned to us by the functions. Since we intend to compute the
statistics for each subject’s data, which are recorded in the columns of exam_scores,
we shall specify axis = 0 in this case.
For the round() function, we specify the argument named decimals to 2. Recall
from Chapter 3 of Study Unit 2 that it is a good programming habit to specify the
argument names of a function explicitly when assigning a value to it. It is therefore
important to always check the available arguments of a function carefully before using
it in the program.
Read
Refer to the three links below for more details and examples on the functions mean(),
median() and std() of the NumPy package:
SU3-17
ANL252 Arrays and Plots
reference/generated/numpy.mean.html#numpy.mean
reference/generated/numpy.median.html#numpy.median
reference/generated/numpy.std.html#numpy.std
Refer to the link below for more details and examples of functions provided by the
NumPy package:
Geeks for Geeks. (2021, June 28). N-dimensional array(ndarray) in numpy. https://
www.geeksforgeeks.org/numpy-ndarray/
SU3-18
ANL252 Arrays and Plots
Lesson Recording
We use the alias “plt” here which is also commonly found in literatures and websites,
since it is short and has a clear reference to “pyplot”. Note that the above syntax has only
instructed Python to import the sub-package ”pyplot”. All the other functions and sub-
packages of matplotlib are not imported.
The sub-package “matplotlib.pyplot” provides many different plot types. In Chapter 3.2,
we will discuss the most common ones: histogram, bar charts, and scatter plots. We can
customise our plots by changing colours, shapes, labels, axes, etc., according to our own
needs and taste. In the following, we will introduce some basic plotting techniques based
on a line plot.
SU3-19
ANL252 Arrays and Plots
To label the axes, we can use the xlabel() and ylabel() functions.
We can also set a title to the current plot using the title() function:
Another useful plot customisation is to define the text and location of the labels on each
tick of the x-axis and y-axis.
SU3-20
ANL252 Arrays and Plots
A list of labels assigned to the argument labels will be plotted on the locations defined
with the argument ticks. We can also rotate the labels in case they fit optically better to
the plot if they are slanted. A numeric value representing the degree of rotation can be
assigned to the argument rotation.
Python will wait for the show() function to actually display all figures.
plt.show()
Note: Since our programmes in this study unit are constructed and run in JupyterLab, the
plots will anyway be displayed if we put all the syntaxes for plotting in one cell so that they
can be executed in the same run. As a result, we will not actually need the show() function
in the last step. However, this function is the final instruction to display the figures if we
run the plotting syntaxes in the original Python programme.
Example (cont’d): We would like to plot the CGPA development of a student in the
last 4 semesters. The data are [3.2, 3.3, 3.4, 3.1].
SU3-21
ANL252 Arrays and Plots
In the first line, we instruct Python to create a line chart with a red line and black circle
markers. The data of the x-axis should be the semester number and the values on the
y-axis are the CGPA. We then name the axes “Semester” and “CGPA” respectively.
The location of the ticks’ labels on the x-axis must be 1, 2, 3, and 4 because these are
the only data of the x-axis. Since the CGPA usually lies within the interval of 0 and
4, we can create an integer list from 0 to 4 as our ticks’ labels on the y-axis. We know
from Chapter 5 of Study Unit 1 that we can use the range() function to generate
such a list. Here, we integrate it within the plt.yticks() function. In the final step,
we add a title to the line plot to highlight the topic of our chart using the title()
function. With the plt.show() function, we let Python generate the plot.
SU3-22
ANL252 Arrays and Plots
Read
Refer to the three links below for more details and examples on the functions plot()
of the “matplotlib.pyplot” package:
matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
The hist() function has many arguments to control the histogram layout. The above
introduction of the function only includes the most common ones. For instance, we can
decide how many bins (bars) and which range of the values it should contain. We can
also choose to have a histogram with horizontal bars by changing the orientation
argument. With the arguments such as rwidth (which represents the width of the bars),
align (with which we can position the bars between two ticks or on top of a tick), and
color, we can format the bars according to our needs.
Example (cont’d): Suppose the exam scores of the two subjects (taken by the same
students) are now completely available, and we would like to generate a histogram
for subject 1 to look at the distribution.
SU3-23
ANL252 Arrays and Plots
First, we import both the “NumPy” and “matplotlib” packages, and then we create
two lists with the exam scores of each of the subjects. Eventually, we create a NumPy
array that contains both lists as its elements.
In the first line, we instruct Python to create a histogram based on the scores of the first
subject’s examination, which are stored in the first row of the array exam_scores.
We set the range argument to be between 0 and 100 to ensure that extreme categories
such as 0-10 or 90-100 marks are also included in the chart, although their frequencies
could be 0. The number of bins is fixed at 10 here so that we gain an accurate image
of the distribution. The width of the bars is reduced from 1 to 0.8 so that they are not
SU3-24
ANL252 Arrays and Plots
touching each other, and they are placed between two ticks on the x-axis (align =
"mid") to indicate the score range each bar represents.
The axes are named “Scores” and “Frequencies”, according to their nature. A title is
also given to the histogram, which is labelled “Exam Marks Distribution”. The ticks
on the x-axis are placed with a gap of 10 marks between 0 and 100. We extended the
range to 105 since the right end is not included by the range() function.
Scatter plots are often used to study the relationship between two variables, which is
usually referred to as their correlation. The values of the first variable are plotted in the
x-axis and the values of the second variable in the y-axis. If the data dots are scattered
around the 45 degrees line of the chart, we can conclude that these variables are correlated
with each other.
SU3-25
ANL252 Arrays and Plots
Same as the hist() function, we only list out some of the most common arguments
of the scatter() function here. For instance, we can change the colour and style
of the markers, assign another colour to the markers’ edge, and adjust its width for
more sophisticated visualisation. You can refer to https://matplotlib.org/api/_as_gen/
matplotlib.pyplot.scatter.html for more available arguments to control the scatter plot
layout.
First, we instruct Python to create a scatter plot based on the array exam_scores.
The first row of the array contains the data of the x-axis, and the data in the second
row are values of the y-axis. The markers should be red circles and have black edges.
The axes are also named accordingly: “Subject 1” for the x-axis and “Subject 2” for the
y-axis. The title of the scatter plot is “Correlation between the Exam Scores of Subject
1 and Subject 2”. The ticks on the x-axis and y-axis are placed with a gap of 10 marks
between 0 and 100. We extended the range to 105 to include 100 in our chart since the
right end is not included by the range() function.
SU3-26
ANL252 Arrays and Plots
Read
For more details and examples on the hist() function of the “matplotlib.pyplot”
package, refer to:
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html
For more details and examples on the scatter() function of the “matplotlib.pyplot”
package, refer to:
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html
SU3-27
ANL252 Arrays and Plots
Summary
We have discussed two Python packages in this study unit: namely, NumPy and
matplotlib. They are the fundamental packages for efficient scientific computing and data
visualisation with Python respectively. We have learned the basics of the two packages
such as sub-setting and some functions to operate on NumPy arrays, and some functions
of matplotlib for plotting and customising basic charts for analytics such as line chart,
histogram, and scatter plot.
SU3-28
ANL252 Arrays and Plots
Formative Assessment
a.
[[1, 2, -1]
[0, 3, -2]]
a. [-1 -2]
b.
[[-1]
[-2]]
3. Which values will remain in the output, based on the following code?
SU3-29
ANL252 Arrays and Plots
a. [[0, 3, -2]]
b. [[3, -2]]
c. [[3]]
d. [[1, -1]]
5. Which of the following NumPy functions does not have element-wise effects?
a. cos()
b. exp()
c. fix()
d. var()
SU3-30
ANL252 Arrays and Plots
d. linestyle
10. Where will the bars of a histogram be placed if we set align = "left"?
a. On top of the lower boundary tick
b. On top of the upper boundary tick
c. Between the ticks of the upper and lower boundaries
d. To the left of the y-axis label
SU3-31
ANL252 Arrays and Plots
Formative Assessment
1. What is the output of the following programme?
a.
[[1, 2, -1]
[0, 3, -2]]
Incorrect. Since a pair of square brackets to wrap up both the lists within the
array() function is missing, it is not a valid programme.
Incorrect. Since the user intends to create an array with two rows. Hence, the
result cannot be a Python list with all the elements in it.
Incorrect. The user intends to create an array with two rows, the result cannot
be a Python list with two sub-lists in it.
d. Error message
Correct. Since a pair of square brackets to wrap up both the lists within
the array() function is missing, it is not a valid programme and an error
message will appear.
SU3-32
ANL252 Arrays and Plots
a. [-1 -2]
b.
[[-1]
[-2]]
3. Which values will remain in the output, based on the following code?
SU3-33
ANL252 Arrays and Plots
a. [[0, 3, -2]]
Incorrect. The only column index in this sub-setting is -2, which is the second
last column of the array. As a result, we cannot have three values remaining
in the output.
b. [[3, -2]]
Incorrect. The only column index in this sub-setting is -2, which is the second
last column of the array. As a result, we cannot have two values remaining
in the output.
c. [[3]]
Correct. The only column index in this sub-setting is -2, which is the second
last column of the array. Since the row index is 1, indicating the second row
of the array, the output here should be the value positioning in the second
column of the second row, which is 3.
d. [[1, -1]]
Incorrect. Since the row index is 1, indicating the second row of the array, the
output here cannot contain any value of the first row.
Correct. The .shape method returns the number of rows and the number
of columns as a tuple with two elements.
SU3-34
ANL252 Arrays and Plots
5. Which of the following NumPy functions does not have element-wise effects?
a. cos()
b. exp()
c. fix()
Incorrect. The fix() function rounds each element of a numeric array to the
nearest integer towards zero.
d. var()
Correct. The var() function calculates the variance of each row, each
column, or the entire array.
a. machine learning
b. web development
SU3-35
ANL252 Arrays and Plots
c. data visualisation
d. database management
a. range
Correct. The plt.plot() function does not have the range argument.
b. marker
Incorrect. The marker argument controls the marker style of a line plot.
c. color
Incorrect. The color argument controls the line colour of a line plot.
d. linestyle
Incorrect. The linestyle argument controls the line style of a line plot.
Incorrect. The plot cannot be rotated generally. For some types of plot such as
histogram or bar chart, their bars can be presented vertically or horizontally.
But the plot cannot be rotated completely.
Correct. We can rotate the labels of the ticks if we want them slanted.
SU3-36
ANL252 Arrays and Plots
Incorrect. It does not control the layout of the axis labels at all.
Incorrect. It does not control the layout of the main title at all.
10. Where will the bars of a histogram be placed if we set align = "left"?
SU3-37
ANL252 Arrays and Plots
Incorrect. If they were to be placed between the ticks of the upper and lower
boundaries, we should have set align = “mid”.
Incorrect. The align argument only controls how the histogram bars are
placed in relation to the ticks on the x-axis.
SU3-38
ANL252 Arrays and Plots
References
Geeks for Geeks. (2021, June 28.). N-dimensional array(ndarray) in numpy. https://
www.geeksforgeeks.org/numpy-ndarray/
stable/
reference/generated/numpy.mean.html#numpy.mean
reference/generated/numpy.median.html#numpy.median
reference/generated/numpy.ndarray.ndim.html#numpy.ndarray.ndim
reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape
reference/generated/numpy.ndarray.size.html#numpy.ndarray.size
matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html
matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html
matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
SU3-39
ANL252 Arrays and Plots
SU3-40
4
Study
Unit
Data Management
ANL252 Data Management
Learning Outcomes
SU4-2
ANL252 Data Management
Overview
This unit will introduce the key data structure for analytics in Python: the pandas
DataFrame. We will learn to develop Python programmes to import data from external
sources and convert them to DataFrames, and then to index and query these structures.
We will then deepen our understanding of the pandas package by learning its efficient
functionality on merging multiple DataFrames, identifying and dealing with missing data
and outliers, sorting, grouping and transforming data, as well as discretising numeric
variables to bins.
SU4-3
ANL252 Data Management
Lesson Recording
In the previous study units, we have learned the basic techniques of Python programming.
In this and the next study units, we will discuss in detail how Python can be used for data
management and data analytics.
The most common package for data management in Python is “pandas”. After installing
pandas using pip, we can import it in our programme by the following syntax:
import pandas as pd
Here, we use the alias pd to refer to the pandas package in our programs.
To start working with pandas, we need to have Python compatible datasets. Data
circulating in organisations or on the Internet are mostly saved as text files or worksheets.
Text editors, spreadsheets, and data management apps are popular tools for opening and
working with them. Pandas provides the same possibilities. The first step here is to load a
dataset in the Python environment and open it in the format of pandas. Suppose we have
a dataset from an external source saved as a .csv text file; we can import it by the following
pandas function:
DataFrame_name = pd.read_csv("csv_file_name.csv")
The content stored in the file “csv_file_name.csv” will then be assigned to the pandas
dataset object, or DataFrame, named DataFrame_name. The function read_csv() is
SU4-4
ANL252 Data Management
called a reader since it reads in specific format of data files and converts them to pandas
DataFrame.
Similar to the functions in NumPy or matplotlib, the read_csv() function has more
arguments than we list out here. We can adjust the execution of the reader to the
specifications of the .csv file with these arguments. For instance, we can specify the
character string of the delimiter, the row number in which the header is stored, the
path of the .csv file, etc. You can refer to https://pandas.pydata.org/docs/user_guide/
io.html#io-read-csv-table for further details.
Since .csv is not the only common file format of data files, pandas also allows the import
of other file formats such as Excel spreadsheets, SPSS data, or Stata data into Python by
providing the functions listed in the following table.
Table 4.1 Most common data file formats and the corresponding reader in pandas
SU4-5
ANL252 Data Management
We can then use the .head() method to display the first five rows of the imported dataset.
It is important to check whether the data has been accurately imported.
DataFrame_name.head()
Alternatively, we can also use the conventional print() function to display the whole
DataFrame. Nevertheless, this can be quite frustrating if the dataset contains many rows
and columns, and the output does not fit to the window properly. Another way to print
the whole DataFrame is to use the display() function or omit the function completely
and simply execute a syntax with only the name of the DataFrame.
SU4-6
ANL252 Data Management
1. age: the age of an individual. Its value can be any integer greater than 0.
2. workclass: a general term to represent the employment status of an
individual. Its value can be Private, Self-emp-not-inc, Self-emp-
inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-
worked.
3. fnlwgt: final weight. In other words, this is the number of people the entry
represents. Its value can be any integer greater than 0.
4. education: the highest level of education achieved by an individual. Its
value can be Bachelors, Some-college, 11th, HS-grad, Prof-school,
Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th,
10th, Doctorate, 5th-6th, Preschool.
5. educational-num: the highest level of education achieved in numerical form.
Its value can be any Integer greater than 0.
6. marital-status: marital status of an individual. Married-civ-spouse
corresponds to a civilian spouse while Married-AF-spouse is a spouse
in the Armed Forces. Its value can be Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Married-spouse-absent,
Married-AF-spouse.
7. occupation: the general type of occupation of an individual. Its value
can be Tech-support, Craft-repair, Other-service, Sales, Exec-
managerial, Prof-specialty, Handlers-cleaners, Machine-op-
SU4-7
ANL252 Data Management
Since the original data file is saved in .csv format, we can use the read_csv()
function to import the dataset to Python.
SU4-8
ANL252 Data Management
If we printed the census dataset using the print() function, the output would not
fit the window at all.
As shown in Figure 4.2, the DataFrame will be truncated anyway. Eventually, only the
head and the tail of the dataset will be displayed due to lack of space. As a result, it is
more advisable to use the pd.head() than print() for control purpose.
SU4-9
ANL252 Data Management
Figure 4.3 Printing the head and tail of the imported dataset by display()
The display() function prints out the first five and last five observations of the
DataFrame. The font is smaller than the one used by the print() function, and we
are therefore able to see all columns side-by-side and without line break. Furthermore,
it also shows the number of rows and columns in the DataFrame, which can be quite
helpful in some cases.
SU4-10
ANL252 Data Management
Read
Refer to the link below for more details and examples on the read_csv() function
of the pandas package:
pandas. (2023a). IO tools (text, CSV, HDF5, …). NumFOCUS, Inc. https://
pandas.pydata.org/docs/user_guide/io.html
Read the following website for more information regarding the US Census data,
including the explanation of the variable names and other useful information about
the data:
Department of Computer Science. (1996, October 8). The adult dataset. University of
SU4-11
ANL252 Data Management
Lesson Recording
Similar to Python lists or NumPy arrays, we can access a pandas DataFrame by using the
index operator []. In this chapter, we will introduce three ways to subset rows, columns,
or elements of a DataFrame.
Note that each variable name must be placed within a pair of quotation marks since it is
treated as a string in this case. If we simply want to access one column, we can omit the
creation of the list and put the variable name as a string inside the index operator directly.
SU4-12
ANL252 Data Management
Suppose we would like to select the target variable "income" alone and save the
column as a NumPy array for further calculation.
Figure 4.5 Select a single column from a DataFrame and save it as NumPy array
We can see that no list will be needed within the index operator if only a single column
is selected. After selecting the column, the resulting subset of the pandas DataFrame
will then be converted to an NumPy array by the np.array() function.
SU4-13
ANL252 Data Management
beginning of every row by pandas. It starts with 0 and ends with the number of rows in
the DataFrame minus one. As a result, rows can be queried by the numeric index position,
starting at 0, using the DataFrame attribute iloc.
DataFrame_name.iloc[start:end]
The indices in the index operator do not need to be consecutive integers. It can be any
integers within the range 0 and number of rows in the DataFrame – 1. But these integers
must be put in a list first if there are more than one of them. If we want to select a single
row instead, we can simply put one index in the index operator.
Example (Cont’d): Sometimes, we may need to inspect a dataset after running some
programmes to adjust it for further analytics purposes. We have been introduced
to the .head() method to print out the first five rows of the dataset. We can also
randomly pick a few rows from the DataFrame for our inspection. Here, we need
NumPy to draw random indices to select the rows for us.
SU4-14
ANL252 Data Management
In the first line of our programme, we determine the total number of rows in
the dataset census. Note that the method .shape() is also applicable to pandas
DataFrames and it returns a tuple (Total Row Number, Total Column Number)
to us. As a result, we can refer to the first element of the tuple as the total number
of rows in census. In the second line, we draw a total of 10 random integers from
the interval 0 and nrow, the number of rows in census. Note that nrow as the upper
boundary is not included in the drawing process at all. The integers drawn by the
random.randint() function will then be assigned to the object named randrow.
And these will then be used as the list of rows that we select from census.
Before we can select rows based on their index labels, we need to first create the index
labels by the method .set_index().
The parameter key can be either a single variable name (column label), a single array of
the same length as the calling DataFrame, or a list containing an arbitrary combination
of variable names and arrays. The argument inplace controls whether the DataFrame
should be modified in place or a new DataFrame should be created. If it is True, the
changes will take place in the original DataFrame.
SU4-15
ANL252 Data Management
Example (Cont’d): Suppose we would like to group the US Census data by the
occupation of the observations.
If we use one of the columns as the row index of our DataFrame, that column will not
be a regular part of the dataset anymore. As you can see from Figure 4.7, the variable
occupation has disappeared from the dataset.
Suppose we would like to change our row index from the occupation to the age group
of each observation. In the first step, we need to remove the occupation as our row
index by the .reset_index() method. It reverses the effect of the .set_index()
method and removes the current row index and converts it back to a column in the
DataFrame.
SU4-16
ANL252 Data Management
Then we re-code the age in each row to our target grouping and store it in a new array.
For observations less than 30 years, the group will be labelled as “Age <30”, whereas
the label of those between 30 and 59 years old is “Age 30-59”, and the rest receives the
label “Age 60+”. Subsequently, we assign this array as the new row index.
We first create a list named agegroup that contains the grouping of each
observation’s age. The list will then be converted into a NumPy array and used as the
row index subsequently.
The DataFrame rows can be queried by the row index labels using the .loc attribute.
We can see that selecting rows from a DataFrame by the .loc attribute works in a very
similar fashion to the column selection. The row labels must be indicated as strings and
put in a list if we want to select more than one of them. If we just want to select rows of a
single label, we can put the label as a string in the .loc attribute directly.
SU4-17
ANL252 Data Management
Example (Cont’d): Suppose we would like to select all the observations that are 30
years of age or younger from the US Census data.
And if we want to select the youngest and oldest age groups from census, we will
need to put the row labels in a list.
SU4-18
ANL252 Data Management
Read
Refer to the links below for more details and examples on the attributes
.set_index() and .reset_index() of the pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.reset_index.html
Refer to the links below for more details and examples on the methods .loc() and
.iloc() of the pandas package:
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.loc.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.iloc.html
Below is a syntax that uses only the row and column indices for the cell selection.
SU4-19
ANL252 Data Management
DataFrame_name.iloc[row_start:row_end, col_start:col_end]
We can also select cells by referring to the corresponding row and column labels.
DataFrame_name.loc[["row_labels"], ["col_labels"]]
If we want to select the rows by index but the columns by labels, we can use the index
operator and .iloc attribute together.
DataFrame_name[["col_labels"]].iloc[row_start:row_end]
But if we want to select the columns by index but the rows by labels, we need to use both
the .loc and .iloc attributes.
DataFrame_name.loc[["row_labels"]].iloc[:,
col_start:col_end]
While putting the row labels in the .loc attribute, we need to be aware that the .iloc
attribute requires both the row and column indices. Since we do not intend to select the
rows by index, we can use the open-end index 0: or simply : here.
Example (Cont’d): Suppose we would like to select the first five observations of the
first five variables.
SU4-20
ANL252 Data Management
Such a selection only makes sense when we exactly know the positions of the variables
and observations. More common is the selection of cells based on labels. Suppose
we would like to select the observed values in workclass and income for all
observations younger than 30 and older than 60 years old.
Nevertheless, this selection method only works if the row labels are set and we can
refer to them in the .loc attribute. While column labels usually correspond to the
variable names, row labels are not necessarily used as row index in most of the
datasets. The row positions are therefore more useful in selecting cells in a DataFrame.
Suppose we would now like to select the observed values of workclass and income
from the first 5 rows.
SU4-21
ANL252 Data Management
If we want to select the observed values of the first two columns from the observations
younger than 30 years old, we will have to use the .iloc and .loc attributes at the
same time.
DataFrame_name[Condition]
SU4-22
ANL252 Data Management
We can also create more complex queries by using bitwise logical operators to chain
several conditions together.
The bitwise logical operators are similar to the logical operators. Instead of writing
and/or, we use & (bitwise and), | (bitwise or), or ~ (bitwise not) to combine our
conditions in the DataFrame queries. We can also add the bitwise not operator to the
above syntax if we want to negate any condition.
We need the bitwise logical operators here because we are actually creating a Boolean
mask for each condition within the index operator []. If there are two conditions, two
Boolean masks will be compared element-wise by the bitwise operator. The result of this
comparison is in turn a Boolean mask as well.
Example (Cont’d): Suppose we would like to select those observations that work more
than 40 hours per week.
SU4-23
ANL252 Data Management
In the next query, we would like to select female respondents from the DataFrame
that work more than 40 hours per week.
SU4-24
ANL252 Data Management
In the last query, we want to select female or non-white respondents who work more
than 40 hours per week.
SU4-25
ANL252 Data Management
In the first parentheses, we create a Boolean mask for observations where gender
is equal to "Female". In the second parentheses, the Boolean mask is created for
observations where race is not "White". These two masks are compared by the |
(bitwise or) operator. The resulting Boolean mask will then be compared by a Boolean
mask where the values in hours-per-week are larger than 40.
SU4-26
ANL252 Data Management
Lesson Recording
Figure 4.19 Concatenating two datasets with different rows but identical variables
In Python, we can use the .append() method to merge two DataFrames with identical
variables into one.
DataFrame_name.append(other = [OtherDataFrames])
SU4-27
ANL252 Data Management
The parameter other is used for the specification of those DataFrames to be appended
to DataFrame_name eventually. If we only have one DataFrame to be assigned to the
parameter other, we simply put its name without quotation marks behind other =. In
the case of specifying multiple DataFrames to the parameter other, we need to put their
names in a list.
Example (Cont’d): To study the different income groups in the US census data, the
data analysts decided to split the dataset into two. Observations with income “<50K”
will be saved in a new dataset named “census_low” and those with income “>50K” are
now saved in “census_high”. Now, after the datasets have been cleaned and studied
separately, both datasets should be merged again for some joint analyses.
Note that we can also apply the .append() method on census_low directly.
However, we must be very sure that we no longer need census_low with its original
data since there is no way to retrieve its original content after the appending process,
unless we can import the original dataset from an external source again. Furthermore,
if we appended census_high to census_low directly and re-ran the same code
SU4-28
ANL252 Data Management
Logically, the observations in the merged DataFrame do not follow the same order as
census since it does not play any role in this appending process at all.
Read
Refer to the link below for more details and examples on the .append() method of
the pandas package:
https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/
pandas.DataFrame.append.html
1. DataFrame A
SU4-29
ANL252 Data Management
Variable 1 Variable 2
1 Value 11 Value 12
2 Value 21 Value 22
3 Value 31 Value 32
2. DataFrame B
Variable 1 Variable 3
1 Value 11 Value 13
2 Value 21 Value 23
4 Value 41 Value 43
The inner join will return the rows where DataFrame A have matching keys in DataFrame
B. In this case, let Variable 1 be the matching key. Then the inner join resulting table should
be as follows:
Example: Suppose we have the two DataFrames (df1 and df2) regarding customers,
shops, and products, as demonstrated in the following two figures:
SU4-30
ANL252 Data Management
Then we inner/natural join the two tables based on the common variable
Customer_id, the DataFrame will keep all the rows that matched and all the
variables with the argument how=’inner’.
Read
Refer to the one link below for more details and examples on merging DataFrames
using the pandas package:
pandas. (2023b). Merge, join, concatenate and compare. NumFOCUS, Inc. https://
pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
SU4-31
ANL252 Data Management
Lesson Recording
In empirical studies, it often occurs that an observed value of a variable is missing. There
are many reasons for missing data: defective measurement tools, withdrawal from the
study, refusal of responses to sensitive questions, etc. In Python, there is the NoneType to
indicate missing data. Different packages have different ways to display a missing value.
For instance, pandas uses a special floating-point value for missing values, and NumPy
uses NaN which stands for “Not a Number”.
Missing data are not desirable for data analytics since they cannot be included in
constructing models, forecasting, etc. Statistical estimation of parameters can be biased.
In pandas, when we use statistical functions on DataFrames, missing values are typically
ignored by these functions. As a result, the execution of the code will not be interrupted,
but the computation of these functions could be due to unequal underlying sample sizes
for each variable.
SU4-32
ANL252 Data Management
certain strings to missing values directly while the data are being converted to pandas
DataFrame.
DataFrame_name = pd.read_csv("csv_file_name.csv",
na_values = "na_string",
na_filer = True/False)
The default value of the na_filter parameter is True. In this case, pandas will convert
all white spaces "" to NaN. However, there could be situations where white space is an
actual value of interest and not a missing value. The filter should then be turned off and
the value would be False.
With the parameter na_values, we can declare certain strings from our DataFrame to be
recognised as missing values. By default, strings like "", "#N/A", "#N/A N/A", "#NA",
"-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null" are treated as missing values and do not need
to be specified explicitly with this parameter.
Example (Cont’d): From Figure 4.1 and Figure 4.2, we can recognise that question
marks are used to indicate missing values in the US Adult Census dataset. Suppose
we would like to declare every cell that contains a question mark solely as a missing
value.
SU4-33
ANL252 Data Management
Figure 4.24 Declaring specific strings as missing values while importing data
In the read_csv() function, we specify a single question mark as string that should
be identified as a missing value in the census DataFrame. In Figure 4.24, we can
see that the values for workclass and occupation in the fifth row are now NaN,
whereas in Figure 4.1, they were simply "?".
Read
Refer to the link below for more details and examples on the parameters associated to
missing values in the read_csv() function of the pandas package:
pandas-docs/stable/reference/api/pandas.read_csv.html
SU4-34
ANL252 Data Management
If the number of missing values (aka NaNs) in a column is more than zero, we have then
identified the variable that has missing values. Such a variable with missing values may
introduce bias into our analytics tasks. And if the number of missing values in a variable
is large, the variable may not have sufficient data for reliable analysis.
DataFrame_name.isnull().sum(axis = 0)
DataFrame_name.isnull().sum(axis = 1)
The above syntax is in fact a Boolean masking. It contains two methods of the pandas
package. The .isnull() method instructs Python to check every cell of the DataFrame
and then return True if it is an NaN. Subsequently, Python should return the sum of each
row or each column of the Boolean mask. If the parameter axis = 0, the values in a
column are added up. And if axis = 1, we will get the sum of the row instead. The default
axis here is 0. Since True is usually represented by 1 and False by 0 when converting
a Boolean variable to a numeric value, the sum of a row or a column with only Boolean
values will therefore be the same as counting the occurrence of True in it.
If our intention is just to check the existence of missing values, we can use the .any()
method instead. The .any() method will return True if at least one of the elements in
the array returned by the .isnull() method is True.
DataFrame_name.isnull().any(axis = 0)
SU4-35
ANL252 Data Management
DataFrame_name.isnull().any(axis = 1)
We can retrieve the indices of the rows or columns with missing data by applying the
.index method on the resulting object from the syntax above.
object_name = DataFrame_name.isnull().any()
object_name[object_name == True].index
Counting the NaNs in columns has a different meaning than counting them in rows. When
we count the number of NaNs in columns, we are checking on the existence of missing
values in each variable. If they exist, we may need different approaches to adjust the
data for different types of variables. For instance, if they exist in a numeric variable, we
can replace the missing values by zero or by the mean of the variables. If a text variable
contains missing data, we may add a response category such as “no reply” to it. We can
also choose to neglect them if the variable is irrelevant for our analyses of the data.
By counting the NaNs in rows, however, we intend to identify those observations with
missing values in at least one of the variables. Depending on the analyses and the
importance of the observation, we can choose to delete the observation or to apply the
appropriate data adjustments to the affected columns.
Example (Cont’d): From Figure 4.25, we can identify two missing values in
workclass and occupation for the fifth observation. Now we would like to find
out whether there are more missing values in these two and other variables.
SU4-36
ANL252 Data Management
The output shows that workclass, occupation, and native-country are the
three variables with missing data. Their proportions of missing data are 2,799 (5.7%),
2,809 (5.7%), and 857 (1.8%) out of 48,842 observations, respectively. We need to
further study these observations to conclude on the adjustment to be applied to the
missing values. For this purpose, we shall look for all the observations, or their row
indices, with at least one missing value.
SU4-37
ANL252 Data Management
The output generated in Figure 4.26 is rather unsatisfactory since we only see the
returned Boolean value from the chained methods .isnull().any(axis = 1).
Furthermore, it is only useful to us if the rows with missing values are selected from
the DataFrame. As a result, we need to filter census with the results above.
First, we save the output generated in Figure 4.27 as an object named missrow. After
that, we use a Boolean mask to select only those “True” observations from missrow,
and the .index method will then return the corresponding row indices to us for
selection. Eventually, we apply these row indices to subset census. If we wish to
work on this subset of the DataFrame further, we can also assign it to an object in the
second line of our code.
From Figure 4.27, we note that there are a total of 3620 rows with at least one missing
value in workclass, occupation, and native-country. We can now study the
data of these observations and decide on the adjustment measures subsequently.
SU4-38
ANL252 Data Management
Read
Refer to the link below for more details and examples on the .isnull() method of
the pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.isnull.html
Refer to the link below for more details and examples on the .sum() method of the
pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.sum.html
Refer to the link below for more details and examples on the .any() method of the
pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.any.html
Refer to the link below for the index() method of the pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.index.html
SU4-39
ANL252 Data Management
To delete an entire row with missing values from the DataFrame, we have two options:
the .drop() and .dropna() methods.
With the .drop() method, we can delete an entire row or column by specifying the
corresponding indices resulting from the localisation methods introduced in Chapter 4.2.
The parameter axis indicates whether rows (0) or columns (1) should be dropped.
The .dropna() method combines the localisation and removal of rows or columns with
missing data in a single function. Its usage is rather convenient since we can omit using
the .isnull().any() and .index() methods before dropping the corresponding
observations or variables.
The drawback of the .dropna() method is the equal treatment for all missing values
throughout the entire dataset. As mentioned in the previous chapters, we have multiple
ways to adjust missing data for different types of variables. Depending on the observed
SU4-40
ANL252 Data Management
values of other variables, we may also want to keep some of the rows with missing data
while deleting others.
Example (Cont’d): Now we would like to remove all the rows with missing values
using the .drop() method.
Figure 4.28 Dropping observations with missing values using .drop() method
For the missrow object, the indices assigned to the index parameter in the .drop()
method are determined by the same syntax used in the index operator for missrow
in the same figure.
SU4-41
ANL252 Data Management
Figure 4.29 Dropping observations with missing values using .dropna() method
Another possibility in dealing with missing values is to replace them with a pre-
defined value. The most common values used for such purpose are 0 or the variable
mean. Some literature also suggests more sophisticated approaches such as interpolation,
extrapolation, or estimation. In Python, the pandas package facilitates replacement of
missing values by the .fillna() method.
DataFrame_name.fillna(value = repl_value)
DataFrame_name["column_label"].fillna(value = repl_value)
Basically, if we apply the .fillna() method on the entire DataFrame, it will replace all
missing values that Python could find with the value specified in the parameter. But if
we specify a column in the DataFrame and attach the .fillna() method to it, only the
missing values found in the corresponding variable will be replaced. By doing this, we
can treat missing data in various variable types differently.
SU4-42
ANL252 Data Management
Example (Cont’d): Suppose we decide to simply replace all missing values in the
DataFrame census by 0.
The output shows that all the missing data (NaN) in row 5 are now replaced by 0.
Nevertheless, it looks rather odd to have a value 0 in the variables workclass and
occupation. As a result, instead of replacing them by 0, we would rather replace
them with the string “Unknown”.
SU4-43
ANL252 Data Management
Read
Refer to the link below for more details and examples on the .drop() method of the
pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.drop.html
Refer to the link below for more details and examples on the .dropna() method of
the pandas package:
SU4-44
ANL252 Data Management
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.dropna.html
Refer to the link below for more details and examples on the .fillna() method of
the pandas package:
pandas.DataFrame.fillna.html
Basically, we can use statistics such as the inter-quartile range (IQR) to detect the existence
of outliers in a variable. Furthermore, visualisation like boxplots or histogram can also be
useful to examine the distribution of the variables.
In Chapter 2.2 of Study Unit 3, we learned how to use the sub-package matplotlib.pyplot
to draw a histogram in Python. The boxplot() function from the same sub-package
facilitates the creation of boxplots for outlier detection.
To compute the inter-quartile range, we can use the .quantile() method to determine
the first and third quartiles of the variable.
DataFrame_name["column_label"].quantile(q = quantile)
SU4-45
ANL252 Data Management
With the parameter q, which is a value between 0 and 1, we can define the quantile of the
distribution that the .quantile() method should return to us. Once the 0.25 and 0.75
quantiles of the target variable are obtained, the inter-quartile range iqr can be computed
by iqr = q3 – q1. An observation y is considered as outlier if y < q1 – 1.5 * iqr
or y > q3 + 1.5 * iqr.
The usual practice in dealing with outliers is to remove them from the dataset. In Python, it
suffices to keep observations that do not contain outliers in the target variable. The syntax
below generates a subset of rows that do not fulfil the above outlier condition.
Note that DF represents the DataFrame_Name and Col is the column_label. The
condition left from the bitwise or operator “|” selects all observations with values in
“Col“ smaller than q1 – 1.5 * iqr whereas the condition to the right selects those
observations larger than q3 + 1.5 * iqr. Nevertheless, this would be the combined
condition to select all the outliers. To invert the selection, we need to put the bitwise
not operator “~” before the entire condition, which must then be placed within a pair of
parentheses.
SU4-46
ANL252 Data Management
Based on the results in Figure 4.32, half of the sample works between 40 and 45 hours
weekly on average. The corresponding upper and lower thresholds to differentiate
outliers from “normal” data are 32.5 and 52.5, respectively.
In the next step, we can select those outlier observations for checking before dropping
them from the DataFrame eventually.
Before dropping those outlier observations from the DataFrame, we shall study them
more carefully. For instance, the observation with row index 48829 works for 60 hours
SU4-47
ANL252 Data Management
per week, which is much higher than the third quartile of the data. But according
to the variable workclass, s/he is self-employed. From this perspective, his/her
average weekly working hours seem sensible. Hence, this observation could be useful
for further analyses.
Nevertheless, to show how the syntax works, we will still drop all the outlier
observations from census that fulfil the above criteria with the following
programme.
SU4-48
ANL252 Data Management
Read
Refer to the link below for more details and examples on the .quantile() method
of the pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.
quantile.html
SU4-49
ANL252 Data Management
Lesson Recording
DataFrame_name.sort_values(by = [List_of_var_names],
ascending)
We can provide a list of variable names to the parameter based on which the DataFrame
will be sorted. The sorting hierarchy among these variables drops with the increasing
index in the list. If we set the parameter ascending to True, the values of the variables
given in the parameter will be sorted in the ascending order, and they will be sorted in the
descending order if it is False.
SU4-50
ANL252 Data Management
Example (Cont’d): Suppose we would like to study the relationship between the
individuals’ income, their educational level (which is represented by the numeric
variable educational-num in the DataFrame census), as well as their age. The
value in educational-num increases with the educational level of the individual.
For this purpose, we will sort the DataFrame census first by educational-num
in the descending order and then by age in the ascending order. That is, we will
see observations with the highest educational level first, while observations with the
lowest educational level will appear at the end. And in each educational level, we will
first find the youngest individual; and the oldest individual will be put as last in the
group.
In the above syntax, we used two variables for the sorting process. As mentioned, the
sorting hierarchy decreases with the index of the variable name in the list. That is, the
DataFrame will be first sorted by education-num, followed by age. As a result, we
also need a list of two Boolean values to instruct Python on how each of the variables
SU4-51
ANL252 Data Management
Read
Refer to the link below for more details and examples on the sort_values()
function of the pandas package:
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.sort_values.html
DataFrame_name.groupby(by = [List_of_Labels]).anymethod()
With the parameter by we can specify a list of column labels, or variable names, based
on which the grouping should be conducted. These variables must be categorical so that
the number of groups is finite and limited. Attached to the .groupby() method can be
any method that we would like to apply on the grouped data. The list of such functions or
methods can be found in Table 3.1 of Study Unit 3 since the NumPy functions or methods
are also applicable to pandas DataFrames.
SU4-52
ANL252 Data Management
Example (Cont’d): Suppose we would like to compute the mean of the number of
working hours, capital gain and capital loss as well as the age for each age group
created in Figure 4.36.
Figure 4.36 Computing the mean of all numeric columns for grouped data
The .mean() method of pandas selects all columns of type integer or float and
compute their means in each age group. As a result, we obtain the group means of
fnlwgt and educational-num as well. To select only the relevant variables for the
group mean calculation, we can subset the census DataFrame first in the above syntax.
Figure 4.37 Computing the mean of selected columns for grouped data
As a result, the average age of the youngest group is 23.4, while those between 39
and 59 is 42.2 and those who are 60 and older is 66.5. Individuals between the age
of 30 and 59 have to work an average of over 43 hours per week while the average
working hours for youngsters and seniors are 36 and 34 hours a week respectively.
Furthermore, seniors at the age of 60 or above have on average the highest capital
gain and loss in comparison to the other two groups.
SU4-53
ANL252 Data Management
Read
Refer to the link below for more details and examples on the functions groupby() of
the pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.groupby.html
SU4-54
ANL252 Data Management
Summary
In this unit, we have seen how Python can be used to manipulate, clean, and query data
using the pandas package. Querying the DataFrame structures can be done in different
ways, such as using the .iloc or .loc attributes for row-based querying or using the
square brackets on the object itself for column-based querying. We also saw that one
can query the DataFrame through Boolean masking. Furthermore, we also came across
situations where we had to use the .append() method and the concat() function to
merge multiple DataFrames with different shapes into one. We then explored how to
detect and replace missing values as well as outliers in a DataFrame. We also discussed
how DataFrames could be modified for further analyses such as sorting and grouping
data.
SU4-55
ANL252 Data Management
Formative Assessment
2. Which of the following values can be used with .iloc to select rows from a
DataFrame?
a. ["var_label1", "var_label2"]
b. [-4:-1]
c. ["0", "1", "2"]
d. DataFrame_name["var_label1" == "good"]
a.
A B
1 2
3 4
b.
A B
3 4
SU4-56
ANL252 Data Management
1 2
c.
A B
1 3
2 4
d.
A B
3 1
4 2
a. A and B
b. B and C
c. Only B
d. A, B, and C
SU4-57
ANL252 Data Management
a. X
b. W
c. Z
d. Y
SU4-58
ANL252 Data Management
Formative Assessment
1. What is not a function/method to display a DataFrame?
a. print()
b. .head()
Incorrect. We can use the .head() method to print out the first five rows of
a DataFrame.
c. show()
d. display()
Incorrect. We can use the display() function to print out the first five and
the last five rows of a DataFrame.
2. Which of the following values can be used with .iloc to select rows from a
DataFrame?
a. ["var_label1", "var_label2"]
b. [-4:-1]
SU4-59
ANL252 Data Management
Incorrect. The values in the index operator for .iloc must be numeric and
not label strings.
d. DataFrame_name["var_label1" == "good"]
Incorrect. The arguments in the index operator for .iloc must be numeric
values and not Boolean expressions.
a.
A B
1 2
3 4
b.
A B
3 4
1 2
Incorrect. The .append() method appends df2 to df1 and not vice versa.
c.
A B
1 3
SU4-60
ANL252 Data Management
2 4
d.
A B
3 1
4 2
a. A and B
Incorrect. Since it is an outer join, the resulting DataFrame should contain all
available columns across the original DataFrames and not only those from
df.
b. B and C
Incorrect. Since it is an outer join, the resulting DataFrame should contain all
available columns across the original DataFrames and not only those from
df2.
SU4-61
ANL252 Data Management
c. Only B
Incorrect. Since it is an outer join and not inner join, the resulting DataFrame
should contain all available columns across the original DataFrames and not
only the common ones.
d. A, B, and C
Correct. Since it is an outer join, the resulting DataFrame should contain all
available columns across the original DataFrames, which are A, B, and C.
a. na_values
b. na_filter
c. na_drop
d. na_omit
a. .drop()
b. .dropna()
SU4-62
ANL252 Data Management
c. .nareplace()
d. .fillna()
a. X
Incorrect. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W should be the lowest in the hierarchy.
b. W
Correct. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W is the lowest in the sorting hierarchy.
c. Z
Incorrect. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W should be the lowest in the hierarchy.
d. Y
Incorrect. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W should be the lowest in the hierarchy.
SU4-63
ANL252 Data Management
d. The values in the grouping variable can be identical to the row indices.
Incorrect. In the case where the row indices are values of a categorical
variable, this categorical variable can perfectly be used as a grouping variable
since its number of possible values is limited.
SU4-64
ANL252 Data Management
References
DataScience Made Simple. (2023). Join in pandas: Merge DataFrames (Inner, Outer, Right,
merge-data-frames-pandas-python/
Department of Computer Science. (1996). The adult dataset. The University of Toronto.
adult/adultDetail.html
pandas. (2023a). IO tools (text, CSV, HDF5, …). NumFOCUS, Inc. https://
pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table
pandas. (2023b). Merge, join, concatenate and compare. NumFOCUS, Inc. https://
pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
docs/stable/reference/api/pandas.concat.html
docs/stable/reference/api/pandas.cut.html
pandas-docs/stable/reference/api/pandas.DataFrame.any.html
pandas.pydata.org/pandas-docs/version/1.3/reference/api/
pandas.DataFrame.append.html
pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.dropna.html
SU4-65
ANL252 Data Management
pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.groupby.html
pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
pandas-docs/stable/reference/api/pandas.DataFrame.index.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.isnull.html
pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.quantile.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.reset_index.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.sort_values.html
SU4-66
ANL252 Data Management
pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
docs/stable/reference/api/pandas.read_csv.html
SU4-67
ANL252 Data Management
SU4-68
5
Study
Unit
Learning Outcomes
SU5-2
ANL252 Data Analytics in Python
Overview
In this unit, we will discuss the implementation of two analytics techniques in Python:
k-means clustering and decision trees. All these functions, modules, and algorithms
can be found in the scikit-learn library. The scikit-learn library is a machine-learning
library written for the Python programming language. It features various modules such
as classification, clustering, regression, etc. We will first develop Python programmes to
prepare the DataFrame for the different requirements of these sub-packages. We will then
learn how the two analytics techniques can be carried out by Python programmes and
how their results can be extracted and presented.
SU5-3
ANL252 Data Analytics in Python
Lesson Recording
Introduction to Scikit-Learn
It has become very common to carry out tasks for data analytics, statistical modelling,
or machine learning in Python recently. And the trend is rising. In fact, many Python
packages have been developed in these areas. One of the most common libraries used for
these purposes is scikit-learn, a free machine-learning library written for Python. In this
study unit, we will write our code with the scikit-learn functions in JupyterLab.
One reason that scikit-learn has become one of the most common machine-learning
libraries for programming is its broad applicability and functionality. It features various
algorithms for classification, regression, clustering, etc. In machine learning, programmes
are constructed with parameters such that they can “learn” from newly fed data. That
is, they can automatically adjust and improve their behaviour according to the new
“knowledge”. Below is a table of the most common algorithms that are available in scikit-
learn.
Supervised learning
SU5-4
ANL252 Data Analytics in Python
Unsupervised learning
Besides machine learning algorithms, scikit-learn also provides modules for model
selection, visualisation, data transformation, as well as example datasets. The website
https://scikit-learn.org/stable/user_guide.html contains many details of the library.
Same as NumPy, matplotlib, and pandas, we can simply use pip (the package installer of
Python) to download and install scikit-learn.
After installing scikit-learn using pip, we can import it into our programme.
import sklearn
Note that it is sklearn and not scikit-learn that refers to the scikit-learn library
in the Python programmes. Nevertheless, since the library is extraordinarily extensive,
programmers usually do not import the entire library. Instead, the common practice
is to load the required algorithm or only its “estimator” object. For instance, if linear
SU5-5
ANL252 Data Analytics in Python
regression models are required for the analytics task, we can import the estimator
LinearRegression from the module linear_model.
Since each module has its own estimators, functions, etc., it is important to refer to the
official websites for the correct spelling, including the cases of the names. It is not unusual
that we need to load a couple of them for a single analytics task. It is therefore important
to put sufficient comments in the programme to explain the purpose and use of each
imported module.
In this study unit, we will demonstrate two scikit-learn algorithms – k-means clustering
and decision trees – to show how the library and Python, in general, can be applied in
data analytics. But before we can apply these algorithms, we need to prepare the data
according to the requirements of each of the algorithms. The preparation process will be
discussed in the next section.
Example: We begin to develop the code using the scikit-learn algorithms and to import
the corresponding packages or modules.
SU5-6
ANL252 Data Analytics in Python
In the first box, the packages introduced in the two previous study units, pandas,
NumPy, and matplotlib are imported. We will need them to manage DataFrames,
convert slices of DataFrames to multi-dimensional arrays, and construct plots to
illustrate and evaluate the model results. In the second box, we import modules from
scikit-learn that we need for the pre-processing and transformation of the DataFrames
for model constructions. For instance, with the train_test_split function, we can
instruct Python to split arrays into random training and testing subsets for evaluating
the estimator performance. Furthermore, the module metrics includes functions to
compute metrics and distances for the evaluation of classification performance. The
functions in the preprocessing module such as scaling, centring, normalisation,
etc. are used to prepare DataFrames for the scikit-learn algorithms. In the last two
boxes, we import the modules of k-means clustering (KMeans) and decision trees
(tree). In addition to the KMeans module, we also import the PCA module from the
decomposition sub-package for dimension reduction, which will be helpful to plot
multivariate data as a two-dimensional chart.
Read
Refer to the link below for more details on the installation of the scikit-learn package:
learn.org/stable/install.html
Refer to the link below for more details and examples on the cluster module of the
scikit-learn package for applying K-Means clustering:
modules/clustering.html
SU5-7
ANL252 Data Analytics in Python
Refer to the link below for more details and examples on the tree module of the
scikit-learn package for constructing decision trees:
scikit learn. (2023). 1.10 decision trees. scikit learn 1.2.2. https://scikit-learn.org/
stable/modules/tree.html
Refer to the link below for more details and examples on the metric module of the
scikit-learn package for metrics and distance computations:
modules/classes.html?highlight=metrics#module-sklearn.metrics
Refer to the link below for more details and examples on the preprocessing module
of the scikit-learn package for data preparation techniques such as scaling, centring,
normalisation, etc.:
highlight=preprocessing#module-sklearn.preprocessing
Lesson Recording
In Figure 5.1, various modules for data preparation have been imported. At the same time,
we have also imported packages that we have already worked with in the previous study
units: NumPy, pandas, matplotlib. In fact, the scikit-learn algorithms work hand-in-hand
with these packages. In SU4, we have already covered some basic data preparation tasks
SU5-8
ANL252 Data Analytics in Python
such as missing data handling. In this section, we will further need to combine NumPy,
pandas and scikit- learn to prepare datasets to meet the requirements of every scikit-learn
module.
Basically, the process of category reduction is to put observations from similar categories
into a new category. For instance, if the country names are categories of a categorical
variable, we can group them by their continents. If a categorical variable contains the
models of a certain product, we can group them by their brands or their main features.
The similarity of the categories is essential here for not losing too much information.
DataFrame_Name["column_label"].replace(to_replace, value)
Example: Suppose we have a pandas DataFrame with one column (i.e., ‘fruits’), as
shown in the first block of code below. In the DataFrame, the category ‘small_apple’
is redundant, which can be converted to the category ‘apple’ with the second block of
code below:
SU5-9
ANL252 Data Analytics in Python
Read
Refer to the link below for more details and examples on the .replace() method of
the pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.replace.html
SU5-10
ANL252 Data Analytics in Python
The column labels to be renamed must be put as keys of a dictionary that will be assigned
to the parameter columns in the .rename() method. The values of the dictionary will
then be the new labels of the corresponding columns.
Example: Suppose we have a pandas DataFrame with one column (i.e., ‘fruits’), as
shown in the first block of code below, we can rename the column as ‘three_fruits’
with the second block of code below:
Read
Refer to the link below for more details and examples on the .rename() method of
the pandas package:
SU5-11
ANL252 Data Analytics in Python
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.rename.html
The parameter drop_first is used to instruct Python to take the first category as
the reference level and remove it from the resulting DataFrame. The reason to define a
reference level for categorical variables and to remove it from the DataFrame is to avoid
linear dependence on the data matrix, which causes error in the calculation. The default
setting here is drop_first = False. In this case, all dummy variables will remain in the
resulting DataFrame. Since most of the modules in scikit-learn have their own algorithms
to deal with this issue, we can keep this setting without causing error in the estimation
process.
SU5-12
ANL252 Data Analytics in Python
Example: Suppose we have a pandas DataFrame with one categorical column (i.e.,
fruits), as shown in the first block of code below, we can convert it to a DataFrame
with three dummy variables, namely, apple, banana, and orange:
Read
Refer to the link below for more details and examples on the get_dummies()
function of the pandas package:
pandas-docs/stable/reference/api/pandas.get_dummies.html
SU5-13
ANL252 Data Analytics in Python
could be optional for some algorithms. The parameter X is the design matrix that contains
all independent variables, and Y is the vector of the target variable. Both X and Y can be
NumPy arrays or pandas DataFrames. As a result, we need to extract the independent
variables as a matrix and the dependent variable as a vector from the original DataFrame.
The procedure here is rather straightforward. We simply select the column that represents
our dependent variable and save it as a new object.
y = DataFrame_name["target_var"]
Similarly, the matrix of the independent variables can be selected in the same manner.
Note that it is required to wrap the names of the independent variables in a list (square
brackets) first before putting them in the index operator [].
If the DataFrame only contains the independent variables and the target variable, we can
also simply drop the target variable from the original DataFrame to obtain X.
X = DataFrame_name.drop("target_var")
If the target variable is categorical and has been transformed to various dummy variables,
the names of the dummy variables must be put in a list when passing them to the .drop()
method:
Y = DataFrame_name[Y_dummyvar[1]]
X = DataFrame_name.drop(Y_dummyvar, axis=1)
SU5-14
ANL252 Data Analytics in Python
In the first line above, we create Y_dummyvar,which denotes a list of the target dummy
variables y1 and y2. However, we only need one of the dummy variables for modelling
purposes since they are highly correlated. In other words, when y1 is equal to 1, y2 is
equal to 0; and vice versa. In the second line above, the column with the label that matches
the second item in Y_dummyvar will be selected from the DataFrame. Lastly, in the third
line above, we also drop the target variable from the DataFrame to have the independent
variables X.
Treat missing values Remove the records with Please refer to SU4 for
missing values; replace the more details.
missing values
SU5-15
ANL252 Data Analytics in Python
1.2.6 Regression
Withstanding that we are able to extract the target variable and independent variables
from Section 1.2.7, now we would like to illustrate more about building simple regression
models on a data frame. Suppose we have a DataFrame for analysis, which contains the
independent variables X and the target variable Y.
SU5-16
ANL252 Data Analytics in Python
where a indicates the coefficient for the independent variable or the slope of the line, b
The linear regression model is fitted based on the least square method, i.e., to minimise
the sum of squared errors (SSE) over all observations i:
Thereby, the oblique line in Figure 5.5 is also named as the least square regression line.
The performance of the linear regression model has been commonly evaluated with the
R2 value, which can be calculated by:
SU5-17
ANL252 Data Analytics in Python
where the SST stands for the sum of square deviations of the target variable Y from its
own mean. The value of R2 ranges from 0 (worst model performance/goodness of fit) to
1 (worst model performance/goodness of fit).
When there are more than one independent variables, we may further represent the linear
regression model as:
The linear regression formulations and R2 method for performance evaluation can be
imported from sklearn package with the following codes:
From here, we are able to create the linear regression object and the fit the linear regression
mode:
lr_object = LinearRegression()
model = lr_object.fit(X, Y)
Lastly, we can evaluate the performance of the linear regression model in terms of R2 with
the ‘.score‘ method:
SU5-18
ANL252 Data Analytics in Python
model.score(X, Y)
# of Employees Profit
1 20000
2 10000
3 30000
4 40000
5 50000
6 70000
Then we create the array for analysis, including the dependent variable ‘study_hours’
and the target variable ‘score’. The ‘.reshape()’ method is used to ensure the
independent variable ‘study_hours’ has two dimensions, i.e., (12, 1).
SU5-19
ANL252 Data Analytics in Python
Then we build the linear regression model with the sklearn package:
With the linear regression model, we can further calculate the predicted target variable
values (score) for each data point:
Lastly, we are able to plot both the scatter plot of the study hour – test score data points
and the linear regression line in the same figure:
SU5-20
ANL252 Data Analytics in Python
Other than linear regression, sklearn also allows for other types of commonly used
regression techniques such as logistic regression, generalised linear models, etc., for
which students may refer to the links at the end of this section for more information.
Read
Refer to the link below for more details and examples on the LinearRegression
function of the linear_model in the scikit-learn package:
developers. https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LinearRegression.html
Refer to the link below for more details and examples on the
sklearn.linear_model of the scikit-learn package:
SU5-21
ANL252 Data Analytics in Python
scikit-learn.org/stable/modules/classes.html#module-
sklearn.linear_modelsklearn.linear_model
SU5-22
ANL252 Data Analytics in Python
Chapter 2: Clustering
Lesson Recording
Different from some other clustering algorithms in which the number of clusters will only
emerge during the grouping process, the K-Means method requires the number of clusters
to be specified before the algorithm starts. The clusters are characterised by their centroids,
which can be interpreted as the centre of an area in a two-dimensional space and are hence
the average of all the observations within a cluster. As the name of the algorithm suggests,
there should be K different means (centroids) and they should be explored during the
clustering process.
SU5-23
ANL252 Data Analytics in Python
Note that the distance mentioned in 2) refers to the Euclidean distance in general. The
Euclidean distance between an object and a cluster centroid is measured by the sum of
the squared differences between the values of some selected clustering criteria, which are
usually some input variables of the object, and the values of the same clustering criteria
of the centroid.
Subsequent to the clustering process, it is important to make sure that the resulting clusters
really create some insights. To interpret the clusters, the characteristics of each cluster
should be explored by looking at the summary statistics (e.g., mean, min, max) of the
clustering criteria. A good clustering solution should allow us to describe the profile of
each cluster clearly.
In addition, there are objective measures for evaluating the quality of clustering solutions:
cohesion, separation, and parsimony. Cohesion measures the similarity of the objects in
a cluster. This value should be small because these objects should be similar. Separation,
on the other hand, measures how dissimilar the clusters are, and this value should
be high. Here, we can apply the Silhouette coefficient since it combines both cohesion
and separation. Briefly speaking, the Silhouette coefficient is a value between -1 and 1
that measures the relationship between the intra-cluster distances and nearest cluster
distances. The mean of the individual Silhouette coefficients will be computed for every
SU5-24
ANL252 Data Analytics in Python
clustering solution for evaluation. A high and positive average Silhouette coefficient
suggests appropriate and useful clustering solution. On the contrary, negative Silhouette
coefficient indicates a rather undesirable clustering result.
First, we need to initiate the KMeans estimator and adjust the estimation parameters
according to our needs.
SU5-25
ANL252 Data Analytics in Python
SU5-26
ANL252 Data Analytics in Python
Here are some explanations regarding some items mentioned in Table 5.3:
SU5-27
ANL252 Data Analytics in Python
The parameter X is a prepared DataFrame based on which the clusters are constructed.
With sample_weight,we can pre-specify the weights for each observation in X. If it is
set to None, which is also the default here, all observations will be assigned equal weight.
The fitted estimator of the K-Means algorithm is saved in km_fit_Object.
To obtain and view the clusters of each data point, we can use the following syntax:
print(km_Object.labels_)
Example (Cont’d): Here we give an example on the popular iris dataset in machine
learning which contains 150 records of iris flowers from three distinct species (setosa,
versicolor, and virginica). The dataset comprises four attributes, namely, sepal length,
sepal width, petal length, and petal width. We start by loading the dataset from
scikit-learn, and the related package for plotting. Since K-Means clustering is an
unsupervised machine learning algorithm, where we do not need a target variable for
clustering purpose.
SU5-28
ANL252 Data Analytics in Python
Next, we specify the number of clusters (3), initialise the KMeans estimator and fit the
KMeans clustering model.
Since we already have the KMeans clustering model now, we can obtain the clusters
with kmeans.labels_.
Then we can visualise the clusters with the matplotlib package regarding the first two
features in the iris dataset, as shown below:
Read
Refer to the link below for more details and examples on the KMeans estimator and
fit()functions of the cluster module in the scikit-learn package:
SU5-29
ANL252 Data Analytics in Python
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
SU5-30
ANL252 Data Analytics in Python
Lesson Recording
Decision trees are among the most common data mining methods which split a set (or
subset) of observations to reach certain decision points based on some criteria eventually.
Each decision point in the tree is also called a node and represents a subset of the sample
based on which the decision tree is created. Nodes that are split from a superordinate node
are called the child node while the origin node is called the parent node. A child node
with no further subdivisions or splitting is called a leaf node.
Since each observation in the sample will be assigned to one of the nodes eventually,
decision tree is a classification technique to separate a sample into multiple classes. The
decision tree algorithm predicts the individual classification based on the values of some
input variables and calculates the predicted value of the target variable at the same time.
These rules of decision form the resulting model which can then be illustrated by a tree-like
structure graphically. This structure convenes the interpretation of the modelling result.
We can also use the decision tree model to understand the relationship between the target
variable and the input variable. In fact, the decision tree handles complex relationships
such as non-linearity and interaction rather well. Note that not all input variables are of
the same importance in the classification process. Their hierarchy in the decision rules
reflects in the decision tree in which input variables appear higher up are more important.
As mentioned, nodes without further splitting are called the leaf nodes. Their value is
the prediction of the target variable for those observations classified in the corresponding
SU5-31
ANL252 Data Analytics in Python
nodes. If the target variable is categorical, the value of the leaf node will be the mode, the
most frequent class. And it will be the mean of the data in that node if the decision tree is
predicting a numeric variable.
Over the years, there have been many algorithms for splitting the nodes and hence the
construction of the tree developed, proposed, and implemented in software packages.
The most common ones are CHAID (chi-square automatic interaction detection), C5.0
(a proprietary algorithm) and CART (classification and regression tree). In Python, the
estimator DecisionTreeClassifier of the scikit-learn package uses an optimised
version of the CART algorithm.
As the name suggests, the CART algorithm can create both classification trees and
regression trees. While regression trees estimate the values of a continuous target variable,
classification trees predict the outcome of a categorical target variable. In other words,
CART is applicable on almost every type of output variable.
In CART, every parent node only has two child nodes. That is, CART will only split the
tree into two sub-samples at every decision point. And the calculation of the split is based
on the input variables that are also used to predict the target output. As the split process
advances, the sample will be divided into more and more, smaller and smaller subsets.
These subsets will also become more and more homogeneous. The whole splitting process
will be terminated once certain stopping criteria are fulfilled.
The homogeneity of each subset reflects the split quality from a parent node to its child
nodes. In classification tree, the homogeneity is measured by Gini and entropy. Roughly
speaking, both Gini and entropy measure the impurity of a node, but with different
theoretical background. By comparing the impurity decrease across all possible splits in
all input variables, the split with the highest reduction of impurity will be chosen. In
scikit-learn, both Gini and entropy are options of the criterion parameter in the
DecisionTreeClassifier.
In a regression tree, the impurity is measured by the sum of squared error (SSE). The SSE
is the total deviance of each observation from the sample mean. For each potential split,
SU5-32
ANL252 Data Analytics in Python
CART computes the SSE for each child node and the split with the lowest sum of SSE
across all child nodes will be chosen. Since the mean of the target variable of the leaf node
is equal to the predicted value of the target variable, splits with low SSE have child nodes
that contain data whose target values are close to the mean value.
One possibility to stop the CART algorithm is when the impurity improvement of a new
split drops below a certain pre-defined threshold. For instance, if a node has reached a
rather low Gini or SSE respectively, meaning that the parent node itself is already rather
homogeneous, another split from the node would only create homogeneous child nodes.
In this case, this split is not necessary since it does not decrease the impurity significantly.
Nevertheless, choosing the right thresholds to stop the split algorithm is not as
straightforward as it seems. If the thresholds are high, the resulting tree could be
oversimplified as splits become more difficult. Low thresholds, on the other hand, could
lead to over-complicated trees that are difficult to interpret and deploy.
Another possibility to stop the algorithm is when the tree has attained a pre-specified
depth. The depth of the tree refers to the number of splits in it. This method can simply
control the size of the tree without oversimplifying or overcomplicating it. One last
stop criterion is to set a lower bound of observations in the nodes. Once the number
of observations in all nodes has reached the bound, a new split from any node would
only create child nodes that contain less observations than the lower bound allows. As a
result, the lower bound blocks the algorithm from carrying out another split and the entire
process ends.
One way to evaluate the performance of a decision tree is to examine the precision of its
prediction. In other words, the predicted values of the target variable will be compared
with the observed data. For classification trees, we can use the confusion matrix in which
the correct and incorrect classifications are summarised. The larger the proportion of
observations for which the predicted and observed classifications are identical, the more
accurate is the decision tree model. For regression trees, the Root-Mean-Square-Error
(RMSE) is usually used to measure the prediction accuracy of the model. Basically, the
RMSE is kind of an average deviance of all the predicted values from their observed
SU5-33
ANL252 Data Analytics in Python
counterparts. The lower such a deviance is, the closer are the predictions to the actual
values, and the better is the model.
tree_Object = sklearn.tree.DecisionTreeClassifier(criterion
= "gini", splitter = "best", max_depth = None,
min_samples_split = 2, min_samples_leaf = 1,
min_weight_fraction_leaf = 0.0, max_features =
None, random_state = None, max_leaf_nodes =
None, min_impurity_decrease = 0.0,
min_impurity_split = None, class_weight = None)
tree_Object = sklearn.tree.DecisionTreeRegressor(criterion
= "mse", splitter = "best", max_depth = None,
min_samples_split = 2, min_samples_leaf = 1,
min_weight_fraction_leaf = 0.0, max_features =
None, random_state = None, max_leaf_nodes =
None, min_impurity_decrease = 0.0,
min_impurity_split = None)
SU5-34
ANL252 Data Analytics in Python
From the above syntaxes, we can see the main differences between the estimators
DecisionTreeClassifier and DecisionTreeRegressor are the values of the
parameter criterion and the availability of the parameter class_weight. The
following table provides description and explanation of the parameters.
SU5-35
ANL252 Data Analytics in Python
If integer, consider
min_samples_split as the
minimum number.
If float, min_samples_split
is a fraction and
ceil(min_samples_split
* n_samples) are the
minimum number of samples
for each split.
SU5-36
ANL252 Data Analytics in Python
If integer, consider
min_samples_leaf as the
minimum number.
If float, min_samples_leaf
is a fraction and
ceil(min_samples_leaf *
n_samples) are the minimum
number of samples for each
node.
SU5-37
ANL252 Data Analytics in Python
If integer, consider
max_features features at each
split.
If float, max_features
is a fraction and
int(max_features *
n_features) features are
considered at each split.
If "auto", max_features =
sqrt(n_features)
If "sqrt",max_features =
sqrt(n_features)
If "log2",max_features =
log2(n_features).
If None,max_features =
n_features.
SU5-38
ANL252 Data Analytics in Python
SU5-39
ANL252 Data Analytics in Python
SU5-40
ANL252 Data Analytics in Python
tree_fit_Object = tree_Object.fit(X, Y,
sample_weight = None)
Read
Refer to the link below for more details and examples on the
DecisionTreeClassifier estimator of the tree module in the scikit-learn
package:
SU5-41
ANL252 Data Analytics in Python
developers. https://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeClassifier.html
Refer to the link below for more details and examples on the
DecisionTreeRegressor estimator of the tree module in the scikit-learn
package:
developers. https://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeRegressor.html
Lesson Recording
The most important tool to understand and evaluate a decision tree is the plot of the tree
itself. The tree module of the scikit-learn package provides the plot_tree() function to
generate such a graph conveniently.
SU5-42
ANL252 Data Analytics in Python
SU5-43
ANL252 Data Analytics in Python
SU5-44
ANL252 Data Analytics in Python
The plot_tree() function can also be combined with the matplotlib options such as the
plot size, the borderline settings, etc.
Example (Iris): Here we also give an example on the popular iris dataset and start by
loading the dataset from scikit-learn, and the related package for plotting. Y and X
denote the target variables and independent variables, respectively.
Figure 5.16 Load packages and prepare dataset for tree plot
SU5-45
ANL252 Data Analytics in Python
SU5-46
ANL252 Data Analytics in Python
From the figure above, we can see that the first split of the tree is generated by the
feature “petal width”. The “petal width <=0.8” sub-samples will be assigned to the
child node on the left and those “petal width > 0.8” observations to the child node
on the right. In the next step, the node on the right-hand side is further split by the
“petal width <=1.75”. The split of the tree will then continue until certain stop criteria
are fulfilled. Since we limit the tree depth down to the second split level in our chart,
nodes of the further split levels will only be displayed in grey boxes with no proper
information in it.
Read
Refer to the link below for more details and examples on the plot_tree() functions
of the tree module in the scikit-learn package:
SU5-47
ANL252 Data Analytics in Python
learn.org/stable/modules/generated/sklearn.tree.plot_tree.html
SU5-48
ANL252 Data Analytics in Python
Summary
In this unit, we have seen how Python can be used to carry out analytics tasks based on two
techniques: k-means clustering and decision trees. One of the most common packages in
Python for data analytics and machine learning algorithms is scikit-learn. In scikit-learn,
the analytics algorithm is called an estimator, and its parameters need to be calibrated
before the fitting process is carried out. Once the calibration step is completed, we can
apply the model on prepared DataFrames.
However, the available DataFrames and their contents are usually not in the format and
shape that the scikit-learn algorithms require in the first place. Therefore, we need to
prepare the dataset accordingly. In this unit, we have learned how to remove missing data,
reduce categories, discretise numeric variables, select and rename variables, transform
data, partition data into training and testing datasets, and extracting dependent and
independent variables from the original DataFrame. Subsequently, the application of K-
Means clustering and decision trees have been demonstrated.
SU5-49
ANL252 Data Analytics in Python
Formative Assessment
SU5-50
ANL252 Data Analytics in Python
SU5-51
ANL252 Data Analytics in Python
Formative Assessment
1. What is an estimator in scikit-learn?
Incorrect. The estimator is the entire algorithm and not just the parameters
in it.
a. .get_dummies()
b. normalize()
c. .deletecat()
SU5-52
ANL252 Data Analytics in Python
d. .replace()
Incorrect. The random_state parameter does not control the range from
which the random numbers are drawn.
b. It creates a variable in your DataFrame to store all random numbers that have
been drawn since the first run of the programme.
d. It draws the same “random numbers” in every run to make the results
reproducible.
SU5-53
ANL252 Data Analytics in Python
Incorrect. We can indeed stop the split of a decision tree if the tree has reached
a certain depth.
Incorrect. If the sub-sample in any child node is smaller than the minimum
required number of observations after a new split, the estimator will
terminate the tree construction before the split.
SU5-54
ANL252 Data Analytics in Python
References
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.add_suffix.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.rename.html
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.replace.html
pandas-docs/stable/reference/api/pandas.get_dummies.html
stable/modules/clustering.html
stable/modules/tree.html
stable/modules/classes.html#module-sklearn.linear_modelsklearn.linear_model
learn.org/stable/install.html
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
SU5-55
ANL252 Data Analytics in Python
developers. https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LinearRegression.html
learn.org/stable/modules/classes.html#module-sklearn.linear_model
learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
highlight=preprocessing#module-sklearn.preprocessing
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
developers. https://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeClassifier.html
developers. https://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeRegressor.html
learn.org/stable/modules/generated/sklearn.tree.plot_tree.html
user_guide.html
SU5-56
6
Study
Unit
Learning Outcomes
SU6-2
ANL252 Basic SQL in Python
Overview
In this unit, we will discuss how to use Python to execute SQL commands for database
management. We will first introduce SQLite3, a standard Python package that allows
programmers to use the Python environment to send SQL statements to the database. We
will also discuss how we can generate SQL statements using Python programmes for data
query and to present their outputs as a pandas DataFrame.
SU6-3
ANL252 Basic SQL in Python
Lesson Recording
Introduction to SQL
Many analytics software have integrated SQL as part of their tools. In Python, the package
sqlite3 provides the possibility to embed SQL codes into Python programmes to facilitate
connections to databases and to query the data in them. Since sqlite3 belongs to the built-
in packages of Python, no installation using pip is required.
Same as scikit-learn, the analytics package introduced in the previous study unit, sqlite3
works hand-in-hand with the pandas packages since both of them are designed for data
management. As a result, we can convert the output table of SQL queries to pandas
DataFrames and vice versa anytime.
SU6-4
ANL252 Basic SQL in Python
Example (Students’ score, cont’d): In this study unit, we will return to our students’
examination score example from Study Units 1 & 2 and manage a database with a table
containing the personal information of the students and another table in which their
scores of two examinations are stored. Before we can create a database and the tables
in it, we need to import the packages sqlite3 and pandas into our Python programme
first.
Unlike other packages, the commands in sqlite3 must be executed through a cursor
object after the connection between Python and the database has been initiated. As a
result, we do not necessarily need an alias to abbreviate sqlite3 for more convenience
when writing our programme.
Read
Refer to the link below for more details on the sqlite3 package of Python:
Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite
databases. https://docs.python.org/3/library/sqlite3.html
SU6-5
ANL252 Basic SQL in Python
etc. has a module to convert text files into its own data file format. In other words, text
files are highly compatible and therefore a good medium of data storage. Furthermore,
their file sizes are usually comparatively small since they only contain the data and the
delimiters, and no other information such as cell formatting. As a result, when exchange of
data is required within an organisation or between data providers and clients, they are the
most suitable format since they would not use up as much upload and download volume
or time as other data file formats.
In Study Unit 1, we developed a programme to let users enter the examination scores of
a course. However, the programme only allows data entry but provides no mechanism
to save the data in an external file. In other words, once we quit the programme, all the
data we have entered will not be accessible or found anymore. Here, we will introduce
the Python commands to write data into a .csv file as well as to read the data from it.
In Python, we need to open an external text file with the open() function first before we
can write data into it or read data from it.
The with statement is usually used in combination with the open() function. The data
stored in the text file called file_name will then be stored in the file_object for
further processing. With the parameter mode, we can choose the operations that we are
permitted to carry out with the file. Here is a list of some of the available modes:
Character Description
SU6-6
ANL252 Basic SQL in Python
Character Description
"x" open for exclusive creation; failing if the file already exists
"a" open for writing, appending to the end of the file if it exists
Once a file is opened in a particular mode, we are only allowed to execute the permitted
operations until it is re-opened. For instance, if we have opened a file in reading mode,
Python will block us from writing, appending, or updating any content of the file and
return an error message to us if our code intends it to do so. The difference between both
the writing modes "w" and "a" is that "w" overwrites the entire original content in the file
by our new entries while "a" appends the new entries to the end of the file while keeping
the original content.
Note that "r" or "a" mode may be combined with the "+" mode. In both cases, we give
permission for the file to be read and updated. The main difference here is that if we open
the file in the "r+" mode, Python will be able to read the entire existing content from the
beginning of the file and write the new entries at the end of the file. But if the file is opened
in the "a+" mode, Python will not be able to read the entire existing content and is only
able to just append the updates at the end of the file. Furthermore, if the text file does not
exist, it will be created in the "a+" mode, while Python will return an error in the "r+"
mode.
After the user has finished entering the data of one record, Python should write this record
to the text file with the following syntax.
SU6-7
ANL252 Basic SQL in Python
file_object.write(data_row)
The name of file_object must be identical with the one defined in the with open()
statement. The object data_row is a string in which the entered data is stored. After all
the data has been saved to the file_object (Python will transfer the data to the real
text file in the background), we need to close the file properly to release its access to other
parties.
file_object.close()
Though closing the external files may not affect the programme flow directly, it is still
important and a good programming habit to do so at the end of the code.
Example (Cont’d): The following programme creates an interface to let the users enter
the personal information of the students including their first and last names, gender,
birthday, nationality, and study programme. The input mechanism is the same as
those introduced in Study Unit 1. For simplicity, we have omitted some of the control
mechanisms to prevent invalid inputs here. This is essential and extremely important
for a programme to be executed correctly. After entering the data of one student,
Python will convert the entries to a comma separated string and store it as a new line
into the file “Student_DB.csv” for later use.
SU6-8
ANL252 Basic SQL in Python
In the first line, we store the file name in a variable called target_file in case we
need it multiple times in the later part of our programme. By choosing the "r+" mode
in the with open command in the second line, we instruct Python to open the file
for reading and writing, and the existing contents of target_file should be stored
in the object called write_to_file. The option of reading the stored data in the
file would be particularly useful if we needed the number of existing records. This
statistic is particularly helpful for two tasks: i) if the file is empty, we will need to add
a line of column headers to the record which we also include in lines 4 to 6, ii) we can
use the current number of records as the ID number of a new student. We would not
have been able to execute these tasks if we had opened target_file in the "a+"
mode instead.
Since .csv text files are comma separated, we need to insert a comma between each
column when concatenating the data to a string before storing it to target_file.
It is also important to put the escape sequence \n at the end of the string so that the
next record will be stored in a new line. Figure 6.3 shows how the interface looks like
when we run our programme in JupyterLab.
SU6-9
ANL252 Basic SQL in Python
To read the existing entries of a text file, we can use a for-loop to go through them line
by line and print the content to the screen.
Once again, the file_object must be identical with the one defined in the with
open() statement. Note that Python is able to separate the records in the text file by
recognising the line break (or escape sequence \n) at the end of every line. As a result, the
for-loop will automatically run through all the lines and store the content in the variable
line, which will then be printed to the screen by the print() function.
Example (Cont’d): Figure 6.4 shows our programme to print the records stored in
“Student_DB.csv” to the screen line by line.
SU6-10
ANL252 Basic SQL in Python
Figure 6.4 Print the records in a .csv file to the screen line by line
Same as the input programme, we store the file name in a variable called
source_file in case we still need it in the later part of our programme.
Subsequently, we open the file in the "r" mode and store the data in the object
read_from_file. The advantage of opening the file in reading mode is that the data
stored in the .csv file is safe from accidental update or removal. The data will then be
printed to the screen line by line with the subsequent for-loop.
Read
Refer to the link below for more details and examples on the open() function:
docs.python.org/3/library/functions.html#open
Refer to the link below for more details and examples on reading and writing files in
Python:
SU6-11
ANL252 Basic SQL in Python
docs.python.org/3/tutorial/inputoutput.html
connection_object = sqlite3.connect("database_name")
If the database already exists, connect() will simply create a connection between the
two platforms and let the user gain access to the existing database. On the other hand, if
the database is new, connect() will create a new database with the name used in the
string "database_name" and link it with Python directly. Once the connection has been
established, we can create SQL syntaxes as strings or string variables in Python, and then
send these string objects to SQL for execution. The .cursor() method creates a cursor
object to take over this task.
cursor_object = connection_object.cursor()
SU6-12
ANL252 Basic SQL in Python
To create a table with imported data from a .csv file, we need to first read in the file as
a pandas database in Python with the pandas.read_csv() function, which we have
introduced in Chapter 1 of Study Unit 4, and then send the data object to the database by
the .to_sql() method of the pandas package.
data_object = pandas.read_csv("csv_file_name.csv")
data_object.to_sql("table_name", connection_object,
if_exists)
Example (Cont’d): In the first step, we establish a Python connection called conn to
the database named "StudentsDB.db" and create a cursor object cur for later use.
In the next step, we import the student data created with the programme in Figure
6.2 as a pandas DataFrame called students.
SU6-13
ANL252 Basic SQL in Python
Subsequently, we can create a new table in the database called students from this
pandas DataFrame. Since we have read in the entire dataset from the .csv file, we can
ask Python to replace the existing table if it exists.
The parameter index instructs Python to write the row index as a column with
column name index_label in the table. Since the default value here is True, we
need to specify it in the .to_sql() method if we do not wish to include this column.
With the cursor object being created, we can execute the SQL commands by sending them
as strings through the cursor object with the .execute() method.
cursor_object.execute("SQL_command_string")
Note that SQL is a separated programming language for database management and its
commands are therefore not the same as those in Python. Furthermore, SQL commands
SU6-14
ANL252 Basic SQL in Python
are not case sensitive and should end with a semi-colon (this is usually optional, but
sometimes the semi-colons are useful to separate the commands that are sent to SQL for
execution at the same time).
To select a table from the database, we can send a SELECT statement to SQL.
The asterisk (*) in the SELECT statement is to instruct SQL to take all columns from the
table. Once the query has been carried out, we can print one record of the result to the
screen by the .fetchone() method.
cursor_object.fetchone()
If we would like Python to print all records from the query result to the screen, we can
use the .fetchall() method instead.
cursor_object.fetchall()
Note that once we have applied the .fetchone() or .fetchall() methods, the data
records in the query result are literally fetched and no longer available. In other words, if
we re-apply the .fetchone() or .fetchall() methods, we will see either no records
or some of them missing. If we wish to select and check out the same table again, we will
have to redo the query.
SU6-15
ANL252 Basic SQL in Python
Example (Cont’d): In Figure 6.8, we use the .execute() method to send the SELECT
statement to SQL for selecting the table students from the database.
In the final step, we fetch all the records in the cursor object cur for printing.
From Figure 6.10, we can see that the first record has already been fetched in Figure
6.9 and is therefore not included in the output.
If we re-fetch the records from the query output after applying fetchall() once
before, SQL will return an empty object to us.
SU6-16
ANL252 Basic SQL in Python
Read
Refer to the link below for more details and examples on the to_sql() function of
the pandas package:
pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.to_sql.html
Refer to the link below for more details and examples on the connect(), .cursor(),
.execute(), .fetchone(), and .fetchall() functions and methods of the
sqlite3 package:
Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite
databases. https://docs.python.org/3/library/sqlite3.html
SU6-17
ANL252 Basic SQL in Python
Lesson Recording
The SELECT statement also allows us to select some of the variables from the table
instead of all of them. But in some cases, we may not even know the variables that
the table contains or how their names are correctly spelt. In this case, we can use the
.description attribute to extract the variable names from the last queried table.
cursor_object.description
Note that .description is an attribute and not a method. As a result, there are no
brackets and arguments behind it. The returned object is a collection of tuples where the
first item of each tuple is the column name, and the last six items are None.
SU6-18
ANL252 Basic SQL in Python
Example (Cont’d): In Figure 6.12, we will use the .description attribute to extract
the column names of the table students.
Unfortunately, the returned object does not have a particularly useful form for further
usage. Since we only need the first item of each tuple, we can run a for-loop within a
list to extract it.
While the for-loop is running through the cur.description object, the current
tuple is stored in the object column from which the item with the index 0 will then
be put in the list named cols.
From the previous Chapter, we learnt that we could fetch the records from a table for
further processes such as printing. However, the data are stored in tuples and when they
are printed, we do not see them as tables such as a pandas DataFrame. Furthermore, once
they are fetched, our programme no longer has access to the queried table. As a result, it
may be desirable to store the result of the query in a pandas DataFrame. In fact, pandas
provides the method .from_records() for this purpose.
SU6-19
ANL252 Basic SQL in Python
query_object = pd.DataFrame.from_records
(data = cursor_object.fetchall(), columns)
Example (Cont’d): We can now start a new data query by selecting the entire table
students and then storing the result of the query in a pandas DataFrame. The query
step is almost identical to the syntax in Figure 6.8. The only difference here is that
the .execute() method will not carry out our SQL command directly, but a string
variable called sql_select in which the SELECT statement will store the queried
data for future use.
In the next step, we can generate the data by applying the .fetchall() method on
the cursor object and converting it to a pandas DataFrame. We will also specify the
list cols which was generated in Figure 6.13 as our column names here.
SU6-20
ANL252 Basic SQL in Python
Figure 6.15 shows the table students after being converted to a pandas DataFrame,
a format that has become well-known to us from the previous study units.
Read
Refer to the link below for more details and examples on the .description attribute
of the sqlite3 package:
Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite
databases. https://docs.python.org/3/library/sqlite3.html#module-sqlite3
Refer to the link below for more details and examples on the .from_records()
function of the pandas package:
docs/stable/reference/api/pandas.DataFrame.from_records.html
Refer to the links below for more details and examples on the SELECT statement of
SQL:
SU6-21
ANL252 Basic SQL in Python
www.w3schools.com/sql/sql_select.asp
Note that if we intend to sort the table by multiple variables, we will need to separate
their names by commas. The sequence of the variables in this list also reflects the sorting
hierarchy. That is, the data are sorted by the first variable in the list initially. Those tied
records are then sorted by the second variable, and so on. If the data must be sorted in the
descending order by a particular variable, we must specify the DESC option behind the
variable name. Since the default value here is ASC, we can omit this option for a variable
if the data should be sorted in its ascending order.
SU6-22
ANL252 Basic SQL in Python
Example (Cont’d): Suppose we would like to sort the students by their age and
nationality in the table students. Since the table only contains their birthdays, stored
in the variable Birthday, we need to sort the records in the descending order so that
their age will be ordered naturally.
SU6-23
ANL252 Basic SQL in Python
Read
Refer to the links below for more details and examples on the ORDER BY option of
SQL:
www.w3schools.com/sql/sql_orderby.asp
order-by/
In the above syntax, the selection criterion is presented in its simplest form: records will
only be selected if one of the variables is equal to a certain value. We can also construct
other criteria by using the operators listed in the following table.
SU6-24
ANL252 Basic SQL in Python
Operator Description
= Equal
Same as the if-command in Python, we can also link multiple criteria in one statement
using the AND, OR and NOT operators.
Sometimes, we would rather not obtain records that contain missing values in one or more
variables from a query. In this case, we need to add the IS NOT NULL syntax to the WHERE
clause.
SU6-25
ANL252 Basic SQL in Python
If the statement were written without the NOT operator, SQL would return all records with
missing values in the variable var_name to us.
Figure 6.17 Select only analytics students from the table students
If the value is a string such as “Analytics” in the above syntax, we will need to put it
in a pair of quotation marks. And it is important here to pay attention to when and
where single or double quotation marks should be used.
Suppose we would like to narrow down our query to only analytics students with ID
numbers between 5 and 10.
SU6-26
ANL252 Basic SQL in Python
Figure 6.18 Select only analytics students with ID number between 5 and 10
Suppose these are not the students that we are actually looking for, and we would like
to select the other analytics students. All we need to modify in the above syntax is to
change the operator from BETWEEN to NOT BETWEEN for the ID variable.
Figure 6.19 Select only analytics students with ID number not between 5 and 10
In the next query, we would like to select all analytics students who are not from
Singapore or China. With the IN operator, we can specify the two values “Singapore”
and “China” that we are searching for in the column Nationality. Note that these
values must be put in a pair of round brackets. Finally, the NOT operator should be
added to the syntax IN ('Singapore', 'China') to negate it.
SU6-27
ANL252 Basic SQL in Python
Now, all analytics students whose first names start with an “M” should be selected.
Here, we can use the LIKE operator and the value is 'M%'. The (%) sign is a wildcard
that represents zero, one, or multiple characters. In other words, our value is an M
followed by a string of arbitrary length.
Figure 6.21 Select analytics students with first names that start with “M”
Instead of searching for first names starting with a particular letter, we can also
select records with first names that contain a pre-defined string, which is “ar” in the
following example.
Figure 6.22 Select analytics students with first names that contain “ar”
SU6-28
ANL252 Basic SQL in Python
Same as our code in Figure 6.22, we can use the (%) wildcard within the string to
specify the position of our pre-defined string. Unsurprisingly, we can also select first
names that end with a certain character, which is “s” here.
Figure 6.23 Select analytics students with first names that end with “s”
If we wish to select first names that start and end with some pre-defined characters,
we can put the wildcard (%) between the start and the end characters. In the following
example, we would like to select students with first names that start with an “M” and
end with an “l”.
Figure 6.24 Select analytics students with first names that start with “M” and end with “l”
The other type of wildcard for the LIKE operator is the underscore sign (_), which
represents a single character. In the following example, we would like to select all
students whose last names start with “Ta” and followed by exactly one character.
Figure 6.25 Select analytics students with last names that have the pattern “Ta_”
SU6-29
ANL252 Basic SQL in Python
The syntax in Figure 6.25 will result in selecting all last names with three characters
that start with “Ta”. In other words, last names such as “Tang” will not be included.
In Figure 6.26, cases with missing value in Birthday will be selected. This step is
usually helpful to let data analysts study the cases with missing values first and then
decide whether to remove them from the dataset.
If we simply want to select all students whose birthday records are not missing, we
will just need to replace IS NULL by IS NOT NULL in the SELECT statement.
SU6-30
ANL252 Basic SQL in Python
In SQL, we can also select particular columns from a table in the data query. The asterisk
(*) in the SELECT statement should be replaced by a list of selected variables in this case.
Example (Cont’d): In Figure 6.28, we select the columns ID, LastName, FirstName,
Nationality and Program from the table students. At the same time, we are
only interested in students outside the analytics programme. However, these records
should also be sorted by their nationalities.
Figure 6.28 Select data of non-analytics students from certain columns of a table
SU6-31
ANL252 Basic SQL in Python
In the first line, we define the tuple named sel_cols with the names of the selected
variables in it. In the second line, the .join() method is applied to ", ", the
separator between the variable names in the output string. Basically, the .join()
method runs through all the items in sel_cols and adds them with ", " (except for
the last item for which the separator is not necessary) to the string sel_cols_str
one after another.
In the next step, we first split the SELECT statement in Figure 6.28 into three parts:
"Select ", the variable list, and the rest. The original variable list can then be
replaced by the string variable sel_cols_str. Subsequently, we can concatenate
these three partitions of the string into one by “adding” them together as shown in
Figure 6.30. Furthermore, we can also use the variable list sel_cols as the value for
the parameter columns in the from_records() function.
Hereafter, we can obtain different columns of the table by just changing the variable
names in sel_cols_str and re-run the code in Figure 6.30. Note that the same
technique can also be applied to the other parts of the SELECT statement.
SU6-32
ANL252 Basic SQL in Python
Read
Refer to the links below for more details and examples on the WHERE clause of SQL:
www.w3schools.com/sql/sql_where.asp
Refer to the link below for more details and examples on the AND, OR, NOT operators
of SQL:
W3 schools. (2023c). SQL AND, OR and NOT operators. Refsnes Data. https://
www.w3schools.com/sql/sql_and_or.asp
Refer to the links below for more details and examples on the IN operator of SQL:
sql/sql_in.asp
Refer to the links below for more details and examples on the BETWEEN operator of
SQL:
www.w3schools.com/sql/sql_between.asp
between/
Refer to the links below for more details and examples on the LIKE operator of SQL:
sql_like.asp
SU6-33
ANL252 Basic SQL in Python
Refer to the link below for more details and examples on the wildcard characters in
SQL:
sql/sql_wildcards.asp
Refer to the links below for more details and examples on NULL values in SQL:
www.w3schools.com/sql/sql_null_values.asp
null/
Refer to the link below for more details and examples on the .join() methods:
library/stdtypes.html#str.join
SU6-34
ANL252 Basic SQL in Python
Lesson Recording
The tables in a database are usually connected in some way. For instance, in the database
of a bank, there may be a table with the personal records of all the customer relationship
managers and another table with the records of all customers, including their transaction
records and the names of their relationship managers. With these data, the bank can query
on the sales records of each manager within a certain period. In this case, we are joining
tables from a database to gain cross-table information.
The NATURAL JOIN clause is used within the SELECT statement. It selects only records
of table1 that can be matched by records in table2. The rows of table1 or table2 for
which SQL cannot find any matches in the opposite table will be dropped from the query.
SQL compares the values of each matching variable from the two tables specified by the
user. Usually, these variables should represent the same feature in both tables such as the
employee number or customer ID. Note that it is also possible to extend the match to
multiple pairs of variables, if necessary. Note that only one copy of each common column
will be kept.
SU6-35
ANL252 Basic SQL in Python
Example: In the following, we will create a new table called product_info which
contains the names of the products that customers wish to buy. The table will then
be joined with the table shop_info, where the customers are willing to visit. The
original datasets are stored in .csv files that include the following information:
SU6-36
ANL252 Basic SQL in Python
Finally, we perform NATURAL JOIN using the syntax mentioned at the beginning of this
Chapter.
Note that there are also other more complicated table merging methods such as cross join
and outer join which are beyond the scope of this course.
Read
Refer to the links below for more details and examples on the INNER JOIN clause of
SQL:
sql_join_inner.asp
inner-join/
SU6-37
ANL252 Basic SQL in Python
Summary
In this study unit, we first learned how to write Python programmes to store data entered
by a user to a .csv text file. After the sqlite3 package was introduced, we were able to
connect Python with the databases and convert external data sources saved as .csv text
files to database tables by SQL. With the SELECT statement, we could execute different
types of data query such as sorting and filtering data. Here, we have used Python
programming to generate SELECT statements flexibly and to convert the query output
to pandas DataFrames for better presentation in the Python environment. We have also
come across the method for joining tables of a database, i.e., natural join. The output of
these methods could vary strongly since they select the records differently.
SU6-38
ANL252 Basic SQL in Python
Formative Assessment
1. Which of the following modes of the open() function does not allow you to write to
the source file?
a. "a"
b. "r"
c. "r+"
d. "w"
3. After querying a SQL table with 50 observations, we use fetchone() to check on the
outcome of one record for merely one time. How many records will remain available
in the query output after that?
a. 0
b. 1
c. 49
d. 50
4. Table A has 50 records and Table B has 70. A total of 40 records could be matched
based on certain conditions. What would be the number of records in the output table
if we merged A and B by inner join and cross join respectively?
a. 40 and 3500
b. 50 and 3500
c. 40 and 50
SU6-39
ANL252 Basic SQL in Python
d. 50 and 70
5. Which of the following is true regarding SQL NATURAL JOIN of two tables?
a. The join condition needs to be specified explicitly.
b. The columns in the two tables are automatically matched.
c. In the case that there is no match, all rows from both tables will be returned.
d. In the case that there is no match, all columns from both tables will be returned.
SU6-40
ANL252 Basic SQL in Python
Formative Assessment
1. Which of the following modes of the open() function does not allow you to write to
the source file?
a. "a"
Incorrect. We can append new records to the file in the appending mode.
b. "r"
Correct. We can only extract data from the file in the reading mode.
c. "r+"
Incorrect. We can add new content to the file in the updating mode.
d. "w"
Correct. We send our SQL commands through the cursor object from
Python to SQL.
SU6-41
ANL252 Basic SQL in Python
Incorrect. The cursor object does not specify the table in the database that we
are working on.
3. After querying a SQL table with 50 observations, we use fetchone() to check on the
outcome of one record for merely one time. How many records will remain available
in the query output after that?
a. 0
Incorrect. The fetchone() function only fetches one record from the query
output. So, there must be more than 0 records remaining.
b. 1
Incorrect. The fetchone() function only fetches one record from the query
output. So, there must be more than 1 record remaining.
c. 49
Correct. Since the fetchone() function only fetches one record from the
query output, there must be 49 records remaining.
d. 50
Incorrect. The fetchone() function fetches one record from the query
output anyway. So, there must be less than 50 records remaining.
4. Table A has 50 records and Table B has 70. A total of 40 records could be matched
based on certain conditions. What would be the number of records in the output table
if we merged A and B by inner join and cross join respectively?
a. 40 and 3500
Correct. Inner join creates the intersection set of both tables (i.e., 40) and
cross join returns the cartesian product of both tables (i.e., 3500).
b. 50 and 3500
SU6-42
ANL252 Basic SQL in Python
c. 40 and 50
Incorrect. Inner join creates the intersection set of both tables (i.e., 40), but
only if A were left joined by B, the number of records of A would return (i.e.,
50).
d. 50 and 70
5. Which of the following is true regarding SQL NATURAL JOIN of two tables?
Correct. In SQL, the NATURAL JOIN can automatically match two tables.
c. In the case that there is no match, all rows from both tables will be returned.
d. In the case that there is no match, all columns from both tables will be
returned.
SU6-43
ANL252 Basic SQL in Python
References
Geeks for geeks. (n.d.). Difference between Natural join and Inner Join in SQL. https://
www.geeksforgeeks.org/difference-between-natural-join-and-inner-join-in-sql/
https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.from_records.html
pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
tutorial/inputoutput.html#tut-files
library/functions.html#open
https://docs.python.org/3/library/stdtypes.html#str.join
Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite databases.
https://docs.python.org/3/library/sqlite3.html
Python Software Foundation. (2023e). sqlite3 — DB-API 2.0 interface for SQLite databases:
objects
sqlite3.html#sqlite3.Cursor.description
alter-table/
SU6-44
ANL252 Basic SQL in Python
between/
table/
www.sqlitetutorial.net/sqlite-cross-join/
drop-table/
by/
join/
by/
rename-column/
SU6-45
ANL252 Basic SQL in Python
character-functions/lower.php
www.w3schools.com/python/ref_list_remove.asp
sql/sql_alter.asp
W3 schools. (2023c). SQL AND, OR and NOT operators. Refsnes Data. https://
www.w3schools.com/sql/sql_and_or.asp
www.w3schools.com/sql/sql_between.asp
W3 schools. (2023e). SQL COUNT(), AVG() and SUM() functions. Refsnes Data. https://
www.w3schools.com/sql/sql_count_avg_sum.asp
www.w3schools.com/sql/sql_create_table.asp
www.w3schools.com/sql/sql_ref_delete.asp
www.w3schools.com/sql/sql_drop_table.asp
www.w3schools.com/sql/sql_groupby.asp
sql/sql_having.asp
sql_in.asp
www.w3schools.com/sql/sql_join_inner.asp
SU6-46
ANL252 Basic SQL in Python
www.w3schools.com/sql/sql_insert.asp
www.w3schools.com/sql/sql_join_left.asp
W3 schools. (2023p). SQL MIN() and MAX() functions. Refsnes Data. https://
www.w3schools.com/sql/sql_min_max.asp
sql/sql_null_values.asp
www.w3schools.com/sql/sql_orderby.asp
www.w3schools.com/sql/sql_select.asp
www.w3schools.com/sql/sql_update.asp
sql/sql_where.asp
sql_wildcards.asp
www.w3schools.com/sql/sql_dates.asp
SU6-47
ANL252 Basic SQL in Python
SU6-48