0% found this document useful (0 votes)
338 views382 pages

ANL252 Python for Data Analytics Guide

This document provides an overview of the course guide and study units for the ANL252 Python for Data Analytics course. It includes information such as the course description and aims, learning outcomes, assessment overview, and course schedule. The course is divided into 6 study units covering topics such as Python programming fundamentals, data types and functions, data management with NumPy and Pandas, data analytics using Scikit-Learn, and basic SQL in Python. Each study unit includes chapters on key concepts and a formative assessment at the end to help students apply their learning.

Uploaded by

Shirley Loh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
338 views382 pages

ANL252 Python for Data Analytics Guide

This document provides an overview of the course guide and study units for the ANL252 Python for Data Analytics course. It includes information such as the course description and aims, learning outcomes, assessment overview, and course schedule. The course is divided into 6 study units covering topics such as Python programming fundamentals, data types and functions, data management with NumPy and Pandas, data analytics using Scikit-Learn, and basic SQL in Python. Each study unit includes chapters on key concepts and a formative assessment at the end to help students apply their learning.

Uploaded by

Shirley Loh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 382

Course Development Team

Head of Programme : Assoc Prof James Tan Swee Chuan


Course Developer(s) : Dr Karl Wu Ka Yui
: Dr Zhu Siying
Technical Writer : Maybel Heng, EMR
Video Production : Mohd Jufrie Bin Ramli, EMR

© 2023 Singapore University of Social Sciences. All rights reserved.

No part of this material may be reproduced in any form or by any means without
permission in writing from the Educational Media & Resources, Singapore University of
Social Sciences.

ISBN 9789814873925

Educational Media & Resources


Singapore University of Social Sciences
463 Clementi Road
Singapore 599494

How to cite this Study Guide (APA):


Wu, K. Y., & Zhu, S. (2023). ANL252 Python for data analytics. Singapore University of

Social Sciences.

Release V1.2.0

Build S1.0.15, T1.7.2


Table of Contents

Table of Contents

Course Guide
1. Welcome.................................................................................................................. CG-2

2. Course Description and Aims............................................................................ CG-3

3. Learning Outcomes.............................................................................................. CG-5

4. Learning Material................................................................................................. CG-6

5. Assessment Overview.......................................................................................... CG-8

6. Course Schedule.................................................................................................. CG-11

7. Learning Mode.................................................................................................... CG-12

Study Unit 1: Introduction to Python Programming


Learning Outcomes................................................................................................. SU1-2

Overview................................................................................................................... SU1-3

Chapter 1: Python Programming Environment.................................................. SU1-4

Chapter 2: Basic Arithmetic and Variables.......................................................... SU1-8

Chapter 3: Print and Input................................................................................... SU1-17

Chapter 4: If-elif-else-Conditions........................................................................ SU1-28

Chapter 5: Loops.................................................................................................... SU1-37

Summary................................................................................................................. SU1-51

Formative Assessment.......................................................................................... SU1-52

........................................................................................................................... SU1-

i
Table of Contents

Study Unit 2: Data Types and Functions


Learning Outcomes................................................................................................. SU2-2

Overview................................................................................................................... SU2-3

Chapter 1: Tuples, Lists, Dictionaries................................................................... SU2-4

Chapter 2: Integrated Methods and Functions................................................. SU2-42

Chapter 3: User-defined Functions..................................................................... SU2-52

Chapter 4: Modules, Packages and Libraries.................................................... SU2-56

Summary................................................................................................................. SU2-61

Formative Assessment.......................................................................................... SU2-62

........................................................................................................................... SU2-

Study Unit 3: Arrays and Plots


Learning Outcomes................................................................................................. SU3-2

Overview................................................................................................................... SU3-3

Chapter 1: Array Management with NumPy...................................................... SU3-4

Chapter 2: Plotting With Matplotlib................................................................... SU3-19

Summary................................................................................................................. SU3-28

Formative Assessment.......................................................................................... SU3-29

References............................................................................................................... SU3-39

Study Unit 4: Data Management


Learning Outcomes................................................................................................. SU4-2

Overview................................................................................................................... SU4-3

ii
Table of Contents

Chapter 1: Import Data........................................................................................... SU4-4

Chapter 2: Data Selection..................................................................................... SU4-12

Chapter 3: Merge DataFrames............................................................................. SU4-27

Chapter 4: Missing Data and Outliers................................................................ SU4-32

Chapter 5: Data Modification............................................................................... SU4-50

Summary................................................................................................................. SU4-55

Formative Assessment.......................................................................................... SU4-56

References............................................................................................................... SU4-65

Study Unit 5: Data Analytics in Python


Learning Outcomes................................................................................................. SU5-2

Overview................................................................................................................... SU5-3

Chapter 1: Introduction to Scikit-Learn................................................................ SU5-4

Chapter 2: Clustering............................................................................................ SU5-23

Chapter 3: Decision Trees..................................................................................... SU5-31

Summary................................................................................................................. SU5-49

Formative Assessment.......................................................................................... SU5-50

References............................................................................................................... SU5-55

Study Unit 6: Basic SQL in Python


Learning Outcomes................................................................................................. SU6-2

Overview................................................................................................................... SU6-3

Chapter 1: Introduction to SQL and SQLite3...................................................... SU6-4

Chapter 2: Data Query.......................................................................................... SU6-18

iii
Table of Contents

Chapter 3: Joining Tables...................................................................................... SU6-35

Summary................................................................................................................. SU6-38

Formative Assessment.......................................................................................... SU6-39

References............................................................................................................... SU6-44

iv
List of Tables

List of Tables

Table 1.1 Python arithmetic operators...................................................................... SU1-9

Table 1.2 Types of variables in Python................................................................... SU1-14

Table 1.3 List of some escape sequences available in Python.............................. SU1-23

Table 1.4 Examples of relational operation............................................................ SU1-28

Table 1.5 List of relational operators....................................................................... SU1-29

Table 1.6 List of logical operators............................................................................ SU1-30

Table 1.7 List of some common exceptions............................................................ SU1-46

Table 2.1 Built-in functions of Python..................................................................... SU2-42

Table 2.2 Built-in methods of Python...................................................................... SU2-45

Table 2.3 Some common packages for data analytics in Python......................... SU2-59

Table 3.1 Most common NumPy functions............................................................ SU3-12

Table 4.1 Most common data file formats and the corresponding reader in
pandas............................................................................................................................. SU4-5

Table 5.1 Most common algorithms available in scikit-learn................................ SU5-4

Table 5.2 Other data preparation techniques......................................................... SU5-15

Table 5.3 Parameters of the KMeans estimator...................................................... SU5-26

Table 5.4 Parameters of the DecisionTreeClassifier and


DecisionTreeRegressor estimators.................................................................. SU5-35

Table 5.5 Parameters of the plot_tree function...................................................... SU5-43

v
List of Tables

Table 6.1 List of some available modes of the open() function........................... SU6-6

Table 6.2 Operators in the WHERE clause................................................................ SU6-25

vi
List of Figures

List of Figures

Figure 1.1 Starting Jupyter Notebook........................................................................ SU1-5

Figure 1.2 JupyterLab interface.................................................................................. SU1-6

Figure 1.3 JupyterLab input cell for Python code entry......................................... SU1-6

Figure 1.4 JupyterLab shows the output after the code is run.............................. SU1-6

Figure 1.5 Simple calculations with Python............................................................. SU1-8

Figure 1.6 Assigning values to variables for operation........................................ SU1-13

Figure 1.7 Checking variable type........................................................................... SU1-16

Figure 1.8 Simple use of the print() function......................................................... SU1-17

Figure 1.9 Printing variables and their calculations.............................................. SU1-18

Figure 1.10 Printing formatted string...................................................................... SU1-19

Figure 1.11 Output of incomplete string formatting syntax................................ SU1-20

Figure 1.12 Simple usage of the .format() method.......................................... SU1-20

Figure 1.13 Multiple usage of .format() method.............................................. SU1-20

Figure 1.14 Syntax error caused by quotation marks within a string................. SU1-21

Figure 1.15 Printing quotation marks within a string........................................... SU1-22

Figure 1.16 Printing quotation marks using escape sequence............................. SU1-22

Figure 1.17 Erroneous line breaks within a string................................................. SU1-22

Figure 1.18 Line breaks created by escape sequence \n....................................... SU1-23

vii
List of Figures

Figure 1.19 Example of using input() function.................................................. SU1-25

Figure 1.20 Convert Input Value to Integer and Float.......................................... SU1-26

Figure 1.21 if-statement example with True condition..................................... SU1-33

Figure 1.22 if-statement example with False condition................................... SU1-33

Figure 1.23 if-else-statement example................................................................ SU1-34

Figure 1.24 Example of if-elif-else-statement................................................ SU1-36

Figure 1.25 Repeating the same code for two individuals to enter..................... SU1-38

Figure 1.26 Entering names and scores repeatedly by while-loops.................. SU1-39

Figure 1.27 Entering names and scores repeatedly by for-Loops...................... SU1-42

Figure 1.28 Breaking from loops using break...................................................... SU1-44

Figure 1.29 Using exception for error handling..................................................... SU1-47

Figure 1.30 Using while-loop for user input if error occurs.............................. SU1-48

Figure 1.31 The complete example programme of Study Unit 1......................... SU1-50

Figure 2.1 Creating tuples in Python......................................................................... SU2-5

Figure 2.2 Sub-setting tuples...................................................................................... SU2-6

Figure 2.3 Sub-setting of tuples with negative indices........................................... SU2-7

Figure 2.4 Sub-setting tuples with “open end” indexing....................................... SU2-8

Figure 2.5 Erroneous modification of tuples............................................................ SU2-8

Figure 2.6 Erroneous adding of elements to tuples................................................ SU2-9

Figure 2.7 Concatenation of two tuples.................................................................... SU2-9

Figure 2.8 Adding elements to tuples by concatenation...................................... SU2-10

viii
List of Figures

Figure 2.9 Length of tuples and index out of range.............................................. SU2-11

Figure 2.10 Sub-setting the last element of tuples................................................. SU2-11

Figure 2.11 Printing tuple elements by for-loops................................................ SU2-13

Figure 2.12 Creating lists in Python........................................................................ SU2-14

Figure 2.13 Sub-setting lists...................................................................................... SU2-15

Figure 2.14 Editing single item in a list.................................................................. SU2-17

Figure 2.15 Editing multiple items in a list............................................................ SU2-17

Figure 2.16 Replacing more values than the specified indices............................ SU2-18

Figure 2.17 Replacing less values than the specified indices............................... SU2-18

Figure 2.18 Concatenating lists................................................................................. SU2-19

Figure 2.19 Concatenating lists with different data types.................................... SU2-20

Figure 2.20 Merging two lists by keeping them as lists in the new one............. SU2-21

Figure 2.21 Extracting elements from a merged list.............................................. SU2-22

Figure 2.22 Printing list elements line by line to the screen................................. SU2-23

Figure 2.23 Printing elements of multiple lists using formatted string.............. SU2-24

Figure 2.24 Rearranging data storage in lists......................................................... SU2-25

Figure 2.25 Printing data of a list using formatted string.................................... SU2-26

Figure 2.26 User’s input to a list.............................................................................. SU2-27

Figure 2.27 Creating a dictionary in Python.......................................................... SU2-32

Figure 2.28 Extracting value from a dictionary..................................................... SU2-32

Figure 2.29 Extracting keys and values from a dictionary................................... SU2-33

ix
List of Figures

Figure 2.30 Store keys and values from a dictionary in lists................................ SU2-33

Figure 2.31 Print all the keys from a dictionary line by line................................ SU2-34

Figure 2.32 Print all the keys and values from a dictionary line by line............ SU2-34

Figure 2.33 Extracting items from a dictionary...................................................... SU2-35

Figure 2.34 Print items from a dictionary after converting it to a list................. SU2-36

Figure 2.35 Change one value in a dictionary....................................................... SU2-37

Figure 2.36 Add items to every dictionary key and convert values to lists....... SU2-37

Figure 2.37 Print items from a dictionary after converting to lists using for-
loops.............................................................................................................................. SU2-37

Figure 2.38 Changing a key of a dictionary........................................................... SU2-38

Figure 2.39 Adding a new key to a dictionary...................................................... SU2-39

Figure 2.40 Merging two dictionaries using the v3.5+ option............................. SU2-40

Figure 2.41 Merging two dictionaries using the v3.9+ option............................. SU2-40

Figure 2.42 Computing statistics of a dictionary’s values.................................... SU2-44

Figure 2.43 Defining functions................................................................................. SU2-54

Figure 2.44 Sorting dictionary using user-defined functions............................... SU2-55

Figure 2.45 Installing package “numpy” with pip................................................ SU2-58

Figure 2.46 Installing package “matplotlib” with pip3......................................... SU2-58

Figure 3.1 Creating an array with NumPy............................................................... SU3-6

Figure 3.2 Sub-setting a column from a NumPy array........................................... SU3-8

Figure 3.3 Sub-setting a sub-array from a NumPy array....................................... SU3-8

x
List of Figures

Figure 3.4 Creating a Boolean mask of a NumPy array......................................... SU3-9

Figure 3.5 Sub-setting a NumPy array based on a Boolean mask......................... SU3-9

Figure 3.6 Extracting information on the characteristics of a NumPy


array.............................................................................................................................. SU3-10

Figure 3.7 Calculate column statistics of a NumPy array.................................... SU3-17

Figure 3.8 Importing “matplotlib.pyplot” into the programme.......................... SU3-21

Figure 3.9 Programme for creating a line plot....................................................... SU3-22

Figure 3.10 Line plot generated by “matplotlib.pyplot”...................................... SU3-22

Figure 3.11 Creating data array for subsequent plotting...................................... SU3-24

Figure 3.12 Programme for creating a histogram.................................................. SU3-24

Figure 3.13 Histogram generated by “matplotlib.pyplot”................................... SU3-25

Figure 3.14 Programme for creating a scatter plot................................................ SU3-26

Figure 3.15 Scatter plot generated by “matplotlib.pyplot”.................................. SU3-27

Figure 4.1 Importing data with pandas.................................................................... SU4-9

Figure 4.2 Printing entire imported dataset............................................................. SU4-9

Figure 4.3 Printing the head and tail of the imported dataset by
display().................................................................................................................. SU4-10

Figure 4.4 Select columns from a DataFrame by variable names........................ SU4-13

Figure 4.5 Select a single column from a DataFrame and save it as NumPy
array.............................................................................................................................. SU4-13

Figure 4.6 Select rows from a DataFrame by random row indices..................... SU4-14

Figure 4.7 Setting the values of a column as row index....................................... SU4-16

xi
List of Figures

Figure 4.8 Resetting row index of a DataFrame.................................................... SU4-16

Figure 4.9 Setting a NumPy array as row index................................................... SU4-17

Figure 4.10 Selecting rows by single row label...................................................... SU4-18

Figure 4.11 Selecting rows by multiple row labels................................................ SU4-18

Figure 4.12 Selecting cells using indices................................................................. SU4-21

Figure 4.13 Selecting cells using row and column labels..................................... SU4-21

Figure 4.14 Selecting cells by row indices and column labels............................. SU4-22

Figure 4.15 Selecting cells by row labels and column indices............................. SU4-22

Figure 4.16 Selecting cells by a Boolean mask....................................................... SU4-24

Figure 4.17 Selecting cells by chaining two Boolean masks................................. SU4-25

Figure 4.18 Selecting cells by chaining multiple Boolean masks......................... SU4-25

Figure 4.19 Concatenating two datasets with different rows but identical
variables........................................................................................................................ SU4-27

Figure 4.20 Appending two DataFrames................................................................ SU4-28

Figure 4.21 df1 on Customer ID and Product........................................................ SU4-30

Figure 4.22 df2 on Customer ID and Shop............................................................. SU4-31

Figure 4.23 DataFrame after inner join................................................................... SU4-31

Figure 4.24 Declaring specific strings as missing values while importing


data................................................................................................................................ SU4-34

Figure 4.25 Counting the number of missing values in each variable................ SU4-37

Figure 4.26 Identifying observations with missing values................................... SU4-37

xii
List of Figures

Figure 4.27 Selecting observations with missing values from a


DataFrame.................................................................................................................... SU4-38

Figure 4.28 Dropping observations with missing values using .drop()


method.......................................................................................................................... SU4-41

Figure 4.29 Dropping observations with missing values using .dropna()


method.......................................................................................................................... SU4-42

Figure 4.30 Replacing missing values by 0 in the entire DataFrame.................. SU4-43

Figure 4.31 Replacing missing values by “Unknown” in specific columns....... SU4-44

Figure 4.32 Computing criteria for outlier detection in a numeric


variable......................................................................................................................... SU4-47

Figure 4.33 Selected outlier observations from a DataFrame.............................. SU4-47

Figure 4.34 Selected non-outlier observations from a DataFrame...................... SU4-48

Figure 4.35 Sorting a DataFrames by two variables.............................................. SU4-51

Figure 4.36 Computing the mean of all numeric columns for grouped
data................................................................................................................................ SU4-53

Figure 4.37 Computing the mean of selected columns for grouped data........... SU4-53

Figure 5.1 Importing modules and functions from scikit-learn............................. SU5-6

Figure 5.2 Reducing the number of categories for a categorical variable........... SU5-10

Figure 5.3 Renaming selected variable.................................................................... SU5-11

Figure 5.4 Creating dummy variables from categorical variables....................... SU5-13

Figure 5.5 Linear regression (two-dimensional).................................................... SU5-17

Figure 5.6 Import packages for linear regression.................................................. SU5-19

xiii
List of Figures

Figure 5.7 NumPy array data preparation............................................................. SU5-20

Figure 5.8 Construct linear regression model........................................................ SU5-20

Figure 5.9 Calculate the predicted value................................................................ SU5-20

Figure 5.10 Linear regression model plot............................................................... SU5-21

Figure 5.11 Import packages for KMeans clustering............................................. SU5-28

Figure 5.12 Load iris dataset..................................................................................... SU5-28

Figure 5.13 Construct KMeans clustering model.................................................. SU5-29

Figure 5.14 Obtain the clusters................................................................................. SU5-29

Figure 5.15 KMeans cluster plot............................................................................... SU5-29

Figure 5.16 Load packages and prepare dataset for tree plot.............................. SU5-45

Figure 5.17 Fit and plot decision tree model.......................................................... SU5-46

Figure 5.18 Decision tree plot................................................................................... SU5-47

Figure 6.1 Importing sqlite3 and pandas.................................................................. SU6-5

Figure 6.2 Programme to input data into a .csv File............................................... SU6-9

Figure 6.3 Interface to input data in JupyterLab................................................... SU6-10

Figure 6.4 Print the records in a .csv file to the screen line by line..................... SU6-11

Figure 6.5 Establish a connection between Python and a database.................... SU6-13

Figure 6.6 Import a.csv file dataset as a pandas DataFrame................................ SU6-14

Figure 6.7 Export pandas DataFrame to database by SQL................................... SU6-14

Figure 6.8 Select a table from the Database............................................................ SU6-16

Figure 6.9 Fetch one record for printing................................................................. SU6-16

xiv
List of Figures

Figure 6.10 Fetch all records for printing............................................................... SU6-16

Figure 6.11 Re-fetch records after applying fetchone() or


fetchall()................................................................................................................ SU6-17

Figure 6.12 Extract column names of a table by the .description


attribute......................................................................................................................... SU6-19

Figure 6.13 Generate variable list of a table........................................................... SU6-19

Figure 6.14 Execute an SQL statement stored in a string variable...................... SU6-20

Figure 6.15 Convert the result of an SQL query to a pandas DataFrame........... SU6-21

Figure 6.16 Sort students by their birthdays and nationalities............................ SU6-23

Figure 6.17 Select only analytics students from the table students................. SU6-26

Figure 6.18 Select only analytics students with ID number between 5 and
10.................................................................................................................................... SU6-27

Figure 6.19 Select only analytics students with ID number not between 5 and
10.................................................................................................................................... SU6-27

Figure 6.20 Select analytics students not from Singapore or China.................... SU6-28

Figure 6.21 Select analytics students with first names that start with “M”........ SU6-28

Figure 6.22 Select analytics students with first names that contain “ar”............ SU6-28

Figure 6.23 Select analytics students with first names that end with “s”........... SU6-29

Figure 6.24 Select analytics students with first names that start with “M” and
end with “l”................................................................................................................. SU6-29

Figure 6.25 Select analytics students with last names that have the pattern
“Ta_”.............................................................................................................................. SU6-29

Figure 6.26 Select students with no birthday record............................................. SU6-30

xv
List of Figures

Figure 6.27 Select students with non-missing birthday records.......................... SU6-30

Figure 6.28 Select data of non-analytics students from certain columns of a


table............................................................................................................................... SU6-31

Figure 6.29 Concatenate items of a tuple to a string............................................. SU6-32

Figure 6.30 Select certain columns of a table by a variable list string................. SU6-32

Figure 6.31 Information about products and customers...................................... SU6-36

Figure 6.32 Information about shops and customers............................................ SU6-36

Figure 6.33 Read .csv files......................................................................................... SU6-36

Figure 6.34 Connect Python to SQL database........................................................ SU6-37

Figure 6.35 Natural join of two tables..................................................................... SU6-37

xvi
List of Lesson Recordings

List of Lesson Recordings

Introduction to JupyterLab.......................................................................................... SU1-4

Basic Arithmetic and Variables in Python................................................................ SU1-8

Print and Input in Python......................................................................................... SU1-17

If-elif-else-Conditions in Python.............................................................................. SU1-28

Loops in Python.......................................................................................................... SU1-37

Python Tuples................................................................................................................ SU2-4

Python Lists................................................................................................................. SU2-13

Python Dictionaries.................................................................................................... SU2-31

Integrated Methods and Functions in Python....................................................... SU2-42

User-defined Functions in Python........................................................................... SU2-52

Modules, Packages and Libraries in Python.......................................................... SU2-56

Array Management with NumPy.............................................................................. SU3-4

Plotting with matplotlib............................................................................................ SU3-19

Import Data in pandas................................................................................................. SU4-4

Data Selection in pandas........................................................................................... SU4-12

Merge DataFrames in pandas................................................................................... SU4-27

Missing Data and Outliers in pandas DataFrames............................................... SU4-32

Data Modification in pandas.................................................................................... SU4-50

xvii
List of Lesson Recordings

Introduction to Scikit-Learn........................................................................................ SU5-4

Data Preparation for Analytics Algorithms of scikit-learn (1/2)........................... SU5-8

Data Preparation for Analytics Algorithms of scikit-learn (2/2)........................... SU5-8

Introduction and Fitting of K-Means Clustering by scikit-learn......................... SU5-23

Introduction and Fitting of Decision Trees by scikit-learn................................... SU5-31

Evaluate Decision Trees by scikit-learn................................................................... SU5-42

Introduction to SQL...................................................................................................... SU6-4

Data Query with SQL................................................................................................ SU6-18

Join Tables with SQL.................................................................................................. SU6-35

xviii
Course
Guide

Python for Data Analytics


ANL252 Course Guide

1. Welcome

Presenter: Dr Karl Wu Ka Yui

This streaming video requires Internet connection. Access it via Wi-Fi to


avoid incurring data charges on your personal mobile plan.

Click here to watch the video. i

Click here for the transcript.

Welcome to the course ANL252 Python for Data Analytics, a 5 credit unit (CU) course.

This StudyGuide will be your personal learning resource to take you through the course
learning journey. The guide is divided into two main sections – the Course Guide and
Study Units.

The Course Guide describes the structure for the entire course and provides you with an
overview of the Study Units. It serves as a roadmap of the different learning components
within the course. This Course Guide contains important information regarding the
course learning outcomes, learning materials and resources, assessment breakdown and
additional course information.

i
https://d2jifwt31jjehd.cloudfront.net/ANL252/IntroVideo/ANL252_Intro_Video.mp4

CG-2
ANL252 Course Guide

2. Course Description and Aims

The course provides foundational knowledge and skills of Python programming, which
enables students to develop programs for data preparation, data management, and data
visualisation to carry out data analytics tasks such as clustering, decision tree, etc. Students
also acquire skills to explore and find patterns in datasets using ready-to-use Python codes
that can be modified to suit individual needs. Furthermore, this course introduces the
application of SQL for querying data from database within any Python code for data
analytics purposes. Since this course is designed to help students with little prior exposure
to programming, it will focus on breadth rather than depth.

Course Structure
This course is a 5-credit unit course.

There are six Study Units in this course. The following provides an overview of each Study
Unit.

Study Unit 1 – Introduction to Python Programming

This unit takes the first steps in Python programming, including variables, data types,
operators, formatted printing, and user input. It also introduces the conditional statement
and loops, two types of control flow that change the behaviour of a program dynamically.

Study Unit 2 – Data Types and Functions

This unit establishes three compound built-in data types in Python: tuples, lists, and
dictionaries. Compound data structures organise and store data in a way that they
can be accessed and worked with efficiently. The type of compound data also defines
the relationship between the data and the operations that can be performed on them.
Furthermore, this unit covers the application of functions, methods, packages, and
modules and how they can be integrated in the program.

CG-3
ANL252 Course Guide

Study Unit 3 – Arrays and Plots

This study unit introduces two Python packages: NumPy and matplotlib. NumPy is
the fundamental package for efficient scientific computing with Python. The students
first learn to create and subset NumPy arrays, before getting on to generate statistics
on the data stored in an array by the integrated NumPy functions. Furthermore, this
unit illustrates the use of the matplotlib package for data visualisation, including the
functionality for plotting and customising basic charts of data analytics.

Study Unit 4 – Data Management

This unit establishes the pandas DataFrame as the key data structure for analytics in
Python. It demonstrates the process of creating Python programs for importing data
from external sources, indexing, and querying data from a DataFrame. It also focuses on
merging multiple DataFrames efficiently, identifying and dealing with missing data and
outliers, sorting, and grouping data.

Study Unit 5 – Data Analytics in Python

This unit covers the application of k-means clustering and decision trees in Python
programming using the scikit-learn library. The specific preparation of DataFrames due
to the different requirements of the analytics algorithms is discussed in the first step,
followed by the implementation of the two techniques in Python programs, including
parameter settings, presentation, and visualisation of the results.

Study Unit 6 – Basic SQL in Python

This unit describes the usage of SQL to query data from databases in Python programs.
The students learn basic information about SQL as a programming language and the
related SQLite3 package. Students will also learn about how to create flexible SQL
statements for data queries such as selecting data and merging tables using Python
programming skills.

CG-4
ANL252 Course Guide

3. Learning Outcomes

Knowledge & Understanding (Theory Component)

By the end of this course, you will be able to:

• Differentiate the various aspects of Python programming.


• Discuss how Python manages packages, modules, functions, etc.
• Explain the operations on arrays and datasets.

Key Skills (Practical Component)

By the end of this course, you will be able to:

• Design Python programmes for performing data analytics.


• Employ logic control flows in Python programmes.
• Prepare data for analysis using Python programming.
• Analyse data using appropriate tools and techniques with Python programming.

CG-5
ANL252 Course Guide

4. Learning Material

To complete the course, you will need the following learning material(s):

Required Textbook(s)
Shaw, Z. A. (2017). Learn python 3 the hard way. Addison-Wesley Professional.

If you are enrolled into this course, you will be able to access the eTextbooks here:

To launch eTextbook, you need a VitalSource account which can be created via
Canvas (iBookStore), using your SUSS email address. Access to adopted eTextbook is
restricted by enrolment to this course.

Recommended Study Material(s) for Learning Activities (Optional)

Website(s):
The Python documentation. https://docs.python.org/

JupyterLab Documentation. https://jupyterlab.readthedocs.io/en/stable/

NumPy user guide. https://numpy.org/

matplotlib pyplot overview. https://matplotlib.org/stable/api/pyplot_summary.html

pandas documentation. https://pandas.pydata.org/pandas-docs/stable/index.html

CG-6
ANL252 Course Guide

scikit learn API reference. https://scikit-learn.org/stable/modules/classes.html

Learn Python. tutorialspoint. https://www.tutorialspoint.com/python/

Python Tutorial, SQL Tutorial. https://www.w3schools.com/

SQLite Tutorial. https://www.sqlitetutorial.net/

CG-7
ANL252 Course Guide

5. Assessment Overview

The overall assessment weighting for Day & Evening course is as follows:

Assessment Description Weight Allocation

Overall Continuous Pre-Course Quiz 01 2%


Assessment Score
Pre-Class Quiz 01 2%

Pre-Class Quiz 02 2%

Tutor-Marked Assignment 18%

Group-Based Assignment 20%

Participation 6%

Overall Examinable Score End-of-Course 50%


Assessment

TOTAL 100%

The overall assessment weighting for the full online classes for this course is as follows:

Assessment Description Weight Allocation

Overall Continuous Pre-Course Quiz 2%


Assessment Score
Pre-Class Quiz 1 2%

Pre-Class Quiz 2 2%

CG-8
ANL252 Course Guide

Assessment Description Weight Allocation

Discussion 10%

Participation 6%

Tutor-marked Assignment 18%

Group-based Assignment 10%

Overall Examinable Score End-of-Course 50%


Assessment

TOTAL 100%

SUSS’s assessment strategy consists of two components, Overall Continuous Assessment


(OCAS) and Overall Examinable Component (OES) that make up the overall course
assessment score. Both components will be equally weighted: 50% OCAS and 50% OES.

a. OCAS: All the components will constitute 100% of OCAS.


b. OES: The Examination is 100% of this component.

To be sure of a pass result you need to achieve scores of 40% in each component. Your
overall rank score is the weighted average of both components.

For detailed information on the Course grading policy, please refer to The Student
Handbook (‘Award of Grades’ section under Assessment and Examination Regulations).
The Student Handbook is available from the Student Portal.

Non-graded Learning Activities:

Activities for the purpose of self-learning are present in each study unit. These learning
activities are meant to enable you to assess your understanding and achievement of the
learning outcomes. The type of activities can be in the form of Formative Assessment,

CG-9
ANL252 Course Guide

Quiz, Review Questions, Application-Based Questions or similar. You are expected to


complete the suggested activities either independently and/or in groups.

CG-10
ANL252 Course Guide

6. Course Schedule

To pace yourself and monitor your study progress, pay special attention to your
Course Schedule. It contains study-unit-related activities including Assignments, Self-
Assessments, and Examinations. Please refer to the Course Timetable on the Student
Portal for the most current Course Schedule.

Note: Always make it a point to check the Student Portal for announcements and
updates.

CG-11
ANL252 Course Guide

7. Learning Mode

The learning approach for this course is structured along the following lines:

a. Self-study guided by the study guide units. Independent study will require at
least 3 hours per week.
b. Working on assignments, either individually or in groups.
c. Classroom Seminars.

StudyGuide

You may be viewing the interactive StudyGuide, which is the mobile-friendly version of
the StudyGuide. The StudyGuide is developed to enhance your learning experience with
interactive learning activities and engaging multimedia. You will be able to personalise
your learning with digital bookmarking, note-taking, and highlighting of texts if your
reader supports these features.

Interaction with Instructor and Fellow Students

Flexible learning—learning at your own pace, space, and time—is a hallmark at SUSS,
and we strongly encourage you to engage your instructor and fellow students in online
discussion forums. Sharing of ideas through meaningful debates will help broaden your
perspective and crystallise your thinking.

Academic Integrity

As a student of SUSS, you are expected to adhere to the academic standards stipulated
in the Student Handbook, which contains important information regarding academic
policies, academic integrity, and course administration. It is your responsibility to read
and understand the information outlined in the Student Handbook prior to embarking on
the course.

CG-12
1
Study
Unit

Introduction to Python
Programming
ANL252 Introduction to Python Programming

Learning Outcomes

By the end of this unit, you should be able to:

1. Differentiate the various aspects of Python programming


2. Employ logic control flows in Python programmes

SU1-2
ANL252 Introduction to Python Programming

Overview

This study unit introduces the Python programming environment and the writing of
Python programmes with some foundation elements. We will also learn how to create
different types of variables and how to assign values to them for further operations. Since
input and output belong to the core of any computer programme, we will learn how
to create user input and construct formatted strings for printing as well. Also, we will
cover the construction of Boolean expressions as conditional statements to control the
behaviour of the program. Finally, we will find out how to create finite loops to repeat
routine instructions iteratively.

SU1-3
ANL252 Introduction to Python Programming

Chapter 1: Python Programming Environment

Lesson Recording

Introduction to JupyterLab

1.1 Installation of Python and Anaconda


In this course, we use Jupyter Notebook/Lab as the environment is more user friendly. In
particular, you should install the Anaconda distribution at https://www.anaconda.com/.
(This installation comes with Jupyter Lab, Jupyter Notebook, Spyder, R Studio, etc.) To
install, use the default installation option. When done, choose the Jupyter Notebook
(anaconda3) from Window’s Start menu as shown in Figure 1.1. Because Anaconda comes
with most of the packages we need in ANL252, you will no longer need to use the pip
install commands to install different packages.

SU1-4
ANL252 Introduction to Python Programming

Figure 1.1 Starting Jupyter Notebook

If you encounter installation issues, you may also start by using the JupyterLab at: https://
jupyter.org/try-jupyter/lab/. Visit this page and the following screen will appear. You
can start a new notebook by clicking on the Pyolite icon (more instructions will follow
later).

SU1-5
ANL252 Introduction to Python Programming

Figure 1.2 JupyterLab interface

1.2 Writing and Executing Python Programmes


We can start writing our Python programme on JupyterLab web interface by clicking the
Pyolite icon under the Notebook section.

Figure 1.3 JupyterLab input cell for Python code entry

Figure 1.4 JupyterLab shows the output after the code is run

SU1-6
ANL252 Introduction to Python Programming

Once we see Figure 1.3, we can type in our Python code in the cell and let the Python
interpreter execute it by pressing the play button (c.f. Figure 1.4), once the syntax is
completed. The output of the programme will then be printed below the input cell.

SU1-7
ANL252 Introduction to Python Programming

Chapter 2: Basic Arithmetic and Variables

Lesson Recording

Basic Arithmetic and Variables in Python

2.1 Arithmetic Operators


Before we start writing more sophisticated programs, we shall first go one step backwards
and familiarise ourselves with the most original function of a computer: calculation.
Python can be powerful in many ways, but we can also use it for very trivial tasks such
as adding two numbers together. In Figure 1.5, we instruct Python to carry out a simple
addition 2 + 7 for us. Similarly, we can also command Python to do other basic arithmetic
operations.

Figure 1.5 Simple calculations with Python

The following Python arithmetic operators are available for mathematical calculations:

SU1-8
ANL252 Introduction to Python Programming

Table 1.1 Python arithmetic operators

Operator Function Example

+ Addition Adds values on either 10 + 20 = 30


side of the operator

- Subtraction Subtracts right-hand 10 – 20 = -10


operand from left-hand
operand

* Multiplication Multiplies values 10 * 20 = 200


on either side of the
operator

/ Division Divides left-hand 20 / 10 = 2


operand by right-hand
operand

% Modulus Divides left-hand 20 % 10 = 0


operand by right-hand
operand and returns
remainder

** Exponent Performs exponential 10 ** 20 =


(power) calculation on 10,000,000,000,000,
operators 000,000

// Floor Division The division of 9 // 2 = 4


operands where the
9.0 // 2.0 = 4.0
result is the quotient
-11 // 3 = -4
in which the digits

SU1-9
ANL252 Introduction to Python Programming

Operator Function Example

after the decimal point -11.0 // 3 = -4.0


are removed. But if
one of the operands is
negative, the result is
floored – i.e., rounded
away from zero
(towards negative
infinity).

(Source: Tutorials Point, 2023, https://www.tutorialspoint.com/python/python_basic_operators.htm)

As with normal mathematics, the exponent operator has higher priority than the operators
of multiplication or division which, in turn, will be calculated prior to the addition or
subtraction operators. Furthermore, we can add parentheses to our equation to indicate
that those terms within a parenthesis should have the highest priority in the calculation.

Note that mathematical functions such as the square root, the logarithm, the exponential,
or the trigonometrical functions are not included in basic Python. If we want to include
these functions in our calculation, we will need to import packages such as “math” or
“NumPy” in our code. We will discuss how to import and call functions from external
libraries or packages in Study Unit 2.

There are other operators in Basic Python such as relational operators, logical operators,
etc. We will also discuss these at a later stage of this study unit.

SU1-10
ANL252 Introduction to Python Programming

Read

For more about basic operators in Python, read:

Tutorials Point. (2023). Python – Basic operators. https://www.tutorialspoint.com/

python/python_basic_operators.htm

2.2 Variables
In most situations, we wish to write programmes that help us automate routine operations
without adjusting our programmes according to the actual needs. For instance, we may
not always want to add 2 and 7 together. Instead, we would prefer to let the computer add
up any pair of arbitrary numbers for us, and we can choose these numbers depending on
the situation. As a result, we would like to keep our programme as general as possible by
using variables instead.

In Python, we define a variable by its name, which is an arbitrary combination of


characters (A-Z, a-z), underscores (_) and numbers (0-9). Subsequently, we assign a value
to the variable and let Python operate with it. We can change the value of the defined
variable at any stage of our programme.

To assign a value to a variable, all we need to do is:

variable = value

Remember that it is important to put the variable left of the equal sign (=) and the value
right of it. If we switched their positions, it would be equivalent to assigning a name to
a number. This would result in a syntax error, and Python will stop executing the rest of
the programme at once.

SU1-11
ANL252 Introduction to Python Programming

In Python, the name of a variable can be short (e.g., x, y, z) or more descriptive (e.g., age,
carname, total_volume). But there are certain rules which we must follow when we
create our variable names, namely:

• A variable name must start with a letter or an underscore (_)


• A variable name cannot start with a number
• A variable name can only contain alpha-numeric characters and underscores (i.e.,
A-z, 0-9, and _)
• Variable names are case-sensitive (age, Age, and AGE are three different variables).

(Source: W3 schools, 2023, https://www.w3schools.com/python/


python_variables.asp)

Here are some examples of valid variable names:

myvar = 10

my_var = 10

my_var = 10

myVar = 10

MYVAR = 10

_myvar2 = 10

Here are some examples of invalid variable names:

2myvar = 10

my-var = 10

my var = 10

SU1-12
ANL252 Introduction to Python Programming

Once values are assigned to variables, we can use them for any arithmetic operations for
numeric values (refer to Chapter 2.1).

Example (Students’ score): Suppose we have the exam scores of two students, 30 and
65, and we would like to store them in two variables, score1 and score2, for some
mathematical operations. Subsequently, we can conduct arithmetic operations with
these variables.

Figure 1.6 Assigning values to variables for operation

Read

Read the following section of the textbook on examples of creating and using variables
in Python:

Exercise 4 Variables and Names

SU1-13
ANL252 Introduction to Python Programming

2.3 Types of Variable and Expressions


In Python, there are different types of variables that we can work with. In the previous
section, we assigned numeric values to variables which make them numeric variables.
Nevertheless, there are also different types of numeric variables. Here are some main types
of variable available in Python:

Table 1.2 Types of variables in Python

Type Description Example

Integer The value of an integer variable must be an integer , a = 5


a value without decimal point. It can be both
positive and negative.

Float The value of a float variable can be an arbitrary b = 10.5


numeric value with a floating point.

String The value of a string variable can contain any c = "John"


letters in both cases, special characters as well as
d = "Tan"
numbers. Note that if numbers are assigned to a
string variable, no mathematical operations can be
applied to it.

To assign a value to a string variable, the value


must be written between a pair of quotation marks
(it can be single or double quotation mark, but it
must be consistent for the same value).

Furthermore, two strings can be concatenated by c + d =


being “added up”. "JohnTan"

SU1-14
ANL252 Introduction to Python Programming

Type Description Example

Boolean (Bool) The value of a Boolean variable can be either True e = True
or False.

In the following, we will use the general term expression for variables or when they are
linked with operators. For instance, a + b is an expression and not a variable, unless we
define c = a + b in our programme where c is then a new variable. However, we would
rather refer to expressions in our programme directly since we do not always define new
variables for calculation steps in between.

To check the type of a variable, we can use the type() function on any variable in our
program.

type(variable_name)

Python will then print the variable type such as “int” (for integer), “float” (for float), or
“str” (for character string) to the screen.

Example (Cont’d): In Figure 1.7, we assign 30 and 65 to the two variables, score1
and score2, respectively. We can use the type() function to check their variable
type.

SU1-15
ANL252 Introduction to Python Programming

Figure 1.7 Checking variable type

We can see that Python returns <class 'int'> as the screen output. The
information we are enquiring is put within the single quotation marks like int in this
case. As a result, we can see that both score1 and score2 are integer variables.

Read

For more information about the different operations on numeric data types, read:

Python Software Foundation. (2023b). Built-in types. https://docs.python.org/3/


library/stdtypes.html

SU1-16
ANL252 Introduction to Python Programming

Chapter 3: Print and Input

Lesson Recording

Print and Input in Python

3.1 Printing
Writing programmes not only automates routine operations by the computer but also
shows the use the results, information, or messages while the Python programme is
running. We can use the print() function to generate screen output for the user to read.

print("My String")

All we need to do here is to put the text within a pair of quotation marks and pack
everything inside the print() function. While the programme is being executed, Python
will extract the content within the quotation marks and print it onto the screen.

Figure 1.8 Simple use of the print() function

The print function is not only limited to printing pre-defined strings. We often wish to
print out the result of a calculation, as shown in Figure 1.9, or the value of a variable, or
the result of a calculation based on variables as well.

SU1-17
ANL252 Introduction to Python Programming

Example (Cont’d): Suppose we would like Python to print the exam score of student1,
score1, and the sum of the two scores, score1 + score2, onto the screen.

Figure 1.9 Printing variables and their calculations

3.2 String Formatting


In the last command of Figure 1.9, we asked Python to first execute the mathematical
operation score1 + score2, and then print it out to the user. Basically, there is nothing
wrong with this screen output; it only looks a bit “cold” and not interactive. To make such
an output look more like a statement addressing the user, we can mix a statement with
variables or expressions in our printing string to become a formatted string for printing.

print(f"My String {expression1} {expression2} …")

The print-command is almost identical to the one for normal printing. The only
difference here is to put an “f” before the open quotation mark. Subsequently, we can
place the variables or expressions that we would like to print anywhere within the text
and wrap it within a pair of curly brackets {}.

SU1-18
ANL252 Introduction to Python Programming

Example (Cont’d): Suppose we would like to print the scores of the two students as
well as their sum in a sentence such as “Our scores are 30 and 65. The total score is
95”. We can use the following code to create this screen output.

Figure 1.10 Printing formatted string

Unlike the entire argument within the quotation marks, every expression written inside
the curly brackets of a formatted string will be evaluated before being printed onto the
screen. In Figure 1.10, the expressions score1, score2, and score1 + score2 will be
evaluated first. That is, Python will execute the print() function with the value assigned
to these expressions and not with the expressions as part of the string.

Example (Cont’d): If the variables or expressions were not put inside some curly
brackets as a formatted string, the expressions score1, score2, and score1 +
score2 would be treated as ordinary strings and printed just as how they were
written. And if the “f” were forgotten at the beginning of the argument in the
print() function, Python would interpret the missing “f” as an instruction to print
the entire text within the quotation mark without evaluating the expressions in the
curly brackets first. Since the curly brackets are part of the string in the print()
function, they will be printed as well.

SU1-19
ANL252 Introduction to Python Programming

Figure 1.11 Output of incomplete string formatting syntax

We can also use the .format() method for string formatting. Note that the .format()
method only takes one expression in its argument.

print("My String {}".format(expression))

For the .format() method, we need to place the curly brackets at the position within
the string where we would like to print our expression.

Example (Cont’d): We can print the total score of the two students at the end of a
statement such as “The total score is 95.” by using the .format() method.

Figure 1.12 Simple usage of the .format() method

Figure 1.13 shows us how the printing string can be extended to obtain the same
output as in Figure 1.10. But the syntax is much longer here.

Figure 1.13 Multiple usage of .format() method

Nevertheless, the .format() method can also be useful if we have one variable to
be printed at the end of our statement.

SU1-20
ANL252 Introduction to Python Programming

Read

Read the following section of the textbook on printing formatted strings:

Exercise 5 More Variables and Printing

Read the following two sections of the textbook on printing formatted strings using
the .format() method:

Exercise 6 Strings and Text

Exercise 7 More Printing

3.3 Escape Sequences


Escape sequences are used to print special characters that are invisible such as ENTER, or
characters that can cause syntax error such as single (') or double quotation marks (").
Suppose we would like to include a quote within a string for the screen output.

Figure 1.14 Syntax error caused by quotation marks within a string

In Figure 1.14, the string in the print() function ends with the second quotation mark.
Everything after it will be interpreted as part of the code. Since the word “alarming” is
neither a Python command nor a variable, Python simply interprets it as an erroneous
syntax. One way to avoid this error is to use single quotation marks for either the citation
quote or the string definition.

SU1-21
ANL252 Introduction to Python Programming

Figure 1.15 Printing quotation marks within a string

Another way is to use the escape sequence \" within the string instead of switching
between single and double quotation marks.

Figure 1.16 Printing quotation marks using escape sequence

Escape sequences are also useful when line breaks should be inserted within a string. By
adding the escape sequence “\n” at the position within the string, the subsequent part of
the string will be printed in the next line of the output screen.

Example (Cont’d): Now, we would like to print the first and the second sentences in
Figure 1.10 in two separate lines. However, we would create a syntax error if we just
placed a line break in our Python script.

Figure 1.17 Erroneous line breaks within a string

Figure 1.17 illustrates that Python treats such a line break within a string as a syntax
error. The reason is that the string in the print() function must be closed by a

SU1-22
ANL252 Introduction to Python Programming

quotation mark in the same line. Instead of closing the first line directly and starting a
new print() function in the second line to solve this problem clumsily, we can add
an escape sequence \n into the string:

Figure 1.18 Line breaks created by escape sequence \n

The following list contains some useful escape sequences available in Python:

Table 1.3 List of some escape sequences available in Python

Escape Sequences Description Example

\newline Backslash and newline >>> print("line1 \


ignored Line2")
line1 line2

\\ Backslash (\) >>> print("\\")


\

\' Single quote (') >>> print("\'")


'

\" Double quote (") >>> print("\"")


"

\n ENTER or line break >>> print("line1 \n


line2")
line1

SU1-23
ANL252 Introduction to Python Programming

Escape Sequences Description Example

line2

\b Backspace (BS) >>> print("line1 \b


line2")
line1 line2

\t Horizontal Tab (TAB) >>> print("line1 \t


line2")
line1 line2

\v Vertical Tab (VT) >>> print("line1


\vline2")
line1
line2

(Source: https://www.quackit.com/python/reference/python_3_escape_sequences.cfm)

Read

Read the following two sections of the textbook on using escape sequence in printing
formatted strings as in Figure 1.17:

Exercise 9 Printing, Printing, Printing

Exercise 10 What Was That?

3.4 Input
While a programme script is being executed, it requires values to be assigned to the
variables to proceed in its instructions. So far, we have discussed the possibility to assign
values to the variables in the script. That means, the values are fixed when the programme

SU1-24
ANL252 Introduction to Python Programming

starts to run. However, in most of the cases, those values are unknown and can only
be assigned while the programme is running, mostly based on the input of the user. In
Python, we can use the input() function to ask the user to enter the value for a variable.

variable = input("My String")

The whole syntax will be put on the right-hand side of an equal sign so that Python can
assign the user input to the variable that is defined on the left-hand side of the same equal
sign.

Unlike the print() function, Python requires the user to type in something and then
press ENTER to complete the execution of the input() function. Same as the print()
function, we can instruct Python to print a string to the screen within the input()
function. Usually, this string should be a question and/or some instructions to inform the
user what they should input here. Furthermore, we are also allowed to mix the assigned
values of some variables with the instruction text to become a formatted string that will
be printed on the screen for the subsequent input.

Example (Cont’d): Instead of pre-assigning values to the variables, we will ask the
user to enter his/her name and his/her score. Subsequently, we will print out his/
her score by addressing his/her name and embed his/her score in a sentence such as
“Your score is …”.

Figure 1.19 Example of using input() function

SU1-25
ANL252 Introduction to Python Programming

In Python, the value assigned by the user within an input() function will be stored as
string. If the input should be an integer or a number with a floating point, we can convert
the input using:

variable = int(input("My String"))


variable = float(input("My String"))

The functions int() and float() are used to convert a string variable into an integer
or a float variable. (Conversely, the str() function may be used to convert an integer or
a float variable into a string variable.)

Example (Cont’d): Since the score of a student must be an integer within 0 and 100,
we can convert it to an integer by embedding the input() function within an int()
function. At the same time, we add a new question to ask the student for his/her
CGPA and convert it to a float variable.

Figure 1.20 Convert Input Value to Integer and Float

The syntax introduced above is to put the input() function within the int() and
float() functions and construct the instruction within a single line. Nevertheless, we

SU1-26
ANL252 Introduction to Python Programming

can also separate these commands into two lines without changing the behaviour of the
programme:

variable = input("My String")


variable = int(variable)

These lines are certainly applicable to the float() function as well. It is noteworthy
that if the user enters a value that is not a number, the int() or float() functions will
interpret it as a syntax error and stop executing the code immediately. It is therefore a good
programming habit to build in certain control mechanism for any user input command in
our code. We will discuss the construction of such a control mechanism in Chapter 5 of
this study unit.

Read

Read the following two sections of the textbook on getting the user input:

Exercise 11 Asking Questions

Exercise 12 Prompting People

SU1-27
ANL252 Introduction to Python Programming

Chapter 4: If-elif-else-Conditions

Lesson Recording

If-elif-else-Conditions in Python

4.1 Boolean Expressions


To automate a routine by a computer programme, we usually need to let the programme
“decide” what to execute in the next step based on some conditions. For instance, the user
can choose to stay or quit the programme after certain operations have been completed.

Before introducing the if-elif-else-conditions that Python uses to decide how


the program should behave after a certain stage of the code, we need to get ourselves
familiarised with the Boolean expressions first. As introduced in Chapter 2.3, Boolean
variables have only two possible values: True and False. So, the basic concept of the
conditional control flow is to first evaluate whether a Boolean expression is True, and
then carry out either set of instructions, depending on the evaluation.

A Boolean expression can be the result of a single relational operation, or a combination


of multiple relational operations linked by logical operators. Here are some relational
operation examples:

Table 1.4 Examples of relational operation

Relational Operation Result

1 == 1 True

3 > 2 True

SU1-28
ANL252 Introduction to Python Programming

Relational Operation Result

0 <= -5 False

a + b < 10 False if a + b >= 10

In the above examples, the first two operations are obviously True since they correspond
to the mathematical relationship between the left-hand and the right-hand sides of the
equations. Note that if we want to check whether two expressions are identical, we will
have to use the double equal sign (==) instead of the ordinary equal sign (=) since the
single equal sign is used to assign a value to a variable. So, if we wrote 1 = 1 instead of
1 == 1, a syntax error would be returned since Python would interpret our intention to
assign a value to a number, which we know is not allowed from Chapter 2.2.

Below is a list of relational operators that we can use in Python.

Table 1.5 List of relational operators

Operator Description Example

== True if the values of two operands are (10 == 20) is False.


equal.

!= True if values of two operands are not (10 != 20) is True.


equal.

<> True if values of two operands are not (10 <> 20) is True.
equal (Similar to the != operator).

> True if the value of left operand is greater (10 > 20) is False.
than the value of right operand.

SU1-29
ANL252 Introduction to Python Programming

Operator Description Example

< True if the value of left operand is less (10 < 20) is True.
than the value of right operand.

>= True if the value of left operand is (10 >= 20) is False.
greater than or equal to the value of right
operand.

<= True if the value of left operand is (10 <= 20) is True.
less than or equal to the value of right
operand.

(Source: Tutorials Point, 2023, https://www.tutorialspoint.com/python/python_basic_operators.htm)

A Boolean expression can also be a combination of multiple relational operations that are
connected by the logical operators. Below is a list of logical operators in Python.

Table 1.6 List of logical operators

Operator Description Example

and If both the operands are True, then the (10 > 0 and 20 >
condition becomes True. 0) is True.

or If any of the two operands are non-zero, (10 > 0 or 20 > 0)


then the condition becomes True. is True.

not Used to reverse the logical state of its not(10 > 0 or 20 >
operand. 0) is False.

(Source: Tutorials Point, 2023, https://www.tutorialspoint.com/python/python_basic_operators.htm)

SU1-30
ANL252 Introduction to Python Programming

The use of a single logical operation is usually quite straightforward as the operators are
designed in a way that it simply matches our spoken language. However, it could become
quite confusing if we combine these operators in a Boolean expression. For instance, the
following Boolean expressions are equivalent:

(a or b) and c <==> (a and c) or (b and c)

(a and b) or c <==> (a or c) and (b or c)

not(a or b) <==> (not a) and (not b)

not(a and b) <==> (not a) or (not b)

These are just a few examples and can be extended endlessly. It is of utmost importance to
be familiar with how to create Boolean expressions using relational and logical operators.
Any failure in combining these operators could lead to unexpected behaviour of our
programme. The only and most effective way here is to practise them with Python since
we can check on the result directly.

Read

Read the following two sections of the textbook on Boolean expressions:

Exercise 27 Memorising Logic

Exercise 28 Boolean Practice

SU1-31
ANL252 Introduction to Python Programming

4.2 Conditional Statements


The result of a Boolean expression can serve as the condition that changes the behaviour
of a programme dynamically by embedding it in an if-conditional statement:

if condition:
instructions

In the if-condition, Python will execute the syntaxes in the instructions if the condition is
True. However, if the condition is False, Python will simply skip these lines and proceed
with the subsequent code lines. Note that it is mandatory to put the colon directly behind
the condition, and the instructions must be indented so that Python can interpret them as
part of the if-block.

SU1-32
ANL252 Introduction to Python Programming

Example (Cont’d): If the score of a student is below 40, we will show a message on
the screen to tell him/her that he/she failed in the exam.

Figure 1.21 if-statement example with True condition

Figure 1.21 illustrates what Python does if the condition is True. On the other hand,
if a student scores more than 40, nothing will be printed based on the Python script.

Figure 1.22 if-statement example with False condition

Figure 1.22 shows how Python skips all the instructions in the if-block since the
condition is False.

If we intend to let Python execute another set of instructions if the condition is False,
and not just skip the if-block, we can add an else-statement to the if-block:

SU1-33
ANL252 Introduction to Python Programming

if condition:
instructions 1
else:
instructions 2

Same as the if-condition, we must add a colon to the else-statement and the instructions
following it must be indented as well.

Example (Cont’d): If the score of a student is below 40, we will show a message on the
screen to tell him/her that he/she had failed in the exam. Otherwise, we will show a
message to tell him/her that he/she passed.

Figure 1.23 if-else-statement example

Figure 1.23 shows that if the condition is False, Python will execute those
instructions following the else-statement.

If the construction of the condition allows more than two outcomes, we may need a third
or fourth if-blocks, etc. In this case, we can use the if-elif-else-block:

if condition 1:
instructions 1
elif condition 2:
instructions 2
else:

SU1-34
ANL252 Introduction to Python Programming

instructions 3

Note that an if-elif-else-block does not necessarily need an else-statement. But we


should ensure that the conditions being checked by the if-statement and the elif-
statements must cover all possible outcomes, unless we are certain that only those
possibilities are being uncovered which do not need any instructions to follow up.

In the example in Figure 1.23, the programme is only constructed to separate students
into two categories: pass and fail. It will then print the statement to the user accordingly.
Suppose we also give grades to evaluate the performance of the students, we can
categorise the scores using if-conditions.

If we construct an if-block to categorise a numeric variable, we should ensure that the


Boolean expressions do not overlap. For instance, if the A grade is assigned when the score
is between 80 and 100, then 80 should not be included in the condition for obtaining the B
grade. The logical operator and should be used to indicate the interval for each category
since both conditions – namely, that the value of the numeric variable must be larger than
the lower bound as well as smaller than the upper bound of the interval – must be fulfilled
simultaneously.

In the example shown in Figure 1.24, the else-statement has also been omitted since all
possible outcomes of the variable score have been covered by the if-block. Nevertheless,
we can also use the else-statement instead of the whole elif-condition for grade A if we
are confident to do so. Just be cautious that in this case, if certain possibilities were not
covered, no instructions would be carried out from the if-block, and the behaviour of the
subsequent part of the programme may be affected.

Example (Cont’d): We will print the grade to the student according to his/her exam
score. If a student scores between 80 and 100, his/her grade will be A; and if his/her
score is between 80 and 60, s/he will get a B; a score between 50 and 60 is equivalent

SU1-35
ANL252 Introduction to Python Programming

to grade C; grade D will be given if a student scores between 40 and 50 and any score
below 40 belongs to grade F.

Figure 1.24 Example of if-elif-else-statement

Lastly, the print() function is not indented here. As a result, Python interprets it as
the part of the code that should be executed after the entire if-block and not as part
of the instructions following the last condition (elif score1 >= 80 and score1
<= 100:).

Read

Read the following three sections of the textbook on conditional statements for control
flow:

Exercise 29 What If

Exercise 30 Else and If

SU1-36
ANL252 Introduction to Python Programming

Chapter 5: Loops

Lesson Recording

Loops in Python

5.1 While-Loops
In the student score example, we constructed a programme in which the name and the
mark of one student can be entered. In the early stage of our example, we had scores of
two students. If we had to enter their names and then assign a grade to each of them, we
would have to repeat the codes in Chapters 2, 3, and 4 twice.

SU1-37
ANL252 Introduction to Python Programming

Figure 1.25 Repeating the same code for two individuals to enter

Usually, we have more than two students in one class. In order not to expand our codes
endlessly and make it clumsy and unreadable, we can construct a loop in our programme
to repeat the instructions that will be applied many times. The first type of loop that we
will introduce here is the while-loop.

while conditions:
instructions

The condition is a Boolean expression which controls whether the loop will continue to
run for a new iteration or not. If the condition is True, Python will go on to execute the

SU1-38
ANL252 Introduction to Python Programming

instructions that are written with indentation and after the colon behind the conditions.
The number of loops can be infinite and will be repeated as long as the condition is True.
As a result, it is extremely important to ensure that a while-loop will be terminated at
some stage by fulfilling the exit condition.

Example (Cont’d): We will repeat the previous task for 3 students.

Figure 1.26 Entering names and scores repeatedly by while-loops

Before the loop starts, we initiate a counter variable called i with the value 0. This
counter will increase by 1 in each iteration. The while-loop is set to continue as long
as i has not reached 3 yet. At the beginning of the last iteration, i should be 2 and
will become 3 when Python executes the instructions within the while-loop. At the
end of this iteration, Python will go back to the first line of the while-loop, and since

SU1-39
ANL252 Introduction to Python Programming

the condition i < 3 is no longer True, the programme will exit the loop as designed.
Therefore, the most crucial command in the whole loop is i = i + 1. Without this
line, i can never reach 3, which is the exit condition, after three iterations. Instead, it
will always stay at 0, the initial value defined before the loop has started.

Read

Read the following section of the textbook on while loops:

Exercise 33 While Loops

5.2 For-Loops
Another type of loop is the for-loop, which are constructed differently. While we need a
True-condition for a while-loop to continue to iterate, we need a list for the running of
the for-loops. We will discuss the construction and properties of a list in detail in Study
Unit 2. Here, we will first learn to generate a simple list of consecutive integers by the
range() function.

variable = range(start, end)

The start value can be any integer as long as it is smaller than the end value. Note that
the end value is not included in the list. In other words, the list will end at end – 1. The
generated list of numbers will be assigned to the variable defined on the left-hand side of
the syntax. For the for-loops, we do not need to store the generated list in a variable first.
Instead, we can use the range() function in the for-statement directly.

SU1-40
ANL252 Introduction to Python Programming

for counter in range(start, end):


instructions

The for-command must end with a colon, followed by the instructions that should be
carried out in each iteration. These instructions must be written with indentation. The
counter variable will do the counting for the iterations, starting from the start value in
the first iteration and running through the entire integer list. Once the counter reaches end
– 1, Python will execute the instructions for the last time and then exit the loop.

SU1-41
ANL252 Introduction to Python Programming

Example (Cont’d): We will carry out the previous task using the for-loop.

Figure 1.27 Entering names and scores repeatedly by for-Loops

The range() function generates a list containing the values 0, 1, and 2, since 3 will not
be included in the list by definition. The code in Figure 1.27 also shows that the counter
variable i is already integrated in the for-statement, and an explicit instruction to
increase it by one in each iteration is not required at all.

SU1-42
ANL252 Introduction to Python Programming

Read

Read the following section of the textbook on while-loops:

Exercise 32 Loops and Lists

5.3 Breaking from Loops


Though we usually have a clear exit condition or a finite list to guarantee a loop to end
at a certain point of the programme, we may still be confronted with situations where we
would like to interrupt the loop and continue with the subsequent programme.

In our previous example, suppose we would like to quit the entire programme after
entering the first student’s data due to some reasons, although the programme allows us to
enter the data for up to three students. It would be reasonable to have a syntax that allows
us to break from the loop in a “clean” manner instead of shutting down the computer
entirely. The command for such situation is break.

for counter in range(start, end):



if conditions:
break

Equivalently, the break command also works within a while-loop.

while conditions:

if conditions:
break

SU1-43
ANL252 Introduction to Python Programming

Usually, break is used together with an if-condition, since we would only want to break
from a loop under some circumstances and not in general.

Example (Cont’d): Suppose we let the user quit the programme by entering -1 for
his/her score now. All we need to do is to add an if-condition after the input()
statement where s/he can enter the score.

Figure 1.28 Breaking from loops using break

The break command is built after the user is asked to enter his/her exam score but
before the grade is assigned. Basically, the programme will still work normally if the
break command is put after the whole if-elif-block. By breaking from the loop

SU1-44
ANL252 Introduction to Python Programming

before a chunk of codes that will have no further influence on the execution of the
programme can shorten the running time and make it faster.

Note that we can also apply the break command on our example in Figure 1.26, where
we use the while-loops for the same task instead.

5.4 Error Handling in Input


Another common use of loops is to control the user input following an input()
statement. For instance, it could happen that the user accidentally types in a letter instead
of a number for the exam score. In this case, we would like the user to redo the input until
it is a number. As a result, we can put the input() statement within a while-loop and
only break from it when the input of the user is valid.

Before we start to construct while-loops for user inputs, we must learn how Python
handles errors. In Chapter 3, we mention that if we apply the int() function to convert
a string variable that contains a non-numeric value to an integer, the programme will
be interrupted due to value error, and the user will have to restart the programme in
PowerShell. This can be very annoying if the user only makes a small mistake in one of the
input fields but needs to re-type all the inputs because of the programme interruption.

To avoid Python from stopping to execute the program by force, we can use the try:
syntax.

try: #The try-block lets us test a block of code for errors.


instructions
except exception: #The except-block lets us handle the
error.
instructions
else: #The else-block carries out instructions if no error
occurs (optional).
instructions
finally: #The finally-block executes instructions
regardless of the result of the try- and except blocks
(optional).
instructions

SU1-45
ANL252 Introduction to Python Programming

The try-except-block is an important instrument in Python to handle errors. Basically,


we can put any syntax in the try-block if we think error can occur in those syntaxes. The
except-block tells Python to continue with the programme except for the occurrence of
an error, or the occurrence of a specific error that we declare under exception. If an
error indeed occurs, Python will carry out the instructions written after the colon behind
the except-statement, instead of stopping the programme entirely. The else-block and
finally-block are optional and can be used if we want certain instructions to be carried
out if no error occurs or for finalising a try-block.

In Python, there are many built-in exceptions. Table 1.7 provides some common ones.

Table 1.7 List of some common exceptions

Exception Description

NameError Raised when a local or global name is not found. This applies only
to unqualified names. The associated value is an error message that
includes the name that could not be found.

TypeError Raised when an operation or function is applied to an object of


inappropriate type. The associated value is a string that gives details
about the type mismatch.

ValueError Raised when an operation or function receives an argument that has


the right type but an inappropriate value, and the situation is not
described by a more precise exception.

(Source: Python Software Foundation, 2023, https://docs.python.org/3/library/exceptions.html)

If the user is asked to enter a numeric value such as an exam score, but enters a string
instead, we can use ValueError as our exception.

SU1-46
ANL252 Introduction to Python Programming

Example (Cont’d): We implement a try-block to test the validity of score input by


the user. If it is a string, the programme will warn the user of an erroneous input.

Figure 1.29 Using exception for error handling

In the first run, we type in a string “test” where we should enter a numeric value for
the score. Python detects an error since the syntax tries to use the int() function to
convert character strings to integers which would usually cause the programme to
stop running abruptly. From the output, we can see that Python prints the warning
message we put in the print() function to the screen and ends the programme
“properly” as if no error has occurred. In the second run, we enter a numeric value as
required. Python detects no error and simply skips the except-block.

After defining the try-block to instruct Python on how to handle errors, we can construct
a while-loop around it. As condition for the loop to continue iterating is when a Boolean
variable that indicates a valid input does not change from False to True. Hence, if this
Boolean variable is True, the programme will break from the loop.

SU1-47
ANL252 Introduction to Python Programming

Example (Cont’d): Now we put the entire try-block within a while-loop. The
while-loop will stop iterating once the input for score is numeric. If it is non-
numeric, the user will see a warning message and he will also be asked to re-enter his
exam score. The whole procedure will last until the input is numeric.

Figure 1.30 Using while-loop for user input if error occurs

We initiate a Boolean variable called valid_input before the while-loop starts. The
initial value of this variable is False, which we also use as the condition for the while-
loop to continue to iterate. In the try-block, we add an else-statement for the case
that the input is correct, and the follow-up instruction here is to change the value
of valid_input to True so that the while-loop stops. Note that we can use the
break command here as well. In the output screen, we can see that if the input is non-
numeric, the programme will print a warning message in a new line and then ask the
user to re-enter the score until the input is valid.

SU1-48
ANL252 Introduction to Python Programming

It is possible or even desirable to set a maximum number of input trials in order not to
have the programme running endlessly. A counter variable can be added to the loop, and
the programme will exit the loop after the maximum allowed number of iterations has
been reached.

SU1-49
ANL252 Introduction to Python Programming

Example (Cont’d): The complete programme of this study unit containing all the
techniques we have learned is given in the following figure.

Figure 1.31 The complete example programme of Study Unit 1

SU1-50
ANL252 Introduction to Python Programming

Summary

We have learned the basics of writing and executing Python programmes. We have also
been introduced to the various variable types and some operators that can be applied to
them. We have then discussed how to generate screen output and how to let the user enter
answers and assign them as values to the variables. Furthermore, we have covered the
construction of conditional statements to dynamically change the programme behaviour,
if necessary. Finally, we have also come to know the use of loops to repeat routine tasks
for an endless number of times within a programme.

SU1-51
ANL252 Introduction to Python Programming

Formative Assessment

1. From which directory (folder) can you run your Python script that is saved with
the .py extension?
a. From the folder where Python is installed
b. From the folder where the operating system is installed
c. From the folder where I have saved my Python script
d. From any arbitrary folder

2. Which line of code is not a valid Python syntax?


a. print("This is {a} wrong syntax!")
b. 0 = 0
c. 0 == 1
d. y = a / int(b)

3. Which of the following is a valid Python variable name?


a. iamavariablelol
b. :)iamavariable
c. 007imavariable
d. i-am-a-variable-lol

4. What is the value of int(-0.5) in Python?


a. -1
b. 0.5
c. 0
d. 1

5. What does the .format() method do?


a. It formats the font of the string in a print() function.

SU1-52
ANL252 Introduction to Python Programming

b. It replaces the expression in a curly bracket within a string by its value.


c. It replaces the curly bracket by a round bracket in a string.
d. It replaces the curly bracket within a string by the value of the expression in
the .format() method.

6. Which of the following Boolean expression is false?


a. -5 <= 0
b. 10 ** 2 == 100
c. int(-0.5) < int(-0.2)
d. 18 / 3 > 1 + 4

7. Which Boolean expression is equivalent to not(a and b)?


a. not a and not b
b. not a or not b
c. not a and b
d. a or b

8. Which statement is correct regarding the if-elif-else statement block?


a. The if-elif-else statement must end with an end-statement.
b. There must be an else-statement in every if-elif-else statement block.
c. If-elif-else statement block can also start with an elif- or else-
statement.
d. Behind each if-, elif-, and else-statement must be a colon before the
instruction block starts.

9. Which values will be printed on the screen given the following Python code?

counter = 0
while counter <= 3:
print(counter)
counter = counter + 1

SU1-53
ANL252 Introduction to Python Programming

a. 0, 1, 2, 3
b. 1, 2, 3, 4
c. 0, 1, 2, 3, 4
d. 0, 2, 3

10. Which of the following is correct about for-loops in Python?


a. We need to initiate a counter before the loop starts.
b. We must write a line to increase the counter by one within the loop.
c. We need an exit condition for the loop.
d. We can use the range() function to generate a list of numbers as the index of
the for-loop iterations.

SU1-54
ANL252 Introduction to Python Programming

Solutions or Suggested Answers

Formative Assessment
1. From which directory (folder) can you run your Python script that is saved with
the .py extension?

a. From the folder where Python is installed

Incorrect. Python would not be able to find your Python script there unless
you have saved it in the Python programme folder, which is rather unlikely.

b. From the folder where the operating system is installed

Incorrect. Python would not be able to find your Python script there unless
you have saved it in the system folder, which is also rather unlikely.

c. From the folder where I have saved my Python script

Correct. You must change to the folder where you have saved your Python
script so that Python can find your file.

d. From any arbitrary folder

Incorrect. Python would not be able to find your Python script unless you
are accidentally in the folder where you had saved your Python script.

2. Which line of code is not a valid Python syntax?

a. print("This is {a} wrong syntax!")

Incorrect. The syntax is correct since the curly bracket and its content will be
treated as part of the printing string.

b. 0 = 0

SU1-55
ANL252 Introduction to Python Programming

Correct. The left-hand side of a value-assignment syntax must be a variable


name. A number there is invalid since Python interprets it as assigning a
value to a number.

c. 0 == 1

Incorrect. This syntax is valid since it is a Boolean expression.

d. y = a / int(b)

Incorrect. This is a valid syntax since we can carry out a division in which the
denominator is being converted to an integer.

3. Which of the following is a valid Python variable name?

a. iamavariablelol

Correct. This variable name is valid since it starts with a character and
contains no invalid character.

b. :)iamavariable

Incorrect. This variable name is invalid since it starts with a colon and
contains invalid characters such as closing round bracket.

c. 007imavariable

Incorrect. This variable name is invalid since it starts with a number.

d. i-am-a-variable-lol

Incorrect. This variable name contains invalid characters such as hyphen.

4. What is the value of int(-0.5) in Python?

a. -1

Incorrect. The int() function does not round down a negative number.

SU1-56
ANL252 Introduction to Python Programming

b. 0.5

Incorrect. The int() function is not used to convert a value to absolute


number.

c. 0

Correct. The int() function will take away all the decimal places.

d. 1

Incorrect. The int() function does not round down a negative number and
then convert it to absolute number.

5. What does the .format() method do?

a. It formats the font of the string in a print() function.

Incorrect. The .format() method does not format the font of a string.

b. It replaces the expression in a curly bracket within a string by its value.

Incorrect. This would be done by the f-instruction in the print() function


for string formatting.

c. It replaces the curly bracket by a round bracket in a string.

Incorrect. The .format() method does not replace the curly bracket by a
round bracket in a string unless the code explicitly requires Python to do so.

d. It replaces the curly bracket within a string by the value of the expression in
the .format() method.

Correct. The value of the expression in the .format() method will be


used to replace the curly bracket within the printing string.

6. Which of the following Boolean expression is false?

a. -5 <= 0

SU1-57
ANL252 Introduction to Python Programming

Incorrect. The expression is true since it requires -5 to be smaller or equal to 0.

b. 10 ** 2 == 100

Incorrect. Since 10 ** 2 = 100, the left-hand side and the right-hand side
are equal.

c. int(-0.5) < int(-0.2)

Correct. Since int(-0.5) is 0 and int(-0.2) is also 0, 0 < 0 is a


false relation.

d. 18 / 3 > 1 + 4

Incorrect. Since 18 / 3 = 6 and 1 + 4 = 5, 6 > 5 is a true relation.

7. Which Boolean expression is equivalent to not(a and b)?

a. not a and not b

Incorrect. not a and not b is equivalent to not(a or b)

b. not a or not b

Correct. not a or not b is equivalent to not(a and b)

c. not a and b

Incorrect. not a and b is equivalent to not(a or not b)

d. a or b

Incorrect. a or b is equivalent to not(not a and not b)

8. Which statement is correct regarding the if-elif-else statement block?

a. The if-elif-else statement must end with an end-statement.

SU1-58
ANL252 Introduction to Python Programming

Incorrect. No end-statement is needed for an if-elif-else statement


block.

b. There must be an else-statement in every if-elif-else statement block.

Incorrect. An if-elif-else statement block does not necessarily require


an else-statement.

c. If-elif-else statement block can also start with an elif- or else-


statement.

Incorrect. An if-elif-else statement block must start with an if-


statement.

d. Behind each if-, elif-, and else-statement must be a colon before the
instruction block starts.

Correct. Behind every if-, elif-, and else-statement must be a colon.

9. Which values will be printed on the screen given the following Python code?

counter = 0
while counter <= 3:
print(counter)
counter = counter + 1

a. 0, 1, 2, 3

Correct. Since the print() function comes before the increment of the
counter, the counter will be printed starting from its initial value 0 and
goes until 3 with an increase of 1 in each iteration.

b. 1, 2, 3, 4

SU1-59
ANL252 Introduction to Python Programming

Incorrect. Since the print() function comes before the increment of the
counter, the counter will be printed starting from its initial value 0 and not
1.

c. 0, 1, 2, 3, 4

Incorrect. Since the loop will only continue to run if the value in counter is
smaller or equal to 3, 4 cannot be printed based on this code.

d. 0, 2, 3

Incorrect. Since the increment of the counter can only be 1 for each iteration
of the loop, a jump from 0 to 2 is impossible based on this code.

10. Which of the following is correct about for-loops in Python?

a. We need to initiate a counter before the loop starts.

Incorrect. We only need to initiate a counter before a while-loop starts.

b. We must write a line to increase the counter by one within the loop.

Incorrect. We only need to write a line to increase the counter by one within
a while-loop.

c. We need an exit condition for the loop.

Incorrect. A for-loop does not need any exit condition since it runs through
a finite list. Once the list comes to an end, Python will exit the loop.

d. We can use the range() function to generate a list of numbers as the index
of the for-loop iterations.

Correct. Every for-loop needs a list for it to run through. One type of list
is a list of integers that can serve as the index for the for-loop iterations,
and we can generate such a list by the range() function.

SU1-60
ANL252 Introduction to Python Programming

Python Software Foundation. (2023a). Built-in exceptions. https://docs.python.org/3/

library/exceptions.html

Python Software Foundation. (2023b). Built-in types. https://docs.python.org/3/

library/stdtypes.html#numeric-types-int-float-complex

Python Software Foundation. (2023c). Download python. https://www.python.org/

downloads/

Python Tutorial. (2023). Python 3 escape sequences. quackit.com. https://

www.quackit.com/python/reference/python_3_escape_sequences.cfm

Shaw, Z. A. (2017). Learn Python 3 the hard way. Addison-Wesley Professional.

Tutorials Point. (2023). Python – Basic operators. https://www.tutorialspoint.com/

python/python_basic_operators.htm

W3 schools. (2023). Python variables. Refsnes Data. https://www.w3schools.com/

python/python_variables.asp

SU1-61
ANL252 Introduction to Python Programming

SU1-62
2
Study
Unit

Data Types and Functions


ANL252 Data Types and Functions

Learning Outcomes

By the end of this unit, you should be able to:

1. Differentiate the various types of compound data structures in Python


2. Discuss how Python manages packages, modules, functions, etc.

SU2-2
ANL252 Data Types and Functions

Overview

Python provides numerous compound data types to group a collection of values together.
Compound data structures organise and store data in a way that they can be accessed
and worked with efficiently. These structures also define the relationship between the
data and the operations that can be performed on them. We will learn in this study unit
how to create, when to use, and what operations to perform on the most common Python
compound data structures: tuples, lists, and dictionaries. Furthermore, we will also learn
about what functions and methods are and how they can be integrated in our programme.
Lastly, we will also deal with Python packages and modules which are, together with
functions and methods, very useful for re-using and extending our tools for performing
specific tasks in Python programming.

SU2-3
ANL252 Data Types and Functions

Chapter 1: Tuples, Lists, Dictionaries

1.1 Tuples

Lesson Recording

Python Tuples

1.1.1 Defining Tuples


In Study Unit 1, we have learned that we could store values in a variable for later
operations in our code. The type of variable depends on whether we want to store numeric
values or character strings in it. More often, we work with multiple data points of the same
nature such as the names of all students in a class. It would be inconvenient to create a
new Python variable for each data point like, e.g., student1, student2, etc. What we
can do instead is to store all data points in some kind of compound data structure, for
example, a tuple to store all students’ names.

Python has several types of compound data for our use. One of these data types is the
tuple. A tuple is a collection of values written as comma-separated items between a
pair of round brackets, similar to vectors in mathematics. Unlike vectors, tuples are not
specifically designed for mathematical operations. The items, or elements in a tuple can
be numeric, string, or a mixture of both.

tuple_name = (element1, element2, …)

Note that tuples are immutable – that is, we are not allowed to modify the values stored in
them once they are defined. Nevertheless, they are more efficient in terms of performance

SU2-4
ANL252 Data Types and Functions

and memory use. Tuples are useful in situations where we want to share the data with
other users without granting them permission to edit the content.

Example (Students’ score, cont’d): We assign the names of two students to a tuple
called names, the scores of the same students to a tuple scores, and a mixture of the
names and scores to all_data.

Figure 2.1 Creating tuples in Python

We can assign elements to a tuple with or without writing them in parentheses in our
code as long as they are separated by commas. Nevertheless, when printing them on
the screen, they will be wrapped by a pair of round brackets. Furthermore, the tuple
all_data, which consists of two strings and two integers, shows us that tuples can
be indeed a mixture of strings, integers, and floats.

1.1.2 Sub-Setting Tuples and Indexing


To access one specific element in a tuple, we need to use the index operator []. In Python,
the index of any compound data type (i.e., tuples, lists, and dictionaries) begins with 0
and ends with the total number of elements minus 1. In other words, the index for the first
element in a tuple is 0, the index for the second element is 1, and the third one is 2, and so
on. If we want to subset more than a single element from a tuple, we can put the start and
end indices in the index operator and connect them with a colon.

SU2-5
ANL252 Data Types and Functions

tuple_subset = tuple_name[start:end]

It is important to recall that the index end will not be included in the sub-setting procedure.

Example (Cont’d): We would now like to extract the first element from the tuple
names and store it in a variable called name1, the second element in name2, as well
as the entire data of the first student including his/her name and score from the tuple
all_data in a new tuple called student1_data.

Figure 2.2 Sub-setting tuples

If we subset a single element from a tuple, it will become a variable of the type that
corresponds to the type of the extracted data. In this case, the variables name1 and
name2 are strings, since “Peter” and “Mary” are stored as string in the original tuple
names. And if we subset multiple elements from a tuple, the result will be a tuple
as well, and the data type of each element will also be taken over from the original
tuple.

In the above example, the tuple student1_data should be a subset of the tuple
all_data with the element indexed 0:2, where index 2 is excluded. In other words,

SU2-6
ANL252 Data Types and Functions

it only contains the elements 0 and 1, which are the first and second elements: Peter
and 72.

We can also use negative indices to access the elements of a tuple starting from the last
element. That is, the index -1 indicates the last element, -2 the second last, etc.

Example (Cont’d): We would now like to extract the last element from the tuple names
and store it in a variable called name1, the second last element in name2, as well as
the entire data of the second student including his name and score from the tuple
all_data in a new tuple called student2_data.

Figure 2.3 Sub-setting of tuples with negative indices

Everything seems alright in the above output except that the last element of the tuple
student2_data, the value 86, is missing. The reason is that the last index is never
included. In our example, the sub-setting indices we wish to have are -2 and -1. But
since Python does not include -1, we only receive student2_data[-2], which is
“Mary” in this case. To overcome this dilemma, we need to leave the end index blank
after the colon. Python will interpret it as “take all indices until the end”.

SU2-7
ANL252 Data Types and Functions

Figure 2.4 Sub-setting tuples with “open end” indexing

This is the correct output that we wanted to obtain originally. Note that the “open
end” indexing also works for positive indices.

1.1.3 Concatenating Tuples


Though we can access the tuple elements by using the index operator, we are neither
allowed to change the values of it nor to add a new value to an existing tuple.

Example (Cont’d): Suppose we would like to change the first element in the tuple
names from “Peter” to “John”. We will receive an error message as result.

Figure 2.5 Erroneous modification of tuples

SU2-8
ANL252 Data Types and Functions

We will now try to add the name “John” to the tuple names. Since names has 2
elements, the index of the last element must be 1. Hence, we add the new element to
names by referring to a new index, namely 2.

Figure 2.6 Erroneous adding of elements to tuples

Unsurprisingly, we also receive an error message here.

Nevertheless, we are allowed to concatenate two tuples into a single tuple by connecting
them with a “+” sign.

Example (Cont’d): Suppose we would like to concatenate the two tuples (names and
Scores) and name the new one newtuple.

Figure 2.7 Concatenation of two tuples

The concatenation of tuples works in the same way as it works for strings. As a result,
we can re-attempt to add a new name "John" to the tuple names. The only thing we
need to change is to put "John" in a tuple first.

SU2-9
ANL252 Data Types and Functions

Figure 2.8 Adding elements to tuples by concatenation

To put the new name “John” into a tuple is indeed tricky since Python would not
recognise syntaxes such as ("John") or "John" as a tuple. The reason here is that
all elements in a tuple must be separated by commas. As a result, we put a comma
behind "John" and leave the next element blank to tell Python that this is a tuple of
length 1. We can see that after concatenating newname with names, Python will ignore
the blank element in newname and append "John" as the only element in newname
to "Peter" and "Mary", the original elements in names.

1.1.4 Length of Tuples


When dealing with tuples, lists, or dictionaries, the function len() can be useful since it
will return the length – i.e., the number of elements of such an object.

tuple_length = len(tuple_name)

The length is also often used as an index to subset a tuple or control the indexing in our
code so that we cannot refer to indices that go beyond the largest index of a tuple.

SU2-10
ANL252 Data Types and Functions

Example (Cont’d): We use the len() function to determine the length of the tuple
all_data. We will then use it as an index to subset the last element of a tuple.

Figure 2.9 Length of tuples and index out of range

We receive an error message here since all_data[data_length] is equal to


all_data[4]. Recall that the last index of a tuple is the length of the tuple minus 1.
As a result, we can reach our original goal by subtracting 1 from the data_length,
which is equal to 4, for indexing.

Figure 2.10 Sub-setting the last element of tuples

SU2-11
ANL252 Data Types and Functions

By using the index data_length - 1, we are now accessing the element


all_data[3], which is the last element of the tuple all_data and 86 in this case.

1.1.5 For-Loops and Tuples


One advantage of tuples, or other compound data types, is that we can access and extract
their elements by using the for-loop iteratively.

In Study Unit 1, we learned how to use the range() function to generate a list for the
for-loop to run over it. Generally, if there are tuples, lists, or dictionaries already created
and existing while the programme is running, we can use the for-loop directly by putting
the name of the tuple, list, or dictionary in the for-statement.

for counter in tuple_name:


instructions

In the above syntax, tuple_name can certainly be replaced by any list_name or


dictionary_name.

SU2-12
ANL252 Data Types and Functions

Example (Cont’d): Now we would like to print out each element of the tuple
all_data onto the screen. The name of the counter here is records.

Figure 2.11 Printing tuple elements by for-loops

1.2 Lists

Lesson Recording

Python Lists

1.2.1 Creating Lists


Another type of compound data type in Python is the list. A list is just like a tuple but
with two main differences:

i. The data are comma-separated items wrapped by a pair of square brackets.


ii. The content of a list is modifiable.

We can construct a Python list in a similar fashion like a tuple.

list_name = [element1, element2, …]

SU2-13
ANL252 Data Types and Functions

As mentioned before, we must wrap the data in a list by square brackets. However, unlike
a tuple which we may omit the round brackets when defining it in our code as long as
the data are separated by commas, we must define a list with the square brackets in the
program. If we omit the brackets, Python will interpret the data as a tuple.

Same as tuples, lists may contain any type of values: floats, integer, Booleans, strings, or
more advanced Python types like lists. The last one is indeed a very interesting property
of the Python compound data type since we can namely put a list within a list, or a tuple
in a dictionary, and so on.

Example (Cont’d): We define two lists, names and scores, to store the data of the
students in the class.

Figure 2.12 Creating lists in Python

1.2.2 Sub-Setting Lists


We can also access elements of a list by the index operator. All the indexing and sub-setting
techniques introduced in Chapter 1.1.2 for tuples are applicable to lists.

Example (Cont’d): Here we will subset the lists that we have defined above into four
new lists and variables respectively.

SU2-14
ANL252 Data Types and Functions

Figure 2.13 Sub-setting lists

In the first line (line 1 and 2 are not counted here), name1 = names[0], we simply
extract the first element of names and store it in name1. Since we have only extracted
one element, it will not be stored as a new list. Instead, it will become a variable, which
we can see from the screen output that the value "Peter" is not wrapped by a pair of
square brackets. And the type of new variable will be the same as the type of extracted
element, which is a string variable in this case.

In the second line, name2 = names[1:], we extract multiple elements from the
original list and save them as a new list called name2. Recall that if we use open-end
indexing – that is, we leave the value behind the colon blank – Python will take all
elements from the original list starting from the starting value until the end. Here,
it indicates that Python should extract the elements with the indices 1 and 2 from
names.

In the third line, score1 = scores[0:2], the first and second elements from
scores will be extracted and saved into a new list score1. We should always be
aware that the last index is never included in the indexing. As a result, only the
elements 72 and 86, and not 35 are extracted and saved into score1.

SU2-15
ANL252 Data Types and Functions

In the fourth line, score2 = scores[-2], we use negative indexing to extract


element starting from the last element of scores. The value -2 indicates that we
would like to extract the second last element, which is 86 in this case. The resulting
object here is indeed a numeric variable since we only extract one numeric value from
the original list.

In the fifth line, score3 = scores[-2:-1], we extract the element with the indices
-2:-1 from scores. Since the last index is not included in the sub-setting, the only
relevant index here is -2. As a result, we should obtain the same result as the fourth
line. However, there is one significant difference between this line and the fourth line:
the new object here is still a list instead of a variable. In other words, if we want to
create a new list with a single element from another list, we will have to use multiple
indexing.

1.2.3 Modifying Lists


Different from tuples, we can change the content of a list or add elements to it. To edit
specific items of a list, we simply assign new values to those items after sub-setting them.

list_name[index] = new value

We can also use multiple indexing to edit multiple items.

list_name[start index:end index] = [list with new values]

If we only intend to edit the items in the list, then the length of the indices on the left-hand
side (which is the number of items to be modified) must be the same as the length of the
list with the new values. If the list of new values is longer than the indices, items will be
added to the list on the left-hand side. On the other hand, if the list of indices is shorter,
items will be removed from the original list.

SU2-16
ANL252 Data Types and Functions

Example (Cont’d): Suppose the name of the third student is not John, but Jon, and we
would like to change it now. Once again, the third item has index 2.

Figure 2.14 Editing single item in a list

Figure 2.14 shows how the third element has been replaced by a new value. Now,
suppose that the lecturer must deduct two marks from the students who scored 72
and 86 after rechecking their exam papers. In other words, we need to modify the first
two elements of the list scores.

Figure 2.15 Editing multiple items in a list

In the second line of the output, the first value has been replaced by 70 and the second
one by 84. However, if our list [70, 84], does not only contain two values but three,
the output will become:

SU2-17
ANL252 Data Types and Functions

Figure 2.16 Replacing more values than the specified indices

Python will not treat it as a syntax error. Instead, it will replace the values of the indices
0 and 1 by the first two new values, 70 and 84, add a new value to it (namely 60),
before returning to the rest of the original list. As a result, the list scores now has
four instead of three elements.

Let us turn to the opposite situation and assume that the indices for modification are
longer than the list of new values.

Figure 2.17 Replacing less values than the specified indices

Same as before, Python will not treat the code here as a syntax error as well. Instead,
it will replace the values of the indices 0 and 1 by the only new value, 70, and then
return to the rest of the original list, which is 35, the last value. As a result, the list
scores now has two instead of three elements.

In the above examples, we can see that lists can be extended or shrunk with the
replacement of certain items in the original list by a longer or shorter list of new values.

SU2-18
ANL252 Data Types and Functions

1.2.4 Concatenating and Merging Lists


Same as tuples, we can concatenate multiple lists into one by “adding” them together.

combined_list = list1 + list2 + …

Basically, Python uses the addition operation “+” to concatenate objects such as strings,
tuples, or lists, rather than carrying out arithmetic addition, except for numeric values.
Concatenating lists is a recommended step in Python programming if the nature, i.e.,
content and type, of the lists is identical.

Example (Cont’d): Suppose we have two classes for the same course. The student
names of these classes are stored in the lists class1 and class2 respectively.
Equivalently, the exam scores are stored in the lists scores1 and scores2. Now, we
would like to concatenate class1 and class2 into a new list called names and
scores1 and scores2 into a new list called scores.

Figure 2.18 Concatenating lists

Even if the nature of the two lists is not identical, it is sometimes still quite convenient to
store their data in one list for later use.

SU2-19
ANL252 Data Types and Functions

Example (Cont’d): Suppose we would like to concatenate the two lists – names and
scores – into a new list called combined_list in order to store the data of the entire
class in a single compound data source.

Figure 2.19 Concatenating lists with different data types

So, the new list consists of ten elements combined from the two lists names and
scores. Apparently, concatenating lists with different types of data, such as strings
and integers in this example, is allowed in Python.

Figure 2.19 shows a straightforward concatenation of two lists in Python. However, such
a combination of the two lists will lead to difficulty in distinguishing the original nature
such as meanings and types of the elements. For instance, in the above example, suppose
the first item in the list scores is the exam score of the first student in names – that
is, Peter’s exam score is 72. We will not be able to assign the score to the corresponding
name unless we know that each score always belongs to the name that is five positions
before. If we keep in mind that the length and contents of the concatenated list may change
every now and then, the meaning and source of each element will become more and more
untraceable with time.

Another way to solve this problem is to merge two lists into a new list without combining
the elements together. Instead, each element of the new list is a list and not a single value.

SU2-20
ANL252 Data Types and Functions

To merge two lists into a new list while keeping them as “list elements”, we cannot use
the addition operator as introduced before. Instead, we define the new list by putting the
list names instead of some values as the elements.

list_name = [list1, list2, …]

The advantage of this merging technique is that we will always know that index 0 of
the new list refers to the list names and index 1 to the list scores, if the new list is a
combination of two known lists in a fixed sequence. As a result, we will be able to trace
the origin and meaning of the data in the new list.

Example (Cont’d): Suppose we would like to put the two lists’ names and scores
into a new list called class_data and keep them as lists in the new list.

Figure 2.20 Merging two lists by keeping them as lists in the new one

In contrast to the concatenation technique where all elements of the new list are just
combined straightforwardly, we can see from Figure 2.20 that the elements of the list
names are wrapped by each one pair of outer and inner square brackets to indicate
that they belong to a list that in turn is an element of the new list class_data. The
same can be observed for the elements of scores too.

Note that this merging technique is not limited to merging lists. We can also merge two
tuples into a list and keep their types as tuple in the new list as well.

SU2-21
ANL252 Data Types and Functions

To access a single element in the merged list, we need to use the double index operator []
since the single index operator would return one of the original lists to us.

Example (Cont’d): Now, we extract the first element as well as the first name and
score from the merged list class_data and print them to the screen.

Figure 2.21 Extracting elements from a merged list

Here, we put in the print() function for class_data[0], which is the first element
of class_data. This element also happens to be the original list names. Then we try
to extract the first element from the first list within class_data by using the double
index operator class_data[0][0]. As a result, Python extracts the first element
in class_data first, which corresponds to the list with the names elements. From
there, Python extracts the element with the index 0, which is “Peter” in this case. The
same has also been carried out with class_data[1][0], which is the first element
of the second list. The resulting element is 72 here.

It is noteworthy that the indexing technique introduced in the above example also works
for multiple indexing.

1.2.5 Printing Lists


In the previous study unit, we learned to use loops to carry out iterative tasks. In fact,
loops can be very useful when working with lists. The reason is obvious: Since the items

SU2-22
ANL252 Data Types and Functions

of a list can be accessed by their indices, we can easily use loops to subset, print, and/or
modify them.

Furthermore, we learned how to generate a list of integers to serve as sort of a counter for
the iterations of the for-loops. Now, after being familiarised with the concept of lists, we
do not always need the range() function to create these integers for us. Instead, we can
simply use any available list as our counter. Nevertheless, the range() function can still
be very useful in some situations.

Example (Cont’d): Suppose we would like to print all the student names of the two
classes on the screen.

Figure 2.22 Printing list elements line by line to the screen

In this example, we print the elements of the list names line by line to the screen. In
each iteration, Python will assign the value of the current list element to the counter
variable i, and the print() function will print the value stored in i to the screen.

Suppose we would like to modify the screen output and use string formatting to print
the score and the student’s name in the same line by linking them up using more
natural language.

SU2-23
ANL252 Data Types and Functions

Figure 2.23 Printing elements of multiple lists using formatted string

Since we want to run through two lists, names and scores, and print out their
corresponding elements, we cannot use one of the lists as our counter in the for-
loop. Thus, we need to use the range() function to generate a list of integers as
our counter. Most conveniently, the integers should be exactly corresponding to the
indices so that we can subset our lists within the for-loop directly. The obvious start
index is 0, and the end index would be the number of items in our lists minus one.
Since the range() function does not include the end value of the range in the result,
we can therefore simply take the length of our list for this purpose. Here, we can apply
the len() function to determine the length of names, and then store the result in the
variable listlen, which will in turn be taken as the end of our integer list for the
range() function.

Within the for-loop, we will have to instruct Python to print the element of names
and scores with the index i for each iteration. And we can subset them by
names[i] and scores[i], respectively.

In the above example, we need to extract the information from two separate lists. In
Chapter 1.2.4, we learned how to merge two lists into one single list where the original
lists are kept as list elements in the new one, which makes the coding easier since we only
need to work with one list. Instead of using the syntax we learned there, we can also use
loops to carry out simple or complicated merging of lists.

SU2-24
ANL252 Data Types and Functions

Example (Cont’d): Suppose we would like to extract the first student’s name and score
from the two lists and put these data into a small list. We then do the same for the
second student, etc. Eventually, all these small lists will be merged into a single list.

Figure 2.24 Rearranging data storage in lists

On line 12, we create a temporary list to store the pair of values that we have extracted
from names and scores. This temporary list templist will then be appended to
finallist, our target list, in line 13. This line is essential for the whole process as
we would like to eventually merge all the information into a single list.

However, it is important that we define finallist as an empty list before the for-
loop starts. To define an empty list, we simply assign a pair of square brackets [] with
no content to our target list. If we did not define our target list first, line 13 would
produce a syntax error as Python would not know what finallist is.

Furthermore, the instructions inside a loop are identical in every iteration unless we
add an if-condition in it. If we defined finallist as an empty list inside the for-
loop instead, the list would be re-initialised and become empty in each iteration.
And if we simply have finallist = templist in line13, the complete content of
finallist would be replaced in each iteration. The final output would then only
consist of the data of the last student. It is therefore important to accumulate data in
every iteration by appending templist to finallist.

SU2-25
ANL252 Data Types and Functions

The square brackets wrapping templist in line 13 is important, since we want to


keep the data of each student as a small “sub-list” within the finallist. If the square
brackets are omitted, each element in the small list will be appended as an individual
element to finallist. Hence, it is important to be familiarised with how brackets
are appropriately placed in a syntax. The output can be quite different.

This arrangement of finallist is perhaps the most natural way to store the data.
In each small list, we have the data of each student. In other words, if we extract an
element of the list finalist, we will have all the data of the one corresponding student,
which can be quite convenient when dealing with these data further.

Figure 2.25 Printing data of a list using formatted string

After creating finallist, we can use a for-loop to run through it. Within the for-
loop, Python will extract one element from finallist in each iteration and store it
in the variable i. Note that i is not a counter variable here. It is more like a temporary
storage for the current element of the list that is being run through. Since each element
of finallist is a list itself with two elements, we can subset them by i[0] and i[1]
respectively. From the construction concept of each of these small lists we know that
the first element is the student’s name, and the second element is the corresponding
exam score. As a result, all we need to do is to put the subsets at the right place of the
formatted string for screen output.

SU2-26
ANL252 Data Types and Functions

1.2.6 Entering Data to Lists


So far, we have assigned pre-defined values to the lists in our code directly. But we can also
let the user enter his/her own data and store them into a list by the input() function.
If we do not limit the number of entries like in Study Unit 1, a while-loop with an
appropriate exit condition will be the right approach here.

Example (Cont’d): The user is now asked to enter the student’s name and score, and
s/he can stop entering by pressing ENTER for either the name or the score. After the
entering process, the list will be printed on the screen for checking.

Figure 2.26 User’s input to a list

In this example, we use several elements that we have learned in this and the
previous study units to provide a clear input environment to the user and the suitable
functionality for data storage in a list.

SU2-27
ANL252 Data Types and Functions

• The programme begins with the definition of an empty list called finallist
that will be used to store the data eventually. This is the same step as in line 9
of Figure 2.25.
• We initiate two variables before the while-loop starts: the counter variable i,
which is set to 1 initially, and the Boolean variable proceed which controls
whether the while-loop should continue with the next iteration or not.
• The while-loop will continue to run as long as proceed is True. Its value will
only change from True to False if the user’s input of either the name or the
score is an empty string "" – i.e., ENTER.
• We print an instruction to tell the user what to do before the loop starts.
This instruction can also be placed within the while-loop. In that case, the
instruction will be printed on the screen in every iteration.
• We start the printing string in line 4 with an escape sequence “\n” to create a
blank line between the system prompt and our instruction. It is visually more
comfortable when the texts are not put too close together. We have built in
several escape sequences for line break in the programme to create blank lines
for the same purpose.
• In line 6, we include the value of the counter variable in the formatted string to
show the number of the student whose data the user is about to enter.
• The if-condition in line 7 checks whether the user’s input for the student’s
name is an ENTER key or not. If it is the case, the programme will switch
proceed to False and break from the while-loop by the command break. For
the mechanism of breaking from the while-loop, the line proceed = False is
insufficient since the remaining instructions within the while-loop will still be
carried out, although the user intends to stop the entering process immediately.
As a result, we need to add the break-command to it. Nevertheless, for the
sake of programming “cleanliness”, we also add the line to change the value of
proceed to False.
• Starting from line 10, we implement a while-loop controlled by a Boolean
variable called valid_input to check whether the user’s input of the score is

SU2-28
ANL252 Data Types and Functions

numeric or not. This part of the code is mostly taken over from Chapter 5.4 in
Study Unit 1. Here the user is allowed to press ENTER – i.e., an empty string
to quit the entering process. As a result, we cannot convert the user’s input to
float or integer in the same line as the input() function, or else the if-condition
in line 13 could become invalid.
• For the exit mechanism, both syntaxes proceed = False and break are not
redundant. The break command is used here to break out from the while-
loop that controls the numeric input, and not the outer while-loop for entering
data of multiple observations. The change of value for proceed from True to
False prevents the merging of list in lines 22-25. Hence, the programme will
jump to the end of the while-loop and break from it due to proceed = False.
• The conversion of the variable std_score to a numeric value will be put in
the try-block to prevent Python from stopping the programme due to the
occurrence of an error. If the input of std_score is indeed a number (but
stored as string temporarily), the control variable valid_input will turn
to True and the while-loop will stop iterating. However, if the input is a
character string (with exception of an empty string), Python will print the
warning message that is written in the except-block, and the while-loop will
start a new iteration.
• In the formatted string of the input() function to enquire for the exam score
in line 12, we embed the student’s name that has just been entered by the user
in line 6. This can prevent users from entering the score of another student due
to the visual confusion caused by the mass of information on the screen.
• The same mechanism to store the data in templist and then to append them
to finallist as in Figure 2.25 is employed here in line 23 and line 24. But the
whole procedure will only be carried out if proceed = True. That is, if the
inputs of both the name and the score are not empty strings.
• We increase our counter variable by 1 after all the data for the ith student has
been typed in and confirmed by the user.

SU2-29
ANL252 Data Types and Functions

• After the entire entering process, the programme will print the entire list to
the screen for checking purpose. However, this will only be executed if the
user has at least entered the entire set of data for one student. This control is
implemented in case the user has already quit the entering process during the
first iteration and prevents us from printing an empty list. We have also built
in several escape sequences for line breaks here to enhance the visual comfort
of the output.

Read

For details and examples on lists and on operations applicable to lists, read the
following official Python documentation:

Python Software Foundation. (2023d). Built-in types: Sequence types – list, tuple, range.

https://docs.python.org/3/library/stdtypes.html#sequence-types-list-

tuple-range

Also, read the following section of the textbook on looping over elements of lists:

Exercise 32 Loops and Lists

Read the following section of the textbook on accessing, adding, removing, and joining
elements of lists:

Exercise 34 Accessing Elements of Lists

Exercise 38 Doing Things to Lists

SU2-30
ANL252 Data Types and Functions

Lesson Recording

Python Dictionaries

1.3 Dictionaries

1.3.1 Defining and Extracting Dictionaries Values


Another useful built-in data type is the dictionary. It is best to think of a dictionary as an
unordered set of key:value pairs, with the requirement that the keys are unique (within
one dictionary). The key:value pairs are separated by commas and wrapped in a pair
of braces.

dictionary_name = {"key1":value1, "key2":value2, …}

Unlike lists and tuples, which are indexed by a range of numbers, dictionaries are indexed
by keys which are usually strings or numbers. As a result, we use keys to subset a
dictionary instead of indices.

value = dictionary_name["key"]

Hence, we need to use the keys to extract values from the dictionary.

Example (Cont’d): Previously, we store the name and score of a student in a small list
and merge all these “sub-lists” into a big list subsequently. Now, we save our data of
class1 in a dictionary instead. In the following, we will use the student names as
the key and their scores as the value.

SU2-31
ANL252 Data Types and Functions

Figure 2.27 Creating a dictionary in Python

To extract the score of Peter, we can put the key, "Peter", in the index operator [].

Figure 2.28 Extracting value from a dictionary

Note that if the key that we would like to refer to is a string, we will have to put it
within a pair of quotation marks.

The advantage of dictionary is that we can choose a much suitable description for the
value and use it as the key. The programmers or users do not need to rely on their memory
or understanding of the code to trace back the nature and type of the values in a tuple or
a list like our examples in Chapter 1.2.4 and Chapter 1.2.5.

1.3.2 Printing Dictionaries


We can use for-loops to print out the items from a dictionary line by line. However, the
syntax is quite different from tuples and lists since each entry in a dictionary contains two
elements: key and value. On the one hand, we can use the key to extract the corresponding
value as demonstrated in Chapter 1.3.1, on the other hand, we cannot simply use the index
operator [] to refer to the keys. One possibility to get the keys of a dictionary is to use the
.keys() method:

dictionary_name.keys()

Equivalently, the .values() method can extract all values from a dictionary.

SU2-32
ANL252 Data Types and Functions

dictionary_name.values()

Method is like a function to carry out certain actions on the object before the dot (.). We
will have detailed discussion on methods and functions in later chapters. Here, the object
is a dictionary, and the methods are .keys() and .values(). Applying these syntaxes
to any dictionary, Python can extract all the keys and values from it.

Example (Cont’d): Suppose we would like to obtain all the keys and values separately
from the dictionary class1 defined in Figure 2.27.

Figure 2.29 Extracting keys and values from a dictionary

The method .keys() returns an object called dict_keys() with the keys in our
dictionary, and the method .values() returns an object called dict_values()
with the values. However, we cannot work with these keys and values yet since we
cannot extract them from the dict_keys() and dict_values() objects directly.
We need to transform them to list by a function called list() first.

Figure 2.30 Store keys and values from a dictionary in lists

SU2-33
ANL252 Data Types and Functions

The keys and values are now stored in their corresponding lists, recognisable by the
square brackets around them. From the techniques introduced in Chapter 1.2, we can
print the keys line by line now.

Figure 2.31 Print all the keys from a dictionary line by line

We can apply the same technique to print the values line by line too. Nevertheless, we
can also use the index operator [] on the original list class1 to print the keys and the
corresponding values. To extract the keys individually, we need to initiate a for-loop
that iterates through the list containing the dictionary keys, which is class_keys
here. In each iteration within the for-loop, one key of the list will be stored in the
variable i. The value can then be extracted by class1[i].

Figure 2.32 Print all the keys and values from a dictionary line by line

We put two objects separated by comma in the print() function. This is a command
to tell Python to print these objects in the same line to the screen. And the values
assigned in these objects are separated by an empty space between them.

If we want to extract all keys and values from a dictionary in the same step, we can apply
the .items() method on a dictionary.

SU2-34
ANL252 Data Types and Functions

dictionary_name.items()

Same as the .keys() and .values() methods, the result returned by Python from the
.items() method is not an object that can be accessed directly. Nevertheless, it can be
converted to a list of keys and values by the list() function.

Example (Cont’d): We apply the .items() method on class1 to obtain all the keys
and values of the dictionary.

Figure 2.33 Extracting items from a dictionary

The first print() function prints the dict_items() object returned from the
.items() method to the screen. This object contains every pair of keys and values
from the dictionary stored in a tuple. In the second print() function, we convert the
dict_items() object to a list. The printed object is a list with each pair of keys and
values stored in a tuple as its elements.

The list created from a dict_items() object contains tuples that have two elements in
each of them: the keys and the values. We can easily use a for-loop and the index operator
to print the contents. Here, we introduce an extension of the for-loop so that we can omit
the indices when referring to the keys and values.

for element1, element2 in list(dictionary_name.items()):


instructions

SU2-35
ANL252 Data Types and Functions

The syntax list(dictionary_name.items()) in the above for-loop can certainly


be replaced by any defined list created from a dictionary. The main difference between
the usual for-loop we know and the one we introduce here is that we use two temporary
storage variables instead of one. The mechanism is rather simple. In each iteration, Python
will extract one tuple from the list, and each element of the tuple will then be stored in
one of these storage variables. Since the keys are stored first in the tuples, element1 will
contain the key and element2 will contain the value. If we want to use these two values in
our instructions, we can simply refer to these variables without using the index operator.

Example (Cont’d): Now we use the double storage variables in a for-loop to print
out the keys and values of our dict_items() object after being converted to a list.

Figure 2.34 Print items from a dictionary after converting it to a list

In line 2, we extract the dict_items() object and convert it directly to a list named
class_items. The two storage variables of the for-loops are key and val here.
And in the print() function, we use a formatted string to print out the two values
in a normal sentence for the user to read.

1.3.3 Editing Dictionaries


We can change a value of a dictionary by assigning a new value to a certain key.

dictionary_name["key"] = value

The value can be a numeric value, a character string, a tuple, or a list.

SU2-36
ANL252 Data Types and Functions

Example (Cont’d): Suppose the score of Peter was 70 instead of 72. We can simply
change it by assigning a new value to the key “Peter”.

Figure 2.35 Change one value in a dictionary

There was then a second exam and each of our three students has a new score in
addition to the scores that are already stored in the dictionary. Below is one rather
trivial approach to carry out this task.

Figure 2.36 Add items to every dictionary key and convert values to lists

The above approach replaces the original values in the dictionary with some lists. Each
list contains two scores for each student. Therefore, we must include the first score of
each student in the list as well, which is not quite elegant in terms of programming
conciseness. Another possibility is to use a for-loop to convert the value of each key
and append the new score to it.

Figure 2.37 Print items from a dictionary after converting to lists using for-loops

In this approach, we defined a new dictionary named exam2_score to store the


scores of the second exam. We also use the students’ names as our dictionary key here.
Hence, class1 and exam2_score share the same keys which makes the mechanism

SU2-37
ANL252 Data Types and Functions

within the for-loops much easier to handle. The list that is used for the for-loop to
iterate is the key list of the dictionary class1. In each iteration, the variable i will
store one key from the list. Since both dictionaries have identical keys, we can use
the dictionary sub-setting technique on both dictionaries by putting i in the index
operator. Subsequently, we merge the extracted values from both dictionaries to one
list, and then assign it to the corresponding key of class1. Eventually, we print out
the edited dictionary for checking purpose.

While assigning new values to a key in a dictionary is rather straightforward, editing a


key in a dictionary is not a simple task in Python. Basically, the keys of a dictionary are
immutable and cannot be changed directly. But we can create a new key in a dictionary,
take over the values from the old key, and then delete the old key in the last step. To delete
a key in a dictionary, we can use the syntax del.

del object

Note that the object in this syntax does not only refer to a key in a dictionary. It can be any
Python object such as a variable, a list, a tuple, a dictionary, etc. Once an object is deleted,
it will no longer be available in the programme.

Example (Cont’d): Suppose the key “John” was a typo when entering the data. It
should be “Jon” instead.

Figure 2.38 Changing a key of a dictionary

SU2-38
ANL252 Data Types and Functions

First, we assign the value of “John” in the dictionary class1 to a new key called
“Jon” of the same dictionary. Subsequently, we delete the key “John” and its value as
an entire object from the running program.

On the other hand, if we want to add a new key to a dictionary, we can simply assign a
value to a new dictionary key.

dictionary["new key"] = value

Basically, the syntax to add a new key is just the same as the syntax to edit an existing key.
The only difference is that the key put in the index operator must be a new one, if we wish
to add a new item.

Example (Cont’d): Suppose we would like to add a new student, Michael, who scores
60 in the exam, to our dictionary.

Figure 2.39 Adding a new key to a dictionary

However, if we want to merge two dictionaries, Python offers two options:

new_dictionary = dictionary1 | dictionary2 #Version 3.9+


new_dictionary = {**dictionary1, **dictionary2} #Version
3.5+

The first syntax is a new option available from Python version 3.9 onwards. (Recall from
Study Unit 1 that this study guide is written based on Python version 3.9.) It uses the

SU2-39
ANL252 Data Types and Functions

“Bitwise Or” operator, which is a vertical line “|”, to merge two dictionaries. The second
syntax is an option available from Python version 3.5 onwards.

Example (Cont’d): Suppose we have a second dictionary with the exam results of a
second class, and we would like to merge these two dictionaries together.

Figure 2.40 Merging two dictionaries using the v3.5+ option

Figure 2.41 Merging two dictionaries using the v3.9+ option

As we can see, the two methods deliver the same result.

SU2-40
ANL252 Data Types and Functions

Read

Read the following section of the textbook on creating and manipulating dictionaries:

Exercise 39 Dictionaries, Oh Lovely Dictionaries

Read the following official Python documentation for more details and examples on
dictionaries:

Python Software Foundation. (2023c). Built-in types: Mapping types. https://


docs.python.org/3/library/stdtypes.html#typesmapping

SU2-41
ANL252 Data Types and Functions

Chapter 2: Integrated Methods and Functions

Lesson Recording

Integrated Methods and Functions in Python

2.1 Built-In Functions in Python


In the previous chapters, we have come across various built-in functions of Python such
as print(), input(), int(), etc. These functions are always available in the Python
environment. A function is a routine programme that processes values which are passed
on to it as an argument or a parameter, and returns some results to the user eventually. For
example, a float value is passed on to the int() function as an argument, Python then
removes all the decimal digits and returns an integer as result.

The following table contains all the built-in functions in alphabetical order.

Table 2.1 Built-in functions of Python

abs() all() any() ascii()

bin() bool() breakpoint() bytearray()

bytes() callable() chr() classmethod()

compile() complex() delattr() dict()

dir() divmod() enumerate() eval()

exec() filter() float() format()

SU2-42
ANL252 Data Types and Functions

frozenset() getattr() globals() hasattr()

hash() help() hex() id()

input() int() isinstance() issubclass()

iter() len() list() locals()

map() max() memoryview() min()

next() object() oct() open()

ord() pow() print() property()

range() repr() reversed() round()

set() setattr() slice() sorted()

staticmethod() str() sum() super()

tuple() type() vars() zip()

__import__()

(Source: Python Software Foundation, 2023, https://docs.python.org/3/library/functions.html)

Some of the listed functions are rather straightforward such as abs(), sum(), round()
, etc. There are also some that are quite unclear in terms of their functionality or area
of use such as frozenset() or staticmethod() just by looking at their names. You
can visit the website https://docs.python.org/3/library/functions.html to get a detailed
explanation on how to integrate and apply all these functions in Python programs.

SU2-43
ANL252 Data Types and Functions

Example (Cont’d): Suppose we would like to summarise the combined exam results
of class1 and class2 by calculating their mean, maximum, and minimum.

Figure 2.42 Computing statistics of a dictionary’s values

In the first three lines, we repeat the syntaxes from Chapter 1.3.2 to merge two
dictionaries into a new one called all_classes. Subsequently, all the exam scores
are extracted by the .values() method from all_classes. In line 5, we use the
sum() function to add up all the numbers stored in scores and divide the result
by the number of elements in scores, determined by the len() function, to obtain
the mean exam scores. The mean will then be rounded to two decimal places by the
round() function when embedding it in a formatted string for printing purpose. In
the last two lines, we determine the highest and lowest exam scores by the max() and
min() functions and include them in the formatted strings for the print() function
to print them onto the screen.

You may wonder why some rather basic functions such as a function for the calculation
of the mean is missing in the above list. Some of those functions may be included in some
common packages that will be introduced in Chapter 4. Some of them could be built-in
methods instead, which will be introduced in the next section.

Read

Read the following website for more details and examples on Python methods for
strings:

SU2-44
ANL252 Data Types and Functions

W3 schools. (2023c). Python string methods. https://www.w3schools.com/python/

python_ref_string.asp

2.2 Built-In Methods in Python


Some routines in Python are not supposed to be applied as functions in the Python
environment; instead, they are methods that can be applied to the objects they are attached
to. In Study Unit 1, we learned the method .format() for format printing. In this study
unit, we come across the .keys(), .values() and .items() methods that can be applied on
dictionaries to extract their keys and values. Same as functions, there are built-in methods
that are always available in the Python environment. These methods will return results to
the programme once they are applied on defined objects during runtime. However, each
method can only be applied to a certain object type. Below is a list of selected built-in
methods of Python.

Table 2.2 Built-in methods of Python

Method Description Applicable Object Type

append() Adds an element at the end List


of the list

capitalize() Converts the first character String


to upper case

casefold() Converts string into lower String


case

center() Returns a centred string String

clear() Removes all the elements List, Dictionary

SU2-45
ANL252 Data Types and Functions

Method Description Applicable Object Type

copy() Returns a copy List, Dictionary

count() Returns the number of String


times a specified value
occurs in a string

count() Returns the number List, Tuple


of elements with the
specified value

endswith() Returns True if the string String


ends with the specified
value

extend() Adds the elements of a list List


(or any iterable) to the end
of the current list

find() Searches the string for a String


specified value and returns
the position of where it
was found

fromkeys() Returns a dictionary with Dictionary


the specified keys and
value

get() Returns the value of the Dictionary


specified key

SU2-46
ANL252 Data Types and Functions

Method Description Applicable Object Type

index() Returns the position of List, Tuple


the first element with the
specified value

insert() Adds an element at the List


specified position

isalnum() Returns True if all String


characters in the string are
alphanumeric

isalpha() Returns True if all String


characters in the string are
in the alphabet

islower() Returns True if all String


characters in the string are
lower case

isnumeric() Returns True if all String


characters in the string are
numeric

isspace() Returns True if all String


characters in the string are
whitespaces

istitle() Returns True if the string String


follows the rules of a title

SU2-47
ANL252 Data Types and Functions

Method Description Applicable Object Type

isupper() Returns True if all String


characters in the string are
upper case

items() Returns a list containing a Dictionary


tuple for each key value
pair

join() Joins the elements of an String


iterable to the end of the
string

keys() Returns a list containing Dictionary


the dictionary's keys

lower() Converts a string into String


lower case

lstrip() Returns a left trim version String


of the string

pop() Removes the element at List, Dictionary


the specified position

popitem() Removes the last inserted Dictionary


key-value pair

remove() Removes the first item List


with the specified value

SU2-48
ANL252 Data Types and Functions

Method Description Applicable Object Type

replace() Returns a string where a String


specified value is replaced
with a specified value

reverse() Reverses the order of the List


list

rfind() Searches the string for a String


specified value and returns
the last position of where it
was found

rstrip() Returns a right trim String


version of the string

sort() Sorts the list List

split() Splits the string at the String


specified separator, and
returns a list

splitlines() Splits the string at line String


breaks and returns a list

startswith() Returns true if the string String


starts with the specified
value

strip() Returns a trimmed version String


of the string

SU2-49
ANL252 Data Types and Functions

Method Description Applicable Object Type

swapcase() Swaps cases: lower case String


becomes upper case and
vice versa

title() Converts the first character String


of each word to upper case

update() Updates the dictionary Dictionary


with the specified key-
value pairs

upper() Converts a string into String


upper case

values() Returns a list of all the Dictionary


values in the dictionary

zfill() Fills the string with a String


specified number of 0
values at the beginning

(Source: W3 schools, 2023, https://www.w3schools.com/python/default.asp)

The complete list of methods can be found on:

• W3 schools. (2023d). Python tuple methods.


https://www.w3schools.com/python/python_ref_tuple.asp
• W3 schools. (2023b). Python list/array methods.
https://www.w3schools.com/python/python_ref_list.asp
• W3 schools. (2023a). Python dictionary methods.
https://www.w3schools.com/python/python_ref_dictionary.asp

SU2-50
ANL252 Data Types and Functions

• W3 schools. (2023c). Python string methods.


https://‌www.w3schools.com/python/python_ref_string.asp

Read

Read the following website for more details and examples on Python methods for
strings:

W3 schools. (2023c). Python string methods. https://www.w3schools.com/python/

python_ref_string.asp

Read the following website for more details and examples on Python methods for
lists:

W3 schools. (2023b). Python list/array methods. https://www.w3schools.com/

python/python_ref_list.asp

Read the following website for more details and examples on Python methods for
dictionaries:

W3 schools. (2023a). Python dictionary methods. https://www.w3schools.com/

python/python_ref_dictionary.asp

Read the following website for more details and examples on Python methods for
tuples:

W3 schools. (2023d). Python tuple methods. https://www.w3schools.com/python/

python_ref_tuple.asp

SU2-51
ANL252 Data Types and Functions

Chapter 3: User-defined Functions

Lesson Recording

User-defined Functions in Python

In Chapter 2, we are introduced to some built-in functions and methods that are already
included in the Python programming environment. Functions and methods help us to
carry out routine tasks which would require us to write very lengthy code to achieve
the same functionality if we were to create the programme by ourselves. Nevertheless,
sometimes we can also write our own functions that suit our own needs.

A user-defined function (we will call it function in the following due to simplicity) can
be viewed as a separate part of the code that will not be interpreted by Python until it is
called from the main programme. Usually, a function consists of four parts:

1. the function name


2. some arguments, i.e., values or parameters the function needs for its processing.
This part is optional. If the function has no arguments, it will simply process all
the instructions without any input from the main program.
3. the instructions of how and what to process within the function.
4. an object (or a value) that should be returned to the main programme at the end
of the function. This is also optional. If no return object is specified, the main
programme will proceed without any output from the function.

In Python, the def-syntax indicates the definition of a function:

def function_name(argument1, argument2, …):


instructions
return object

SU2-52
ANL252 Data Types and Functions

Same as for-loops or if-conditions, the def-statement must end with a colon, and
all the follow-up instructions and codes that belong to the function must be indented.
Subsequently, the function can be called in the main programme by integrating the
function_name at an appropriate place.

… (Main program)
y = function_name(argument1 = object1, argument2 = object2,
…)
print(y)
… (Continue with main program)

We use a function to carry out a certain process. The objects object1 and object2 are
the corresponding input for argument1 and argument2 to the function. It is essential
for Python that object1 and object2 are already defined somewhere in the previous
part of the main programme. The output object (or value) from the function will then be
assigned to the variable y.

Caution! It is tempting for beginners to “outsource” some parts of the main programme
and make them separate functions for the sake of “programme cleanliness”. Though
the code may look more structured at first sight, the debugging process can be quite
challenging if the whole programme is jumping between the main programme and the
functions. The rules of thumb for where user-defined functions are appropriate would be:

1. if the same routine, probably with different arguments, appears more than once
in the main programme
2. if several functions should be combined into one which will then be used in the
main programme on multiple occasions
3. if a function can really increase the efficiency of the main programme

SU2-53
ANL252 Data Types and Functions

In these cases, functions can really simplify the programme code and the debugging of it
since there are less chances for syntax or logical errors.

Example (Cont’d): Here, we implement two user-defined functions:


list_dictkeys(dicts) and sort_dictkeys(dicts).

Figure 2.43 Defining functions

The first function list_dictkeys(dicts) is used to extract the keys from a


dictionary and convert them to a list in the same step. It has an argument dicts,
which is a dictionary variable to store the dictionary passed on by the main
programme to the function for the extraction and conversion process. This process
is written in the return syntax directly since we target on returning the list as the
output object of our function anyway. An additional step to first save the list in an
extra variable first and then to return the extra variable is therefore not necessary here.

The second function sort_dictkeys(dicts) sorts the dictionary keys directly. The
dictionary object is not a specific dictionary from the main programme, but a variable
called dicts that is used as the argument in the definition of the function and is also
only used within the function. The main programme will then pass on a dictionary
to sort_dictkeys and store the dictionary in dicts once it starts to process the
instructions inside the function. It is noteworthy to mention that we can call another
function within a function like in line 5, where we call list_dictkeys(dicts) to
generate the list of dictionary keys for further process.

The main programme does not become significantly shorter but simpler, in terms of
the number of functions and methods involved.

SU2-54
ANL252 Data Types and Functions

Figure 2.44 Sorting dictionary using user-defined functions

Read

Read the following three exercises of the textbook for more details and examples on
user-defined functions:

Exercise 18: Names, Variables, Code, Functions

Exercise 19: Functions and Variables

Exercise 21: Functions Can Return Something

SU2-55
ANL252 Data Types and Functions

Chapter 4: Modules, Packages and Libraries

Lesson Recording

Modules, Packages and Libraries in Python

Since the students are using Anaconda for this course, the content below is for the students’
information only.

4.1 Import a Standard Package


Besides built-in functions and methods as well as user-defined functions, Python also
provides packages which we can think of as a directory of Python scripts or the so-called
modules. These modules specify new functions, methods, and object types for solving
particular tasks. Packages are organised hierarchically. That means they may contain sub-
packages as well as regular modules themselves.

The packages of the standard library are already installed in the Python environment.
A library is a collection of codes for us to perform specific tasks without writing our
own code. But before we can use the modules in our programme, we need to import the
package or a specific module of the package first.

import package_name as package_alias


from package_name import module_name as module_alias

In the first syntax, we import the whole package into our programme. The alias is a
name that is used to refer to that particular package from thereon in our programme.
It is advantageous to use a package alias if it has a very long name. Note that the “as

SU2-56
ANL252 Data Types and Functions

package_alias” part is optional in the import syntax. If the original package name is
preferred, this part can be omitted.

The second syntax imports a particular module from a package. The alias here is the
referral name of the module that we will use in our programme, and not the package.
Once again, the alias part is optional and can be omitted.

4.2 Managing Packages with pip/pip3


There are many Python packages available from the Internet but are not yet installed
in the Python environment. To use those Python packages, we will first have to install
them on our system. Then we can import them into our programme in the same way as
the standard library. The simplest way to install such packages is to use PowerShell or
Command Prompt as well (or similar terminal apps from other operating systems). Once
we are prompted in the terminal window to give instructions to the operation system to
carry out, type in one of the following commands and then press ENTER:

> pip install package_name


> pip3 install package_name

“pip” or “pip3” refers to the installer program that Python uses for installing external
packages. Basically, “pip3” is a newer version of “pip”. In most of the cases, we can use
either one for our installation.

For instance, the package “numpy” will be needed in the next few study units and we
would like to install it for our class preparation:

SU2-57
ANL252 Data Types and Functions

Figure 2.45 Installing package “numpy” with pip

Once ENTER is pressed, Python will download the package installation file and install it.
The message “Successfully installed xxx” will appear on the screen once the installation
has been completed.

Another package that we will need in the next study units is “matplotlib”. This time we
install it with pip3.

Figure 2.46 Installing package “matplotlib” with pip3

SU2-58
ANL252 Data Types and Functions

In the same installation process, some other packages are downloaded and installed as
well. This is because “matplotlib” needs some modules of these packages so that it can
work.

If we wish to update/upgrade a package, the command that we need to type in our


terminal app will be:

> pip install package_name --upgrade


> pip3 install package_name --upgrade

And if we want to uninstall a package, the command will be:

> pip uninstall package_name


> pip3 uninstall package_name

In Python, there are some packages that are used quite commonly for data analytics. A
couple of them will also be covered in this study guide.

Table 2.3 Some common packages for data analytics in Python

Package Description In this Study


Guide (Y/N)?

matplotlib Creates data visualisation Y

numpy Manages multi-dimensional arrays Y

pandas Handles two-dimensional data tables Y

SU2-59
ANL252 Data Types and Functions

Package Description In this Study


Guide (Y/N)?

pendulum Provides complex coding for dates and N


times

requests Sends HTTP requests from Python code N

scikit-learn Provides tools for data analytics Y

scipy Carries out scientific and technical N


computations

sqlite3 Manages SQL database in Python Y

SU2-60
ANL252 Data Types and Functions

Summary

In this unit, we have learned three types of built-in compound data structures of Python:
tuples, lists, and dictionaries. We discussed the major differences among these objects and
the typical applications of them. Their creation and modification were also explained and
demonstrated in detail. The most important issue here is the technique for sub-setting and
indexing the elements in these objects. We were also introduced to functions, methods,
and packages in Python. While some of them are built-in – i.e., they are already included
in the Python environment – others can be user-defined or installed from external sources.
Based on examples, we have been shown how built-in functions and methods, user-
defined functions, as well as standard or external libraries can be applied to our Python
programmes.

SU2-61
ANL252 Data Types and Functions

Formative Assessment

1. What is the output of the following programme?

a = 1, 4, 9, 16
a[2] = 3
print(a)

a. 1, 3, 9, 16
b. 1, 4, 3, 16
c. 1, 4, [3, 9], 16
d. Syntax Error

2. What is the output of the following programme?

a = []
for i in range(0, 5):
a = a + [i ** 2]
print(a)

a. [0, 1, 2, 3, 4]
b. [0, 1, 4, 9, 16, 25]
c. [0, 1, 4, 9, 16]
d. [0, 1, 2, 3, 4, 5]

3. Which of the following syntaxes will have “north” as output?


a.
d = {“north”: 2, “south”, -2}
dkeys = list(d.keys())
print(dkeys[0])

SU2-62
ANL252 Data Types and Functions

b.
d = {“north”: 2, “south”, -2}
dkeys = list(d.keys())
print(d[0])

c.
d = {“north”: 2, “south”, -2}
dkeys = list(d.items())
print(dkeys[0])

d.
d = {“north”: 2, “south”, -2}
dkeys = list(d.values())
print(dkeys[0])

4. Which of the following statements is correct regarding the properties of dictionaries?


a. The values of a dictionary can only be integers, floats, and strings.
b. The keys of a dictionary cannot be modified.
c. The curly brackets around a dictionary can be omitted.
d. The elements of a dictionary must be separated by semi-colons.

5. What is a Python function?


a. A stand-alone Python programme
b. A dictionary with the module names of the standard library as keys
c. An object specifically designed for machine-learning operations
d. A Python routine code that is re-usable for a particular task

6. Which of the following methods does not apply to string variables?


a. .get()
b. .lower()

SU2-63
ANL252 Data Types and Functions

c. .replace()
d. .strip()

7. Which component is not optional in a user-defined function?


a. Function name
b. Arguments
c. Loops
d. Return value

8. What is not a good habit when implementing user-defined functions in a programme?


a. We should only implement user-defined functions whenever it is sensible to
combine multiple functions into one.
b. We should only implement user-defined functions when we need to carry out
the same routine repeatedly in our programme.
c. We should only implement user-defined functions when we can reduce our
main programme to some syntaxes merely for calling the functions.
d. We should only implement user-defined functions when we have recurrent
tasks that create certain output objects needed for the further parts of the main
programme and the creation of such objects requires some input arguments
from the previous parts of the main programme.

9. Which of the following statements is correct when using alias for importing package/
module?
a. We can use both the original name and alias to refer to the package/module in
our programme.
b. An alias is optional and can be omitted if we are comfortable to work with the
original package/module name.
c. An alias must be shorter than the original package/module name.
d. If only a single module from a package is imported, the alias refers to the
package and not the module.

SU2-64
ANL252 Data Types and Functions

10. Which command is used to install a new Python package?


a. > pip3 setup package_name
b. > pip3 install package_name --upgrade
c. > pip3 install package_name
d. > pip3 import package_name

SU2-65
ANL252 Data Types and Functions

Solutions or Suggested Answers

Formative Assessment
1. What is the output of the following programme?

a = 1, 4, 9, 16
a[2] = 3
print(a)

a. 1, 3, 9, 16

Incorrect. Since a is a tuple, it is immutable and we cannot assign a new value


to one of the elements in it.

b. 1, 4, 3, 16

Incorrect. Since a is a tuple, it is immutable and we cannot assign a new value


to one of the elements in it.

c. 1, 4, [3, 9], 16

Incorrect. Since a is a tuple, it is immutable and we cannot assign a new value


or change the type of the elements in it.

d. Syntax Error

Correct. Since a is a tuple, it is immutable. If we try to reassign a value to


one of the elements in it, we will get an error message from Python.

2. What is the output of the following programme?

a = []
for i in range(0, 5):
a = a + [i ** 2]

SU2-66
ANL252 Data Types and Functions

print(a)

a. [0, 1, 2, 3, 4]

Incorrect. In each iteration, i square will be put into a list and then appended
to a. The values of i are 0, 1, 2, 3, 4.

b. [0, 1, 4, 9, 16, 25]

Incorrect. Though i square is taken here, the last value is out of range since
the for-loop will stop running at i = 4.

c. [0, 1, 4, 9, 16]

Correct. The for-loop stops running at i = 4 and the values in a are i


square.

d. [0, 1, 2, 3, 4, 5]

Incorrect. The last value is out of range, and the values in a are i and not i
square.

3. Which of the following syntaxes will have “north” as output?

a.
d = {“north”: 2, “south”, -2}
dkeys = list(d.keys())
print(dkeys[0])

Correct. The .keys() method is used to extract the keys of a dictionary.


After converting the object to a list, we subset its item with the index 0,
which is the first key in this case: “north”.

b.
d = {“north”: 2, “south”, -2}
dkeys = list(d.keys())

SU2-67
ANL252 Data Types and Functions

print(d[0])

Incorrect. In the print() function, it refers to the object d and not dkeys.
And since d is dictionary, we can only access its element by the keys and not
the indices.

c.
d = {“north”: 2, “south”, -2}
dkeys = list(d.items())
print(dkeys[0])

Incorrect. The .items() method is used to extract the keys and values of
a dictionary and store each pair of them in a tuple. dkeys[0] will return
(“north”, 2) as its result.

d.
d = {“north”: 2, “south”, -2}
dkeys = list(d.values())
print(dkeys[0])

Incorrect. The .values() method is used to extract the values of a


dictionary. After converting the object to a list, we subset its item with the
index 0, which is the first value 2 in this case, and not the key “north”.

4. Which of the following statements is correct regarding the properties of dictionaries?

a. The values of a dictionary can only be integers, floats, and strings.

Incorrect. The value of a dictionary can also be tuples, lists, or other object
types.

b. The keys of a dictionary cannot be modified.

SU2-68
ANL252 Data Types and Functions

Correct. The keys of a dictionary cannot be modified. We can only modify


a specific key indirectly by adding a new pair of key and value to the
dictionary and deleting the old pair from it.

c. The curly brackets around a dictionary can be omitted.

Incorrect. A Python dictionary must be wrapped by a pair of curly brackets.

d. The elements of a dictionary must be separated by semi-colons.

Incorrect. The elements of a dictionary must be separated by commas.

5. What is a Python function?

a. A stand-alone Python programme

Incorrect. A function cannot be a stand-alone programme. We must have a


main programme to call a function for a specific task.

b. A dictionary with the module names of the standard library as keys

Incorrect. A function is a routine programme and not a dictionary.

c. An object specifically designed for machine-learning operations

Incorrect. A function is not an object, it is a routine programme.

d. A Python routine code that is re-usable for a particular task

Correct. A function is a chunk of code that performs a particular task and


can be called endlessly by the main programme.

6. Which of the following methods does not apply to string variables?

a. .get()

Correct. It is a method for dictionary. It extracts the value of a particular


dictionary key.

SU2-69
ANL252 Data Types and Functions

b. .lower()

Incorrect. It is a method for string variables. It converts a string to lower case.

c. .replace()

Incorrect. It is also a method for string variables. It replaces part of a string


by another string.

d. .strip()

Incorrect. It is a method for string variables too. It removes all empty spaces
at the beginning and at the end of a string.

7. Which component is not optional in a user-defined function?

a. Function name

Correct. A user-defined function must have a function name.

b. Arguments

Incorrect. Arguments are input from the main programme to the function
and they are optional. A function can still be carried out without arguments
provided.

c. Loops

Incorrect. A function does not need loops for its functionality. Its instructions
can contain loops, but it would still work without them.

d. Return value

Incorrect. A function can also be carried out even if it does not return any
value to the main programme.

8. What is not a good habit when implementing user-defined functions in a programme?

SU2-70
ANL252 Data Types and Functions

a. We should only implement user-defined functions whenever it is sensible to


combine multiple functions into one.

Incorrect. It is a good habit indeed. If multiple functions are applied to a


single object step-by-step, we can create a function and give a sensible name
to it to simplify our main programme.

b. We should only implement user-defined functions when we need to carry


out the same routine repeatedly in our programme.

Incorrect. It is recommended to use user-defined functions to replace


recurrent routines in the main programme.

c. We should only implement user-defined functions when we can reduce our


main programme to some syntaxes merely for calling the functions.

Correct. It is not a good habit to “outsource” chunks of code to various


functions just for the sake of simplifying the main programme. It usually
does not simplify the whole programme at all and actually makes the
debugging more difficult since we must jump between the functions on
checking the source of error.

d. We should only implement user-defined functions when we have recurrent


tasks that create certain output objects needed for the further parts of the
main programme and the creation of such objects requires some input
arguments from the previous parts of the main programme.

Incorrect. It is a good habit to implement functions when they can really carry
out recurrent tasks and return some objects/values to the main programme
based on some input arguments originated from the previous parts of the
main programme.

9. Which of the following statements is correct when using alias for importing package/
module?

SU2-71
ANL252 Data Types and Functions

a. We can use both the original name and alias to refer to the package/module
in our programme.

Incorrect. Once we have imported a package/module with an alias, we must


use the alias to refer to it in our programme.

b. An alias is optional and can be omitted if we are comfortable to work with


the original package/module name.

Correct. The alias part is not compulsory. If we have a concise original


package/module name, there is no reason to use an alias at all.

c. An alias must be shorter than the original package/module name.

Incorrect. There is no guideline for the length of the alias. It can even be longer
than the original package/module name.

d. If only a single module from a package is imported, the alias refers to the
package and not the module.

Incorrect. The alias refers to the module and not the package in the “from …
import … as …” syntax.

10. Which command is used to install a new Python package?

a. > pip3 setup package_name

Incorrect. It should be “install” and not “setup” following “pip3”.

b. > pip3 install package_name --upgrade

Incorrect. The “--upgrade”-option is only used for upgrading/updating


existing packages.

c. > pip3 install package_name

Correct. “install” should be following “pip3” for installation.

SU2-72
ANL252 Data Types and Functions

d. > pip3 import package_name

Incorrect. It should be “install” and not “import” following “pip3”.

Python Software Foundation. (2023a). Built-in functions. https:// docs.python.org/3/

library/functions.html

Python Software Foundation. (2023b). Built-in types. https://docs.python.org/3/

library/stdtypes.html#sequence-types-list-tuple-range

Python Software Foundation. (2023c). Built-in types: Mapping types. https://

docs.python.org/3/library/stdtypes.html#typesmapping

Python Software Foundation. (2023d). Built-in types: Sequence types – list, tuple,

range.https://docs.python.org/3/library/stdtypes.html#sequence-types-list-

tuple-range

Shaw, Z. A. (2017). Learn Python 3 the hard way. Addison-Wesley Professional.

W3 schools. (2023a). Python dictionary methods. Refsnes Data. https://

www.w3schools.com/python/python_ref_dictionary.asp

W3 schools. (2023b). Python list/array methods. Refsnes Data. https://

www.w3schools.com/python/python_ref_list.asp

W3 schools. (2023c). Python string methods. Refsnes Data. https://www.w3schools.com/

python/python_ref_string.asp

W3 schools. (2023d). Python tuple methods. Refsnes Data. https://www.w3schools.com/

python/python_ref_tuple.asp

SU2-73
ANL252 Data Types and Functions

SU2-74
3
Study
Unit

Arrays and Plots


ANL252 Arrays and Plots

Learning Outcomes

By the end of this unit, you should be able to:

1. Explain the operations on arrays


2. Analyse data using appropriate tools for data visualisation

SU3-2
ANL252 Arrays and Plots

Overview

This study unit introduces two Python packages: NumPy and matplotlib. NumPy is the
fundamental package for efficient scientific computing with Python. We will learn how to
create NumPy arrays and how to use indexing and Boolean masks for sub-setting NumPy
arrays. We will also learn the NumPy functions to generate statistics on the data stored
in an array. Furthermore, we will learn how to use the “matplotlib.pyplot” sub-package
for data visualisation purpose. In particular, the functionalities available for plotting and
customising basic charts of data analytics will also be the focus of this study unit.

SU3-3
ANL252 Arrays and Plots

Chapter 1: Array Management with NumPy

Lesson Recording

Array Management with NumPy

In Study Unit 2, we were introduced to various types of compound data such as tuples,
lists, and dictionaries. We also discussed in detail how to modify a list or store different
types of elements in a list. Despite being able to group lists or tuples in a superordinate list,
these types of compound data are basically still one-dimensional. Recall in one part of our
exam score example in the previous study units, the list of data to be analysed can consist
of either sub-lists with individual student scores from different subjects or the scores of
all students in one subject in each one of them. As a result, the data in this example are in
fact two-dimensional: the student dimension and the subject dimension.

Hence, lists and dictionaries are no longer sufficient to store multi-dimensional data for
analysis, and arrays should be used instead. Nevertheless, we can also replace lists and
dictionaries with arrays when it comes to one-dimensional data. Note that the shape of
a Python array must be rectangular – that is, the number of values in each row and each
column must be identical, and all values must be entirely of the same type (typically
numeric values or strings). Therefore, an array is not equivalent to a dataset in the
conventional sense since missing values and mixed data types are common features of
usual datasets. In Python, we can work with arrays using the “numpy” package.

1.1 Creating NumPy Arrays


First of all, we need to launch JupyterLab and start working with NumPy arrays after the
“numpy” package has been imported into our programme. The corresponding syntax in
Python is:

SU3-4
ANL252 Arrays and Plots

import numpy as np

Recall that we can import a package with an alias, which will then be used as our reference
to the package in the further part of our programme. And for “numpy”, the most common
alias used in literature or online references is “np”.

To create an array, we can use the array() function:

array_name = np.array([[list1_data1, list1_data2, …],


[list2_data1, list2_data2, …], …])

In the above syntax, the array() function was attached to the np. prefix, which is
required if NumPy is imported using the import-statement. The prefix should be omitted
if we use the from … import … statement to import NumPy instead.

The data assigned to array_name in the array() function are originally stored in
various regular Python lists. Each list corresponds to a row of the array and the total
number of rows is therefore equal to the number of lists included in the array() function.
Additionally, the number of elements in each list must be identical. If the lists have
different lengths, an error message will appear. Note that it is compulsory to wrap all the
lists, separated by commas, in a pair of outer square brackets - again before putting them
into the array() function.

In NumPy, an n-dimensional array is also called an “ndarray”. Each direction of an array is


called an axis. For instance, the rows or the columns are the two axes of a two-dimensional
array, just like the coordinate system.

Example (Student score, cont’d): Suppose we have three lists and each of them
contains one student’s exam scores of three different courses. The three lists are [72,

SU3-5
ANL252 Arrays and Plots

73, 53], [86, 83, 90] and [35, 42, 51]. And we would like to store all these
scores in a two-dimensional array.

Figure 3.1 Creating an array with NumPy

From the output of the array, we can see that each list in the array() function
corresponds to one row in the array, and not one column. Furthermore, NumPy uses
the outer square bracket to indicate the start and end of the array, and the inner square
brackets are used to wrap the data in each row.

Due to the mechanism of the array() function, it is crucial to make sure what features
of the data are represented by the columns and the rows before creating the arrays.

1.2 Sub-Setting Arrays


We can use the index operator to access certain elements of an array in the same way as
indexing tuples, lists, or dictionaries in Study Unit 2. From there, we can subset an array.
Nonetheless, there are two ways to subset an array using the index operator:

• Using index: Suppose we would like to get the second value of the first row in an
array. We can extract the element by array_name[0, 1], where array_name can
be any arbitrary name of an array. The index here starts with 0, same as Python List.
Recall that if we intend to do multiple indexing like start:end, the end index will

SU3-6
ANL252 Arrays and Plots

not be included in our array subset. Furthermore, negative indexing and open-end
indexing are also allowed here. Check on Chapter 1 of Study Unit 2 for more details
regarding indexing.
• Using Boolean masking: Suppose we want to get all values larger than 80. The
first step is to use array_name > 80 to produce a Boolean mask. The result is
a NumPy array with Boolean elements: True if the corresponding value is above
80; False if it is below. Subsequently, we can use the Boolean mask inside a pair
of square brackets to do sub-setting. Only those elements above 80, for which the
corresponding Boolean mask is True, are selected. If, for instance, there are two
values above 80, we will end up with a NumPy array with two values.

Since arrays can be multi-dimensional (ndarray), we can certainly subset every dimension
of it by using multi-dimensional indexing. The following syntax is used for subsetting
two-dimensional NumPy arrays:

array_name[row_index, column_index]

Basically, the usage of the index operator here is just like sub-setting a one-dimensional
Python list. The only difference is the two sets of indices in it: one for row indexing, and
the other one for column indexing. The resulting subset of the array can be a single value,
a row, a column, or an array with less rows and/or less columns. For multi-dimensional
NumPy arrays, the order of the indices in the index operator must follow the sequence
of the axes or dimensions, in an array. For instance, axis one of a two-dimensional array
refers to the rows and axis two to the columns.

Example (Cont’d): The array exam_scores created in Figure 3.1 contains data of
individual students in the row and data of each subject in the column. Suppose we
would now like to extract all the exam scores of the second subject.

SU3-7
ANL252 Arrays and Plots

Figure 3.2 Sub-setting a column from a NumPy array

To subset a column from a two-dimensional array, we must indicate indices for both
the rows and the columns, or else an error message would appear. The column index
is clearly 1 here since we intend to extract the second column of the array. The row
index must be multiple indexing since we would like to access the entire column. As
a result, open-end indexing starting from index 0 is the most appropriate way here to
access the elements of row 1 to row 3 of column 2.

In the next step, we would like to extract the exam scores of the first two students in
the last two subjects.

Figure 3.3 Sub-setting a sub-array from a NumPy array

Here, we use negative indexing for sub-setting the column. Since our array has three
columns, -2 is the index of the second last column. We leave the end index here open
to instruct Python to “take every index until the end” which, in this case, is the last
column of the array exam_score.

SU3-8
ANL252 Arrays and Plots

Assuming that the passing mark is 40, we would now like to subset all the failed
exams. That is, we extract all exam marks below 40.

Figure 3.4 Creating a Boolean mask of a NumPy array

If we ask Python to compare an array with a numeric value, we will obtain a Boolean
mask as mentioned before. Comparing the Boolean values in Figure 3.2 with the array
created in Figure 3.4, the only True value found here is the score of the third student
in subject 1. To subset exam_score using the Boolean mask, we need to put the
condition exam_scores < 40 within a pair of square brackets.

Figure 3.5 Sub-setting a NumPy array based on a Boolean mask

The result is the only value in exam_scores that is smaller than 40, namely 35, which
is the score of the third student in the first subject. And it is also the only True value
in the Boolean mask.

We can also check various properties of an array using the following NumPy functions
and methods:

type(array_name)
array_name.ndim
array_name.shape
array_name.size
array_name.dtype

SU3-9
ANL252 Arrays and Plots

The type() function indicates the type of our array while the .ndim method returns
the array’s number of dimensions, which is usually 2 in our case. The .shape method
provides the number of rows and columns of the given array and the .size method
calculates the total number of elements in an array. The .dtype method shows us the
type of data contained in the array.

Example (Cont’d): In the following, we extract all the information on the


characteristics of our array exam_scores.

Figure 3.6 Extracting information on the characteristics of a NumPy array

The array type returned from the type() function is a Python type output <class
'numpy.ndarray'>, where ndarray stands for n-dimensional array.

Another remarkable output is the values returned from the .size() method. In fact,
the .size() method returns a tuple (row_number, column_number). We subset
the corresponding result in the formatted string by using the index operator [].

SU3-10
ANL252 Arrays and Plots

The type of data returned by the .dtype() method is int32, a specific NumPy
integer type that fixes the length of an integer variable at 32 bytes. The usual integer
variable of Python has no fixed length and its type is simply called int.

Read

Refer to the three links below for more details and examples on the methods “shape”,
“ndim” and “size” of NumPy arrays:

NumPy Developers. (2022d). numpy.ndarray.shape. https://

docs.scipy.org/doc/numpy/reference/generated/

numpy.ndarray.shape.html#numpy.ndarray.shape

NumPy Developers. (2022c). numpy.ndarray.ndim. https://

docs.scipy.org/doc/numpy/reference/generated/

numpy.ndarray.ndim.html#‌numpy.ndarray.ndim

NumPy Developers. (2022e). numpy.ndarray.size. https://docs.scipy.org/doc/

numpy/reference/generated/numpy.ndarray.size.html#‌numpy.ndarray.size

1.3 Working With NumPy Arrays


The NumPy package does not only facilitate the creation and management of arrays, but
it also provides various functions for us to work with.

Each function deals with specific types of variables. For instance, mathematical functions
such as log() and sqrt() can only be applied on arrays with numeric values, whereas
strip() and upper() are functions specifically designed for arrays with only strings in
them.

Below is a table of some frequently used NumPy functions.

SU3-11
ANL252 Arrays and Plots

Table 3.1 Most common NumPy functions

Function Description

Array Information and Operations

count_nonzero() Counts the number of non-zero values in the array

extract() Return the elements of an array, given some


conditions

nanargmin() Return the indices of minimum values of rows or


columns, ignoring the missing values

nonzero() Return the indices of the elements that are non-zero

partition() Return a partitioned copy of an array

where() Return selected elements, depending on condition

Statistics

amin(), amax() Returns row or column minimum or maximum

percentile() Return row, column, or array percentile

mean() Return row, column, or array mean

median() Return row, column, or array median

nan_to_num() Replace NaN (missings) with zero and infinity with


large finite numbers

std() Return row, column, or array standard deviation

SU3-12
ANL252 Arrays and Plots

Function Description

var() Return row, column, or array variance

Rounding

ceil() Return ceiling

fix() Round to the nearest integer towards zero

floor() Return floor

round() Round to the given decimal places

trunc() Return truncated value

Exponents and Logarithms

exp() Return exponential

log() Return natural logarithm

log10() Return base-10 logarithm

log2() Return base-2 logarithm

Arithmetic

add() Add two arrays together

absolute() Calculate the absolute value

divide() Divide first array by second array. Non-zero


elements in second array required

SU3-13
ANL252 Arrays and Plots

Function Description

mod() Return remainder of division

multiply() Multiply two arrays with each other

sign() Indicate the sign of a number

sqrt() Return non-negative square-root

subtract() Subtract two arrays from one another

Trigonometric

arccos() Return inverse cosine

arcsin() Return inverse sine

arctan() Return inverse tangent

degrees() Convert angles from radians to degrees

cos() Return cosine

radians() Convert angles from degrees to radians

sin() Return sine

tan() Return tangent

Random Sampling

random.randint() Return an array of specified shape with random


integers in a given interval

SU3-14
ANL252 Arrays and Plots

Function Description

random.random_ sample() Return an array of specified shape with random


floats in the interval [0, 1)

random.ranf() Return an array of specified shape with random


floats in the interval [0, 1)

random.normal() Return an array of specified shape with random


floats from a normally distributed population

String Information and Operations

find() Return the lowest index of a substring in a given


string

islower() Check if all characters in the string are lower case

istitle() Check if a string is title-cased

isupper() Check if all characters in the string are upper case

startswith() Check if a string starts with a given prefix

String Information and Operations

capitalize() Convert the first character of a string to upper case

join() Join a sequence of elements to a single string by


given string separator

lower() Covert a given string to lower case

SU3-15
ANL252 Arrays and Plots

Function Description

strip() Remove all leading and trailing spaces from a


string

title() Convert the first character in each word of a string


to upper case

upper() Covert a given string to upper case

We must also be aware of whether the effect of a function is element-wise, row-wise,


column-wise, or array-wise. Having element-wise effects means that the function will be
applied to all elements of an array individually. Having effects on array means that the
function will be operating with all the array elements at the same time. Row-wise and
column-wise effects indicate that the function will take all the elements from the same row
or column into its operation. If an array has more than two dimensions, the row-wise and
column-wise effects will become axes effects. That is, the functions can be applied in each
direction of the array.

Example (Cont’d): Suppose we would like to compute some statistics of each subject’s
exam scores such as the mean, standard deviation, maximum, minimum, etc. We will
also have to round off all these statistics to 2 decimal digits.

SU3-16
ANL252 Arrays and Plots

Figure 3.7 Calculate column statistics of a NumPy array

For the statistical functions in NumPy, we usually need to specify the axis argument
in it. If axis = 0, the column statistics will be calculated. If, however, axis = 1, the
row statistics will be returned to us by the functions. Since we intend to compute the
statistics for each subject’s data, which are recorded in the columns of exam_scores,
we shall specify axis = 0 in this case.

For the round() function, we specify the argument named decimals to 2. Recall
from Chapter 3 of Study Unit 2 that it is a good programming habit to specify the
argument names of a function explicitly when assigning a value to it. It is therefore
important to always check the available arguments of a function carefully before using
it in the program.

Read

Refer to the three links below for more details and examples on the functions mean(),
median() and std() of the NumPy package:

SU3-17
ANL252 Arrays and Plots

NumPy Developers. (2022a). numpy.mean. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.mean.html#numpy.‌mean

NumPy Developers. (2022b). numpy.median. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.median.html#numpy.median

NumPy Developers. (2022f). numpy.std. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.std.html#numpy.std

Refer to the link below for more details and examples of functions provided by the
NumPy package:

Geeks for Geeks. (2021, June 28). N-dimensional array(ndarray) in numpy. https://

www.geeksforgeeks.org/numpy-ndarray/

SU3-18
ANL252 Arrays and Plots

Chapter 2: Plotting With Matplotlib

Lesson Recording

Plotting with matplotlib

2.1 Basic Plotting With Matplotlib


In this chapter, we will introduce some basic features of data visualisation in Python. The
most common visualisation package here is “matplotlib”. Its sub-package named “pyplot”
contains all the functions that we need in this Chapter.

To import the sub-package “matplotlib.pyplot”, we can use the following syntax:

import matplotlib.pyplot as plt

We use the alias “plt” here which is also commonly found in literatures and websites,
since it is short and has a clear reference to “pyplot”. Note that the above syntax has only
instructed Python to import the sub-package ”pyplot”. All the other functions and sub-
packages of matplotlib are not imported.

The sub-package “matplotlib.pyplot” provides many different plot types. In Chapter 3.2,
we will discuss the most common ones: histogram, bar charts, and scatter plots. We can
customise our plots by changing colours, shapes, labels, axes, etc., according to our own
needs and taste. In the following, we will introduce some basic plotting techniques based
on a line plot.

SU3-19
ANL252 Arrays and Plots

The following syntax creates a simple line plot:

plt.plot(x, y, color, linestyle, linewidth, marker,


markerfacecolor, markeredgecolor, markersize)

The argument x is a list of x-axis values. Correspondingly, y is a list of the y-axis


values. The argument color is a string to indicate the colour of the line to be plotted
such as “blue”, “red”, etc. The arguments linestyle and linewidth control whether
the line should be solid, dashed, dotted, etc., and how thin or thick it should be. We
can also choose the style of the marker of the data points on the line chart such as
point, circle, square, etc., with the marker argument. We can also fix the colour and
size of the marker with the arguments markerfacecolor, markeredgecolor and
markersize. There are actually more arguments available for the plot() function.
You can refer to https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html for all
available functions of “matplotlib”.

To label the axes, we can use the xlabel() and ylabel() functions.

plt.xlabel("My X-Label String")


plt.ylabel("My Y-Label String")

We can also set a title to the current plot using the title() function:

plt.title("My Plot Title")

Another useful plot customisation is to define the text and location of the labels on each
tick of the x-axis and y-axis.

SU3-20
ANL252 Arrays and Plots

plt.xticks(ticks, labels, rotation)


plt.yticks(ticks, labels, rotation)

A list of labels assigned to the argument labels will be plotted on the locations defined
with the argument ticks. We can also rotate the labels in case they fit optically better to
the plot if they are slanted. A numeric value representing the degree of rotation can be
assigned to the argument rotation.

Python will wait for the show() function to actually display all figures.

plt.show()

Note: Since our programmes in this study unit are constructed and run in JupyterLab, the
plots will anyway be displayed if we put all the syntaxes for plotting in one cell so that they
can be executed in the same run. As a result, we will not actually need the show() function
in the last step. However, this function is the final instruction to display the figures if we
run the plotting syntaxes in the original Python programme.

Example (cont’d): We would like to plot the CGPA development of a student in the
last 4 semesters. The data are [3.2, 3.3, 3.4, 3.1].

Figure 3.8 Importing “matplotlib.pyplot” into the programme

The following programme will generate the line chart required.

SU3-21
ANL252 Arrays and Plots

Figure 3.9 Programme for creating a line plot

In the first line, we instruct Python to create a line chart with a red line and black circle
markers. The data of the x-axis should be the semester number and the values on the
y-axis are the CGPA. We then name the axes “Semester” and “CGPA” respectively.
The location of the ticks’ labels on the x-axis must be 1, 2, 3, and 4 because these are
the only data of the x-axis. Since the CGPA usually lies within the interval of 0 and
4, we can create an integer list from 0 to 4 as our ticks’ labels on the y-axis. We know
from Chapter 5 of Study Unit 1 that we can use the range() function to generate
such a list. Here, we integrate it within the plt.yticks() function. In the final step,
we add a title to the line plot to highlight the topic of our chart using the title()
function. With the plt.show() function, we let Python generate the plot.

Figure 3.10 Line plot generated by


“matplotlib.pyplot”

SU3-22
ANL252 Arrays and Plots

Read

Refer to the three links below for more details and examples on the functions plot()
of the “matplotlib.pyplot” package:

matplotlib. (2023c). matplotlib.pyplot.plot. The Matplot development team. https://

matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html

2.2 Histograms and Scatter Plot


Histogram is another common type of plot in data analytics. It shows the distribution of
a variable by plotting the frequencies that certain ranges of value occur in a sample.

plt.hist(x, bins = None, range = None, align = "mid",


orientation = "vertical", rwidth = None, color =
None)

The hist() function has many arguments to control the histogram layout. The above
introduction of the function only includes the most common ones. For instance, we can
decide how many bins (bars) and which range of the values it should contain. We can
also choose to have a histogram with horizontal bars by changing the orientation
argument. With the arguments such as rwidth (which represents the width of the bars),
align (with which we can position the bars between two ticks or on top of a tick), and
color, we can format the bars according to our needs.

Example (cont’d): Suppose the exam scores of the two subjects (taken by the same
students) are now completely available, and we would like to generate a histogram
for subject 1 to look at the distribution.

SU3-23
ANL252 Arrays and Plots

Figure 3.11 Creating data array for subsequent plotting

First, we import both the “NumPy” and “matplotlib” packages, and then we create
two lists with the exam scores of each of the subjects. Eventually, we create a NumPy
array that contains both lists as its elements.

After that, the following programme is written to generate the histogram.

Figure 3.12 Programme for creating a histogram

In the first line, we instruct Python to create a histogram based on the scores of the first
subject’s examination, which are stored in the first row of the array exam_scores.
We set the range argument to be between 0 and 100 to ensure that extreme categories
such as 0-10 or 90-100 marks are also included in the chart, although their frequencies
could be 0. The number of bins is fixed at 10 here so that we gain an accurate image
of the distribution. The width of the bars is reduced from 1 to 0.8 so that they are not

SU3-24
ANL252 Arrays and Plots

touching each other, and they are placed between two ticks on the x-axis (align =
"mid") to indicate the score range each bar represents.

The axes are named “Scores” and “Frequencies”, according to their nature. A title is
also given to the histogram, which is labelled “Exam Marks Distribution”. The ticks
on the x-axis are placed with a gap of 10 marks between 0 and 100. We extended the
range to 105 since the right end is not included by the range() function.

Figure 3.13 Histogram generated by “matplotlib.pyplot”

Scatter plots are often used to study the relationship between two variables, which is
usually referred to as their correlation. The values of the first variable are plotted in the
x-axis and the values of the second variable in the y-axis. If the data dots are scattered
around the 45 degrees line of the chart, we can conclude that these variables are correlated
with each other.

To create a scatter plot, “matplotlib.pyplot” provides the following possibility:

plt.scatter(x, y, color = None, marker = None,


linewidths = None, edgecolors = None)

SU3-25
ANL252 Arrays and Plots

Same as the hist() function, we only list out some of the most common arguments
of the scatter() function here. For instance, we can change the colour and style
of the markers, assign another colour to the markers’ edge, and adjust its width for
more sophisticated visualisation. You can refer to https://matplotlib.org/api/_as_gen/
matplotlib.pyplot.scatter.html for more available arguments to control the scatter plot
layout.

Example (cont’d): Now, we investigate the correlation between the students’


performances in the two exams. The same set of data as in Figure 3.11 is used here to
generate a scatter plot by the following program.

Figure 3.14 Programme for creating a scatter plot

First, we instruct Python to create a scatter plot based on the array exam_scores.
The first row of the array contains the data of the x-axis, and the data in the second
row are values of the y-axis. The markers should be red circles and have black edges.
The axes are also named accordingly: “Subject 1” for the x-axis and “Subject 2” for the
y-axis. The title of the scatter plot is “Correlation between the Exam Scores of Subject
1 and Subject 2”. The ticks on the x-axis and y-axis are placed with a gap of 10 marks
between 0 and 100. We extended the range to 105 to include 100 in our chart since the
right end is not included by the range() function.

SU3-26
ANL252 Arrays and Plots

Figure 3.15 Scatter plot generated by


“matplotlib.pyplot”

Read

For more details and examples on the hist() function of the “matplotlib.pyplot”
package, refer to:

Matplotlib. (2023b). matplotlib.pyplot.hist. The Matplotlib development team.

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html

For more details and examples on the scatter() function of the “matplotlib.pyplot”
package, refer to:

Matplotlib. (2023a). matplotlib.pyplot.scatter. The Matplotlib development team.

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html

SU3-27
ANL252 Arrays and Plots

Summary

We have discussed two Python packages in this study unit: namely, NumPy and
matplotlib. They are the fundamental packages for efficient scientific computing and data
visualisation with Python respectively. We have learned the basics of the two packages
such as sub-setting and some functions to operate on NumPy arrays, and some functions
of matplotlib for plotting and customising basic charts for analytics such as line chart,
histogram, and scatter plot.

SU3-28
ANL252 Arrays and Plots

Formative Assessment

1. What is the output of the following programme?

a = np.array([1, 2, -1], [0, 3, -2])


print(a)

a.
[[1, 2, -1]
[0, 3, -2]]

b. [1, 2, -1, 0, 3, -2]


c. [[1, 2, -1], [0, 3, -2]]
d. Error message

2. What is the output of the following programme?

a = np.array([[1, 2, -1], [0, 3, -2]])


print(a[a < 0])

a. [-1 -2]
b.
[[-1]
[-2]]

c. [[], [-1, -2]]


d. [[, , -1], [, , -2]]

3. Which values will remain in the output, based on the following code?

SU3-29
ANL252 Arrays and Plots

a = np.array([[1, 2, -1], [0, 3, -2]])


print(a[1:, -2:-1])

a. [[0, 3, -2]]
b. [[3, -2]]
c. [[3]]
d. [[1, -1]]

4. What information does the NumPy method .shape provide?


a. The dimension number of an array
b. The number of rows and columns of an array
c. The total number of elements of an array
d. The type of data in an array

5. Which of the following NumPy functions does not have element-wise effects?
a. cos()
b. exp()
c. fix()
d. var()

6. What is Matplotlib package in Python used for?


a. machine learning
b. web development
c. data visualisation
d. database management

7. Which of the following is not an argument of the plt.plot() function?


a. range
b. marker
c. color

SU3-30
ANL252 Arrays and Plots

d. linestyle

8. What does the rotation argument of the plt.xticks() function control?


a. It controls the rotation of the plot.
b. It controls the rotation of the labels of the ticks.
c. It controls the rotation of the axis labels.
d. It controls the rotation of the main title.

9. What does the function plt.show() actually do?


a. It displays the most previous command of “matplotlib.pyplot”.
b. It sends all the figures to the connected printer for printing.
c. It displays all the figures on the screen.
d. It shows all the available arguments of the most previously executed function.

10. Where will the bars of a histogram be placed if we set align = "left"?
a. On top of the lower boundary tick
b. On top of the upper boundary tick
c. Between the ticks of the upper and lower boundaries
d. To the left of the y-axis label

SU3-31
ANL252 Arrays and Plots

Solutions or Suggested Answers

Formative Assessment
1. What is the output of the following programme?

a = np.array([1, 2, -1], [0, 3, -2])


print(a)

a.
[[1, 2, -1]
[0, 3, -2]]

Incorrect. Since a pair of square brackets to wrap up both the lists within the
array() function is missing, it is not a valid programme.

b. [1, 2, -1, 0, 3, -2]

Incorrect. Since the user intends to create an array with two rows. Hence, the
result cannot be a Python list with all the elements in it.

c. [[1, 2, -1], [0, 3, -2]]

Incorrect. The user intends to create an array with two rows, the result cannot
be a Python list with two sub-lists in it.

d. Error message

Correct. Since a pair of square brackets to wrap up both the lists within
the array() function is missing, it is not a valid programme and an error
message will appear.

SU3-32
ANL252 Arrays and Plots

2. What is the output of the following programme?

a = np.array([[1, 2, -1], [0, 3, -2]])


print(a[a < 0])

a. [-1 -2]

Correct. The sub-setting result in NumPy will be presented in a one-


dimensional row array.

b.
[[-1]
[-2]]

Incorrect. The sub-setting result in NumPy will not be presented as a column


array.

c. [[], [-1, -2]]

Incorrect. The sub-setting result in NumPy will not be presented in a


multidimensional array.

d. [[, , -1], [, , -2]]

Incorrect. The sub-setting result in NumPy will not be presented in a


multidimensional array and positions where the value was False in the
Boolean mask will not be kept in the resulting array.

3. Which values will remain in the output, based on the following code?

a = np.array([[1, 2, -1], [0, 3, -2]])


print(a[1:, -2:-1])

SU3-33
ANL252 Arrays and Plots

a. [[0, 3, -2]]

Incorrect. The only column index in this sub-setting is -2, which is the second
last column of the array. As a result, we cannot have three values remaining
in the output.

b. [[3, -2]]

Incorrect. The only column index in this sub-setting is -2, which is the second
last column of the array. As a result, we cannot have two values remaining
in the output.

c. [[3]]

Correct. The only column index in this sub-setting is -2, which is the second
last column of the array. Since the row index is 1, indicating the second row
of the array, the output here should be the value positioning in the second
column of the second row, which is 3.

d. [[1, -1]]

Incorrect. Since the row index is 1, indicating the second row of the array, the
output here cannot contain any value of the first row.

4. What information does the NumPy method .shape provide?

a. The dimension number of an array

Incorrect. The dimension number of an array can be extracted by the .ndim


method.

b. The number of rows and columns of an array

Correct. The .shape method returns the number of rows and the number
of columns as a tuple with two elements.

c. The total number of elements of an array

SU3-34
ANL252 Arrays and Plots

Incorrect. The total number of elements of an array can be extracted by the


.size method.

d. The type of data in an array

Incorrect. The type of data in an array can be extracted by the .dtype


method.

5. Which of the following NumPy functions does not have element-wise effects?

a. cos()

Incorrect. The cos() function calculates the cosine of each element of a


numeric array.

b. exp()

Incorrect. The exp() function calculates the exponential of each element of


a numeric array.

c. fix()

Incorrect. The fix() function rounds each element of a numeric array to the
nearest integer towards zero.

d. var()

Correct. The var() function calculates the variance of each row, each
column, or the entire array.

6. What is Matplotlib package in Python used for?

a. machine learning

Incorrect. The Matplotlib package is not used for machine learning.

b. web development

SU3-35
ANL252 Arrays and Plots

Incorrect. The Matplotlib package is not used for web development.

c. data visualisation

Correct. The Matplotlib package is used for data visualisation.

d. database management

Incorrect. The Matplotlib package is not used for database management.

7. Which of the following is not an argument of the plt.plot() function?

a. range

Correct. The plt.plot() function does not have the range argument.

b. marker

Incorrect. The marker argument controls the marker style of a line plot.

c. color

Incorrect. The color argument controls the line colour of a line plot.

d. linestyle

Incorrect. The linestyle argument controls the line style of a line plot.

8. What does the rotation argument of the plt.xticks() function control?

a. It controls the rotation of the plot.

Incorrect. The plot cannot be rotated generally. For some types of plot such as
histogram or bar chart, their bars can be presented vertically or horizontally.
But the plot cannot be rotated completely.

b. It controls the rotation of the labels of the ticks.

Correct. We can rotate the labels of the ticks if we want them slanted.

SU3-36
ANL252 Arrays and Plots

c. It controls the rotation of the axis labels.

Incorrect. It does not control the layout of the axis labels at all.

d. It controls the rotation of the main title.

Incorrect. It does not control the layout of the main title at all.

9. What does the function plt.show() actually do?

a. It displays the most previous command of “matplotlib.pyplot”.

Incorrect. It does not only display the most previous command of


“matplotlib.pyplot”.

b. It sends all the figures to the connected printer for printing.

Incorrect. It does not send anything to the printer for printing.

c. It displays all the figures on the screen.

Correct. It will display all the figures on the screen.

d. It shows all the available arguments of the most previously executed


function.

Incorrect. It is not a function to print out the arguments of a function at all.

10. Where will the bars of a histogram be placed if we set align = "left"?

a. On top of the lower boundary tick

Correct. They will be placed on top of the lower boundary tick.

b. On top of the upper boundary tick

Incorrect. If they were to be placed on top of the upper boundary tick, we


should have set align = “right”.

c. Between the ticks of the upper and lower boundaries

SU3-37
ANL252 Arrays and Plots

Incorrect. If they were to be placed between the ticks of the upper and lower
boundaries, we should have set align = “mid”.

d. To the left of the y-axis label

Incorrect. The align argument only controls how the histogram bars are
placed in relation to the ticks on the x-axis.

SU3-38
ANL252 Arrays and Plots

References

Geeks for Geeks. (2021, June 28.). N-dimensional array(ndarray) in numpy. https://

www.geeksforgeeks.org/numpy-ndarray/

Project Jupyter. (2018). JupyterLab documentation. https://jupyterlab.readthedocs.io/en/

stable/

NumPy Developers. (2022a). numpy.mean. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.mean.html#numpy.mean

NumPy Developers. (2022b). numpy.median. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.median.html#numpy.median

NumPy Developers. (2022c). numpy.ndarray.ndim. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.ndarray.ndim.html#numpy.ndarray.ndim

NumPy Developers. (2022d). numpy.ndarray.shape. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape

NumPy Developers. (2022e). numpy.ndarray.size. https://docs.scipy.org/doc/numpy/

reference/generated/numpy.ndarray.size.html#numpy.ndarray.size

NumPy Developers. (2022f). numpy.std. https://docs.scipy.org/doc/numpy/reference/


generated/numpy.std.html#numpy.std

matplotlib. (2023a). matplotlib.plot.scatter. The Matplotlib development team. https://

matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html

matplotlib. (2023b). matplotlib.pyploy.hist. The Matplotlib development team. https://

matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html

matplotlib. (2023c). matplotlib.pyplot.plot. The Matplotlib development team. https://

matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html

Project Jupyter. (2023). Jupyter. https://jupyter.org/

SU3-39
ANL252 Arrays and Plots

SU3-40
4
Study
Unit

Data Management
ANL252 Data Management

Learning Outcomes

By the end of this unit, you should be able to:

1. Explain the operations on datasets


2. Prepare data for analysis using Python programming

SU4-2
ANL252 Data Management

Overview

This unit will introduce the key data structure for analytics in Python: the pandas
DataFrame. We will learn to develop Python programmes to import data from external
sources and convert them to DataFrames, and then to index and query these structures.
We will then deepen our understanding of the pandas package by learning its efficient
functionality on merging multiple DataFrames, identifying and dealing with missing data
and outliers, sorting, grouping and transforming data, as well as discretising numeric
variables to bins.

SU4-3
ANL252 Data Management

Chapter 1: Import Data

Lesson Recording

Import Data in pandas

In the previous study units, we have learned the basic techniques of Python programming.
In this and the next study units, we will discuss in detail how Python can be used for data
management and data analytics.

The most common package for data management in Python is “pandas”. After installing
pandas using pip, we can import it in our programme by the following syntax:

import pandas as pd

Here, we use the alias pd to refer to the pandas package in our programs.

To start working with pandas, we need to have Python compatible datasets. Data
circulating in organisations or on the Internet are mostly saved as text files or worksheets.
Text editors, spreadsheets, and data management apps are popular tools for opening and
working with them. Pandas provides the same possibilities. The first step here is to load a
dataset in the Python environment and open it in the format of pandas. Suppose we have
a dataset from an external source saved as a .csv text file; we can import it by the following
pandas function:

DataFrame_name = pd.read_csv("csv_file_name.csv")

The content stored in the file “csv_file_name.csv” will then be assigned to the pandas
dataset object, or DataFrame, named DataFrame_name. The function read_csv() is

SU4-4
ANL252 Data Management

called a reader since it reads in specific format of data files and converts them to pandas
DataFrame.

Similar to the functions in NumPy or matplotlib, the read_csv() function has more
arguments than we list out here. We can adjust the execution of the reader to the
specifications of the .csv file with these arguments. For instance, we can specify the
character string of the delimiter, the row number in which the header is stored, the
path of the .csv file, etc. You can refer to https://pandas.pydata.org/docs/user_guide/
io.html#io-read-csv-table for further details.

Since .csv is not the only common file format of data files, pandas also allows the import
of other file formats such as Excel spreadsheets, SPSS data, or Stata data into Python by
providing the functions listed in the following table.

Table 4.1 Most common data file formats and the corresponding reader in pandas

Reader Format Type Data Description

read_csv() text CSV

read_html() text HTML

read_clipboard() text Local clipboard

read_excel() binary MS Excel

read_stata() binary Stata

read_sas() binary SAS

read_spss() binary SPSS

read_pickle() binary Python Pickle Format

read_sql() SQL SQL

SU4-5
ANL252 Data Management

Reader Format Type Data Description

read_gbq() SQL Google BigQuery

We can then use the .head() method to display the first five rows of the imported dataset.
It is important to check whether the data has been accurately imported.

DataFrame_name.head()

Alternatively, we can also use the conventional print() function to display the whole
DataFrame. Nevertheless, this can be quite frustrating if the dataset contains many rows
and columns, and the output does not fit to the window properly. Another way to print
the whole DataFrame is to use the display() function or omit the function completely
and simply execute a syntax with only the name of the DataFrame.

Note: When using NumPy ndarrays (n-dimensional arrays) to store multi-


dimensional data, a burden is placed on the programmer to specify the orientation
of the dataset because axes are considered equivalent. The meanings of rows and
columns of an array do not differ significantly and switching the roles of rows and
columns (i.e., transposing) to store the data does not change the nature of the array
at all. Moreover, the functionality of the NumPy functions will remain. For pandas
DataFrame, or datasets in general, the rows record individual observations while the
columns represent the features or variables of the data. Their roles usually do not
change throughout the entire analysis process. Thus, the axes lend more semantic
meaning to the data, and hence reduce the amount of mental effort required to code
the data transformation.

SU4-6
ANL252 Data Management

Example (Adult Census Data): The US Adult Census dataset is a repository


of 48,842 entries extracted from the 1994 US Census database. The dataset is
downloadable from https://www.kaggle.com/wenruliu/adult-income-dataset. It is
used to predict whether income would exceed $50,000 per year according to the
14 social-demographic attributes (for more, see http://www.cs.toronto.edu/~delve/
data/adult/adultDetail.html). Below is a list of the available variables in the dataset:

1. age: the age of an individual. Its value can be any integer greater than 0.
2. workclass: a general term to represent the employment status of an
individual. Its value can be Private, Self-emp-not-inc, Self-emp-
inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-
worked.
3. fnlwgt: final weight. In other words, this is the number of people the entry
represents. Its value can be any integer greater than 0.
4. education: the highest level of education achieved by an individual. Its
value can be Bachelors, Some-college, 11th, HS-grad, Prof-school,
Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th,
10th, Doctorate, 5th-6th, Preschool.
5. educational-num: the highest level of education achieved in numerical form.
Its value can be any Integer greater than 0.
6. marital-status: marital status of an individual. Married-civ-spouse
corresponds to a civilian spouse while Married-AF-spouse is a spouse
in the Armed Forces. Its value can be Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Married-spouse-absent,
Married-AF-spouse.
7. occupation: the general type of occupation of an individual. Its value
can be Tech-support, Craft-repair, Other-service, Sales, Exec-
managerial, Prof-specialty, Handlers-cleaners, Machine-op-

SU4-7
ANL252 Data Management

inspct, Adm-clerical, Farming-fishing, Transport-moving,


Priv-house-serv, Protective-serv, Armed-Forces.
8. relationship: represents what this individual is, relative to others. For
example, an individual could be a husband. Each entry only has one
relationship attribute. Its value can be Wife, Own-child, Husband, Not-
in-family, Other-relative, Unmarried.
9. race: Descriptions of an individual’s race. Its value can be White, Asian-
Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10. gender: the biological gender of the individual. Its value can be Male,
Female.
11. capital-gain: capital gains for an individual. Its value can be any integer
greater than or equal to 0.
12. capital-loss: capital loss for an individual. Its value can be any integer greater
than or equal to 0.
13. hours-per-week: the hours an individual has reported to work per week. Its
value can be any positive real number or 0.
14. native-country: country of origin for an individual. Its value can
be United-States, Cambodia, England, Puerto-Rico, Canada,
Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece,
South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland,
Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-
Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary,
Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-
Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
15. income: whether an individual makes more than $50,000 annually. Its value
can be <=50k, >50k.

Since the original data file is saved in .csv format, we can use the read_csv()
function to import the dataset to Python.

SU4-8
ANL252 Data Management

Figure 4.1 Importing data with pandas

If we printed the census dataset using the print() function, the output would not
fit the window at all.

Figure 4.2 Printing entire imported dataset

As shown in Figure 4.2, the DataFrame will be truncated anyway. Eventually, only the
head and the tail of the dataset will be displayed due to lack of space. As a result, it is
more advisable to use the pd.head() than print() for control purpose.

Furthermore, Python provides the display() function.

SU4-9
ANL252 Data Management

Figure 4.3 Printing the head and tail of the imported dataset by display()

The display() function prints out the first five and last five observations of the
DataFrame. The font is smaller than the one used by the print() function, and we
are therefore able to see all columns side-by-side and without line break. Furthermore,
it also shows the number of rows and columns in the DataFrame, which can be quite
helpful in some cases.

SU4-10
ANL252 Data Management

Read

Refer to the link below for more details and examples on the read_csv() function
of the pandas package:

pandas. (2023a). IO tools (text, CSV, HDF5, …). NumFOCUS, Inc. https://

pandas.pydata.org/docs/user_guide/io.html

Read the following website for more information regarding the US Census data,
including the explanation of the variable names and other useful information about
the data:

Department of Computer Science. (1996, October 8). The adult dataset. University of

Toronto. http://www.cs.toronto.edu/~delve/data/adult/ adultDetail.html

SU4-11
ANL252 Data Management

Chapter 2: Data Selection

Lesson Recording

Data Selection in pandas

Similar to Python lists or NumPy arrays, we can access a pandas DataFrame by using the
index operator []. In this chapter, we will introduce three ways to subset rows, columns,
or elements of a DataFrame.

2.1 Selecting Columns by Variable Names


To select specific columns which represent the variables of a dataset, we can create a list
with the variable names (or labels) to be selected and then put it in the index operator.

DataFrame_name[["var_name1", "var_name2", …]]

Note that each variable name must be placed within a pair of quotation marks since it is
treated as a string in this case. If we simply want to access one column, we can omit the
creation of the list and put the variable name as a string inside the index operator directly.

SU4-12
ANL252 Data Management

Example (Cont’d): Now we would like to select the columns "marital-status",


"race", "gender" and "income" from the US Census dataset.

Figure 4.4 Select columns from a DataFrame by variable names

Suppose we would like to select the target variable "income" alone and save the
column as a NumPy array for further calculation.

Figure 4.5 Select a single column from a DataFrame and save it as NumPy array

We can see that no list will be needed within the index operator if only a single column
is selected. After selecting the column, the resulting subset of the pandas DataFrame
will then be converted to an NumPy array by the np.array() function.

2.2 Selecting Rows by Positions and Indices


Accessing rows requires different techniques from accessing columns. While we can use
the labels of the columns or variable names to select the columns we want, there are
usually no natural “observation names” that we can refer to when selecting rows from
a DataFrame. However, in Figure 4.1, we can see that a row index is provided at the

SU4-13
ANL252 Data Management

beginning of every row by pandas. It starts with 0 and ends with the number of rows in
the DataFrame minus one. As a result, rows can be queried by the numeric index position,
starting at 0, using the DataFrame attribute iloc.

DataFrame_name.iloc[start:end]

The indices in the index operator do not need to be consecutive integers. It can be any
integers within the range 0 and number of rows in the DataFrame – 1. But these integers
must be put in a list first if there are more than one of them. If we want to select a single
row instead, we can simply put one index in the index operator.

Example (Cont’d): Sometimes, we may need to inspect a dataset after running some
programmes to adjust it for further analytics purposes. We have been introduced
to the .head() method to print out the first five rows of the dataset. We can also
randomly pick a few rows from the DataFrame for our inspection. Here, we need
NumPy to draw random indices to select the rows for us.

Figure 4.6 Select rows from a DataFrame by random row indices

SU4-14
ANL252 Data Management

In the first line of our programme, we determine the total number of rows in
the dataset census. Note that the method .shape() is also applicable to pandas
DataFrames and it returns a tuple (Total Row Number, Total Column Number)
to us. As a result, we can refer to the first element of the tuple as the total number
of rows in census. In the second line, we draw a total of 10 random integers from
the interval 0 and nrow, the number of rows in census. Note that nrow as the upper
boundary is not included in the drawing process at all. The integers drawn by the
random.randint() function will then be assigned to the object named randrow.
And these will then be used as the list of rows that we select from census.

Before we can select rows based on their index labels, we need to first create the index
labels by the method .set_index().

DataFrame_name.set_index(key, inplace = True)

The parameter key can be either a single variable name (column label), a single array of
the same length as the calling DataFrame, or a list containing an arbitrary combination
of variable names and arrays. The argument inplace controls whether the DataFrame
should be modified in place or a new DataFrame should be created. If it is True, the
changes will take place in the original DataFrame.

SU4-15
ANL252 Data Management

Example (Cont’d): Suppose we would like to group the US Census data by the
occupation of the observations.

Figure 4.7 Setting the values of a column as row index

If we use one of the columns as the row index of our DataFrame, that column will not
be a regular part of the dataset anymore. As you can see from Figure 4.7, the variable
occupation has disappeared from the dataset.

Suppose we would like to change our row index from the occupation to the age group
of each observation. In the first step, we need to remove the occupation as our row
index by the .reset_index() method. It reverses the effect of the .set_index()
method and removes the current row index and converts it back to a column in the
DataFrame.

Figure 4.8 Resetting row index of a DataFrame

SU4-16
ANL252 Data Management

Then we re-code the age in each row to our target grouping and store it in a new array.
For observations less than 30 years, the group will be labelled as “Age <30”, whereas
the label of those between 30 and 59 years old is “Age 30-59”, and the rest receives the
label “Age 60+”. Subsequently, we assign this array as the new row index.

Figure 4.9 Setting a NumPy array as row index

We first create a list named agegroup that contains the grouping of each
observation’s age. The list will then be converted into a NumPy array and used as the
row index subsequently.

The DataFrame rows can be queried by the row index labels using the .loc attribute.

DataFrame_name.loc[["row_label1", "row_label2", …]]

We can see that selecting rows from a DataFrame by the .loc attribute works in a very
similar fashion to the column selection. The row labels must be indicated as strings and
put in a list if we want to select more than one of them. If we just want to select rows of a
single label, we can put the label as a string in the .loc attribute directly.

SU4-17
ANL252 Data Management

Example (Cont’d): Suppose we would like to select all the observations that are 30
years of age or younger from the US Census data.

Figure 4.10 Selecting rows by single row label

And if we want to select the youngest and oldest age groups from census, we will
need to put the row labels in a list.

Figure 4.11 Selecting rows by multiple row labels

SU4-18
ANL252 Data Management

Read

Refer to the links below for more details and examples on the attributes
.set_index() and .reset_index() of the pandas package:

pandas. (2023q). pandas.DataFrame.set_index. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.set_‌index.html

pandas. (2023p). pandas.DataFrame.reset_index. NumFOCUS, Inc.

https://pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.reset_index.html

Refer to the links below for more details and examples on the methods .loc() and
.iloc() of the pandas package:

pandas. (2023n). pandas.DataFrame.loc. NumFOCUS, Inc.

https://pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.loc.html

pandas. (2023k). pandas.DataFrame.iloc. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.iloc.html

2.3 Selecting Cells by Positions and Indices


To select elements in the DataFrame, we can specify both column and row labels in the
.loc attribute, or the positions in the .iloc attribute, or a combination of both.

Below is a syntax that uses only the row and column indices for the cell selection.

SU4-19
ANL252 Data Management

DataFrame_name.iloc[row_start:row_end, col_start:col_end]

We can also select cells by referring to the corresponding row and column labels.

DataFrame_name.loc[["row_labels"], ["col_labels"]]

If we want to select the rows by index but the columns by labels, we can use the index
operator and .iloc attribute together.

DataFrame_name[["col_labels"]].iloc[row_start:row_end]

But if we want to select the columns by index but the rows by labels, we need to use both
the .loc and .iloc attributes.

DataFrame_name.loc[["row_labels"]].iloc[:,
col_start:col_end]

While putting the row labels in the .loc attribute, we need to be aware that the .iloc
attribute requires both the row and column indices. Since we do not intend to select the
rows by index, we can use the open-end index 0: or simply : here.

Example (Cont’d): Suppose we would like to select the first five observations of the
first five variables.

SU4-20
ANL252 Data Management

Figure 4.12 Selecting cells using indices

Such a selection only makes sense when we exactly know the positions of the variables
and observations. More common is the selection of cells based on labels. Suppose
we would like to select the observed values in workclass and income for all
observations younger than 30 and older than 60 years old.

Figure 4.13 Selecting cells using row and column labels

Nevertheless, this selection method only works if the row labels are set and we can
refer to them in the .loc attribute. While column labels usually correspond to the
variable names, row labels are not necessarily used as row index in most of the
datasets. The row positions are therefore more useful in selecting cells in a DataFrame.
Suppose we would now like to select the observed values of workclass and income
from the first 5 rows.

SU4-21
ANL252 Data Management

Figure 4.14 Selecting cells by row indices and column labels

If we want to select the observed values of the first two columns from the observations
younger than 30 years old, we will have to use the .iloc and .loc attributes at the
same time.

Figure 4.15 Selecting cells by row labels and column indices

2.4 Selecting Cells by Boolean Masking


In Chapter 1.2 of Study Unit 3, we learned how to use Boolean mask to subset a NumPy
array. Here, we will apply the same technique to select cells from a DataFrame. A Boolean
mask is an array where each of the values is either True or False. The Boolean mask
array is overlaid on top of the data structure that we are querying. Any element aligned
with a True value will be selected, and any element aligned with a False value will not.

DataFrame_name[Condition]

SU4-22
ANL252 Data Management

We can also create more complex queries by using bitwise logical operators to chain
several conditions together.

DataFrame_name[(Condition1) &/| (Condition2) &/| …]

The bitwise logical operators are similar to the logical operators. Instead of writing
and/or, we use & (bitwise and), | (bitwise or), or ~ (bitwise not) to combine our
conditions in the DataFrame queries. We can also add the bitwise not operator to the
above syntax if we want to negate any condition.

We need the bitwise logical operators here because we are actually creating a Boolean
mask for each condition within the index operator []. If there are two conditions, two
Boolean masks will be compared element-wise by the bitwise operator. The result of this
comparison is in turn a Boolean mask as well.

Remember that each Boolean mask/condition needs to be encased in parentheses because


of the order of operations.

Example (Cont’d): Suppose we would like to select those observations that work more
than 40 hours per week.

SU4-23
ANL252 Data Management

Figure 4.16 Selecting cells by a Boolean mask

The above syntax is a combination of two instructions. First, we create a Boolean


mask to select only those observations where census["hours-per-week"] > 40
is True. Recall that census["hours-per-week"] is a syntax to select a specific
column from the DataFrame. So, the condition here is to tell Python to select the
column named hours-per-week first and then assign True to those cases where
the observed value is larger than 40. Those row indices where the Boolean mask is
True will then be selected from the DataFrame census by the index operator [].

In the next query, we would like to select female respondents from the DataFrame
that work more than 40 hours per week.

SU4-24
ANL252 Data Management

Figure 4.17 Selecting cells by chaining two Boolean masks

In the last query, we want to select female or non-white respondents who work more
than 40 hours per week.

Figure 4.18 Selecting cells by chaining multiple Boolean masks

SU4-25
ANL252 Data Management

In the first parentheses, we create a Boolean mask for observations where gender
is equal to "Female". In the second parentheses, the Boolean mask is created for
observations where race is not "White". These two masks are compared by the |
(bitwise or) operator. The resulting Boolean mask will then be compared by a Boolean
mask where the values in hours-per-week are larger than 40.

SU4-26
ANL252 Data Management

Chapter 3: Merge DataFrames

Lesson Recording

Merge DataFrames in pandas

3.1 Appending DataFrames by Rows


It often happens that newly collected data must be appended to an existing set of data
for analyses. To do so, the newly collected data must consist of the same variables as the
existing dataset. Merging these datasets here means appending rows 4, 5, and 6 to the
existing dataset which contains rows 1 to 3. The end-result is one dataset with six rows.

Figure 4.19 Concatenating two datasets with different rows but identical variables

In Python, we can use the .append() method to merge two DataFrames with identical
variables into one.

DataFrame_name.append(other = [OtherDataFrames])

SU4-27
ANL252 Data Management

The parameter other is used for the specification of those DataFrames to be appended
to DataFrame_name eventually. If we only have one DataFrame to be assigned to the
parameter other, we simply put its name without quotation marks behind other =. In
the case of specifying multiple DataFrames to the parameter other, we need to put their
names in a list.

Example (Cont’d): To study the different income groups in the US census data, the
data analysts decided to split the dataset into two. Observations with income “<50K”
will be saved in a new dataset named “census_low” and those with income “>50K” are
now saved in “census_high”. Now, after the datasets have been cleaned and studied
separately, both datasets should be merged again for some joint analyses.

Figure 4.20 Appending two DataFrames

Note that we can also apply the .append() method on census_low directly.
However, we must be very sure that we no longer need census_low with its original
data since there is no way to retrieve its original content after the appending process,
unless we can import the original dataset from an external source again. Furthermore,
if we appended census_high to census_low directly and re-ran the same code

SU4-28
ANL252 Data Management

out of whatever reasons, census_low would eventually contain the observations of


census_high double. Therefore, in Figure 4.20, we first copy census_low to a new
DataFrame named census_new. And the .append() method is only applied on
census_new. Re-running the same code would not create mess in any of the involved
DataFrames at all.

Logically, the observations in the merged DataFrame do not follow the same order as
census since it does not play any role in this appending process at all.

Read

Refer to the link below for more details and examples on the .append() method of
the pandas package:

https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/
pandas.DataFrame.append.html

3.2 Merging DataFrames with Different Shapes


The .append() method introduced in Chapter 3.1 is applicable for merging two
DataFrames with the same variables by rows. Nevertheless, there are other scenarios when
merging multiple datasets in general, where one of the most useful ways is natural join
(or inner join). The syntax for natural join/inner join is demonstrated as below:

pd.merge(df1, df2, on=‘matching_key’, how=‘inner’)

For example, if we have two DataFrames as follows:

1. DataFrame A

SU4-29
ANL252 Data Management

Variable 1 Variable 2

1 Value 11 Value 12

2 Value 21 Value 22

3 Value 31 Value 32

2. DataFrame B

Variable 1 Variable 3

1 Value 11 Value 13

2 Value 21 Value 23

4 Value 41 Value 43

The inner join will return the rows where DataFrame A have matching keys in DataFrame
B. In this case, let Variable 1 be the matching key. Then the inner join resulting table should
be as follows:

Variable 1 Variable 2 Variable 3

1 Value 11 Value 12 Value 13

2 Value 21 Value 22 Value 23

The following is a simple example on inner join/natural join of DataFrame:

Example: Suppose we have the two DataFrames (df1 and df2) regarding customers,
shops, and products, as demonstrated in the following two figures:

Figure 4.21 df1 on Customer ID and Product

SU4-30
ANL252 Data Management

Figure 4.22 df2 on Customer ID and Shop

Then we inner/natural join the two tables based on the common variable
Customer_id, the DataFrame will keep all the rows that matched and all the
variables with the argument how=’inner’.

Figure 4.23 DataFrame after inner join

Read

Refer to the one link below for more details and examples on merging DataFrames
using the pandas package:

pandas. (2023b). Merge, join, concatenate and compare. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

SU4-31
ANL252 Data Management

Chapter 4: Missing Data and Outliers

Lesson Recording

Missing Data and Outliers in pandas DataFrames

In empirical studies, it often occurs that an observed value of a variable is missing. There
are many reasons for missing data: defective measurement tools, withdrawal from the
study, refusal of responses to sensitive questions, etc. In Python, there is the NoneType to
indicate missing data. Different packages have different ways to display a missing value.
For instance, pandas uses a special floating-point value for missing values, and NumPy
uses NaN which stands for “Not a Number”.

Missing data are not desirable for data analytics since they cannot be included in
constructing models, forecasting, etc. Statistical estimation of parameters can be biased.
In pandas, when we use statistical functions on DataFrames, missing values are typically
ignored by these functions. As a result, the execution of the code will not be interrupted,
but the computation of these functions could be due to unequal underlying sample sizes
for each variable.

4.1 Identifying Missing Values


Most of the time, we must work with datasets provided from external sources, and missing
values can be referred to very differently. The reasons for such discrepancies could be
typing errors, the varying habit of the data collectors when entering missing values, or
the limitation of the software used for data entry, etc. In pandas, readers such as the
read_csv() function provide two parameters, na_filter and na_values, to convert

SU4-32
ANL252 Data Management

certain strings to missing values directly while the data are being converted to pandas
DataFrame.

DataFrame_name = pd.read_csv("csv_file_name.csv",
na_values = "na_string",
na_filer = True/False)

The default value of the na_filter parameter is True. In this case, pandas will convert
all white spaces "" to NaN. However, there could be situations where white space is an
actual value of interest and not a missing value. The filter should then be turned off and
the value would be False.

With the parameter na_values, we can declare certain strings from our DataFrame to be
recognised as missing values. By default, strings like "", "#N/A", "#N/A N/A", "#NA",
"-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null" are treated as missing values and do not need
to be specified explicitly with this parameter.

Example (Cont’d): From Figure 4.1 and Figure 4.2, we can recognise that question
marks are used to indicate missing values in the US Adult Census dataset. Suppose
we would like to declare every cell that contains a question mark solely as a missing
value.

SU4-33
ANL252 Data Management

Figure 4.24 Declaring specific strings as missing values while importing data

In the read_csv() function, we specify a single question mark as string that should
be identified as a missing value in the census DataFrame. In Figure 4.24, we can
see that the values for workclass and occupation in the fifth row are now NaN,
whereas in Figure 4.1, they were simply "?".

Read

Refer to the link below for more details and examples on the parameters associated to
missing values in the read_csv() function of the pandas package:

pandas. (2023t). pandas.read_csv. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.read_csv.html

SU4-34
ANL252 Data Management

4.2 Locating Missing Values


In Chapter 4.1, we learned that the parameters in the read_csv() function can instruct
Python to indicate missing values clearly with NaN in the DataFrame. Though they will
become uniquely identifiable, it is neither easy to locate their positions nor to detect their
existence, if the dataset contains many rows and columns. One way to find out their
existence and positions is to count the NaNs in each row and each column.

If the number of missing values (aka NaNs) in a column is more than zero, we have then
identified the variable that has missing values. Such a variable with missing values may
introduce bias into our analytics tasks. And if the number of missing values in a variable
is large, the variable may not have sufficient data for reliable analysis.

DataFrame_name.isnull().sum(axis = 0)
DataFrame_name.isnull().sum(axis = 1)

The above syntax is in fact a Boolean masking. It contains two methods of the pandas
package. The .isnull() method instructs Python to check every cell of the DataFrame
and then return True if it is an NaN. Subsequently, Python should return the sum of each
row or each column of the Boolean mask. If the parameter axis = 0, the values in a
column are added up. And if axis = 1, we will get the sum of the row instead. The default
axis here is 0. Since True is usually represented by 1 and False by 0 when converting
a Boolean variable to a numeric value, the sum of a row or a column with only Boolean
values will therefore be the same as counting the occurrence of True in it.

If our intention is just to check the existence of missing values, we can use the .any()
method instead. The .any() method will return True if at least one of the elements in
the array returned by the .isnull() method is True.

DataFrame_name.isnull().any(axis = 0)

SU4-35
ANL252 Data Management

DataFrame_name.isnull().any(axis = 1)

We can retrieve the indices of the rows or columns with missing data by applying the
.index method on the resulting object from the syntax above.

object_name = DataFrame_name.isnull().any()
object_name[object_name == True].index

Counting the NaNs in columns has a different meaning than counting them in rows. When
we count the number of NaNs in columns, we are checking on the existence of missing
values in each variable. If they exist, we may need different approaches to adjust the
data for different types of variables. For instance, if they exist in a numeric variable, we
can replace the missing values by zero or by the mean of the variables. If a text variable
contains missing data, we may add a response category such as “no reply” to it. We can
also choose to neglect them if the variable is irrelevant for our analyses of the data.

By counting the NaNs in rows, however, we intend to identify those observations with
missing values in at least one of the variables. Depending on the analyses and the
importance of the observation, we can choose to delete the observation or to apply the
appropriate data adjustments to the affected columns.

Example (Cont’d): From Figure 4.25, we can identify two missing values in
workclass and occupation for the fifth observation. Now we would like to find
out whether there are more missing values in these two and other variables.

SU4-36
ANL252 Data Management

Figure 4.25 Counting the number of missing values in each variable

The output shows that workclass, occupation, and native-country are the
three variables with missing data. Their proportions of missing data are 2,799 (5.7%),
2,809 (5.7%), and 857 (1.8%) out of 48,842 observations, respectively. We need to
further study these observations to conclude on the adjustment to be applied to the
missing values. For this purpose, we shall look for all the observations, or their row
indices, with at least one missing value.

Figure 4.26 Identifying observations with missing values

SU4-37
ANL252 Data Management

The output generated in Figure 4.26 is rather unsatisfactory since we only see the
returned Boolean value from the chained methods .isnull().any(axis = 1).
Furthermore, it is only useful to us if the rows with missing values are selected from
the DataFrame. As a result, we need to filter census with the results above.

Figure 4.27 Selecting observations with missing values from a DataFrame

First, we save the output generated in Figure 4.27 as an object named missrow. After
that, we use a Boolean mask to select only those “True” observations from missrow,
and the .index method will then return the corresponding row indices to us for
selection. Eventually, we apply these row indices to subset census. If we wish to
work on this subset of the DataFrame further, we can also assign it to an object in the
second line of our code.

From Figure 4.27, we note that there are a total of 3620 rows with at least one missing
value in workclass, occupation, and native-country. We can now study the
data of these observations and decide on the adjustment measures subsequently.

SU4-38
ANL252 Data Management

Read

Refer to the link below for more details and examples on the .isnull() method of
the pandas package:

Pandas. (2023). pandas.DataFrame.isnull. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.isnull.html

Refer to the link below for more details and examples on the .sum() method of the
pandas package:

pandas. (2023s). pandas.DataFrame.sum. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.sum.html

Refer to the link below for more details and examples on the .any() method of the
pandas package:

pandas. (2023e). pandas.DataFrame.any. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.any.‌html

Refer to the link below for the index() method of the pandas package:

Pandas. (2023l). pandas.DataFrame.index. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.‌index.html

SU4-39
ANL252 Data Management

4.3 Replacing Missing Values


After checking the existence of missing values in a DataFrame and locating them, we
should decide on how to deal with them. Usually, we can consider deleting the entire
observations, replacing them with other values, or simply ignoring them.

To delete an entire row with missing values from the DataFrame, we have two options:
the .drop() and .dropna() methods.

DataFrame_name.drop(axis = 0, index = [index1, index2, …])

With the .drop() method, we can delete an entire row or column by specifying the
corresponding indices resulting from the localisation methods introduced in Chapter 4.2.
The parameter axis indicates whether rows (0) or columns (1) should be dropped.

The .dropna() method combines the localisation and removal of rows or columns with
missing data in a single function. Its usage is rather convenient since we can omit using
the .isnull().any() and .index() methods before dropping the corresponding
observations or variables.

DataFrame_name.dropna(axis = 0, how = "any"/"all")

The axis parameter in pandas functions or methods should not be unfamiliar to


us anymore. With the how parameter, however, we can instruct Python to drop an
observation with only missing values in all variables (all), or to drop an observation with
at least one missing value in any variable (any).

The drawback of the .dropna() method is the equal treatment for all missing values
throughout the entire dataset. As mentioned in the previous chapters, we have multiple
ways to adjust missing data for different types of variables. Depending on the observed

SU4-40
ANL252 Data Management

values of other variables, we may also want to keep some of the rows with missing data
while deleting others.

Example (Cont’d): Now we would like to remove all the rows with missing values
using the .drop() method.

Figure 4.28 Dropping observations with missing values using .drop() method

For the missrow object, the indices assigned to the index parameter in the .drop()
method are determined by the same syntax used in the index operator for missrow
in the same figure.

We can obtain the same result using the .dropna() function.

SU4-41
ANL252 Data Management

Figure 4.29 Dropping observations with missing values using .dropna() method

Another possibility in dealing with missing values is to replace them with a pre-
defined value. The most common values used for such purpose are 0 or the variable
mean. Some literature also suggests more sophisticated approaches such as interpolation,
extrapolation, or estimation. In Python, the pandas package facilitates replacement of
missing values by the .fillna() method.

DataFrame_name.fillna(value = repl_value)
DataFrame_name["column_label"].fillna(value = repl_value)

Basically, if we apply the .fillna() method on the entire DataFrame, it will replace all
missing values that Python could find with the value specified in the parameter. But if
we specify a column in the DataFrame and attach the .fillna() method to it, only the
missing values found in the corresponding variable will be replaced. By doing this, we
can treat missing data in various variable types differently.

SU4-42
ANL252 Data Management

Example (Cont’d): Suppose we decide to simply replace all missing values in the
DataFrame census by 0.

Figure 4.30 Replacing missing values by 0 in the entire DataFrame

The output shows that all the missing data (NaN) in row 5 are now replaced by 0.
Nevertheless, it looks rather odd to have a value 0 in the variables workclass and
occupation. As a result, instead of replacing them by 0, we would rather replace
them with the string “Unknown”.

SU4-43
ANL252 Data Management

Figure 4.31 Replacing missing values by “Unknown” in specific columns

In the first line, we apply the same chained methods .isnull().any(axis = 0)


on the DataFrame census to detect the existence of missing values in each column. In
the second line, we select only those columns where the returned values from the first
line are True using the .index method. We then apply the .fillna() method to
replace the missing values by “Unknown”, which is the replacement string assigned
to the parameter value.

Read

Refer to the link below for more details and examples on the .drop() method of the
pandas package:

pandas. (2023g). pandas.DataFrame.drop. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.drop.html

Refer to the link below for more details and examples on the .dropna() method of
the pandas package:

SU4-44
ANL252 Data Management

pandas. (2023h). pandas.DataFrame.dropna. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.‌dropna.html

Refer to the link below for more details and examples on the .fillna() method of
the pandas package:

pandas. (2023i). pandas.DataFrame.fillna. NumFOCUS, Inc. https://


pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.fillna.‌html

4.4 Detecting and Removing Outliers


Beside missing data, outliers are data that may cause bias in the estimation of statistical
parameters and hence the goodness of fit of the models. Since biased estimates are
undesirable, it is important to identify them and undertake appropriate adjustments
before conducting any analysis.

Basically, we can use statistics such as the inter-quartile range (IQR) to detect the existence
of outliers in a variable. Furthermore, visualisation like boxplots or histogram can also be
useful to examine the distribution of the variables.

In Chapter 2.2 of Study Unit 3, we learned how to use the sub-package matplotlib.pyplot
to draw a histogram in Python. The boxplot() function from the same sub-package
facilitates the creation of boxplots for outlier detection.

To compute the inter-quartile range, we can use the .quantile() method to determine
the first and third quartiles of the variable.

DataFrame_name["column_label"].quantile(q = quantile)

SU4-45
ANL252 Data Management

With the parameter q, which is a value between 0 and 1, we can define the quantile of the
distribution that the .quantile() method should return to us. Once the 0.25 and 0.75
quantiles of the target variable are obtained, the inter-quartile range iqr can be computed
by iqr = q3 – q1. An observation y is considered as outlier if y < q1 – 1.5 * iqr
or y > q3 + 1.5 * iqr.

The usual practice in dealing with outliers is to remove them from the dataset. In Python, it
suffices to keep observations that do not contain outliers in the target variable. The syntax
below generates a subset of rows that do not fulfil the above outlier condition.

DF[~((DF["Col"] < q1 – 1.5 * iqr) | (DF["Col"] > q3 + 1.5 *


iqr))]

Note that DF represents the DataFrame_Name and Col is the column_label. The
condition left from the bitwise or operator “|” selects all observations with values in
“Col“ smaller than q1 – 1.5 * iqr whereas the condition to the right selects those
observations larger than q3 + 1.5 * iqr. Nevertheless, this would be the combined
condition to select all the outliers. To invert the selection, we need to put the bitwise
not operator “~” before the entire condition, which must then be placed within a pair of
parentheses.

Example (Cont’d): In the following, we will use the aforementioned inter-quartile


range rule to detect outliers in the variable hours-per-week - i.e., observations with
extraordinary high or low number of working hours.

SU4-46
ANL252 Data Management

Figure 4.32 Computing criteria for outlier detection in a numeric variable

Based on the results in Figure 4.32, half of the sample works between 40 and 45 hours
weekly on average. The corresponding upper and lower thresholds to differentiate
outliers from “normal” data are 32.5 and 52.5, respectively.

In the next step, we can select those outlier observations for checking before dropping
them from the DataFrame eventually.

Figure 4.33 Selected outlier observations from a DataFrame

Before dropping those outlier observations from the DataFrame, we shall study them
more carefully. For instance, the observation with row index 48829 works for 60 hours

SU4-47
ANL252 Data Management

per week, which is much higher than the third quartile of the data. But according
to the variable workclass, s/he is self-employed. From this perspective, his/her
average weekly working hours seem sensible. Hence, this observation could be useful
for further analyses.

Nevertheless, to show how the syntax works, we will still drop all the outlier
observations from census that fulfil the above criteria with the following
programme.

Figure 4.34 Selected non-outlier observations from a DataFrame

The construction of the syntax for selecting those non-outliers is rather


straightforward. All we need to do is to place a bitwise not operator “~” in front of
the entire selection condition for the outliers that are now wrapped up in a round
bracket.

SU4-48
ANL252 Data Management

Read

Refer to the link below for more details and examples on the .quantile() method
of the pandas package:

pandas. (2023o). pandas.DataFrame.quantile. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.

‌quantile.html

SU4-49
ANL252 Data Management

Chapter 5: Data Modification

Lesson Recording

Data Modification in pandas

5.1 Sorting Data


The order of the observations in a DataFrame is usually rather arbitrary and random.
It can be a result of the sequence in which the data were collected or recorded, or in
which they were merged. Sometimes, we may want to sort the data according to values of
some variables for better understanding. For instance, we may want to sort an employee
dataset by the rank of the employees in the organisation. In Python, the .sort_values()
method from the pandas package helps us to rearrange the order of the rows in a
DataFrame.

DataFrame_name.sort_values(by = [List_of_var_names],
ascending)

We can provide a list of variable names to the parameter based on which the DataFrame
will be sorted. The sorting hierarchy among these variables drops with the increasing
index in the list. If we set the parameter ascending to True, the values of the variables
given in the parameter will be sorted in the ascending order, and they will be sorted in the
descending order if it is False.

SU4-50
ANL252 Data Management

Example (Cont’d): Suppose we would like to study the relationship between the
individuals’ income, their educational level (which is represented by the numeric
variable educational-num in the DataFrame census), as well as their age. The
value in educational-num increases with the educational level of the individual.
For this purpose, we will sort the DataFrame census first by educational-num
in the descending order and then by age in the ascending order. That is, we will
see observations with the highest educational level first, while observations with the
lowest educational level will appear at the end. And in each educational level, we will
first find the youngest individual; and the oldest individual will be put as last in the
group.

Figure 4.35 Sorting a DataFrames by two variables

In the above syntax, we used two variables for the sorting process. As mentioned, the
sorting hierarchy decreases with the index of the variable name in the list. That is, the
DataFrame will be first sorted by education-num, followed by age. As a result, we
also need a list of two Boolean values to instruct Python on how each of the variables

SU4-51
ANL252 Data Management

should be sorted. Here, we ask Python to first sort education-num in descending


order (False) and then age in the ascending order (True).

Read

Refer to the link below for more details and examples on the sort_values()
function of the pandas package:

pandas. (2023r). pandas.DataFrame.sort_values. NumFOCUS, Inc.

https://pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.sort_‌values.html

5.2 Grouping Data


In data analytics, we often need to group the data by one or more variables and compute
the aggregated statistics of some other variables for each group. To group a DataFrame by
some variables in Python, we can use the .groupby() method of the pandas package.

DataFrame_name.groupby(by = [List_of_Labels]).anymethod()

With the parameter by we can specify a list of column labels, or variable names, based
on which the grouping should be conducted. These variables must be categorical so that
the number of groups is finite and limited. Attached to the .groupby() method can be
any method that we would like to apply on the grouped data. The list of such functions or
methods can be found in Table 3.1 of Study Unit 3 since the NumPy functions or methods
are also applicable to pandas DataFrames.

SU4-52
ANL252 Data Management

Example (Cont’d): Suppose we would like to compute the mean of the number of
working hours, capital gain and capital loss as well as the age for each age group
created in Figure 4.36.

Figure 4.36 Computing the mean of all numeric columns for grouped data

The .mean() method of pandas selects all columns of type integer or float and
compute their means in each age group. As a result, we obtain the group means of
fnlwgt and educational-num as well. To select only the relevant variables for the
group mean calculation, we can subset the census DataFrame first in the above syntax.

Figure 4.37 Computing the mean of selected columns for grouped data

As a result, the average age of the youngest group is 23.4, while those between 39
and 59 is 42.2 and those who are 60 and older is 66.5. Individuals between the age
of 30 and 59 have to work an average of over 43 hours per week while the average
working hours for youngsters and seniors are 36 and 34 hours a week respectively.
Furthermore, seniors at the age of 60 or above have on average the highest capital
gain and loss in comparison to the other two groups.

SU4-53
ANL252 Data Management

Read

Refer to the link below for more details and examples on the functions groupby() of
the pandas package:

pandas. (2023j). pandas.DataFrame.groupby. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.‌groupby.html

SU4-54
ANL252 Data Management

Summary

In this unit, we have seen how Python can be used to manipulate, clean, and query data
using the pandas package. Querying the DataFrame structures can be done in different
ways, such as using the .iloc or .loc attributes for row-based querying or using the
square brackets on the object itself for column-based querying. We also saw that one
can query the DataFrame through Boolean masking. Furthermore, we also came across
situations where we had to use the .append() method and the concat() function to
merge multiple DataFrames with different shapes into one. We then explored how to
detect and replace missing values as well as outliers in a DataFrame. We also discussed
how DataFrames could be modified for further analyses such as sorting and grouping
data.

SU4-55
ANL252 Data Management

Formative Assessment

1. What is not a function/method to display a DataFrame?


a. print()
b. .head()
c. show()
d. display()

2. Which of the following values can be used with .iloc to select rows from a
DataFrame?
a. ["var_label1", "var_label2"]
b. [-4:-1]
c. ["0", "1", "2"]
d. DataFrame_name["var_label1" == "good"]

3. What is the resulting DataFrame according to the following programme?

df = pd.DataFrame([[1, 2]], columns = list('AB'))


df2 = pd.DataFrame([[3, 4]], columns = list('AB'))
df.append(df2)

a.
A B

1 2

3 4

b.
A B

3 4

SU4-56
ANL252 Data Management

1 2

c.
A B

1 3

2 4

d.
A B

3 1

4 2

4. Which columns does dfnew contain, according to the following programme?

df = pd.DataFrame([[1, 2]], columns = list('AB'))


df2 = pd.DataFrame([[3, 4]], columns = list('BC'))
dfnew = pd.concat([df, df2], join = 'outer')

a. A and B
b. B and C
c. Only B
d. A, B, and C

5. Which argument of the pd.read_csv() function is used to declare specific strings


as missing values?
a. na_values
b. na_filter
c. na_drop
d. na_omit

SU4-57
ANL252 Data Management

6. Which method should be used to replace missing values by another value?


a. .drop()
b. .dropna()
c. .nareplace()
d. .fillna()

7. Which variable is the lowest in the sorting hierarchy?

df.sort_values(by = ['Z', 'Y', 'X', 'W'])

a. X
b. W
c. Z
d. Y

8. What is not recommended when applying the .groupby() method on a DataFrame?


a. The grouping variable should be categorical.
b. The grouping variable should be discretised before.
c. The grouping variable should be of type float.
d. The values in the grouping variable can be identical to the row indices.

SU4-58
ANL252 Data Management

Solutions or Suggested Answers

Formative Assessment
1. What is not a function/method to display a DataFrame?

a. print()

Incorrect. We can use the print() function to print out a DataFrame.

b. .head()

Incorrect. We can use the .head() method to print out the first five rows of
a DataFrame.

c. show()

Correct. show() is a function of matplotlib which is used to show the


created graphs.

d. display()

Incorrect. We can use the display() function to print out the first five and
the last five rows of a DataFrame.

2. Which of the following values can be used with .iloc to select rows from a
DataFrame?

a. ["var_label1", "var_label2"]

Incorrect. We can only apply .iloc on row positions.

b. [-4:-1]

Correct. -4:-1 are clearly indices representing the row positions to be


selected.

c. ["0", "1", "2"]

SU4-59
ANL252 Data Management

Incorrect. The values in the index operator for .iloc must be numeric and
not label strings.

d. DataFrame_name["var_label1" == "good"]

Incorrect. The arguments in the index operator for .iloc must be numeric
values and not Boolean expressions.

3. What is the resulting DataFrame according to the following programme?

df = pd.DataFrame([[1, 2]], columns = list('AB'))


df2 = pd.DataFrame([[3, 4]], columns = list('AB'))
df.append(df2)

a.
A B

1 2

3 4

Correct. The .append() method merges two DataFrames by row.

b.
A B

3 4

1 2

Incorrect. The .append() method appends df2 to df1 and not vice versa.

c.
A B

1 3

SU4-60
ANL252 Data Management

2 4

Incorrect. The .append() method does not merge DataFrames by column.


Besides, df and df2 are row DataFrames and not column DataFrames.

d.
A B

3 1

4 2

Incorrect. The .append() method does not merge DataFrames by column.


Besides, df and df2 are row DataFrames and not column DataFrames.
Furthermore, it should be df2 appended to df1 and not df1 to df2.

4. Which columns does dfnew contain, according to the following programme?

df = pd.DataFrame([[1, 2]], columns = list('AB'))


df2 = pd.DataFrame([[3, 4]], columns = list('BC'))
dfnew = pd.concat([df, df2], join = 'outer')

a. A and B

Incorrect. Since it is an outer join, the resulting DataFrame should contain all
available columns across the original DataFrames and not only those from
df.

b. B and C

Incorrect. Since it is an outer join, the resulting DataFrame should contain all
available columns across the original DataFrames and not only those from
df2.

SU4-61
ANL252 Data Management

c. Only B

Incorrect. Since it is an outer join and not inner join, the resulting DataFrame
should contain all available columns across the original DataFrames and not
only the common ones.

d. A, B, and C

Correct. Since it is an outer join, the resulting DataFrame should contain all
available columns across the original DataFrames, which are A, B, and C.

5. Which argument of the pd.read_csv() function is used to declare specific strings


as missing values?

a. na_values

Correct. We can specify a list of strings representing missing values in the


DataFrame with na_values.

b. na_filter

Incorrect. na_filter is used to convert white spaces to missing values.

c. na_drop

Incorrect. There is no na_drop parameter for the pd.read_csv() function.

d. na_omit

Incorrect. There is no na_omit parameter for the pd.read_csv() function.

6. Which method should be used to replace missing values by another value?

a. .drop()

Incorrect. The .drop() method is used to delete an entire row/column from


a DataFrame.

b. .dropna()

SU4-62
ANL252 Data Management

Incorrect. The .dropna() method is used to delete an entire row from a


DataFrame if it contains any missing value.

c. .nareplace()

Incorrect. There is no method called .nareplace() in the pandas package.

d. .fillna()

Correct. With the .fillna() method, we can replace missing values in a


DataFrame by a user-defined value.

7. Which variable is the lowest in the sorting hierarchy?

df.sort_values(by = ['Z', 'Y', 'X', 'W'])

a. X

Incorrect. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W should be the lowest in the hierarchy.

b. W

Correct. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W is the lowest in the sorting hierarchy.

c. Z

Incorrect. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W should be the lowest in the hierarchy.

d. Y

Incorrect. The sorting hierarchy decreases with the increase of the index in
the by list. As a result, variable W should be the lowest in the hierarchy.

8. What is not recommended when applying the .groupby() method on a DataFrame?

SU4-63
ANL252 Data Management

a. The grouping variable should be categorical.

Incorrect. The number of possible values of a grouping variable should be


limited. A categorical variable matches this criterion perfectly.

b. The grouping variable should be discretised before.

Incorrect. If the grouping variable is the result of a discretised numeric


variable, then its number of possible values has been limited, which matches
the criterion of a grouping variable perfectly.

c. The grouping variable should be of type float.

Correct. This is not recommended since the number of possible values in


a grouping variable should be limited. However, the number of possible
values of a float variable is infinite.

d. The values in the grouping variable can be identical to the row indices.

Incorrect. In the case where the row indices are values of a categorical
variable, this categorical variable can perfectly be used as a grouping variable
since its number of possible values is limited.

SU4-64
ANL252 Data Management

References

DataScience Made Simple. (2023). Join in pandas: Merge DataFrames (Inner, Outer, Right,

Left Join) in Pandas Python. https://www.datasciencemadesimple.com/join-

merge-data-frames-pandas-python/

Department of Computer Science. (1996). The adult dataset. The University of Toronto.

Kaggle. (n.d.). Adult income dataset. https://www.cs.toronto.edu/~delve/data/

adult/adultDetail.html

Kaggle Inc. https://kaggle.com/wenruliu/adult-income-dataset

pandas. (2023a). IO tools (text, CSV, HDF5, …). NumFOCUS, Inc. https://

pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table

pandas. (2023b). Merge, join, concatenate and compare. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

pandas. (2023c). pandas.concat. NumFOCUS, Inc. https://pandas.pydata.org/pandas-

docs/stable/reference/api/pandas.concat.html

pandas. (2023d). pandas.cut. NumFOCUS, Inc. https://pandas.pydata.org/pandas-

docs/stable/reference/api/pandas.cut.html

pandas. (2023e). pandas.DataFrame.any. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.any.html

pandas. (2023f). pandas.DataFrame.append. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/version/1.3/reference/api/

pandas.DataFrame.append.html

pandas. (2023g). pandas.DataFrame.drop. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

pandas. (2023h). pandas.DataFrame.dropna. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.dropna.html

SU4-65
ANL252 Data Management

pandas. (2023i). pandas.DataFrame.fillna. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

pandas. (2023j). pandas.DataFrame.groupby. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.groupby.html

pandas. (2023k). pandas.DataFrame.iloc. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

pandas. (2023l). pandas.DataFrame.index. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.index.html

pandas. (2023m). pandas.DataFrame.isnull. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.isnull.html

pandas. (2023n). pandas.DataFrame.loc. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

pandas. (2023o). pandas.DataFrame.quantile. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.quantile.html

pandas. (2023p). pandas.DataFrame.reset_index. NumFOCUS, Inc.

https://pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.reset_index.html

pandas. (2023q). pandas.DataFrame.set_index. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.set_index.html

pandas. (2023r). pandas.DataFrame.sort_values. NumFOCUS, Inc.

https://pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.sort_values.html

SU4-66
ANL252 Data Management

pandas. (2023s). pandas.DataFrame.sum. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.sum.html

pandas. (2023t). pandas.read_csv. NumFOCUS, Inc. https://pandas.pydata.org/pandas-

docs/stable/reference/api/pandas.read_csv.html

SU4-67
ANL252 Data Management

SU4-68
5
Study
Unit

Data Analytics in Python


ANL252 Data Analytics in Python

Learning Outcomes

By the end of this unit, you should be able to:

1. Design Python programmes for performing data analytics


2. Analyse data using appropriate tools for data mining

SU5-2
ANL252 Data Analytics in Python

Overview

In this unit, we will discuss the implementation of two analytics techniques in Python:
k-means clustering and decision trees. All these functions, modules, and algorithms
can be found in the scikit-learn library. The scikit-learn library is a machine-learning
library written for the Python programming language. It features various modules such
as classification, clustering, regression, etc. We will first develop Python programmes to
prepare the DataFrame for the different requirements of these sub-packages. We will then
learn how the two analytics techniques can be carried out by Python programmes and
how their results can be extracted and presented.

SU5-3
ANL252 Data Analytics in Python

Chapter 1: Introduction to Scikit-Learn

1.1 Installing and Importing Scikit-Learn

Lesson Recording

Introduction to Scikit-Learn

It has become very common to carry out tasks for data analytics, statistical modelling,
or machine learning in Python recently. And the trend is rising. In fact, many Python
packages have been developed in these areas. One of the most common libraries used for
these purposes is scikit-learn, a free machine-learning library written for Python. In this
study unit, we will write our code with the scikit-learn functions in JupyterLab.

One reason that scikit-learn has become one of the most common machine-learning
libraries for programming is its broad applicability and functionality. It features various
algorithms for classification, regression, clustering, etc. In machine learning, programmes
are constructed with parameters such that they can “learn” from newly fed data. That
is, they can automatically adjust and improve their behaviour according to the new
“knowledge”. Below is a table of the most common algorithms that are available in scikit-
learn.

Table 5.1 Most common algorithms available in scikit-learn

Supervised learning

Linear Models Gaussian Processes

Discriminant Analysis Cross decomposition

SU5-4
ANL252 Data Analytics in Python

Kernel ridge regression Decision Trees

Support Vector Machines Isotonic regression

Nearest Neighbours Neural network models (supervised)

Unsupervised learning

Gaussian mixture models Novelty and Outlier Detection

Clustering Density Estimation

Covariance estimation Neural network models (unsupervised)

Besides machine learning algorithms, scikit-learn also provides modules for model
selection, visualisation, data transformation, as well as example datasets. The website
https://scikit-learn.org/stable/user_guide.html contains many details of the library.

Same as NumPy, matplotlib, and pandas, we can simply use pip (the package installer of
Python) to download and install scikit-learn.

pip install scikit-learn

After installing scikit-learn using pip, we can import it into our programme.

import sklearn

Note that it is sklearn and not scikit-learn that refers to the scikit-learn library
in the Python programmes. Nevertheless, since the library is extraordinarily extensive,
programmers usually do not import the entire library. Instead, the common practice
is to load the required algorithm or only its “estimator” object. For instance, if linear

SU5-5
ANL252 Data Analytics in Python

regression models are required for the analytics task, we can import the estimator
LinearRegression from the module linear_model.

from sklearn.linear_model import LinearRegression

Since each module has its own estimators, functions, etc., it is important to refer to the
official websites for the correct spelling, including the cases of the names. It is not unusual
that we need to load a couple of them for a single analytics task. It is therefore important
to put sufficient comments in the programme to explain the purpose and use of each
imported module.

In this study unit, we will demonstrate two scikit-learn algorithms – k-means clustering
and decision trees – to show how the library and Python, in general, can be applied in
data analytics. But before we can apply these algorithms, we need to prepare the data
according to the requirements of each of the algorithms. The preparation process will be
discussed in the next section.

Example: We begin to develop the code using the scikit-learn algorithms and to import
the corresponding packages or modules.

Figure 5.1 Importing modules and functions from scikit-learn

SU5-6
ANL252 Data Analytics in Python

In the first box, the packages introduced in the two previous study units, pandas,
NumPy, and matplotlib are imported. We will need them to manage DataFrames,
convert slices of DataFrames to multi-dimensional arrays, and construct plots to
illustrate and evaluate the model results. In the second box, we import modules from
scikit-learn that we need for the pre-processing and transformation of the DataFrames
for model constructions. For instance, with the train_test_split function, we can
instruct Python to split arrays into random training and testing subsets for evaluating
the estimator performance. Furthermore, the module metrics includes functions to
compute metrics and distances for the evaluation of classification performance. The
functions in the preprocessing module such as scaling, centring, normalisation,
etc. are used to prepare DataFrames for the scikit-learn algorithms. In the last two
boxes, we import the modules of k-means clustering (KMeans) and decision trees
(tree). In addition to the KMeans module, we also import the PCA module from the
decomposition sub-package for dimension reduction, which will be helpful to plot
multivariate data as a two-dimensional chart.

Read

Refer to the link below for more details on the installation of the scikit-learn package:

scikit learn. (2023). Installing ski-kit learn. scikit-learn developers. https://scikit-

learn.org/stable/install.html

Refer to the link below for more details and examples on the cluster module of the
scikit-learn package for applying K-Means clustering:

scikit learn. (2023). 2.3 clustering. scikit learn 1.2.2. https://scikit-learn.org/stable/

modules/clustering.html

SU5-7
ANL252 Data Analytics in Python

Refer to the link below for more details and examples on the tree module of the
scikit-learn package for constructing decision trees:

scikit learn. (2023). 1.10 decision trees. scikit learn 1.2.2. https://scikit-learn.org/

stable/modules/tree.html

Refer to the link below for more details and examples on the metric module of the
scikit-learn package for metrics and distance computations:

Scikit learn. (2023). API reference. scikit learn 1.2.2. https://scikit-learn.org/stable/

modules/classes.html?highlight=metrics#module-sklearn.metrics

Refer to the link below for more details and examples on the preprocessing module
of the scikit-learn package for data preparation techniques such as scaling, centring,
normalisation, etc.:

scikit learn. (2023). sklearn.preprocessing: Preprocessing and normalization. scikit

learn 1.2.2. https://scikit-learn.org/stable/modules/classes.html?

highlight=preprocessing#‌module-sklearn.preprocessing

1.2 Data Preparation for Analytics Algorithms

Lesson Recording

Data Preparation for Analytics Algorithms of scikit-learn (1/2)

Data Preparation for Analytics Algorithms of scikit-learn (2/2)

In Figure 5.1, various modules for data preparation have been imported. At the same time,
we have also imported packages that we have already worked with in the previous study
units: NumPy, pandas, matplotlib. In fact, the scikit-learn algorithms work hand-in-hand
with these packages. In SU4, we have already covered some basic data preparation tasks

SU5-8
ANL252 Data Analytics in Python

such as missing data handling. In this section, we will further need to combine NumPy,
pandas and scikit- learn to prepare datasets to meet the requirements of every scikit-learn
module.

1.2.1 Reducing Number of Categories


If a DataFrame contains a large number of categorical variables, the required
computational effort of the analytics algorithm in the fitting process could be significant.
One possible solution here is to reduce the number of categories at the expense of
information loss. It is therefore a task for data analysts to balance this trade-off carefully.

Basically, the process of category reduction is to put observations from similar categories
into a new category. For instance, if the country names are categories of a categorical
variable, we can group them by their continents. If a categorical variable contains the
models of a certain product, we can group them by their brands or their main features.
The similarity of the categories is essential here for not losing too much information.

In terms of programming, all we need to do is to replace some category labels in a variable


by the .replace() method of the pandas package.

DataFrame_Name["column_label"].replace(to_replace, value)

The parameter to_replace can be a list or dictionary of category labels to be replaced


by the list or dictionary of new labels that is assigned to the value parameter.

Example: Suppose we have a pandas DataFrame with one column (i.e., ‘fruits’), as
shown in the first block of code below. In the DataFrame, the category ‘small_apple’
is redundant, which can be converted to the category ‘apple’ with the second block of
code below:

SU5-9
ANL252 Data Analytics in Python

Figure 5.2 Reducing the number of categories for a categorical variable

Read

Refer to the link below for more details and examples on the .replace() method of
the pandas package:

pandas. (2023a). pandas.DataFrame.replace. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.replace.html

1.2.2 Renaming Variables


If a dataset is originated from an external source, the given variable names may not
necessarily reflect the needs and ideas of the analyst. Sometimes, they can be lengthy and
make the result output visually appalling. In pandas, the .rename() method is used to
rename the variables in a DataFrame.

SU5-10
ANL252 Data Analytics in Python

DataFrame_name.rename(columns = {"oldvar": "newvar"})

The column labels to be renamed must be put as keys of a dictionary that will be assigned
to the parameter columns in the .rename() method. The values of the dictionary will
then be the new labels of the corresponding columns.

Example: Suppose we have a pandas DataFrame with one column (i.e., ‘fruits’), as
shown in the first block of code below, we can rename the column as ‘three_fruits’
with the second block of code below:

Figure 5.3 Renaming selected variable

Read

Refer to the link below for more details and examples on the .rename() method of
the pandas package:

SU5-11
ANL252 Data Analytics in Python

pandas. (2023c). pandas.DataFrame.rename. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.rename.html

1.2.3 Creating Dummy Variables


Categorical variables are usually converted to dummy variables before they can be
evaluated and included in the computation of the scikit-learn algorithms. Dummy
variables are binary variables that only have two values: 0 and 1. If an observation belongs
to a certain category, the corresponding dummy variable will be 1; otherwise, it is 0. Since
each category of a categorical variable will be transformed to a dummy variable, the
number of categories has indeed a direct impact on the number of dummy variables in the
final DataFrame used in the algorithm. As a result, it is important to keep the number of
categories at a rather low level. The syntaxes and procedures for category reduction have
been discussed in Chapter 1.2.2 and 1.2.3 of this study unit.

In Python, we can convert categorical variables to dummy variables using the


get_dummies() function of the pandas package.

DataFrame_name["column"] = pd.get_dummies(data, drop_first)

The parameter drop_first is used to instruct Python to take the first category as
the reference level and remove it from the resulting DataFrame. The reason to define a
reference level for categorical variables and to remove it from the DataFrame is to avoid
linear dependence on the data matrix, which causes error in the calculation. The default
setting here is drop_first = False. In this case, all dummy variables will remain in the
resulting DataFrame. Since most of the modules in scikit-learn have their own algorithms
to deal with this issue, we can keep this setting without causing error in the estimation
process.

SU5-12
ANL252 Data Analytics in Python

Example: Suppose we have a pandas DataFrame with one categorical column (i.e.,
fruits), as shown in the first block of code below, we can convert it to a DataFrame
with three dummy variables, namely, apple, banana, and orange:

Figure 5.4 Creating dummy variables from categorical variables

Read

Refer to the link below for more details and examples on the get_dummies()
function of the pandas package:

pandas. (2023e). pandas.get_dummies. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.get_dummies‌.html

1.2.4 Extracting Dependent and Independent Variables


In scikit-learn, the ultimate command of many algorithms to fit a model on a DataFrame is
the .fit() function. The .fit() function usually has two parameters: X and Y, where Y

SU5-13
ANL252 Data Analytics in Python

could be optional for some algorithms. The parameter X is the design matrix that contains
all independent variables, and Y is the vector of the target variable. Both X and Y can be
NumPy arrays or pandas DataFrames. As a result, we need to extract the independent
variables as a matrix and the dependent variable as a vector from the original DataFrame.

The procedure here is rather straightforward. We simply select the column that represents
our dependent variable and save it as a new object.

y = DataFrame_name["target_var"]

Similarly, the matrix of the independent variables can be selected in the same manner.

X = DataFrame_name[["X1", "X2", …]]

Note that it is required to wrap the names of the independent variables in a list (square
brackets) first before putting them in the index operator [].

If the DataFrame only contains the independent variables and the target variable, we can
also simply drop the target variable from the original DataFrame to obtain X.

X = DataFrame_name.drop("target_var")

If the target variable is categorical and has been transformed to various dummy variables,
the names of the dummy variables must be put in a list when passing them to the .drop()
method:

Y_dummyvar = [y1, y2]

Y = DataFrame_name[Y_dummyvar[1]]

X = DataFrame_name.drop(Y_dummyvar, axis=1)

SU5-14
ANL252 Data Analytics in Python

In the first line above, we create Y_dummyvar,which denotes a list of the target dummy
variables y1 and y2. However, we only need one of the dummy variables for modelling
purposes since they are highly correlated. In other words, when y1 is equal to 1, y2 is
equal to 0; and vice versa. In the second line above, the column with the label that matches
the second item in Y_dummyvar will be selected from the DataFrame. Lastly, in the third
line above, we also drop the target variable from the DataFrame to have the independent
variables X.

1.2.5 Other Data Preparation Techniques


There also exists other data preparation techniques with various purposes and application
fields, which are not limited to the content summarised in the table below.

Table 5.2 Other data preparation techniques

Purpose Common treatment Comment

Treat missing values Remove the records with Please refer to SU4 for
missing values; replace the more details.
missing values

Treat outliers Remove the outliers Please refer to SU4 for


more details.

Treat categorical values Create dummy variables; Need to be treated


one-hot encoding in algorithms that can
(Covered in this Chapter) only process numerical
variables (e.g., regression
and clustering). For
example, age categories
of {infant, young, adult,
elderly} are converted into
three dummy variables.

SU5-15
ANL252 Data Analytics in Python

Purpose Common treatment Comment

Young = {0, 1}, Adult = {0,


1}, Elderly = {0, 1}. And
when these three variables
are all zeros, it implies
Infant.

Treat numerical values Discretisation or binning To convert one numeric


variable (e.g., age) to
categorical values of
{infant, young, adult,
elderly}

Normalisation/ Scaling/transforming Convert data into a


standardisation variables standard scale for
easy comparison and
interpretation. Useful
when the data has different
units of measurement or
scales.

1.2.6 Regression
Withstanding that we are able to extract the target variable and independent variables
from Section 1.2.7, now we would like to illustrate more about building simple regression
models on a data frame. Suppose we have a DataFrame for analysis, which contains the
independent variables X and the target variable Y.

The figure below demonstrates a simple two-dimensional linear regression, where X in


the horizontal axis is used to predict the value of Y in the vertical axis:

SU5-16
ANL252 Data Analytics in Python

Figure 5.5 Linear regression (two-dimensional)

where a indicates the coefficient for the independent variable or the slope of the line, b

represents the intercept term, and represents the error term.

The linear regression model is fitted based on the least square method, i.e., to minimise
the sum of squared errors (SSE) over all observations i:

Thereby, the oblique line in Figure 5.5 is also named as the least square regression line.

The performance of the linear regression model has been commonly evaluated with the
R2 value, which can be calculated by:

SU5-17
ANL252 Data Analytics in Python

where the SST stands for the sum of square deviations of the target variable Y from its
own mean. The value of R2 ranges from 0 (worst model performance/goodness of fit) to
1 (worst model performance/goodness of fit).

When there are more than one independent variables, we may further represent the linear
regression model as:

where Y denotes the target/dependent variable; {x1, …, xn} indicates the n


independent variables; b and also represent the intercept term and the error term,
respectively.

The linear regression formulations and R2 method for performance evaluation can be
imported from sklearn package with the following codes:

from sklearn.linear_model import LinearRegression

From here, we are able to create the linear regression object and the fit the linear regression
mode:

lr_object = LinearRegression()
model = lr_object.fit(X, Y)

Lastly, we can evaluate the performance of the linear regression model in terms of R2 with
the ‘.score‘ method:

SU5-18
ANL252 Data Analytics in Python

model.score(X, Y)

Example: Suppose that the number of employees in a certain company is related to


its monthly profit. The number of employees and the monthly profit are summarised
as follows:

# of Employees Profit

1 20000

2 10000

3 30000

4 40000

5 50000

6 70000

We first import the related packages from sklearn:

Figure 5.6 Import packages for linear regression

Then we create the array for analysis, including the dependent variable ‘study_hours’
and the target variable ‘score’. The ‘.reshape()’ method is used to ensure the
independent variable ‘study_hours’ has two dimensions, i.e., (12, 1).

SU5-19
ANL252 Data Analytics in Python

Figure 5.7 NumPy array data preparation

Then we build the linear regression model with the sklearn package:

Figure 5.8 Construct linear regression model

With the linear regression model, we can further calculate the predicted target variable
values (score) for each data point:

Figure 5.9 Calculate the predicted value

Lastly, we are able to plot both the scatter plot of the study hour – test score data points
and the linear regression line in the same figure:

SU5-20
ANL252 Data Analytics in Python

Figure 5.10 Linear regression model plot

Other than linear regression, sklearn also allows for other types of commonly used
regression techniques such as logistic regression, generalised linear models, etc., for
which students may refer to the links at the end of this section for more information.

Read

Refer to the link below for more details and examples on the LinearRegression
function of the linear_model in the scikit-learn package:

scikit learn. (2023g). sklearn.linear_model.LinearRegression. scikit-learn

developers. https://scikit-learn.org/stable/modules/generated/

sklearn.linear_model.LinearRegression.html

Refer to the link below for more details and examples on the
sklearn.linear_model of the scikit-learn package:

SU5-21
ANL252 Data Analytics in Python

scikit learn. (2023c). API reference. scikit-learn developers. https://

scikit-learn.org/stable/modules/classes.html#module-

sklearn.linear_modelsklearn.linear_model

SU5-22
ANL252 Data Analytics in Python

Chapter 2: Clustering

2.1 Introduction of K-Means Clustering

Lesson Recording

Introduction and Fitting of K-Means Clustering by scikit-learn

Clustering is a multivariate analytics technique to group “similar” observations into finite


number of disjoint clusters. One of the most popular clustering algorithms is the K-
Means method. This technique is very efficient in clustering large data sets. The algorithm
here is to split the data into K groups with equal variance by minimising the variation
within the cluster. This variation is called the inertia or within-cluster sum-of-squares.
In other words, “members” in the same cluster should be as “similar” as possible, while
observations from different clusters should be most distinguishable.

Different from some other clustering algorithms in which the number of clusters will only
emerge during the grouping process, the K-Means method requires the number of clusters
to be specified before the algorithm starts. The clusters are characterised by their centroids,
which can be interpreted as the centre of an area in a two-dimensional space and are hence
the average of all the observations within a cluster. As the name of the algorithm suggests,
there should be K different means (centroids) and they should be explored during the
clustering process.

SU5-23
ANL252 Data Analytics in Python

The process of K-Means clustering can be described in five main steps:

1. K observations are randomly selected as initial cluster centroids where K is a pre-


defined positive integer.
2. Compute the distance of each object to the centroids.
3. Based on the distance computed, each object is assigned to the nearest centroid.
Objects assigned to the same centroid form a cluster. There will be K different
clusters.
4. For each cluster, recompute the centroid using the objects assigned to the cluster.
The iteration starts again from Step 2.
5. The iteration stops when the centroids remain unchanged, or a specified number
of iterations has been performed.

Note that the distance mentioned in 2) refers to the Euclidean distance in general. The
Euclidean distance between an object and a cluster centroid is measured by the sum of
the squared differences between the values of some selected clustering criteria, which are
usually some input variables of the object, and the values of the same clustering criteria
of the centroid.

Subsequent to the clustering process, it is important to make sure that the resulting clusters
really create some insights. To interpret the clusters, the characteristics of each cluster
should be explored by looking at the summary statistics (e.g., mean, min, max) of the
clustering criteria. A good clustering solution should allow us to describe the profile of
each cluster clearly.

In addition, there are objective measures for evaluating the quality of clustering solutions:
cohesion, separation, and parsimony. Cohesion measures the similarity of the objects in
a cluster. This value should be small because these objects should be similar. Separation,
on the other hand, measures how dissimilar the clusters are, and this value should
be high. Here, we can apply the Silhouette coefficient since it combines both cohesion
and separation. Briefly speaking, the Silhouette coefficient is a value between -1 and 1
that measures the relationship between the intra-cluster distances and nearest cluster
distances. The mean of the individual Silhouette coefficients will be computed for every

SU5-24
ANL252 Data Analytics in Python

clustering solution for evaluation. A high and positive average Silhouette coefficient
suggests appropriate and useful clustering solution. On the contrary, negative Silhouette
coefficient indicates a rather undesirable clustering result.

Furthermore, parsimony is another important criterion in clustering. As a result, we


prefer smaller number of clusters if the quality of the corresponding clustering solution is
satisfactory. Nevertheless, the number of clustering criteria should also be parsimonious,
so that the clustering solution can be interpreted conveniently.

2.2 Fitting K-Means Clustering by Scikit-Learn


In scikit-learn, all algorithms are controlled and executed by the so-called estimator.
We can adjust our parameters for the modelling process in the syntax of the estimator
declaration. In K-Means Clustering, the estimator is called KMeans. And it can be
imported from the cluster module of the sklearn package.

from sklearn.cluster import KMeans

First, we need to initiate the KMeans estimator and adjust the estimation parameters
according to our needs.

km_Object = sklearn.cluster.KMeans(n_clusters = 8, init =


"k-means++", n_init = 10, max_iter = 300,
tol = 0.0001, precompute_distances = "auto",
random_state = None)

The following table provides description and explanation of the parameters.

SU5-25
ANL252 Data Analytics in Python

Table 5.3 Parameters of the KMeans estimator

Parameter Value Type Description

n_clusters integer The number of clusters to form as well as


(Default: 8) the number of centroids to generate.

init (Default: "k-means++", Method for initialisation:


"k-means+ "random", callable,
"k-means++": selects initial cluster
+") array-like of shape
centres for k-mean clustering in a smart
way to speed up convergence.

"random": choose n_clusters


observations at random from the data as
initial centroids.

Array: An array with number of rows


equal to the n_clusters and number of
columns equal to the number of variables
that give the initial centroids.

Callable: It should take arguments X,


n_clusters and a random state and
return an initialisation.

n_init integer Number of times the k-means algorithm


(Default: 10) will be run with different centroid seeds.
The final results will be the best output
of n_init consecutive runs in terms of
inertia.

max_iter integer Maximum number of iterations of the k-


(Default: 300) means algorithm for a single run.

SU5-26
ANL252 Data Analytics in Python

Parameter Value Type Description

tol float Relative tolerance of the difference in


(Default: 1e-4) the cluster centres of two consecutive
iterations to declare convergence.

"auto", True, False Pre-compute distances (faster but takes


precompute_
distances more memory).

(Default: "auto": do not pre-compute distances


"auto") if n_samples * n_clusters > 12
million. This corresponds to about 100MB
overhead per job using double precision.

True: always pre-compute distances.

False: never pre-compute distances.

random_state integer, Determines random number generation


(Default: None) for centroid initialisation. Use an integer
RandomState instance,
to make the randomness deterministic.
None

(Source: scikit learn, 2023, https://scikit-learn.org/stable/modules/generated/


sklearn.cluster.KMeans.html)

Here are some explanations regarding some items mentioned in Table 5.3:

• A callable is a programme part that can be called such as a user-defined function


or a built-in method.
• Centroid seeds refer to the initial cluster centres.
• Inertia measures the within-cluster sum-of-squares and should be minimised in the
clustering process.

Next, we can apply the KMeans estimator on a prepared DataFrame.

SU5-27
ANL252 Data Analytics in Python

km_fit_Object = km_Object.fit(X, sample_weight = None)

The parameter X is a prepared DataFrame based on which the clusters are constructed.
With sample_weight,we can pre-specify the weights for each observation in X. If it is
set to None, which is also the default here, all observations will be assigned equal weight.
The fitted estimator of the K-Means algorithm is saved in km_fit_Object.

To obtain and view the clusters of each data point, we can use the following syntax:

print(km_Object.labels_)

Example (Cont’d): Here we give an example on the popular iris dataset in machine
learning which contains 150 records of iris flowers from three distinct species (setosa,
versicolor, and virginica). The dataset comprises four attributes, namely, sepal length,
sepal width, petal length, and petal width. We start by loading the dataset from
scikit-learn, and the related package for plotting. Since K-Means clustering is an
unsupervised machine learning algorithm, where we do not need a target variable for
clustering purpose.

Figure 5.11 Import packages for KMeans clustering

Then we load the iris dataset into our Python programme:

Figure 5.12 Load iris dataset

SU5-28
ANL252 Data Analytics in Python

Next, we specify the number of clusters (3), initialise the KMeans estimator and fit the
KMeans clustering model.

Figure 5.13 Construct KMeans clustering model

Since we already have the KMeans clustering model now, we can obtain the clusters
with kmeans.labels_.

Figure 5.14 Obtain the clusters

Then we can visualise the clusters with the matplotlib package regarding the first two
features in the iris dataset, as shown below:

Figure 5.15 KMeans cluster plot

Read

Refer to the link below for more details and examples on the KMeans estimator and
fit()functions of the cluster module in the scikit-learn package:

SU5-29
ANL252 Data Analytics in Python

scikit learn. (2023e). sklearn.cluster.KMeans. scikit-learn developers. https://scikit-

learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

SU5-30
ANL252 Data Analytics in Python

Chapter 3: Decision Trees

3.1 Introduction to Decision Trees

Lesson Recording

Introduction and Fitting of Decision Trees by scikit-learn

Decision trees are among the most common data mining methods which split a set (or
subset) of observations to reach certain decision points based on some criteria eventually.
Each decision point in the tree is also called a node and represents a subset of the sample
based on which the decision tree is created. Nodes that are split from a superordinate node
are called the child node while the origin node is called the parent node. A child node
with no further subdivisions or splitting is called a leaf node.

Since each observation in the sample will be assigned to one of the nodes eventually,
decision tree is a classification technique to separate a sample into multiple classes. The
decision tree algorithm predicts the individual classification based on the values of some
input variables and calculates the predicted value of the target variable at the same time.
These rules of decision form the resulting model which can then be illustrated by a tree-like
structure graphically. This structure convenes the interpretation of the modelling result.

We can also use the decision tree model to understand the relationship between the target
variable and the input variable. In fact, the decision tree handles complex relationships
such as non-linearity and interaction rather well. Note that not all input variables are of
the same importance in the classification process. Their hierarchy in the decision rules
reflects in the decision tree in which input variables appear higher up are more important.

As mentioned, nodes without further splitting are called the leaf nodes. Their value is
the prediction of the target variable for those observations classified in the corresponding

SU5-31
ANL252 Data Analytics in Python

nodes. If the target variable is categorical, the value of the leaf node will be the mode, the
most frequent class. And it will be the mean of the data in that node if the decision tree is
predicting a numeric variable.

Over the years, there have been many algorithms for splitting the nodes and hence the
construction of the tree developed, proposed, and implemented in software packages.
The most common ones are CHAID (chi-square automatic interaction detection), C5.0
(a proprietary algorithm) and CART (classification and regression tree). In Python, the
estimator DecisionTreeClassifier of the scikit-learn package uses an optimised
version of the CART algorithm.

As the name suggests, the CART algorithm can create both classification trees and
regression trees. While regression trees estimate the values of a continuous target variable,
classification trees predict the outcome of a categorical target variable. In other words,
CART is applicable on almost every type of output variable.

In CART, every parent node only has two child nodes. That is, CART will only split the
tree into two sub-samples at every decision point. And the calculation of the split is based
on the input variables that are also used to predict the target output. As the split process
advances, the sample will be divided into more and more, smaller and smaller subsets.
These subsets will also become more and more homogeneous. The whole splitting process
will be terminated once certain stopping criteria are fulfilled.

The homogeneity of each subset reflects the split quality from a parent node to its child
nodes. In classification tree, the homogeneity is measured by Gini and entropy. Roughly
speaking, both Gini and entropy measure the impurity of a node, but with different
theoretical background. By comparing the impurity decrease across all possible splits in
all input variables, the split with the highest reduction of impurity will be chosen. In
scikit-learn, both Gini and entropy are options of the criterion parameter in the
DecisionTreeClassifier.

In a regression tree, the impurity is measured by the sum of squared error (SSE). The SSE
is the total deviance of each observation from the sample mean. For each potential split,

SU5-32
ANL252 Data Analytics in Python

CART computes the SSE for each child node and the split with the lowest sum of SSE
across all child nodes will be chosen. Since the mean of the target variable of the leaf node
is equal to the predicted value of the target variable, splits with low SSE have child nodes
that contain data whose target values are close to the mean value.

One possibility to stop the CART algorithm is when the impurity improvement of a new
split drops below a certain pre-defined threshold. For instance, if a node has reached a
rather low Gini or SSE respectively, meaning that the parent node itself is already rather
homogeneous, another split from the node would only create homogeneous child nodes.
In this case, this split is not necessary since it does not decrease the impurity significantly.

Nevertheless, choosing the right thresholds to stop the split algorithm is not as
straightforward as it seems. If the thresholds are high, the resulting tree could be
oversimplified as splits become more difficult. Low thresholds, on the other hand, could
lead to over-complicated trees that are difficult to interpret and deploy.

Another possibility to stop the algorithm is when the tree has attained a pre-specified
depth. The depth of the tree refers to the number of splits in it. This method can simply
control the size of the tree without oversimplifying or overcomplicating it. One last
stop criterion is to set a lower bound of observations in the nodes. Once the number
of observations in all nodes has reached the bound, a new split from any node would
only create child nodes that contain less observations than the lower bound allows. As a
result, the lower bound blocks the algorithm from carrying out another split and the entire
process ends.

One way to evaluate the performance of a decision tree is to examine the precision of its
prediction. In other words, the predicted values of the target variable will be compared
with the observed data. For classification trees, we can use the confusion matrix in which
the correct and incorrect classifications are summarised. The larger the proportion of
observations for which the predicted and observed classifications are identical, the more
accurate is the decision tree model. For regression trees, the Root-Mean-Square-Error
(RMSE) is usually used to measure the prediction accuracy of the model. Basically, the
RMSE is kind of an average deviance of all the predicted values from their observed

SU5-33
ANL252 Data Analytics in Python

counterparts. The lower such a deviance is, the closer are the predictions to the actual
values, and the better is the model.

3.2 Fitting Decision Trees


Same as K-Means clustering, scikit-learn also has an estimator for decision trees model
which is integrated in the tree module. In order to be able to use all possible functions in
the module, the sklearn.tree module shall be imported in the first place.

from sklearn import tree

If we want to apply a classification tree in Python, we must initiate a


DecisionTreeClassifier estimator object. In contrast, if the data should be fitted by
a regression tree, a DecisionTreeRegressor estimator object should be declared.

tree_Object = sklearn.tree.DecisionTreeClassifier(criterion
= "gini", splitter = "best", max_depth = None,
min_samples_split = 2, min_samples_leaf = 1,
min_weight_fraction_leaf = 0.0, max_features =
None, random_state = None, max_leaf_nodes =
None, min_impurity_decrease = 0.0,
min_impurity_split = None, class_weight = None)

tree_Object = sklearn.tree.DecisionTreeRegressor(criterion
= "mse", splitter = "best", max_depth = None,
min_samples_split = 2, min_samples_leaf = 1,
min_weight_fraction_leaf = 0.0, max_features =
None, random_state = None, max_leaf_nodes =
None, min_impurity_decrease = 0.0,
min_impurity_split = None)

SU5-34
ANL252 Data Analytics in Python

From the above syntaxes, we can see the main differences between the estimators
DecisionTreeClassifier and DecisionTreeRegressor are the values of the
parameter criterion and the availability of the parameter class_weight. The
following table provides description and explanation of the parameters.

Table 5.4 Parameters of the DecisionTreeClassifier and DecisionTreeRegressor estimators

Parameter Value Type Description

criterion Classification: "gini", The function to measure


(Default: "entropy" the quality of a split.
"gini" For classification trees, the
Regression:"mse",
for classification, "mse" supported criteria are "gini"
"friedman_mse",
for regression) for the Gini impurity and
"mae", "poisson"
"entropy" for the information
gain.

For regression trees, the


supported criteria are "mse" for
the mean squared error which
is equal to variance reduction
as feature selection criterion,
"friedman_mse", which uses
mean squared error with
Friedman’s improvement score
for potential splits, "mae"
for the mean absolute error,
and "poisson" which uses

SU5-35
ANL252 Data Analytics in Python

Parameter Value Type Description

reduction in Poisson deviance to


find splits.

splitter "best", "random" The strategy used to choose the


(Default: "best") split at each node. Supported
strategies are "best" to choose
the best split and "random" to
choose the best random split.

max_depth integer The maximum depth of the


(Default: None) tree. If None, then nodes are
expanded until all leaves are
pure or until all leaves contain
less than min_samples_split
samples.

min_samples_split Integer or float The minimum number of


(Default: 2) samples required to split an
internal node:

If integer, consider
min_samples_split as the
minimum number.

If float, min_samples_split
is a fraction and
ceil(min_samples_split
* n_samples) are the
minimum number of samples
for each split.

SU5-36
ANL252 Data Analytics in Python

Parameter Value Type Description

min_samples_leaf Integer or float The minimum number of


(Default: 1) samples required to be at
a leaf node. A split point
at any depth will only be
considered if it leaves at least
min_samples_leaf training
samples in each of the left
and right branches. This may
have the effect of smoothing the
model, especially in regression.

If integer, consider
min_samples_leaf as the
minimum number.

If float, min_samples_leaf
is a fraction and
ceil(min_samples_leaf *
n_samples) are the minimum
number of samples for each
node.

float The minimum weighted fraction


min_weight_
fraction_leaf of the sum of weights (of all the
input samples) required to be at
(Default: 0.0)
a leaf node. Samples have equal
weight when sample_weight
is not provided.

SU5-37
ANL252 Data Analytics in Python

Parameter Value Type Description

max_features Integer, float, "auto", The number of features to


(Default: None) "sqrt", "log2" consider when looking for the
best split:

If integer, consider
max_features features at each
split.

If float, max_features
is a fraction and
int(max_features *
n_features) features are
considered at each split.

If "auto", max_features =
sqrt(n_features)

If "sqrt",max_features =
sqrt(n_features)

If "log2",max_features =
log2(n_features).

If None,max_features =
n_features.

Note: the search for a split does


not stop until at least one valid
partition of the node samples
is found, even if it requires
to effectively inspect more than
max_features features.

SU5-38
ANL252 Data Analytics in Python

Parameter Value Type Description

random_state integer, Controls the randomness of


(Default: None) the estimator. The features
RandomState instance,
are always randomly permuted
None
at each split even if
splitter is set to "best".
When max_features <
n_features, the algorithm
will select max_features at
random at each split before
finding the best split among
them. But the best-found split
may vary across different runs,
even if max_features =
n_features. That is the case,
if the improvement of the
criterion is identical for several
splits and one split has to be
selected at random. To obtain a
deterministic behaviour during
fitting, random_state has to be
fixed to an integer.

max_leaf_nodes integer Grow a tree with


(Default: None) max_leaf_nodes in best-first
fashion. Best nodes are defined
as relative reduction in impurity.

SU5-39
ANL252 Data Analytics in Python

Parameter Value Type Description

If None then there will be an


unlimited number of leaf nodes.

float A node will be split if this


min_impurity_
decrease split induces a decrease of the
impurity greater than or equal to
(Default: 0.0)
this value.

min_impurity_split float Threshold for early stopping in


(Default: 0) tree growth. A node will split
if its impurity is above the
threshold; otherwise, it is a leaf.

class_weight dictionary, list of Weights associated with classes


(Default: None) dictionaries, in the form {class_label:
weight}. If None, all classes
Note: Only available for "balanced"
are supposed to have weight
classification trees.
one. For multi-output, a list of
dictionaries can be provided in
the same order as the columns of
y.

Note that for multi-output,


weights should be defined for
each class of every column in its
own dictionary.

The "balanced" mode uses


the values of y to
automatically adjust weights

SU5-40
ANL252 Data Analytics in Python

Parameter Value Type Description

inversely proportional to class


frequencies in the input data.

For multi-output, the weights


of each column of y will be
multiplied.

(Source: scikit learn, 2023, https://scikit-learn.org/stable/modules/generated/


sklearn.tree.DecisionTreeClassifier.html, https://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeRegressor.html)

Next, we can apply the DecisionTreeClassifier and DecisionTreeRegressor


estimators on the prepared training DataFrames with the input variables X and the target
variable Y.

tree_fit_Object = tree_Object.fit(X, Y,
sample_weight = None)

For both the DecisionTreeClassifier and DecisionTreeRegressor estimators,


scikit-learn also facilitates the parameter sample_weight to specify individual weights
in X. If it is set to None (i.e., the default value), all observations will be assigned equal
weight. In addition, sample_weight will be multiplied with class_weight if it is
specified in the DecisionTreeClassifier estimator for classification trees.

Read

Refer to the link below for more details and examples on the
DecisionTreeClassifier estimator of the tree module in the scikit-learn
package:

SU5-41
ANL252 Data Analytics in Python

scikit learn. (2023m). sklearn.tree.DecisionTreeClassifier. scikit-learn

developers. https://scikit-learn.org/stable/modules/generated/

sklearn.tree.DecisionTreeClassifier.‌html

Refer to the link below for more details and examples on the
DecisionTreeRegressor estimator of the tree module in the scikit-learn
package:

scikit learn. (2023n). sklearn.tree.DecisionTreeRegressor. scikit-learn

developers. https://scikit-learn.org/stable/modules/generated/

sklearn.tree.DecisionTreeRegressor.‌html

3.3 Model Evaluation

Lesson Recording

Evaluate Decision Trees by scikit-learn

The most important tool to understand and evaluate a decision tree is the plot of the tree
itself. The tree module of the scikit-learn package provides the plot_tree() function to
generate such a graph conveniently.

tree.plot_tree(decision_tree, max_depth = None,


feature_names = None, class_names
= None, label = "all", filled =
False, impurity = True, node_ids =
False, proportion = False, rounded =
False, precision = 3, ax = None,
fontsize = None)

The following table provides description and explanation of the parameters.

SU5-42
ANL252 Data Analytics in Python

Table 5.5 Parameters of the plot_tree function

Parameter Value Type Description

decision_tree An decision tree The decision tree to be plotted.


(No default value) regressor or classifier
object

max_depth integer The maximum depth of the


(Default: None) representation. If None, the tree
is fully generated.

feature_names List of strings Names of each of the features


(Default: None) (variables). If None, generic
names will be used ("X[0]",
"X[1]", …).

class_names List of strings or Names of each of the target


(Default: None) Boolean classes in ascending numerical
order. Only relevant for
classification and not supported
for multi-output. If True, shows
a symbolic representation of the
class name.

label "all", "root", Whether to show informative


(Default: "all") "none" labels for impurity, etc. Options
include "all" to show at every
node, "root" to show only at
the top root node, or "none" to
not show at any node.

SU5-43
ANL252 Data Analytics in Python

Parameter Value Type Description

filled Boolean When set to True, paint


(Default: False) nodes to indicate majority class
for classification, extremity of
values for regression, or purity of
node for multi-output.

impurity Boolean When set to True, show the


(Default: True) impurity at each node.

node_ids Boolean When set to True, show the ID


(Default: False) number on each node.

proportion Boolean When set to True, change


(Default: False) the display of ”values” and/or
“samples” to be proportions and
percentages respectively.

rounded Boolean When set to True, draw node


(Default: False) boxes with rounded corners and
use Helvetica fonts instead of
Times-Roman.

precision integer Number of digits of precision for


(Default: 3) floating point in the values of
impurity, threshold, and value
attributes of each node.

SU5-44
ANL252 Data Analytics in Python

Parameter Value Type Description

ax matplotlib axis Axes to plot to. If None,


(Default: None) use current axis. Any previous
content is cleared.

fontsize integer Size of text font. If None,


(Default: None) determined automatically to fit
figure.

(Source: scikit learn, 2023, https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

The plot_tree() function can also be combined with the matplotlib options such as the
plot size, the borderline settings, etc.

Example (Iris): Here we also give an example on the popular iris dataset and start by
loading the dataset from scikit-learn, and the related package for plotting. Y and X
denote the target variables and independent variables, respectively.

Figure 5.16 Load packages and prepare dataset for tree plot

We then build the decision tree with the tree.DecisionTreefClassifier(),


and fit the model. Finally, we plot the decision tree with tree.plot_tree()
method.

SU5-45
ANL252 Data Analytics in Python

Figure 5.17 Fit and plot decision tree model

For the parameter feature_names, we can assign the array


DataFrame.feature_names since it requires the labels of all the input variables,
including the dummy variables resulting from the categorical variables. The
parameter class_name requires the labels of the categories in the target variable,
which are now stored in the array DataFrame.target_names. We set the
fontsize parameter to 3 and omit the impurity statistics (impurity = False) in
each node so that the displayed information is still complete and readable. To visually
distinguish the nodes, we instruct Python to fill the node boxes with colours (filled
= True) while the corners do not need to be rounded (rounded = False). To
keep the decision tree chart within a certain size, we limit the depth of the tree to a
maximum of 2 levels (max_depth = 2).

SU5-46
ANL252 Data Analytics in Python

Figure 5.18 Decision tree plot

From the figure above, we can see that the first split of the tree is generated by the
feature “petal width”. The “petal width <=0.8” sub-samples will be assigned to the
child node on the left and those “petal width > 0.8” observations to the child node
on the right. In the next step, the node on the right-hand side is further split by the
“petal width <=1.75”. The split of the tree will then continue until certain stop criteria
are fulfilled. Since we limit the tree depth down to the second split level in our chart,
nodes of the further split levels will only be displayed in grey boxes with no proper
information in it.

Read

Refer to the link below for more details and examples on the plot_tree() functions
of the tree module in the scikit-learn package:

SU5-47
ANL252 Data Analytics in Python

scikit learn. (2023o). sklearn.tree.plot_tree. scikit-learn developers. https://scikit-

learn.org/stable/modules/generated/sklearn.tree.plot_tree.html

SU5-48
ANL252 Data Analytics in Python

Summary

In this unit, we have seen how Python can be used to carry out analytics tasks based on two
techniques: k-means clustering and decision trees. One of the most common packages in
Python for data analytics and machine learning algorithms is scikit-learn. In scikit-learn,
the analytics algorithm is called an estimator, and its parameters need to be calibrated
before the fitting process is carried out. Once the calibration step is completed, we can
apply the model on prepared DataFrames.

However, the available DataFrames and their contents are usually not in the format and
shape that the scikit-learn algorithms require in the first place. Therefore, we need to
prepare the dataset accordingly. In this unit, we have learned how to remove missing data,
reduce categories, discretise numeric variables, select and rename variables, transform
data, partition data into training and testing datasets, and extracting dependent and
independent variables from the original DataFrame. Subsequently, the application of K-
Means clustering and decision trees have been demonstrated.

SU5-49
ANL252 Data Analytics in Python

Formative Assessment

1. What is an estimator in scikit-learn?


a. It is the estimation algorithm of a data analytics/machine-learning module.
b. It is the function to fit a model.
c. It is a parameter to calibrate the model estimation.
d. It is a module of the scikit-learn package.

2. Which function or method can be used to reduce the number of categories in a


categorical variable?
a. .get_dummies()
b. normalize()
c. .deletecat()
d. .replace()

3. Whenever random numbers have to be drawn in scikit-learn, there is a parameter


named random_state included in the function. What does random_state actually
control?
a. It draws a random number from an interval [-1, 1].
b. It creates a variable in your DataFrame to store all random numbers that have
been drawn since the first run of the programme.
c. It controls the probability distribution of the random numbers.
d. It draws the same “random numbers” in every run to make the results
reproducible.

4. What is not a potential stop criterion in a decision tree construction process?


a. The tree depth
b. The number observations in the nodes
c. The impurity improvement

SU5-50
ANL252 Data Analytics in Python

d. The accuracy score

SU5-51
ANL252 Data Analytics in Python

Solutions or Suggested Answers

Formative Assessment
1. What is an estimator in scikit-learn?

a. It is the estimation algorithm of a data analytics/machine-learning module.

Correct. The estimator in scikit-learn contains the entire algorithm of an


analytics module.

b. It is the function to fit a model.

Incorrect. We need to apply the .fit() function on the estimator object to


fit a model.

c. It is a parameter to calibrate the model estimation.

Incorrect. The estimator is the entire algorithm and not just the parameters
in it.

d. It is a module of the scikit-learn package.

Incorrect. An estimator is part of a module of the scikit-learn package.

2. Which function or method can be used to reduce the number of categories in a


categorical variable?

a. .get_dummies()

Incorrect. The .get_dummies() method creates dummy variables from a


categorical variable.

b. normalize()

Incorrect. The .normalize() function normalises numeric variable.

c. .deletecat()

SU5-52
ANL252 Data Analytics in Python

Incorrect. There is no method called .deletecat() in the scikit-learn


package.

d. .replace()

Correct. With the .replace() method, we replace the labels of various


categories by one label, and the number of categories is hence reduced.

3. Whenever random numbers have to be drawn in scikit-learn, there is a parameter


named random_state included in the function. What does random_state actually
control?

a. It draws a random number from an interval [-1, 1].

Incorrect. The random_state parameter does not control the range from
which the random numbers are drawn.

b. It creates a variable in your DataFrame to store all random numbers that have
been drawn since the first run of the programme.

Incorrect. There is no storage or record of the random numbers drawn in


Python.

c. It controls the probability distribution of the random numbers.

Incorrect. The random_state parameter does not control the underlying


distribution of the random numbers at all.

d. It draws the same “random numbers” in every run to make the results
reproducible.

Correct. If we assign an integer to random_state, Python will produce


the same set of “random numbers” in every run. Different integers result
in different sets of “random numbers”.

4. What is not a potential stop criterion in a decision tree construction process?

SU5-53
ANL252 Data Analytics in Python

a. The tree depth

Incorrect. We can indeed stop the split of a decision tree if the tree has reached
a certain depth.

b. The number observations in the nodes

Incorrect. If the sub-sample in any child node is smaller than the minimum
required number of observations after a new split, the estimator will
terminate the tree construction before the split.

c. The impurity improvement

Incorrect. The impurity improvement is actually an important stop criterion


of a decision tree construction process. If new splits of a decision tree do not
improve its impurity, the algorithm will stop further splitting of nodes.

d. The accuracy score

Correct. The accuracy score is an indicator for the predictive performance


of a decision tree and not a stop criterion.

SU5-54
ANL252 Data Analytics in Python

References

pandas. (2023a). pandas.DataFrame.add_suffix. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.add_suffix.html

pandas. (2023b). pandas.DataFrame.astype. NumFOCUS, Inc. https://pandas.pydata.org/


pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

pandas. (2023c). pandas.DataFrame.rename. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.rename.html

pandas. (2023d). pandas.DataFrame.replace. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.replace.html

pandas. (2023e). pandas.get_dummies. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.get_dummies.html

scikit learn. (2023a). 2.3 clustering. scikit-learn developers. https://scikit-learn.org/

stable/modules/clustering.html

scikit learn. (2023b). 1.10 decision trees. scikit-learn developers. https://scikit-learn.org/

stable/modules/tree.html

scikit learn. (2023c). API reference. scikit-learn developers. https://scikit-learn.org/

stable/modules/classes.html#module-sklearn.linear_modelsklearn.linear_model

scikit learn. (2023d). Installing scikit-learn. scikit-learn developers. https://scikit-

learn.org/stable/install.html

scikit learn. (2023e). sklearn.cluster.KMeans. scikit-learn developers. https://scikit-

learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

scikit learn. (2023f). sklearn.decomposition.PCA. scikit-learn developers. https://scikit-

learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

SU5-55
ANL252 Data Analytics in Python

scikit learn. (2023g). sklearn.linear_model.LinearRegression. scikit-learn

developers. https://scikit-learn.org/stable/modules/generated/

sklearn.linear_model.LinearRegression.html

scikit learn. (2023h). sklearn.linear_model. scikit-learn developers. https://scikit-

learn.org/stable/modules/classes.html#module-sklearn.linear_model

scikit learn. (2023i). sklearn.metrics.accuracy_score. scikit-learn developers. https://scikit-

learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

scikit learn. (2023j). sklearn.preprocessing.normalize. scikit-learn developers. https://scikit-

learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

scikit learn. (2023k). sklearn.preprocessing: Preprocessing and normalization. scikit-

learn developers. https://scikit-learn.org/stable/modules/classes.html?

highlight=preprocessing#module-sklearn.preprocessing

scikit learn. (2023l). sklearn.preprocessing.StandardScaler. scikit learn. https://scikit-

learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

scikit learn. (2023m). sklearn.tree.DecisionTreeClassifier. scikit-learn

developers. https://scikit-learn.org/stable/modules/generated/

sklearn.tree.DecisionTreeClassifier.html

scikit learn. (2023n). sklearn.tree.DecisionTreeRegressor. scikit-learn

developers. https://scikit-learn.org/stable/modules/generated/

sklearn.tree.DecisionTreeRegressor.html

scikit learn. (2023o). sklearn.tree.plot_tree. scikit-learn developers. https://scikit-

learn.org/stable/modules/generated/sklearn.tree.plot_tree.html

scikit learn. (2023p). User guide. scikit-learn developers. https://scikit-learn.org/stable/

user_guide.html

SU5-56
6
Study
Unit

Basic SQL in Python


ANL252 Basic SQL in Python

Learning Outcomes

By the end of this unit, you should be able to:

1. Explain the operations on tables in databases


2. Design Python programmes for data retrieval

SU6-2
ANL252 Basic SQL in Python

Overview

In this unit, we will discuss how to use Python to execute SQL commands for database
management. We will first introduce SQLite3, a standard Python package that allows
programmers to use the Python environment to send SQL statements to the database. We
will also discuss how we can generate SQL statements using Python programmes for data
query and to present their outputs as a pandas DataFrame.

SU6-3
ANL252 Basic SQL in Python

Chapter 1: Introduction to SQL and SQLite3

Lesson Recording

Introduction to SQL

1.1 Introduction to SQL


SQL (Structured Query Language) is the most common and popular programming
language designed for database management. There have been various versions of
SQL including procedural extensions released such as PSQL, T-SQL, SQL/PSM, etc.
Due to the large amount of internal and external data available nowadays, managing,
maintaining, and updating database have become compulsory for many organisations.
And the demand for SQL specialists has increased tremendously.

Many analytics software have integrated SQL as part of their tools. In Python, the package
sqlite3 provides the possibility to embed SQL codes into Python programmes to facilitate
connections to databases and to query the data in them. Since sqlite3 belongs to the built-
in packages of Python, no installation using pip is required.

Same as scikit-learn, the analytics package introduced in the previous study unit, sqlite3
works hand-in-hand with the pandas packages since both of them are designed for data
management. As a result, we can convert the output table of SQL queries to pandas
DataFrames and vice versa anytime.

SU6-4
ANL252 Basic SQL in Python

Example (Students’ score, cont’d): In this study unit, we will return to our students’
examination score example from Study Units 1 & 2 and manage a database with a table
containing the personal information of the students and another table in which their
scores of two examinations are stored. Before we can create a database and the tables
in it, we need to import the packages sqlite3 and pandas into our Python programme
first.

Figure 6.1 Importing sqlite3 and pandas

Unlike other packages, the commands in sqlite3 must be executed through a cursor
object after the connection between Python and the database has been initiated. As a
result, we do not necessarily need an alias to abbreviate sqlite3 for more convenience
when writing our programme.

Read

Refer to the link below for more details on the sqlite3 package of Python:

Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite

databases. https://docs.python.org/3/library/sqlite3.html

1.2 Creating Data Files in Python


Before creating a database and its tables by SQL respectively, we need to know the type
of files that data entered by users at runtime of a Python programme can be stored –
e.g., comma separated .csv text files. The advantage of storing data in an external file in
comparison to a pandas DataFrame is that almost every software such as Excel, SPSS, SAS,

SU6-5
ANL252 Basic SQL in Python

etc. has a module to convert text files into its own data file format. In other words, text
files are highly compatible and therefore a good medium of data storage. Furthermore,
their file sizes are usually comparatively small since they only contain the data and the
delimiters, and no other information such as cell formatting. As a result, when exchange of
data is required within an organisation or between data providers and clients, they are the
most suitable format since they would not use up as much upload and download volume
or time as other data file formats.

In Study Unit 1, we developed a programme to let users enter the examination scores of
a course. However, the programme only allows data entry but provides no mechanism
to save the data in an external file. In other words, once we quit the programme, all the
data we have entered will not be accessible or found anymore. Here, we will introduce
the Python commands to write data into a .csv file as well as to read the data from it.

In Python, we need to open an external text file with the open() function first before we
can write data into it or read data from it.

with open("file_name", mode = "r/x/w/a/+") as file_object:


instructions

The with statement is usually used in combination with the open() function. The data
stored in the text file called file_name will then be stored in the file_object for
further processing. With the parameter mode, we can choose the operations that we are
permitted to carry out with the file. Here is a list of some of the available modes:

Table 6.1 List of some available modes of the open() function

Character Description

"r" open for reading (default)

SU6-6
ANL252 Basic SQL in Python

Character Description

"x" open for exclusive creation; failing if the file already exists

"w" open for writing, truncating the file first

"a" open for writing, appending to the end of the file if it exists

"+" open for updating without truncation (reading and writing)

(Source: Python Software Foundation, 2023, https://docs.python.org/3/library/functions.html#open)

Once a file is opened in a particular mode, we are only allowed to execute the permitted
operations until it is re-opened. For instance, if we have opened a file in reading mode,
Python will block us from writing, appending, or updating any content of the file and
return an error message to us if our code intends it to do so. The difference between both
the writing modes "w" and "a" is that "w" overwrites the entire original content in the file
by our new entries while "a" appends the new entries to the end of the file while keeping
the original content.

Note that "r" or "a" mode may be combined with the "+" mode. In both cases, we give
permission for the file to be read and updated. The main difference here is that if we open
the file in the "r+" mode, Python will be able to read the entire existing content from the
beginning of the file and write the new entries at the end of the file. But if the file is opened
in the "a+" mode, Python will not be able to read the entire existing content and is only
able to just append the updates at the end of the file. Furthermore, if the text file does not
exist, it will be created in the "a+" mode, while Python will return an error in the "r+"
mode.

After the user has finished entering the data of one record, Python should write this record
to the text file with the following syntax.

SU6-7
ANL252 Basic SQL in Python

file_object.write(data_row)

The name of file_object must be identical with the one defined in the with open()
statement. The object data_row is a string in which the entered data is stored. After all
the data has been saved to the file_object (Python will transfer the data to the real
text file in the background), we need to close the file properly to release its access to other
parties.

file_object.close()

Though closing the external files may not affect the programme flow directly, it is still
important and a good programming habit to do so at the end of the code.

Example (Cont’d): The following programme creates an interface to let the users enter
the personal information of the students including their first and last names, gender,
birthday, nationality, and study programme. The input mechanism is the same as
those introduced in Study Unit 1. For simplicity, we have omitted some of the control
mechanisms to prevent invalid inputs here. This is essential and extremely important
for a programme to be executed correctly. After entering the data of one student,
Python will convert the entries to a comma separated string and store it as a new line
into the file “Student_DB.csv” for later use.

SU6-8
ANL252 Basic SQL in Python

Figure 6.2 Programme to input data into a .csv File

In the first line, we store the file name in a variable called target_file in case we
need it multiple times in the later part of our programme. By choosing the "r+" mode
in the with open command in the second line, we instruct Python to open the file
for reading and writing, and the existing contents of target_file should be stored
in the object called write_to_file. The option of reading the stored data in the
file would be particularly useful if we needed the number of existing records. This
statistic is particularly helpful for two tasks: i) if the file is empty, we will need to add
a line of column headers to the record which we also include in lines 4 to 6, ii) we can
use the current number of records as the ID number of a new student. We would not
have been able to execute these tasks if we had opened target_file in the "a+"
mode instead.

Since .csv text files are comma separated, we need to insert a comma between each
column when concatenating the data to a string before storing it to target_file.
It is also important to put the escape sequence \n at the end of the string so that the
next record will be stored in a new line. Figure 6.3 shows how the interface looks like
when we run our programme in JupyterLab.

SU6-9
ANL252 Basic SQL in Python

Figure 6.3 Interface to input data in JupyterLab

To read the existing entries of a text file, we can use a for-loop to go through them line
by line and print the content to the screen.

for line in file_object:


print(line)

Once again, the file_object must be identical with the one defined in the with
open() statement. Note that Python is able to separate the records in the text file by
recognising the line break (or escape sequence \n) at the end of every line. As a result, the
for-loop will automatically run through all the lines and store the content in the variable
line, which will then be printed to the screen by the print() function.

Example (Cont’d): Figure 6.4 shows our programme to print the records stored in
“Student_DB.csv” to the screen line by line.

SU6-10
ANL252 Basic SQL in Python

Figure 6.4 Print the records in a .csv file to the screen line by line

Same as the input programme, we store the file name in a variable called
source_file in case we still need it in the later part of our programme.
Subsequently, we open the file in the "r" mode and store the data in the object
read_from_file. The advantage of opening the file in reading mode is that the data
stored in the .csv file is safe from accidental update or removal. The data will then be
printed to the screen line by line with the subsequent for-loop.

Read

Refer to the link below for more details and examples on the open() function:

Python Software Foundation. (2023b). Built-in functions. https://

docs.python.org/3/library/functions.html#open

Refer to the link below for more details and examples on reading and writing files in
Python:

SU6-11
ANL252 Basic SQL in Python

Python Software Foundation. (2023a). 7 input and output. https://

docs.python.org/3/tutorial/inputoutput.html

1.3 Importing Data to SQL With SQLite3


A database is an organised collection of data, and the data are stored in tables. The data in
each table is a specific set of the records in the database such as a company’s record of its
employees, customers, sales, and suppliers. These tables are usually directly or indirectly
connected with each other but it is not a compulsory requirement for them to be put in the
same database. Furthermore, the tables have a similar structure as the pandas DataFrames:
their columns represent the features of the data, or variables, and the records are stored
in rows.

To work with databases in Python, we need to generate a “connection” between Python


and the databases first. We can use the connect() function of the sqlite3 package for this
purpose.

connection_object = sqlite3.connect("database_name")

If the database already exists, connect() will simply create a connection between the
two platforms and let the user gain access to the existing database. On the other hand, if
the database is new, connect() will create a new database with the name used in the
string "database_name" and link it with Python directly. Once the connection has been
established, we can create SQL syntaxes as strings or string variables in Python, and then
send these string objects to SQL for execution. The .cursor() method creates a cursor
object to take over this task.

cursor_object = connection_object.cursor()

SU6-12
ANL252 Basic SQL in Python

To create a table with imported data from a .csv file, we need to first read in the file as
a pandas database in Python with the pandas.read_csv() function, which we have
introduced in Chapter 1 of Study Unit 4, and then send the data object to the database by
the .to_sql() method of the pandas package.

data_object = pandas.read_csv("csv_file_name.csv")
data_object.to_sql("table_name", connection_object,
if_exists)

With the parameter if_exists, we can replace ("replace"), append ("append"), or


let Python create an error message ("fail") if the table already exists in the database.

Example (Cont’d): In the first step, we establish a Python connection called conn to
the database named "StudentsDB.db" and create a cursor object cur for later use.

Figure 6.5 Establish a connection between Python and a database

In the next step, we import the student data created with the programme in Figure
6.2 as a pandas DataFrame called students.

SU6-13
ANL252 Basic SQL in Python

Figure 6.6 Import a.csv file dataset as a pandas DataFrame

Subsequently, we can create a new table in the database called students from this
pandas DataFrame. Since we have read in the entire dataset from the .csv file, we can
ask Python to replace the existing table if it exists.

Figure 6.7 Export pandas DataFrame to database by SQL

The parameter index instructs Python to write the row index as a column with
column name index_label in the table. Since the default value here is True, we
need to specify it in the .to_sql() method if we do not wish to include this column.

With the cursor object being created, we can execute the SQL commands by sending them
as strings through the cursor object with the .execute() method.

cursor_object.execute("SQL_command_string")

Note that SQL is a separated programming language for database management and its
commands are therefore not the same as those in Python. Furthermore, SQL commands

SU6-14
ANL252 Basic SQL in Python

are not case sensitive and should end with a semi-colon (this is usually optional, but
sometimes the semi-colons are useful to separate the commands that are sent to SQL for
execution at the same time).

To select a table from the database, we can send a SELECT statement to SQL.

SELECT * FROM table_name;

The asterisk (*) in the SELECT statement is to instruct SQL to take all columns from the
table. Once the query has been carried out, we can print one record of the result to the
screen by the .fetchone() method.

cursor_object.fetchone()

If we would like Python to print all records from the query result to the screen, we can
use the .fetchall() method instead.

cursor_object.fetchall()

Note that once we have applied the .fetchone() or .fetchall() methods, the data
records in the query result are literally fetched and no longer available. In other words, if
we re-apply the .fetchone() or .fetchall() methods, we will see either no records
or some of them missing. If we wish to select and check out the same table again, we will
have to redo the query.

SU6-15
ANL252 Basic SQL in Python

Example (Cont’d): In Figure 6.8, we use the .execute() method to send the SELECT
statement to SQL for selecting the table students from the database.

Figure 6.8 Select a table from the Database

Subsequently, we print out one of the records for checking purpose.

Figure 6.9 Fetch one record for printing

In the final step, we fetch all the records in the cursor object cur for printing.

Figure 6.10 Fetch all records for printing

From Figure 6.10, we can see that the first record has already been fetched in Figure
6.9 and is therefore not included in the output.

If we re-fetch the records from the query output after applying fetchall() once
before, SQL will return an empty object to us.

SU6-16
ANL252 Basic SQL in Python

Figure 6.11 Re-fetch records after applying fetchone() or fetchall()

Read

Refer to the link below for more details and examples on the to_sql() function of
the pandas package:

pandas. (2023b). pandas.DataFrame.to_sql. NumFOCUS, Inc. https://

pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.‌to_sql.html

Refer to the link below for more details and examples on the connect(), .cursor(),
.execute(), .fetchone(), and .fetchall() functions and methods of the
sqlite3 package:

Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite

databases. https://docs.python.org/3/library/sqlite3.html

SU6-17
ANL252 Basic SQL in Python

Chapter 2: Data Query

Lesson Recording

Data Query with SQL

2.1 Selecting Table


In Chapter 1, we have already introduced the SELECT statement of the standard SQL in
its simplest form for table selection. In the following, we will discuss some further options
provided by the SELECT statement to optimise our data query.

The SELECT statement also allows us to select some of the variables from the table
instead of all of them. But in some cases, we may not even know the variables that
the table contains or how their names are correctly spelt. In this case, we can use the
.description attribute to extract the variable names from the last queried table.

cursor_object.description

Note that .description is an attribute and not a method. As a result, there are no
brackets and arguments behind it. The returned object is a collection of tuples where the
first item of each tuple is the column name, and the last six items are None.

SU6-18
ANL252 Basic SQL in Python

Example (Cont’d): In Figure 6.12, we will use the .description attribute to extract
the column names of the table students.

Figure 6.12 Extract column names of a table by the .description attribute

Unfortunately, the returned object does not have a particularly useful form for further
usage. Since we only need the first item of each tuple, we can run a for-loop within a
list to extract it.

Figure 6.13 Generate variable list of a table

While the for-loop is running through the cur.description object, the current
tuple is stored in the object column from which the item with the index 0 will then
be put in the list named cols.

From the previous Chapter, we learnt that we could fetch the records from a table for
further processes such as printing. However, the data are stored in tuples and when they
are printed, we do not see them as tables such as a pandas DataFrame. Furthermore, once
they are fetched, our programme no longer has access to the queried table. As a result, it
may be desirable to store the result of the query in a pandas DataFrame. In fact, pandas
provides the method .from_records() for this purpose.

SU6-19
ANL252 Basic SQL in Python

query_object = pd.DataFrame.from_records
(data = cursor_object.fetchall(), columns)

The .from_records() method is actually created to convert structured or n-


dimensional record arrays to pandas DataFrames. Nevertheless, it suits our purpose
perfectly by passing the resulting object of the fetchall() function as the parameter
data to the .from_records() method. With the columns parameter, we can specify
our own column names of the output DataFrame. The default value here is None, and the
corresponding column names would be simply the column indices.

Example (Cont’d): We can now start a new data query by selecting the entire table
students and then storing the result of the query in a pandas DataFrame. The query
step is almost identical to the syntax in Figure 6.8. The only difference here is that
the .execute() method will not carry out our SQL command directly, but a string
variable called sql_select in which the SELECT statement will store the queried
data for future use.

Figure 6.14 Execute an SQL statement stored in a string variable

In the next step, we can generate the data by applying the .fetchall() method on
the cursor object and converting it to a pandas DataFrame. We will also specify the
list cols which was generated in Figure 6.13 as our column names here.

SU6-20
ANL252 Basic SQL in Python

Figure 6.15 Convert the result of an SQL query to a pandas DataFrame

Figure 6.15 shows the table students after being converted to a pandas DataFrame,
a format that has become well-known to us from the previous study units.

Read

Refer to the link below for more details and examples on the .description attribute
of the sqlite3 package:

Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite

databases. https://docs.python.org/3/library/sqlite3.html#module-sqlite3

Refer to the link below for more details and examples on the .from_records()
function of the pandas package:

pandas. (2023a). pandas.DataFrame.from_records. https://pandas.pydata.org/pandas-

docs/stable/reference/api/pandas.DataFrame.‌from_records.html

Refer to the links below for more details and examples on the SELECT statement of
SQL:

SU6-21
ANL252 Basic SQL in Python

W3 schools. (2023s). SQL SELECT statement. Refsnes Data. https://

www.w3schools.com/sql/sql_select.asp

SQLite Tutorial. (2022m). SQLite select. https://www.sqlitetutorial.net/sqlite-select/

2.2 Sorting Data


In SQL, we can add the keyword ORDER BY to the SELECT statement to sort the data of
a table by some of its variables in the ascending or descending orders.

SELECT * FROM table_name


ORDER BY var1_name, var2_name ASC|DESC;

Note that if we intend to sort the table by multiple variables, we will need to separate
their names by commas. The sequence of the variables in this list also reflects the sorting
hierarchy. That is, the data are sorted by the first variable in the list initially. Those tied
records are then sorted by the second variable, and so on. If the data must be sorted in the
descending order by a particular variable, we must specify the DESC option behind the
variable name. Since the default value here is ASC, we can omit this option for a variable
if the data should be sorted in its ascending order.

SU6-22
ANL252 Basic SQL in Python

Example (Cont’d): Suppose we would like to sort the students by their age and
nationality in the table students. Since the table only contains their birthdays, stored
in the variable Birthday, we need to sort the records in the descending order so that
their age will be ordered naturally.

Figure 6.16 Sort students by their birthdays and nationalities

The second sorting criterion, Nationality, is placed after a comma to separate it


from the other sorting criteria. The option ASC can be omitted here. From the result, we
can see that the students with ID number 9 and ID number 6 share the same birthday.
Since student number 9 is from Indonesia, which is alphabetically before Singapore,
this record is sorted in front of the student number 6. It is also noteworthy that the
birthday of the student number 8 is missing. As a result, it appears at the end of the
table, after sorting the students in the descending order of their birthdays.

SU6-23
ANL252 Basic SQL in Python

Read

Refer to the links below for more details and examples on the ORDER BY option of
SQL:

W3 schools. (2023r). SQL ORDER BY keyword. Refsnes Data. https://

www.w3schools.com/sql/sql_orderby.asp

SQLite Tutorial. (2022k). SQLite order by. https://www.sqlitetutorial.net/sqlite-

order-by/

2.3 Filtering Data


The main purpose of data query is to request the records of interest from the available
tables. Often, it is not the entire table that we are looking for. Instead, we would like to
have records that fulfil certain criteria. In SQL, we can use the WHERE clause in the SELECT
statement to filter the useful records for us.

SELECT * FROM table_name


WHERE var_name = value;

In the above syntax, the selection criterion is presented in its simplest form: records will
only be selected if one of the variables is equal to a certain value. We can also construct
other criteria by using the operators listed in the following table.

SU6-24
ANL252 Basic SQL in Python

Table 6.2 Operators in the WHERE clause

Operator Description

= Equal

> Greater than

< Less than

>= Greater than or equal

<= Less than or equal

<> Not equal (Note: In some SQL versions it may be written as !


=)

BETWEEN Between a certain range

LIKE Search for a pattern

IN To specify multiple possible values for a column

(Source: W3 schools, 2023, https://www.w3schools.com/sql/sql_where.asp)

Same as the if-command in Python, we can also link multiple criteria in one statement
using the AND, OR and NOT operators.

Sometimes, we would rather not obtain records that contain missing values in one or more
variables from a query. In this case, we need to add the IS NOT NULL syntax to the WHERE
clause.

SELECT * FROM table_name


WHERE var_name IS NOT NULL;

SU6-25
ANL252 Basic SQL in Python

If the statement were written without the NOT operator, SQL would return all records with
missing values in the variable var_name to us.

Example (Cont’d): Suppose we would like to select students of the analytics


programme from the table of students. The query can simply be carried out by
adding the criterion Program = 'Analytics' to the WHERE syntax.

Figure 6.17 Select only analytics students from the table students

If the value is a string such as “Analytics” in the above syntax, we will need to put it
in a pair of quotation marks. And it is important here to pay attention to when and
where single or double quotation marks should be used.

Suppose we would like to narrow down our query to only analytics students with ID
numbers between 5 and 10.

SU6-26
ANL252 Basic SQL in Python

Figure 6.18 Select only analytics students with ID number between 5 and 10

Suppose these are not the students that we are actually looking for, and we would like
to select the other analytics students. All we need to modify in the above syntax is to
change the operator from BETWEEN to NOT BETWEEN for the ID variable.

Figure 6.19 Select only analytics students with ID number not between 5 and 10

In the next query, we would like to select all analytics students who are not from
Singapore or China. With the IN operator, we can specify the two values “Singapore”
and “China” that we are searching for in the column Nationality. Note that these
values must be put in a pair of round brackets. Finally, the NOT operator should be
added to the syntax IN ('Singapore', 'China') to negate it.

SU6-27
ANL252 Basic SQL in Python

Figure 6.20 Select analytics students not from Singapore or China

Now, all analytics students whose first names start with an “M” should be selected.
Here, we can use the LIKE operator and the value is 'M%'. The (%) sign is a wildcard
that represents zero, one, or multiple characters. In other words, our value is an M
followed by a string of arbitrary length.

Figure 6.21 Select analytics students with first names that start with “M”

Instead of searching for first names starting with a particular letter, we can also
select records with first names that contain a pre-defined string, which is “ar” in the
following example.

Figure 6.22 Select analytics students with first names that contain “ar”

SU6-28
ANL252 Basic SQL in Python

Same as our code in Figure 6.22, we can use the (%) wildcard within the string to
specify the position of our pre-defined string. Unsurprisingly, we can also select first
names that end with a certain character, which is “s” here.

Figure 6.23 Select analytics students with first names that end with “s”

If we wish to select first names that start and end with some pre-defined characters,
we can put the wildcard (%) between the start and the end characters. In the following
example, we would like to select students with first names that start with an “M” and
end with an “l”.

Figure 6.24 Select analytics students with first names that start with “M” and end with “l”

The other type of wildcard for the LIKE operator is the underscore sign (_), which
represents a single character. In the following example, we would like to select all
students whose last names start with “Ta” and followed by exactly one character.

Figure 6.25 Select analytics students with last names that have the pattern “Ta_”

SU6-29
ANL252 Basic SQL in Python

The syntax in Figure 6.25 will result in selecting all last names with three characters
that start with “Ta”. In other words, last names such as “Tang” will not be included.

In Figure 6.26, cases with missing value in Birthday will be selected. This step is
usually helpful to let data analysts study the cases with missing values first and then
decide whether to remove them from the dataset.

Figure 6.26 Select students with no birthday record

If we simply want to select all students whose birthday records are not missing, we
will just need to replace IS NULL by IS NOT NULL in the SELECT statement.

Figure 6.27 Select students with non-missing birthday records

SU6-30
ANL252 Basic SQL in Python

In SQL, we can also select particular columns from a table in the data query. The asterisk
(*) in the SELECT statement should be replaced by a list of selected variables in this case.

SELECT var_name1, var_name2, …


FROM table_name WHERE criteria;

It is also possible to use Python programming to manipulate the SELECT statement as


string before sending it to SQL. That is, we can create our own variable list as string in our
Python programme first and then combine it with the rest of the statement. This approach
will give us the flexibility to generate different variable lists for data query depending on
the requirement of the situation.

Example (Cont’d): In Figure 6.28, we select the columns ID, LastName, FirstName,
Nationality and Program from the table students. At the same time, we are
only interested in students outside the analytics programme. However, these records
should also be sorted by their nationalities.

Figure 6.28 Select data of non-analytics students from certain columns of a table

The parameter columns in the from_records() function must be replaced by a


new variable list since the list cols which we have generated in Figure 6.13 contains
more column names than the table we query here. As a result, it is more practical to
store the names of the selected variable in a list or a tuple first, and then concatenate
them to a single string by the join() method.

SU6-31
ANL252 Basic SQL in Python

Figure 6.29 Concatenate items of a tuple to a string

In the first line, we define the tuple named sel_cols with the names of the selected
variables in it. In the second line, the .join() method is applied to ", ", the
separator between the variable names in the output string. Basically, the .join()
method runs through all the items in sel_cols and adds them with ", " (except for
the last item for which the separator is not necessary) to the string sel_cols_str
one after another.

In the next step, we first split the SELECT statement in Figure 6.28 into three parts:
"Select ", the variable list, and the rest. The original variable list can then be
replaced by the string variable sel_cols_str. Subsequently, we can concatenate
these three partitions of the string into one by “adding” them together as shown in
Figure 6.30. Furthermore, we can also use the variable list sel_cols as the value for
the parameter columns in the from_records() function.

Figure 6.30 Select certain columns of a table by a variable list string

Hereafter, we can obtain different columns of the table by just changing the variable
names in sel_cols_str and re-run the code in Figure 6.30. Note that the same
technique can also be applied to the other parts of the SELECT statement.

SU6-32
ANL252 Basic SQL in Python

Read

Refer to the links below for more details and examples on the WHERE clause of SQL:

W3 schools. (2023u). SQL WHERE clause. Refsnes Data. https://

www.w3schools.com/sql/sql_where.asp

SQLite Tutorial. (2022p). SQLite where. RefsnesData. https://


www.sqlitetutorial.net/sqlite-where/

Refer to the link below for more details and examples on the AND, OR, NOT operators
of SQL:

W3 schools. (2023c). SQL AND, OR and NOT operators. Refsnes Data. https://

www.w3schools.com/sql/sql_and_or.asp

Refer to the links below for more details and examples on the IN operator of SQL:

W3 schools. (2023k). SQL IN operator. Refsnes Data. https://www.w3schools.com/

sql/sql_in.asp

SQLite Tutorials. (2022i). SQLite IN. https://www.sqlitetutorial.net/sqlite-in/

Refer to the links below for more details and examples on the BETWEEN operator of
SQL:

W3 schools. (2023d). SQL BETWEEN operator. Refsnes Data. https://

www.w3schools.com/sql/sql_between.asp

SQLite Tutorials. (2022b). SQLite BETWEEN. https://www.sqlitetutorial.net/sqlite-

between/

Refer to the links below for more details and examples on the LIKE operator of SQL:

W3 schools. (2023o). SQL LIKE operator. https://www.w3schools.com/sql/

sql_like.asp

SU6-33
ANL252 Basic SQL in Python

SQLite Tutorials. (2022n). SQLite LIKE. https://www.sqlitetutorial.net/sqlite-like/

Refer to the link below for more details and examples on the wildcard characters in
SQL:

W3 schools. (2023v). SQL wildcards. Refsnes Data. https://www.w3schools.com/

sql/sql_wildcards.asp

Refer to the links below for more details and examples on NULL values in SQL:

W3 schools. (2023q). SQL NULL values. Refsnes Data. https://

www.w3schools.com/sql/sql_null_values.asp

SQLite Tutorial. (2022l). SQLite IS NULL. https://www.sqlitetutorial.net/sqlite-is-

null/

Refer to the link below for more details and examples on the .join() methods:

Python Software Foundation. (2023c). Built-in types. https://docs.python.org/3/

library/stdtypes.html#str.join

SU6-34
ANL252 Basic SQL in Python

Chapter 3: Joining Tables

Lesson Recording

Join Tables with SQL

The tables in a database are usually connected in some way. For instance, in the database
of a bank, there may be a table with the personal records of all the customer relationship
managers and another table with the records of all customers, including their transaction
records and the names of their relationship managers. With these data, the bank can query
on the sales records of each manager within a certain period. In this case, we are joining
tables from a database to gain cross-table information.

3.1 Natural Join


In SQL, to join two or more tables, one of the most popular ways is NATURAL JOIN, which
will be discussed in this section.

SELECT * FROM table1_name


NATURAL JOIN table2_name;

The NATURAL JOIN clause is used within the SELECT statement. It selects only records
of table1 that can be matched by records in table2. The rows of table1 or table2 for
which SQL cannot find any matches in the opposite table will be dropped from the query.
SQL compares the values of each matching variable from the two tables specified by the
user. Usually, these variables should represent the same feature in both tables such as the
employee number or customer ID. Note that it is also possible to extend the match to
multiple pairs of variables, if necessary. Note that only one copy of each common column
will be kept.

SU6-35
ANL252 Basic SQL in Python

Example: In the following, we will create a new table called product_info which
contains the names of the products that customers wish to buy. The table will then
be joined with the table shop_info, where the customers are willing to visit. The
original datasets are stored in .csv files that include the following information:

Figure 6.31 Information about


products and customers

Figure 6.32 Information about


shops and customers

We can use the pandas package to read the .csv files:

Figure 6.33 Read .csv files

Then we connect Python to SQL database with the .to_sql() function.

SU6-36
ANL252 Basic SQL in Python

Figure 6.34 Connect Python to SQL database

Finally, we perform NATURAL JOIN using the syntax mentioned at the beginning of this
Chapter.

Figure 6.35 Natural join of two tables

Note that there are also other more complicated table merging methods such as cross join
and outer join which are beyond the scope of this course.

Read

Refer to the links below for more details and examples on the INNER JOIN clause of
SQL:

W3 schools. (2023l). SQL INNER JOIN keyword. https://www.w3schools.com/sql/

sql_join_inner.asp

SQLite Tutorials. (2022j). SQL inner join. https://www.sqlitetutorial.net/sqlite-

inner-join/

SU6-37
ANL252 Basic SQL in Python

Summary

In this study unit, we first learned how to write Python programmes to store data entered
by a user to a .csv text file. After the sqlite3 package was introduced, we were able to
connect Python with the databases and convert external data sources saved as .csv text
files to database tables by SQL. With the SELECT statement, we could execute different
types of data query such as sorting and filtering data. Here, we have used Python
programming to generate SELECT statements flexibly and to convert the query output
to pandas DataFrames for better presentation in the Python environment. We have also
come across the method for joining tables of a database, i.e., natural join. The output of
these methods could vary strongly since they select the records differently.

SU6-38
ANL252 Basic SQL in Python

Formative Assessment

1. Which of the following modes of the open() function does not allow you to write to
the source file?
a. "a"
b. "r"
c. "r+"
d. "w"

2. What role does the cursor object play in Python?


a. It carries our commands from Python to SQL.
b. It connects Python with the database.
c. It is the database object in Python.
d. It defines the table in the database with direct access.

3. After querying a SQL table with 50 observations, we use fetchone() to check on the
outcome of one record for merely one time. How many records will remain available
in the query output after that?
a. 0
b. 1
c. 49
d. 50

4. Table A has 50 records and Table B has 70. A total of 40 records could be matched
based on certain conditions. What would be the number of records in the output table
if we merged A and B by inner join and cross join respectively?
a. 40 and 3500
b. 50 and 3500
c. 40 and 50

SU6-39
ANL252 Basic SQL in Python

d. 50 and 70

5. Which of the following is true regarding SQL NATURAL JOIN of two tables?
a. The join condition needs to be specified explicitly.
b. The columns in the two tables are automatically matched.
c. In the case that there is no match, all rows from both tables will be returned.
d. In the case that there is no match, all columns from both tables will be returned.

SU6-40
ANL252 Basic SQL in Python

Solutions or Suggested Answers

Formative Assessment
1. Which of the following modes of the open() function does not allow you to write to
the source file?

a. "a"

Incorrect. We can append new records to the file in the appending mode.

b. "r"

Correct. We can only extract data from the file in the reading mode.

c. "r+"

Incorrect. We can add new content to the file in the updating mode.

d. "w"

Incorrect. We can write to the file in the writing mode.

2. What role does the cursor object play in Python?

a. It carries our commands from Python to SQL.

Correct. We send our SQL commands through the cursor object from
Python to SQL.

b. It connects Python with the database.

Incorrect. The connection object connects Python to the database.

c. It is the database object in Python.

Incorrect. There is no direct database object in Python.

d. It defines the table in the database with direct access.

SU6-41
ANL252 Basic SQL in Python

Incorrect. The cursor object does not specify the table in the database that we
are working on.

3. After querying a SQL table with 50 observations, we use fetchone() to check on the
outcome of one record for merely one time. How many records will remain available
in the query output after that?

a. 0

Incorrect. The fetchone() function only fetches one record from the query
output. So, there must be more than 0 records remaining.

b. 1

Incorrect. The fetchone() function only fetches one record from the query
output. So, there must be more than 1 record remaining.

c. 49

Correct. Since the fetchone() function only fetches one record from the
query output, there must be 49 records remaining.

d. 50

Incorrect. The fetchone() function fetches one record from the query
output anyway. So, there must be less than 50 records remaining.

4. Table A has 50 records and Table B has 70. A total of 40 records could be matched
based on certain conditions. What would be the number of records in the output table
if we merged A and B by inner join and cross join respectively?

a. 40 and 3500

Correct. Inner join creates the intersection set of both tables (i.e., 40) and
cross join returns the cartesian product of both tables (i.e., 3500).

b. 50 and 3500

SU6-42
ANL252 Basic SQL in Python

Incorrect. Only if A were left joined by B, the number of records of A would


return (i.e., 50). Cross join returns the cartesian product of both tables (i.e.,
3500).

c. 40 and 50

Incorrect. Inner join creates the intersection set of both tables (i.e., 40), but
only if A were left joined by B, the number of records of A would return (i.e.,
50).

d. 50 and 70

Incorrect. Only if A were left joined by B, the number of records of A would


return (i.e., 50), and only if B were left joined by A, the number of records of
B would return (i.e., 70).

5. Which of the following is true regarding SQL NATURAL JOIN of two tables?

a. The join condition needs to be specified explicitly.

Incorrect. There is no need to specify the join condition explicitly.

b. The columns in the two tables are automatically matched.

Correct. In SQL, the NATURAL JOIN can automatically match two tables.

c. In the case that there is no match, all rows from both tables will be returned.

Incorrect. If there are no matched rows, then no row will be returned.

d. In the case that there is no match, all columns from both tables will be
returned.

Incorrect. NATURAL JOIN requires at least one pair of common column


names.

SU6-43
ANL252 Basic SQL in Python

References

Geeks for geeks. (n.d.). Difference between Natural join and Inner Join in SQL. https://

www.geeksforgeeks.org/difference-between-natural-join-and-inner-join-in-sql/

pandas. (2023a). pandas.DataFrame.from_records. NumFOCUS, Inc.

https://pandas.pydata.org/pandas-docs/stable/reference/api/
pandas.DataFrame.from_records.html

pandas. (2023b). pandas.DataFrame.to_sql. NumFOCUS, Inc. https://pandas.pydata.org/

pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

Python Software Foundation. (2023a). 7 input and output. https://docs.python.org/3/

tutorial/inputoutput.html#tut-files

Python Software Foundation. (2023b). Built-in functions. https://docs.python.org/3/

library/functions.html#open

Python Software Foundation. (2023c). Built-in types. Python Software Foundation.

https://docs.python.org/3/library/stdtypes.html#str.join

Python Software Foundation. (2023d). sqlite3 — DB-API 2.0 interface for SQLite databases.

https://docs.python.org/3/library/sqlite3.html

Python Software Foundation. (2023e). sqlite3 — DB-API 2.0 interface for SQLite databases:

Connection objects. https://docs.python.org/3/library/sqlite3.html#connection-

objects

Python Software Foundation. (2023f). sqlite3 — DB-API 2.0 interface for

SQLite databases: Cursor objects. https://docs.python.org/3/library/

sqlite3.html#sqlite3.Cursor.description

SQLite. (2023). Date and time functions. https://sqlite.org/lang_datefunc.html

SQLite Tutorial. (2022a). SQLite ALTER TABLE. https://www.sqlitetutorial.net/sqlite-

alter-table/

SU6-44
ANL252 Basic SQL in Python

SQLite Tutorial. (2022b). SQLite BETWEEN. https://www.sqlitetutorial.net/sqlite-

between/

SQLite Tutorial. (2022c). SQLite create table. https://www.sqlitetutorial.net/sqlite-create-

table/

SQLite Tutorial. (2022d). SQLite cross join. SQLite Tutorial. https://

www.sqlitetutorial.net/sqlite-cross-join/

SQLite Tutorial. (2022e). SQLite delete. https://www.sqlitetutorial.net/sqlite-delete/

SQLite Tutorial. (2022f). SQLite drop table. https://www.sqlitetutorial.net/sqlite-

drop-table/

SQLite Tutorial. (2022g). SQLite group by. https://www.sqlitetutorial.net/sqlite-group-

by/

SQLite Tutorial. (2022h). SQLite having. https://www.sqlitetutorial.net/sqlite-having/

SQLite Tutorial. (2022i). SQLite IN. https://www.sqlitetutorial.net/sqlite-in/

SQLite Tutorial. (2022j). SQLite inner join. https://www.sqlitetutorial.net/sqlite-inner-

join/

SQLite Tutorial. (2022k). SQLite insert. https://www.sqlitetutorial.net/sqlite-insert/

SQLite Tutorial. (2022l). SQLite IS NULL. https://www.sqlitetutorial.net/sqlite-is-null/

SQLite Tutorial. (2022m). SQLite left join. https://www.sqlitetutorial.net/sqlite-left-join/

SQLite Tutorial. (2022n). SQLite LIKE. https://www.sqlitetutorial.net/sqlite-like/

SQLite Tutorial. (2022k). SQLite order by. https://www.sqlitetutorial.net/sqlite-order-

by/

SQLite Tutorial. (2022l). SQLite rename column. https://www.sqlitetutorial.net/sqlite-

rename-column/

SQLite Tutorial. (2022m). SQLite select. https://www.sqlitetutorial.net/sqlite-select/

SQLite Tutorial. (2022n). SQLite union. https://www.sqlitetutorial.net/sqlite-union/

SQLite Tutorial. (2022o). SQLite update. https://www.sqlitetutorial.net/sqlite-update/

SQLite Tutorial. (2022p). SQLite where. https://www.sqlitetutorial.net/sqlite-where/

SU6-45
ANL252 Basic SQL in Python

W3 resource. (n.d.). SQL LOWER() function. https://www.w3resource.com/sql/

character-functions/lower.php

W3 schools. (2023a). Python list remove() method. Refsnes Data. https://

www.w3schools.com/python/ref_list_remove.asp

W3 schools. (2023b). SQL ALTER keyword. Refsnes Data. https://www.w3schools.com/

sql/sql_alter.asp

W3 schools. (2023c). SQL AND, OR and NOT operators. Refsnes Data. https://

www.w3schools.com/sql/sql_and_or.asp

W3 schools. (2023d). SQL BETWEEN operator. Refsnes Data. https://

www.w3schools.com/sql/sql_between.asp

W3 schools. (2023e). SQL COUNT(), AVG() and SUM() functions. Refsnes Data. https://

www.w3schools.com/sql/sql_count_avg_sum.asp

W3 schools. (2023f). SQL CREATE TABLE statement. Refsnes Data. https://

www.w3schools.com/sql/sql_create_table.asp

W3 schools. (2023g). SQL DELETE keyword. Refsnes Data. https://

www.w3schools.com/sql/sql_ref_delete.asp

W3 schools. (2023h). SQL DROP TABLE statement. Refsnes Data. https://

www.w3schools.com/sql/sql_drop_table.asp

W3 schools. (2023i). SQL GROUP BY statement. Refsnes Data. https://

www.w3schools.com/sql/sql_groupby.asp

W3 schools. (2023j). SQL HAVING clause. Refsnes Data. https://www.w3schools.com/

sql/sql_having.asp

W3 schools. (2023k). SQL IN operator. Refsnes Data. https://www.w3schools.com/sql/

sql_in.asp

W3 schools. (2023l). SQL INNER JOIN keyword. Refsnes Data. https://

www.w3schools.com/sql/sql_join_inner.asp

SU6-46
ANL252 Basic SQL in Python

W3 schools. (2023m). SQL INSERT INTO statement. Refsnes Data. https://

www.w3schools.com/sql/sql_insert.asp

W3 schools. (2023n). SQL LEFT JOIN keyword. Refsnes Data. https://

www.w3schools.com/sql/sql_join_left.asp

W3 schools. (2023o). SQL LIKE operator. https://www.w3schools.com/sql/sql_like.asp

W3 schools. (2023p). SQL MIN() and MAX() functions. Refsnes Data. https://

www.w3schools.com/sql/sql_min_max.asp

W3 schools. (2023q). SQL NULL values. Refsnes Data. https://www.w3schools.com/

sql/sql_null_values.asp

W3 schools. (2023r). SQL ORDER BY keyword. Refsnes Data. https://

www.w3schools.com/sql/sql_orderby.asp

W3 schools. (2023s). SQL SELECT statement. Refsnes Data. https://

www.w3schools.com/sql/sql_select.asp

W3 schools. (2023r). SQL UPDATE statement. Refsnes Data. https://

www.w3schools.com/sql/sql_update.asp

W3 schools. (2023u). SQL WHERE clause. Refsnes Data. https://www.w3schools.com/

sql/sql_where.asp

W3 schools. (2023v). SQL wildcards. Refsnes Data. https://www.w3schools.com/sql/

sql_wildcards.asp

W3 schools. (2023w). SQL working with dates. Refsnes Data. https://

www.w3schools.com/sql/sql_dates.asp

SU6-47
ANL252 Basic SQL in Python

SU6-48

You might also like