0% found this document useful (0 votes)
28 views561 pages

Python 4 Data Science

Uploaded by

Irfan Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views561 pages

Python 4 Data Science

Uploaded by

Irfan Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 561

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371475814

Python for Data Science

Book · June 2023


DOI: 10.5281/zenodo.8024718

CITATIONS READS
2 245

1 author:

Veit Schiele
cusy GmbH
2 PUBLICATIONS 2 CITATIONS

SEE PROFILE

All content following this page was uploaded by Veit Schiele on 11 June 2023.

The user has requested enhancement of the downloaded file.


Python for Data Science
Release 1.0.0

Veit Schiele

Jun 11, 2023


CONTENTS

1 Introduction 3
1.1 Target groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Structure of the Python for Data Science tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Follow us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Pull-Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Workspace 5
2.1 IPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 Read, persist and provide data 149


3.1 Open data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
3.2 pandas IO tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.3 Serialisation formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.4 Intake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.5 Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.7 Geodata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
3.8 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
3.9 NoSQL databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
3.10 Application Programming Interface (API) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
3.11 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

4 Data cleansing and validation 267


4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

5 Visualise data 335

6 Performance 337
6.1 k-Means example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
6.2 Performance measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
6.3 Search for existing implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
6.4 Find anti-patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
6.5 Vectorisations with NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
6.6 Special data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
6.7 Select compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
6.8 Task planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
6.9 Multithreading, Multiprocessing and Async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358

i
7 Create a product 375
7.1 Manage code with Git . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
7.2 Manage data with DVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
7.3 Reproduce environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
7.4 Creating programme libraries and packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
7.5 Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
7.6 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
7.7 Citing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
7.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
7.9 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
7.10 Check and improve code quality and complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
7.11 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546

8 Create web applications 551

9 Index 553

Index 555

ii
Python for Data Science, Release 1.0.0

This is a tutorial on Data Science with Python. This immediately raises the question: What is Data Science? The
term has become ubiquitous, but there is no single definition. Some even consider the term superfluous, because what
science does not have to do with data? Nevertheless, it seems to me that data science is more than just hype: scientific
data has become increasingly voluminous and often can no longer be adequately tapped with conventional mathematical
and statistical methods alone – additional hacking skills are needed. However, it is not a new field of knowledge that
you need to learn, but a set of skills that you can apply in your field. Whether you are analysing astronomical objects,
analysing machines, forecasting stock prices or working with data in other fields, the goal of this tutorial is to enable
you to solve tasks programmatically in your field.
This tutorial is not intended to be an introduction to Python or programming in general; for that there is the Python
basics tutorial. Instead, it is intended to show the Python data science stack – libraries such as IPython, NumPy, pandas,
Matplotlib and related tools – so that you can subsequently effectively analyse and visualise your data.

CONTENTS 1
Python for Data Science, Release 1.0.0

2 CONTENTS
CHAPTER

ONE

INTRODUCTION

1.1 Target groups

The target groups are diverse, from data scientists to data engineers and analysts to systems engineers. Their skills and
workflows are very different. However, one of the great strengths of Python for Data Science is that it allows these
different experts to work closely together in cross-functional teams.
Data scientists
explore data with different parameters and summarise the results.
Data engineers
check the quality of the code and make it more robust, efficient and scalable.
Data analysts
use the code provided by data engineers to systematically analyse the data.
System engineers
provide the research platform based on the JupyterHub on which the other roles can perform their work.
In this tutorial we address system engineers who want to build and run a platform based on Jupyter notebooks. We then
explain how this platform can be used effectively by data scientists, data engineers and analysts.

1.2 Structure of the Python for Data Science tutorial

From Chapter 2, the tutorial follows the prototype of a research project:


2. Workspace with the installation and configuration of IPython, Jupyter notebooks with nbextensions and ipywid-
gets.
3. Read, persist and provide data either through a REST API or directly from an HTML page.
4. Data cleansing and validation is a recurring task that involves removing or changing redundant, inconsistent or
incorrectly formatted data.
5. Visualise data has been moved to a separate tutorial with the many different possibilities.
6. performance/index> introduces ways to make your code run faster.
7. Create a product shows what is necessary to achieve reproducible results: not only reproducible environments
are needed, but also versioning of the source code and data. The source code should be packed into programme
libraries with documentation, licence(s), tests and logging. Finally, the chapter includes advice on improving
code quality and secure operation.
8. Create web applications can either generate dashboards from Jupyter notebooks or require more comprehensive
application logic, such as demonstrated in Bokeh-Plots in Flask einbinden, or provide data via a RESTful API.

3
Python for Data Science, Release 1.0.0

1.3 Status

1.4 Follow us

• GitHub
• Mastodon

1.5 Pull-Requests

If you have suggestions for improvements and additions, I recommend that you create a Fork of my GitHub Repository
and make your changes there. . You are also welcome to make a pull request. If the changes contained therein are
small and atomic, I’ll be happy to look at your suggestions.

4 Chapter 1. Introduction
CHAPTER

TWO

WORKSPACE

Setting up the workspace includes installing and configuring IPython and Jupyter with nbextensions and ipywidgets,
and NumPy.

2.1 IPython

IPython, or Interactive Python, was initially an advanced Python interpreter that has now grown into an extensive project
designed to provide tools for the entire life cycle of research computing. Today, IPython is not only an interactive
interface to Python, but also offers a number of useful syntactic additions for the language. In addition, IPython is
closely related to the Jupyter project.
See also:
• Miki Tebeka - IPython: The Productivity Booster

2.1.1 Start the IPython shell

You can easily start IPython in a console:

$ pipenv run ipython


Python 3.7.0 (default, Aug 22 2018, 15:22:29)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

Alternatively, you can use IPython in a Jupyter notebook. To do this, start the notebook server first:

$ pipenv run jupyter notebook


[I 17:35:02.419 NotebookApp] Serving notebooks from local directory: /Users/veit/cusy/
˓→trn/Python4DataScience

[I 17:35:02.419 NotebookApp] The Jupyter Notebook is running at:


[I 17:35:02.427 NotebookApp] http://localhost:8888/?
˓→token=72209334c2e325a68115902a63bd064db436c0c84aeced7f

[I 17:35:02.428 NotebookApp] Use Control-C to stop this server and shut down all kernels␣
˓→(twice to skip confirmation).

[C 17:35:02.497 NotebookApp]

The standard browser should then be opened with the specified URL. Often this is http://localhost:8888.
Now you can start a Python process in the browser by creating a new notebook.

5
Python for Data Science, Release 1.0.0

2.1.2 IPython examples

Running Python code

Show Python version

[1]: import sys

sys.version_info

Show versions of Python packages

Most Python packages provide a __version__ method for this:

[2]: import pandas as pd

pd.__version__
[2]: '1.3.4'

Alternatively, you can use version from importlib_metadata:

[3]: from importlib_metadata import version

print(version('pandas'))
1.3.4

Information about the host operating system and the versions of installed Python packages

[4]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-30-generic
Version : #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8

pandas : 1.3.4
numpy : 1.21.3
pytz : 2021.1
(continues on next page)

6 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

(continued from previous page)


dateutil : 2.8.2
pip : None
setuptools : 57.4.0
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.28.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : None
numba : None

Only use Python versions 3.8

[5]: import sys


assert sys.version_info[:2] >= (3, 8)

2.1. IPython 7
Python for Data Science, Release 1.0.0

Shell commands

[6]: !python3 -V
Python 3.8.12

[7]: !python3 -m pip --version


pip 21.3.1 from /srv/jupyter/.local/share/virtualenvs/Python4DataScience-q37UAbtL/lib/
˓→python3.8/site-packages/pip (python 3.8)

Tab completion

. . . for objects with methods and attributes:

. . . and also for modules:

Note:
As you may have noticed in surprise, the __version__ method used above is not offered in the selection. IPython
initially hides these private methods and attributes that begin with underscores. However, they can also be completed
with a tabulator if you first enter an underscore. Alternatively, you can change this setting in the IPython configuration.
. . . for almost everything:

8 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Displaying information about an object

With a question mark (?) you can display information about an object if, for example, there is a method multiply with
the following docstring:
[8]: import numpy as np

[9]: np.mean?

Signature:
np.mean(
a,
axis=None,
dtype=None,
out=None,
keepdims=<no value>,
*,
where=<no value>,
)
Docstring:
Compute the arithmetic mean along the specified axis.

Returns the average of the array elements. The average is taken over
the flattened array by default, otherwise over the specified axis.
`float64` intermediate and return values are used for integer inputs.

Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
axis : None or int or tuple of ints, optional
Axis or axes along which the means are computed. The default is to
compute the mean of the flattened array.

.. versionadded:: 1.7.0

If this is a tuple of ints, a mean is performed over multiple axes,


instead of a single axis or all the axes as before.
dtype : data-type, optional
Type to use in computing the mean. For integer inputs, the default
is `float64`; for floating point inputs, it is the same as the
input dtype.
out : ndarray, optional
Alternate output array in which to place the result. The default
is ``None``; if provided, it must have the same shape as the
expected output, but the type will be cast if necessary.
See :ref:`ufuncs-output-type` for more details.

keepdims : bool, optional


If this is set to True, the axes which are reduced are left
in the result as dimensions with size one. With this option,
the result will broadcast correctly against the input array.
(continues on next page)

2.1. IPython 9
Python for Data Science, Release 1.0.0

(continued from previous page)

If the default value is passed, then `keepdims` will not be


passed through to the `mean` method of sub-classes of
`ndarray`, however any non-default value will be. If the
sub-class' method does not implement `keepdims` any
exceptions will be raised.

where : array_like of bool, optional


Elements to include in the mean. See `~numpy.ufunc.reduce` for details.

.. versionadded:: 1.20.0

Returns
-------
m : ndarray, see dtype parameter above
If `out=None`, returns a new array containing the mean values,
otherwise a reference to the output array is returned.

See Also
--------
average : Weighted average
std, var, nanmean, nanstd, nanvar

Notes
-----
The arithmetic mean is the sum of the elements along the axis divided
by the number of elements.

Note that for floating-point input, the mean is computed using the
same precision the input has. Depending on the input data, this can
cause the results to be inaccurate, especially for `float32` (see
example below). Specifying a higher-precision accumulator using the
`dtype` keyword can alleviate this issue.

By default, `float16` results are computed using `float32` intermediates


for extra precision.

Examples
--------
>>> a = np.array([[1, 2], [3, 4]])
>>> np.mean(a)
2.5
>>> np.mean(a, axis=0)
array([2., 3.])
>>> np.mean(a, axis=1)
array([1.5, 3.5])

In single precision, `mean` can be inaccurate:

>>> a = np.zeros((2, 512*512), dtype=np.float32)


>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
(continues on next page)

10 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

(continued from previous page)


>>> np.mean(a)
0.54999924

Computing the mean in float64 is more accurate:

>>> np.mean(a, dtype=np.float64)


0.55000000074505806 # may vary

Specifying a where argument:


>>> a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]])
>>> np.mean(a)
12.0
>>> np.mean(a, where=[[True], [False], [False]])
9.0
File: ~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-
˓→packages/numpy/core/fromnumeric.py

Type: function

By using ?? the source code of the function is also displayed, if this is possible:

[10]: np.mean??

Signature:
np.mean(
a,
axis=None,
dtype=None,
out=None,
keepdims=<no value>,
*,
where=<no value>,
)
Source:
@array_function_dispatch(_mean_dispatcher)
def mean(a, axis=None, dtype=None, out=None, keepdims=np._NoValue, *,
where=np._NoValue):
"""
Compute the arithmetic mean along the specified axis.

Returns the average of the array elements. The average is taken over
the flattened array by default, otherwise over the specified axis.
`float64` intermediate and return values are used for integer inputs.

Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
axis : None or int or tuple of ints, optional
Axis or axes along which the means are computed. The default is to
compute the mean of the flattened array.

(continues on next page)

2.1. IPython 11
Python for Data Science, Release 1.0.0

(continued from previous page)


.. versionadded:: 1.7.0

If this is a tuple of ints, a mean is performed over multiple axes,


instead of a single axis or all the axes as before.
dtype : data-type, optional
Type to use in computing the mean. For integer inputs, the default
is `float64`; for floating point inputs, it is the same as the
input dtype.
out : ndarray, optional
Alternate output array in which to place the result. The default
is ``None``; if provided, it must have the same shape as the
expected output, but the type will be cast if necessary.
See :ref:`ufuncs-output-type` for more details.

keepdims : bool, optional


If this is set to True, the axes which are reduced are left
in the result as dimensions with size one. With this option,
the result will broadcast correctly against the input array.

If the default value is passed, then `keepdims` will not be


passed through to the `mean` method of sub-classes of
`ndarray`, however any non-default value will be. If the
sub-class' method does not implement `keepdims` any
exceptions will be raised.

where : array_like of bool, optional


Elements to include in the mean. See `~numpy.ufunc.reduce` for details.

.. versionadded:: 1.20.0

Returns
-------
m : ndarray, see dtype parameter above
If `out=None`, returns a new array containing the mean values,
otherwise a reference to the output array is returned.

See Also
--------
average : Weighted average
std, var, nanmean, nanstd, nanvar

Notes
-----
The arithmetic mean is the sum of the elements along the axis divided
by the number of elements.

Note that for floating-point input, the mean is computed using the
same precision the input has. Depending on the input data, this can
cause the results to be inaccurate, especially for `float32` (see
example below). Specifying a higher-precision accumulator using the
`dtype` keyword can alleviate this issue.

(continues on next page)

12 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

(continued from previous page)


By default, `float16` results are computed using `float32` intermediates
for extra precision.

Examples
--------
>>> a = np.array([[1, 2], [3, 4]])
>>> np.mean(a)
2.5
>>> np.mean(a, axis=0)
array([2., 3.])
>>> np.mean(a, axis=1)
array([1.5, 3.5])

In single precision, `mean` can be inaccurate:

>>> a = np.zeros((2, 512*512), dtype=np.float32)


>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)
0.54999924

Computing the mean in float64 is more accurate:

>>> np.mean(a, dtype=np.float64)


0.55000000074505806 # may vary

Specifying a where argument:


>>> a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]])
>>> np.mean(a)
12.0
>>> np.mean(a, where=[[True], [False], [False]])
9.0

"""
kwargs = {}
if keepdims is not np._NoValue:
kwargs['keepdims'] = keepdims
if where is not np._NoValue:
kwargs['where'] = where
if type(a) is not mu.ndarray:
try:
mean = a.mean
except AttributeError:
pass
else:
return mean(axis=axis, dtype=dtype, out=out, **kwargs)

return _methods._mean(a, axis=axis, dtype=dtype,


out=out, **kwargs)
File: ~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-
˓→packages/numpy/core/fromnumeric.py

Type: function

2.1. IPython 13
Python for Data Science, Release 1.0.0

? can also be used to search in the IPython namespace. In doing so, a series of characters can be represented with the
wildcard (*). For example, to get a list of all functions in the top-level NumPy namespace that contain mean:

[11]: np.*mean*?

np.mean
np.nanmean

2.1.3 IPython magic

IPython not only enables Python to be used interactively, but also extends the Python syntax with so-called magic
commands, which are provided with the prefix %. They are designed to quickly and easily solve common data analysis
problems. A distinction is made between two different types of magic commands:
• line magics, denoted by a single % prefix, that run on a single input line
• cell magics which are preceded by a double symbol %% and which are executed within a notebook cell.

Execute external code: %run

If you start developing larger code, you will likely be working in both IPython for interactive exploration and a text
editor to save code that you want to reuse. With the %run magic you can execute this code directly in your IPython
session.
Imagine you created a myscript.py file with the following content:

def square(x):
return x ** 2

for N in range(1, 4):


print(N, "squared is", square(N))

[1]: %run myscript.py


1 squared is 1
2 squared is 4
3 squared is 9

Note that after running this script, all of the functions defined in it will be available for use in your IPython session:

[2]: square(4)
[2]: 16

There are several ways you can improve the way your code runs. As usual, you can display the documentation in
IPython with %run?.

14 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Run timing code: %timeit

Another example of a Magic function is %timeit, which automatically determines the execution time of the following
one-line Python statement. So we can e.g. output the performance of a list comprehension with:

[3]: %timeit L = [n ** 2 for n in range(1000)]


3.17 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The advantage of %timeit is that short commands automatically run multiple runs to get more robust results. For multi-
line instructions, adding a second % character creates cell magic that can process multiple input lines. For example,
here is the equivalent construction using a for loop:

[4]: %%timeit
L = []
for n in range(1000):
L.append(n ** 2)
3.31 ms ± 855 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can immediately see that the list comprehension is about 10% faster than its equivalent with a for loop. We then
describe performance measurements and optimisations in more detail in Profiling.

Execute code from other interpreters

IPython has a %%script script magic with which you can execute a cell in a subprocess of an interpreter on your
system, e.g. bash, ruby, perl, zsh, R etc. This can also be its own script that expects input in stdin. To do this,
simply pass a path or a shell command to the program that is specified in the %%script line. The rest of the cell is
executed by this script, capturing stdout or err from the subprocess and displaying it.

[1]: %%script python2


import sys
print 'Python %s' % sys.version
Python 2.7.15 (default, Oct 22 2018, 19:33:46)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]

[2]: %%script python3


import sys
print('Python: %s' % sys.version)
Python: 3.7.0 (default, Aug 22 2018, 15:22:29)
[Clang 8.0.0 (clang-800.0.42.1)]

[3]: %%ruby
puts "Ruby #{RUBY_VERSION}"
Ruby 2.3.7

[4]: %%bash
echo "$BASH"
/bin/bash

You can capture stdout and err from these sub-processes in Python variables:

2.1. IPython 15
Python for Data Science, Release 1.0.0

[5]: %%bash --out output --err error


echo "stdout"
echo "stderr" >&2

[6]: print(error)
print(output)
stderr

stdout

Configure standard script magic

The list of aliases for the script magic is configurable. By default, some common interpreters can be used if necessary.
However, you can also specify your own interpreter in ipython_config.py:

c.ScriptMagics.scripts = ['R', 'pypy', 'myprogram']


c.ScriptMagics.script_paths = {'myprogram': '/path/to/myprogram'}

Help functions: ?, %magic and %lsmagic

Like normal Python functions, the IPython magic functions have docstrings that can be easily accessed. E.g. to read
the documentation of the %timeit magic, just type:

[5]: %timeit?

Documentation for other functions can be accessed in a similar manner. To access a general description of the %magic
functions available, including some examples, you can type:

[6]: %magic

For a quick list of all available magic functions, type:

[7]: %lsmagic
[7]: Available line magics:
%alias %alias_magic %autoawait %autocall %automagic %autosave %bookmark %cat %cd␣
˓→ %clear %colors %conda %config %connect_info %cp %debug %dhist %dirs %doctest_
˓→mode %ed %edit %env %gui %hist %history %killbgscripts %ldir %less %lf %lk
˓→%ll %load %load_ext %loadpy %logoff %logon %logstart %logstate %logstop %ls
˓→%lsmagic %lx %macro %magic %man %matplotlib %mkdir %more %mv %notebook %page␣
˓→ %pastebin %pdb %pdef %pdoc %pfile %pinfo %pinfo2 %pip %popd %pprint
˓→%precision %prun %psearch %psource %pushd %pwd %pycat %pylab %qtconsole
˓→%quickref %recall %rehashx %reload_ext %rep %rerun %reset %reset_selective %rm␣
˓→ %rmdir %run %save %sc %set_env %store %sx %system %tb %time %timeit
˓→%unalias %unload_ext %who %who_ls %whos %xdel %xmode

Available cell magics:


%%! %%HTML %%SVG %%bash %%capture %%debug %%file %%html %%javascript %%js %
˓→%latex %%markdown %%perl %%prun %%pypy %%python %%python2 %%python3 %%ruby %
˓→%script %%sh %%svg %%sx %%system %%time %%timeit %%writefile
(continues on next page)

16 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

(continued from previous page)

Automagic is ON, % prefix IS NOT needed for line magics.

You can also simply define your own magic functions. For more information, see Defining custom magics.

2.1.4 Shell commands in IPython

The IPython Notebook allows simple UNIX/Linux commands to be executed in a single input cell. There are no limits
but when using, please keep in mind that in contrast to a regular UNIX/Linux shell, start each shell command with a !,
for example !ls for the command ls (see below for further explanations about the command). Furthermore, each shell
command is executed in its own subshell. For this reason, the results of previous shell commands are not available to
you.
To begin with, the command ls lists the files in the current working directory. The output is shown below the input
cell, and lists the single file shell.ipynb :

[1]: !ls
shell.ipynb

The command !pwd displays the path to working directory:

[2]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/basics/ipython

The command !echo outputs text given as parameter to the echo command. The example below demonstrates how to
print Hello world:

[3]: !echo "Hello world!"


Hello world!

Passing values to and from the shell

There is a clever way through which you can access the output of a UNIX/Linux command as a variable in Python.
Assign the output of a UNIX/Linux command to a variable as follows:

[4]: contents = !ls

Here the Python variable contents has been assigned the output of the command ls. As a result, contents is a list,
where each list element corresponds to a line in the output. With the print command you output the list contents:

[5]: print(contents)
['create-delete.ipynb', 'file-system.ipynb', 'grep-find.ipynb', 'index.rst', 'pipes-
˓→filters.ipynb', 'regex.ipynb', 'shell-variables.ipynb']

You will see the same result below when executing the pwd command. The current directory is stored in the variable
directory:

[6]: directory = !pwd

2.1. IPython 17
Python for Data Science, Release 1.0.0

[7]: print(directory)
['/Users/veit/cusy/trn/Python4DataScience/docs/basics/ipython']

2.1.5 Unix shell

Any command on the command line will also work in Jupyter Notebooks if prefixed with !. The results can then
interact with the Jupyter namespace, see Passing values to and from the shell.

Navigate through files and directories

First let us find out where we are by running a command called pwd:

[1]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython/unix-shell

Here, the response is the iPython chapter of the Jupyter tutorial in my home directory /Users/veit.
On Windows the home directory will look like C:\Documents and Settings\veit or C:\Users\veit and on
Linux like /home/veit.
To see the contents of our directory, we can use ls:

[2]: !ls
create-delete.ipynb index.rst shell-variables.ipynb
file-system.ipynb pipes-filters.ipynb
grep-find.ipynb regex.ipynb

• a trailing / indicates a directory


• @ indicates a link
• * indicates an executable
Depending on your default options, the shell might also use colors to indicate whether an entry is a file or a directory.

ls options and arguments

[3]: !ls -F ../


debugging.ipynb extensions.rst magics.ipynb start.rst
display.ipynb importing.ipynb mypackage/ unix-shell/
dvc.list index.rst shell.ipynb
length.txt sorted-length.txt

ls is the command, with the option -F and the argument ../.


• Options either start with a single dash (-) or two dashes (--), and they change the behavior of a command.
• Arguments tell the command what to operate on.
• Options and arguments are sometimes also referred as parameters.
• Each part is separated by spaces.

18 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

• Also, capitalisation is important, e.g.


– ls -s will display the size of files and directories alongside the names,
– while ls -S will sort the files and directories by size.

[4]: !ls -s
total 70384
24 create-delete.ipynb 16 pipes-filters.ipynb
70288 file-system.ipynb 8 regex.ipynb
24 grep-find.ipynb 16 shell-variables.ipynb
8 index.rst

[5]: !ls -S
file-system.ipynb pipes-filters.ipynb index.rst
grep-find.ipynb shell-variables.ipynb
create-delete.ipynb regex.ipynb

Show all options and arguments

ls comes with a lot of other useful options. Using man you can print out the built-in manual page for the desired
UNIX/Linux-command:

[6]: !man ls

LS(1) BSD General Commands Manual LS(1)

NAME
ls -- list directory contents

SYNOPSIS
ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1%] [file ...]

DESCRIPTION
For each operand that names a file of a type other than directory, ls
displays its name as well as any requested, associated information. For
each operand that names a file of type directory, ls displays the names
of files contained within that directory, as well as any requested, asso-
ciated information.

If no operands are given, the contents of the current directory are dis-
played. If more than one operand is given, non-directory operands are
displayed first; directory and non-directory operands are sorted sepa-
rately and in lexicographical order.

The following options are available:

-@ Display extended attribute keys and sizes in long (-l) output.

-1 (The numeric digit ``one''.) Force output to be one entry per


line. This is the default when output is not to a terminal.
(continues on next page)

2.1. IPython 19
Python for Data Science, Release 1.0.0

(continued from previous page)

...
BSD May 19, 2002 BSD

Illegal options

If you try to use an option that isn’t supported, the commands will usually print an error message, e.g. for:

[7]: !ls -z
ls: illegal option -- z
usage: ls [-@ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1%] [file ...]

Hidden Files

With the -a option you can display all files:

[8]: !ls -a
. file-system.ipynb regex.ipynb
.. grep-find.ipynb shell-variables.ipynb
.ipynb_checkpoints index.rst
create-delete.ipynb pipes-filters.ipynb

In addition to the hidden directories .. and ., you may also see a directory called .ipynb_checkpoints. This file
usually contains snapshots of the Jupyter notebooks.

Show directory treeThe command tree lists contents of directories in a tree-like format.

[9]: !tree
.
create-delete.ipynb
file-system.ipynb
grep-find.ipynb
index.rst
pipes-filters.ipynb
regex.ipynb
shell-variables.ipynb

0 directories, 7 files

20 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Change directory

At first it may seem irritating to some that they cannot use !cd to change to another directory.

[10]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython/unix-shell

[11]: !cd ..

[12]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython/unix-shell

The reason for this is that Jupyter uses a temporary subshell. If you want to change to another directory permanently,
you have to use the magic command %cd.

[13]: %cd ..
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython

[14]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython

With the %automagic function, these can also be used without the preceding % character:

[17]: %automagic

Automagic is ON, % prefix IS NOT needed for line magics.

[18]: cd ..
/Users/veit/cusy/trn/Python4DataScience/docs/workspace

Absolute and relative Paths

[19]: cd .
/Users/veit/cusy/trn/Python4DataScience/docs/workspace

[20]: cd ../..
/Users/veit/cusy/trn/Python4DataScience

[21]: cd ..
/Users/veit/cusy/trn

[22]: cd /
/

2.1. IPython 21
Python for Data Science, Release 1.0.0

[23]: cd
/Users/veit

[24]: cd ~
/Users/veit

[25]: cd /Users/veit
/Users/veit

Create, update and delete files and directories

Creates a new directory test and then checks this with ls:

[1]: !mkdir tests

[2]: !ls
create-delete.ipynb importing.ipynb shell-variables.ipynb
debugging.ipynb index.rst shell.ipynb
display.ipynb length.txt sorted-length.txt
dvc.list magics.ipynb start.rst
extensions.rst mypackage tests
file-system.ipynb pipes-filters.ipynb
grep-find.ipynb regex.ipynb

Then we create the file test_file.txt in this directory.

[3]: !touch tests/test_file.txt

[4]: !ls tests


test_file.txt

Now we change the suffix of the file:

[5]: !mv tests/test_file.txt tests/test_file.py

[6]: !ls tests


test_file.py

Now we make a copy of this file:

[7]: !cp tests/test_file.py tests/test_file2.py

[8]: !ls tests


test_file.py test_file2.py

A directory with all the files it contains is also possible recursively with the -r option:

22 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[9]: !cp -r tests tests.bak

[10]: !ls tests.bak


test_file.py test_file2.py

Finally, we delete the directories tests and tests.bak again:

[11]: !rm -r tests tests.bak

[12]: !ls
create-delete.ipynb grep-find.ipynb pipes-filters.ipynb
debugging.ipynb importing.ipynb regex.ipynb
display.ipynb index.rst shell-variables.ipynb
dvc.list length.txt shell.ipynb
extensions.rst magics.ipynb sorted-length.txt
file-system.ipynb mypackage start.rst

Transfering files

wget

[13]: !wget https://dvc.org/deb/dvc.list


--2021-05-14 20:35:44-- https://dvc.org/deb/dvc.list
Auflösen des Hostnamens dvc.org (dvc.org)... 2606:4700:3036::6815:51cd, 2606:4700:3033::
˓→ac43:a44c, 172.67.164.76, ...

Verbindungsaufbau zu dvc.org (dvc.org)|2606:4700:3036::6815:51cd|:443 ... verbunden.


HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 303 See Other
Platz: https://s3-us-east-2.amazonaws.com/dvc-s3-repo/deb/dvc.list [folgend]
--2021-05-14 20:35:45-- https://s3-us-east-2.amazonaws.com/dvc-s3-repo/deb/dvc.list
Auflösen des Hostnamens s3-us-east-2.amazonaws.com (s3-us-east-2.amazonaws.com)... 52.
˓→219.97.209

Verbindungsaufbau zu s3-us-east-2.amazonaws.com (s3-us-east-2.amazonaws.com)|52.219.97.


˓→209|:443 ... verbunden.

HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 200 OK


Länge: 51 [binary/octet-stream]
Wird in »dvc.list.3« gespeichert.

dvc.list.3 100%[===================>] 51 --.-KB/s in 0s

2021-05-14 20:35:45 (1,68 MB/s) - »dvc.list.3« gespeichert [51/51]

• -r recursively crawls other files and directories


• -np avoids crawling to parent directories
• -D targets only the following domain name
• -nH avoids creating a subdirectory for the websites content
• -m mirrors with time stamping, time stamping, infinite recursion depth, and preservation of FTP directory settings

2.1. IPython 23
Python for Data Science, Release 1.0.0

• -q supresses the output to the screen

cURL

Alternatively, you can use cURL, which supports a much larger range of protocols.

[14]: !curl -o dvc.list https://dvc.org/deb/dvc.list


% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 85 100 85 0 0 159 0 --:--:-- --:--:-- --:--:-- 159

Pipes and filters

ls shows all files and directories at this point.

[1]: !ls
create-delete.ipynb index.rst shell-variables.ipynb
file-system.ipynb pipes-filters.ipynb
grep-find.ipynb regex.ipynb

With *.rst we restrict the results to all files with the suffix .rst:

[2]: !ls *.rst


index.rst

We can also output only the number of lines, words and characters in these documents:

[3]: !wc *.rst


19 49 466 index.rst

Now we write the number of characters in the file length.txt and then output the text with cat:

[4]: !wc -m *.rst > length.txt

[5]: !cat length.txt


466 index.rst

We can also have the files sorted by the number of characters:

[6]: !sort -n length.txt


466 index.rst

[7]: !sort -n length.txt > sorted-length.txt

We can also overwrite the existing file:

[8]: !sort -n length.txt > length.txt

If we only want to know the total number of characters, i.e. only output the last line, we can do this with tail:

24 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[9]: !tail -n 1 length.txt

> is used to overwrite a file while >> is used to append to a file.

[10]: !echo amount of characters >> length.txt

[11]: !cat length.txt


amount of characters

Pipe |

You can connect commands with a pipe (|). In the following one-liner, we want to display the number of characters
for the shortest file:

[12]: !wc -l *.rst | sort -n | head


19 index.rst

If we want to display the first lines of the main text (without the first three lines for the title):

[13]: !cat index.rst | head -n 5 | tail -n 2


Any command on the command line will also work in Jupyter Notebooks if prefixed
with ``!``. The results can then interact with the Jupyter namespace, see

grep and find

grep

grep finds and prints lines in files that match a regular expression. In the following example, we search for the string
Python:

[1]: !grep Python ../index.rst


IPython
`IPython <https://ipython.org/>`_, or *Interactive Python*, was initially an
advanced Python interpreter that has now grown into an extensive project
Today, IPython is not only an interactive interface to Python, but also offers a
number of useful syntactic additions for the language. In addition, IPython is
* `Miki Tebeka - IPython: The Productivity Booster

The option -w limits the matches to the word boundaries so that IPython is ignored:

[2]: !grep -w Python ../index.rst


`IPython <https://ipython.org/>`_, or *Interactive Python*, was initially an
advanced Python interpreter that has now grown into an extensive project
Today, IPython is not only an interactive interface to Python, but also offers a

-n shows the line numbers that match:

2.1. IPython 25
Python for Data Science, Release 1.0.0

[3]: !grep -n -w Python ../index.rst


4:`IPython <https://ipython.org/>`_, or *Interactive Python*, was initially an
5:advanced Python interpreter that has now grown into an extensive project
7:Today, IPython is not only an interactive interface to Python, but also offers a

-v inverts our search

[4]: !grep -n -v "^ " ../index.rst


1:IPython
2:=======
3:
4:`IPython <https://ipython.org/>`_, or *Interactive Python*, was initially an
5:advanced Python interpreter that has now grown into an extensive project
6:designed to provide tools for the entire life cycle of research computing.
7:Today, IPython is not only an interactive interface to Python, but also offers a
8:number of useful syntactic additions for the language. In addition, IPython is
9:closely related to the `Jupyter project <https://jupyter.org/>`_.
10:
11:.. seealso::
14:
15:.. toctree::
19:

grep has lots of other options. To find out what they are, you can type:

[5]: !grep --help


usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
[-e pattern] [-f file] [--binary-files=value] [--color=when]
[--context[=num]] [--directories=action] [--label] [--line-buffered]
[--null] [pattern] [file ...]

In the following example we use the -E option and put the pattern in quotes to prevent the shell from trying to interpret
it. The ^ in the pattern anchors the match to the start of the line and the . matches a single character.

[6]: !grep -n -E "^.Python" ../index.rst


1:IPython

find

find . searches in this directory whereby the search is restricted to directories with -type d.

[7]: !find .. -type d


..
../mypackage
../.ipynb_checkpoints
../unix-shell
../unix-shell/.ipynb_checkpoints

With -type f the search ist restricted to files.

26 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[8]: !find . -type f


./regex.ipynb
./index.rst
./create-delete.ipynb
./file-system.ipynb
./shell-variables.ipynb
./grep-find.ipynb
./pipes-filters.ipynb

With -mtime the search is limited to the last X days, in our example to the last day:

[9]: !find . -mtime -1


.
./regex.ipynb
./.ipynb_checkpoints
./.ipynb_checkpoints/grep-find-checkpoint.ipynb
./grep-find.ipynb
./pipes-filters.ipynb

With -name you can filter the search by name.

[10]: !find .. -name "*.rst"


../index.rst
../extensions.rst
../unix-shell/index.rst
../start.rst

Now we count the characters in the files with the suffix .rst:

[11]: !wc -c $(find .. -name "*.rst")


833 ../index.rst
2097 ../extensions.rst
466 ../unix-shell/index.rst
1138 ../start.rst
4534 insgesamt

It is also possible to search for a regular expression in these files:

[12]: !grep "ipython.org" $(find .. -name "*.rst")


../index.rst:`IPython <https://ipython.org/>`_, or *Interactive Python*, was initially an

Finally, we filter out all results whose path contains ipynb_checkpoints:

[13]: !find . -name "*.ipynb" | grep -v ipynb_checkpoints


./regex.ipynb
./create-delete.ipynb
./file-system.ipynb
./shell-variables.ipynb
./grep-find.ipynb
./pipes-filters.ipynb

2.1. IPython 27
Python for Data Science, Release 1.0.0

Shell variables

Display of all shell variables

[1]: !set
BASH=/usr/local/bin/bash
BASHOPTS=checkwinsize:cmdhist:complete_fullquote:extquote:force_fignore:globasciiranges:
˓→hostcomplete:interactive_comments:progcomp:promptvars:sourcepath

BASH_ALIASES=()
BASH_ARGC=()
BASH_ARGV=()
BASH_CMDS=()
BASH_EXECUTION_STRING=set
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="5" [1]="0" [2]="16" [3]="1" [4]="release" [5]="x86_64-apple-darwin19.
˓→3.0")

BASH_VERSION='5.0.16(1)-release'
CLICOLOR=1
CPPFLAGS=-I/usr/local/opt/openblas/include
DIRSTACK=()
DISPLAY=/private/tmp/com.apple.launchd.WfrReZHn7e/org.macosforge.xquartz:0
...

Showing the value of a variable

[2]: !echo $HOME


/Users/veit

The path variable

It defines the shell’s search path, i.e., the list of directories that the shell looks in for runnable programs.

[1]: !echo $PATH


/Users/veit/.local/share/virtualenvs/Python4DataScience-G-MBNaSt/bin:/Users/veit/Library/
˓→Python/3.7/bin:/Users/veit/spack/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/

˓→Library/TeX/texbin:/usr/local/MacGPG2/bin:/opt/X11/bin:/Library/Apple/usr/bin

Creating and changing variables

Creating or overwriting variables

[3]: !export SPACK_ROOT=~/spack

28 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Append additional specifications

[5]: !export PATH=/usr/local/opt/[email protected]/bin:$PATH

2.1.6 Show objects with display

IPython can display objects such as HTML, JSON, PNG, JPEG, SVG and Latex

Images

To display images (JPEG, PNG) in IPython and notebooks, you can use the Image class:

[1]: from IPython.display import Image


Image('https://www.python.org/images/python-logo.gif')
[1]: <IPython.core.display.Image object>

[2]: from IPython.display import SVG


SVG('https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg')
[2]:

Non-embedded images

• By default, image data is embedded:

Image ('img_url')

• However, if the url is given as kwarg, this is interpreted as a soft link:

Image (url='img_url')

• embed can also be specified explicitly:

Image (url='img_url', embed = True)

HTML

Python objects can declare HTML representations to be displayed in a notebook:

[3]: from IPython.display import HTML

[4]: %%html
<ul>
<li>foo</li>
<li>bar</li>
</ul>
<IPython.core.display.HTML object>

2.1. IPython 29
Python for Data Science, Release 1.0.0

Javascript

With notebooks, objects can also declare a JavaScript representation. This enables e.g. data visualisations with
Javascript libraries like d3.js.

[5]: from IPython.display import Javascript


welcome = Javascript('alert("Dies ist ein Beispiel für eine durch IPython angezeigte␣
˓→Javascript-Warnung.")');

display(welcome)
<IPython.core.display.Javascript object>

For more extensive Javascript you can also use the %%javascript syntax.

LaTeX

IPython.display also has built-in support for displaying mathematical expressions set in LaTeX and rendered in the
browser with MathJax:

[6]: from IPython.display import Math


Math(r'F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx')
∫︁ ∞
[6]:
𝐹 (𝑘) = 𝑓 (𝑥)𝑒2𝜋𝑖𝑘 𝑑𝑥
−∞

With the Latex class you have to specify the limits yourself. In this way, however, you can also use other LaTeX modes,
such as eqnarray:

[7]: from IPython.display import Latex


Latex(r"""\begin{eqnarray}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t}␣
˓→& = \frac{4\pi}{c}\vec{\mathbf{j}} \\

\end{eqnarray}""")
[7]:

1 𝜕E
⃗ − 4𝜋⃗
∇×B = 𝑐 j (2.1)
𝑐 𝜕𝑡
(2.2)

Audio

IPython also enables interactive work with sounds. With the display.Audio class you can create an audio control
that is embedded in the notebook. The interface is analogous to that of the Image class. All audio formats supported
by the browser can be used.

[8]: from IPython.display import Audio

In the following we will output the sine function of a NumPy array as an audio signal. The Audio class normalises and
codes the data and embeds the resulting audio in the notebook.

30 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[9]: import numpy as np


f = 500.0
rate = 8000.0
L = 3
times = np.linspace(0,L,rate*L)
signal = np.sin(f*times)

Audio(data=signal, rate=rate)
[9]: <IPython.lib.display.Audio object>

Links to local files

IPython has built-in classes for generating links to local files. To do this, create a link to a single file with the FileLink
object:

[10]: from IPython.display import FileLink, FileLinks


FileLink('magics.ipynb')
[10]: /Users/veit/cusy/trn/Python4DataScience/docs/basics/ipython/magics.ipynb

Alternatively, you can generate a list with links to all files in a directory, e.g .:

[11]: FileLinks ('.')


[11]: ./
index.rst
profiling.ipynb
debugging.ipynb
mprun_demo.py
magics.ipynb
shell.ipynb
display.ipynb
myscript.py
start.rst

Display notebooks

[12]: import os, sys, types


import nbformat

[13]: from pygments import highlight


from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter

from IPython.display import display, HTML

formatter = HtmlFormatter()
lexer = PythonLexer()

(continues on next page)

2.1. IPython 31
Python for Data Science, Release 1.0.0

(continued from previous page)


# publish the CSS for pygments highlighting
display(HTML("""
<style type='text/css'>
%s
</style>
""" % formatter.get_style_defs()
))
<IPython.core.display.HTML object>

[14]: def show_notebook(fname):


"""display a short summary of the cells of a notebook"""
nb = nbformat.read(fname, as_version=4)
html = []
for cell in nb.cells:
html.append("<h4>%s cell</h4>" % cell.cell_type)
if cell.cell_type == 'code':
html.append(highlight(cell.source, lexer, formatter))
else:
html.append("<pre>%s</pre>" % cell.source)
display(HTML('\n'.join(html)))

show_notebook(os.path.join("mypackage/foo.ipynb"))
<IPython.core.display.HTML object>

2.1.7 foo.ipynb

[1]: def bar():


return "bar"

[2]: def has_ip_syntax():


listing = !ls
return listing

[3]: def whatsmyname():


return __name__

2.1.8 Import notebooks

To be able to develop more modularly, the import of notebooks is necessary. However, since notebooks are not Python
files, they are not easy to import. Fortunately, Python provides some hooks for the import so that IPython notebooks
can eventually be imported.

[1]: import os, sys, types

[2]: import nbformat

from IPython import get_ipython


from IPython.core.interactiveshell import InteractiveShell

32 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Import hooks usually have two objects:


• Module Loader that takes a module name (e.g. IPython.display) and returns a module
• Module Finder, which finds out if a module is present and tells Python which loader to use
But first, let’s write a method that a notebook will find using the fully qualified name and the optional path. E.g.
mypackage.foo becomes mypackage/foo.ipynb and replaces Foo_Bar with Foo Bar if Foo_Bar doesn’t exist.

[3]: def find_notebook(fullname, path=None):


name = fullname.rsplit('.', 1)[-1]
if not path:
path = ['']
for d in path:
nb_path = os.path.join(d, name + ".ipynb")
if os.path.isfile(nb_path):
return nb_path
# let import Foo_Bar find "Foo Bar.ipynb"
nb_path = nb_path.replace("_", " ")
if os.path.isfile(nb_path):
return nb_path

Notebook Loader

The Notebook Loader does the following three steps:


1. Load the notebook document into memory
2. Create an empty module
3. Execute every cell in the module namespace
Because IPython cells can have an extended syntax, transform_cell converts each cell to pure Python code
before executing it.

[4]: class NotebookLoader(object):


"""Module Loader for IPython Notebooks"""
def __init__(self, path=None):
self.shell = InteractiveShell.instance()
self.path = path

def load_module(self, fullname):


"""import a notebook as a module"""
path = find_notebook(fullname, self.path)

print ("importing notebook from %s" % path)

# load the notebook object


nb = nbformat.read(path, as_version=4)

# create the module and add it to sys.modules


# if name in sys.modules:
# return sys.modules[name]
mod = types.ModuleType(fullname)
mod.__file__ = path
(continues on next page)

2.1. IPython 33
Python for Data Science, Release 1.0.0

(continued from previous page)


mod.__loader__ = self
mod.__dict__['get_ipython'] = get_ipython
sys.modules[fullname] = mod

# extra work to ensure that magics that would affect the user_ns
# magics that would affect the user_ns actually affect the
# notebook module’s ns
save_user_ns = self.shell.user_ns
self.shell.user_ns = mod.__dict__

try:
for cell in nb.cells:
if cell.cell_type == 'code':
# transform the input to executable Python
code = self.shell.input_transformer_manager.transform_cell(cell.source)
# run the code in the module
exec(code, mod.__dict__)
finally:
self.shell.user_ns = save_user_ns
return mod

Notebook Finder

The Finder is a simple object that indicates whether a notebook can be imported based on its file name and that returns
the appropriate loader.

[5]: class NotebookFinder(object):


"""Module Finder finds the transformed IPython Notebook"""
def __init__(self):
self.loaders = {}

def find_module(self, fullname, path=None):


nb_path = find_notebook(fullname, path)
if not nb_path:
return

key = path
if path:
# lists aren’t hashable
key = os.path.sep.join(path)

if key not in self.loaders:


self.loaders[key] = NotebookLoader(path)
return self.loaders[key]

34 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Register hook

Now we register NotebookFinder with sys.meta_path:

[6]: sys.meta_path.append(NotebookFinder())

Check

Now our notebook mypackage/foo.ipynb should be importable with:

[7]: from mypackage import foo


importing notebook from /Users/veit/cusy/trn/Python4DataScience/docs/basics/ipython/
˓→mypackage/foo.ipynb

Is the Python method bar being executed?

[8]: foo.bar()
[8]: 'bar'

. . . and the IPython syntax?

[9]: foo.has_ip_syntax()
[9]: ['debugging.ipynb',
'display.ipynb',
'importing.ipynb',
'index.rst',
'magics.ipynb',
'mprun_demo.py',
'mypackage',
'myscript.py',
'profiling.ipynb',
'shell.ipynb',
'show.ipynb',
'start.rst']

Reusable import hook

The import hook can also easily be executed in other notebooks with

[10]: %run importing.ipynb


importing notebook from /Users/veit/cusy/trn/Python4DataScience/docs/basics/ipython/
˓→mypackage/foo.ipynb

2.1. IPython 35
Python for Data Science, Release 1.0.0

2.1.9 IPython extensions

IPython extensions are Python modules that change the behavior of the shell. They are identified by an importable
module name and are usually located in .ipython/extensions/.
Some important extensions are already included in IPython: autoreload and storemagic. You can find other ex-
tensions in the Extensions Index or on PyPI with the IPython tag.
See also:
• IPython extensions docs

Use extensions

The %load_ext magic can be used to load extensions while IPython is running.

%load_ext myextension

Alternatively, an extension can also be loaded each time IPython is started by listing it in the IPython configuration file:

c.InteractiveShellApp.extensions = [
'myextension'
]

If you haven’t created an IPython configuration file yet, you can do this with:

$ ipython profile create [profilename]

If no profile name is given, default` is used. The file is usually created in ``~/.ipython/
profile_default/ and named depending on the purpose: ipython_config.py is used for all IPython commands,
while ipython_notebook_config.py is only used for commands in IPython notebooks.

Writing IPython extensions

An IPython extension is an importable Python module that has special functions for loading and unloading:

def load_ipython_extension(ipython):
# The `ipython` argument is the currently active `InteractiveShell`
# instance, which can be used in any way. This allows you to register
# new magics or aliases, for example.

def unload_ipython_extension(ipython):
# If you want your extension to be unloadable, put that logic here.

See also:
• Defining custom magics

36 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

2.1.10 Debugging

IPython contains various tools to analyse faulty code, essentially the exception reporting and the debugger.

Check exceptions with %xmode

If the execution of a Python script fails, an exception is usually thrown and relevant information about the cause of
the error is written to a traceback. With the %xmode magic function you can control the amount of information that is
displayed in IPython. Let’s look at the following code for this:

[1]: def func1(a, b):


return a / b

def func2(x):
a = x
b = x - 1
return func1(a, b)

[2]: func2(1)
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-2-7cb498ea7ed1> in <module>
----> 1 func2(1)

<ipython-input-1-586ccabd0db3> in func2(x)
5 a = x
6 b = x - 1
----> 7 return func1(a, b)

<ipython-input-1-586ccabd0db3> in func1(a, b)
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x

ZeroDivisionError: division by zero

Calling func2 leads to an error and the traceback shows exactly what happened: each line shows the context of each
step that ultimately led to the error. With the %xmode magic function (short for exception mode) we can control which
information should be displayed to us.
%xmode takes a single argument, the mode, and there are three options: * Plain * Context * Verbose
The default setting is Context and outputs something like the one above. Plain is more compact and provides less
information:

[3]: %xmode Plain


func2(1)
Exception reporting mode: Plain
Traceback (most recent call last):

(continues on next page)

2.1. IPython 37
Python for Data Science, Release 1.0.0

(continued from previous page)


File "<ipython-input-3-2c0c65acd0a8>", line 2, in <module>
func2(1)

File "<ipython-input-1-586ccabd0db3>", line 7, in func2


return func1(a, b)

File "<ipython-input-1-586ccabd0db3>", line 2, in func1


return a / b

ZeroDivisionError: division by zero

The Verbose mode shows some additional information, including the arguments for any functions being called:

[4]: %xmode Verbose


func2(1)
Exception reporting mode: Verbose
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-4-180acea4108b> in <module>
1 get_ipython().run_line_magic('xmode', 'Verbose')
----> 2 func2(1)
global func2 = <function func2 at 0x106312a60>

<ipython-input-1-586ccabd0db3> in func2(x=1)
5 a = x
6 b = x - 1
global func1 = <function func1 at 0x1063129d8>
a = 1
----> 7 return func1(a, b)
b = 0

<ipython-input-1-586ccabd0db3> in func1(a=1, b=0)


1 def func1(a, b):
a = 1
----> 2 return a / b
b = 0
3
4 def func2(x):
5 a = x

ZeroDivisionError: division by zero

This additional information can help narrow down the reason for the exception. Conversely, however, the Verbose
mode can lead to extremely long tracebacks in the case of complex code, in which the essential points can hardly be
recognized.

38 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Debugging with %debug

Debugging can help if an error cannot be found by reading a traceback. The Python standard for interactive debugging
is the Python debugger pdb. You can use it to navigate your way through the code line by line to see what is possibly
causing an error. The extended version for IPython is ipdb.
In IPython, the %debug-magic command is perhaps the most convenient way to debug. If you call it after an exception
has been thrown, an interactive debug prompt will automatically open during the exception. Using the ipdb prompt,
you can examine the current status of the stack, examine the available variables and even run Python commands.
Let’s look at the last exception, then do some basic tasks:
[5]: %debug
> <ipython-input-1-586ccabd0db3>(2)func1()
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x

ipdb> print(a)
1
ipdb> print(b)
0
ipdb> quit

However, the interactive debugger does a lot more – we can also go up and down the stack and examine the values of
variables:

[6]: %debug
> <ipython-input-1-586ccabd0db3>(2)func1()
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x

ipdb> up
> <ipython-input-1-586ccabd0db3>(7)func2()
3
4 def func2(x):
5 a = x
6 b = x - 1
----> 7 return func1(a, b)

ipdb> print(x)
1
ipdb> up
> <ipython-input-4-180acea4108b>(2)<module>()
1 get_ipython().run_line_magic('xmode', 'Verbose')
----> 2 func2(1)

ipdb> down
> <ipython-input-1-586ccabd0db3>(7)func2()
(continues on next page)

2.1. IPython 39
Python for Data Science, Release 1.0.0

(continued from previous page)


3
4 def func2(x):
5 a = x
6 b = x - 1
----> 7 return func1(a, b)

ipdb> quit

This greatly simplifies the search for the function calls that led to the error.
If you want the debugger to start automatically when an exception is thrown, you can use the %pdb-magic function to
enable this behavior:

[7]: %xmode Plain


%pdb on
func2(1)
Exception reporting mode: Plain
Automatic pdb calling has been turned ON
Traceback (most recent call last):

File "<ipython-input-7-f80f6b5cecf3>", line 3, in <module>


func2(1)

File "<ipython-input-1-586ccabd0db3>", line 7, in func2


return func1(a, b)

File "<ipython-input-1-586ccabd0db3>", line 2, in func1


return a / b

ZeroDivisionError: division by zero

> <ipython-input-1-586ccabd0db3>(2)func1()
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x

ipdb> print(b)
0
ipdb> quit

If you have a script that you want to run in interactive mode from the start, you can do so with the command %run -d.

40 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Essential commands of the ipdb

Command Description
list Show the current location in the file
h(elp) Display a list of commands or find help on a specific command
q(uit) Terminates the debugger and the program
c(ontinue) Exit the debugger, continue in the program
n(ext) Go to the next step in the program
<enter> Repeat the previous command
p(rint) Print variables
s(tep) Step into a subroutine
r(eturn) Return from a subroutine

Further information on the IPython debugger can be found at ipdb.

2.2 Jupyter

We have moved the Jupyter chapter to its own tutorial: Jupyter Tutorial.

2.3 NumPy

NumPy is the abbreviation for numeric Python. Many Python packages that provide scientific functions use NumPy’s
array objects as one of the standard interfaces for data exchange. In the following, I will give a brief overview of the
main functionality of NumPy:
• ndarray, an efficient multidimensional array that provides fast array-based operations, such as shuffling and
cleaning data, subgrouping and filtering, transformation and all other kinds of computations. There are also
flexible functions for broadcasting, i.e. evaluations of arrays of different sizes.
• Mathematical functions for fast operations on whole arrays of data, such as sorting, uniqueness and set operations.
Instead of loops with if-elif-else branches, the expressions are written in conditional logic.
• Tools for reading and writing array data to disk and working with memory mapped files.
• Functions for linear algebra, random number generation and Fourier transform.
• A C API for connecting NumPy to libraries written in C, C++ or FORTRAN.

Note: This section introduces you to the basics of using NumPy arrays and should be sufficient to follow the rest of the
tutorial. For many data analytic applications, it is not necessary to have a deep understanding of NumPy, but mastering
array-oriented programming and thinking is an important step on the way to becoming a data scientist.

See also:
• Home
• Docs
• GitHub
• Tutorials

2.2. Jupyter 41
Python for Data Science, Release 1.0.0

2.3.1 Introduction to NumPy

NumPy operations perform complex calculations on entire arrays without the need for Python for loops, which can be
slow for large sequences. NumPy’s speed is explained by its C-based algorithms, which avoid the overhead of Python
code. To give you an idea of the performance difference, we measure the difference between a NumPy array and a
Python list with a hundred thousand integers:

[1]: import numpy as np

[2]: myarray = np.arange(100000)


mylist = list(range(100000))

[3]: %time for _ in range(10): myarray2 = myarray ** 2


CPU times: user 0 ns, sys: 2.55 ms, total: 2.55 ms
Wall time: 2.28 ms

[4]: %time for _ in range(10): mylist2 = [x ** 2 for x in mylist]


CPU times: user 390 ms, sys: 25.5 ms, total: 416 ms
Wall time: 423 ms

2.3.2 ndarray – an N-dimensional array object

ndarray allows mathematical operations on whole blocks of data, using a similar syntax to similar operations between
scalar elements. In NumPy, there are many different types for describing scalars, mostly based on types from the C
language and those compatible with Python.
See also:
• Array Scalars
Note:
Whenever this tutorial talks about array or ndarray, in most cases it refers to the ndarray object.

[1]: import numpy as np

[2]: py_list = [2020, 2021, 20222]


array_1d = np.array(py_list)

[3]: array_1d
[3]: array([ 2020, 2021, 20222])

Nested sequences, such as a list of lists of equal length, can be converted into a multidimensional array:

[4]: list_of_lists = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
array_2d = np.array(list_of_lists)

[5]: array_2d
[5]: array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])

42 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Since list_of_lists was a list with three lists, the NumPy array array_2d has two dimensions whose shape is
derived from the data. With the attributes ndim and shape we can output the number of dimensions and the outline of
array_2d:

[6]: array_2d.ndim
[6]: 2

[7]: array_2d.shape
[7]: (3, 4)

To give you an idea of the syntax, I first create an array of random numbers with five columns and seven slices:

[9]: data = np.random.randn(7, 3)


data
[9]: array([[ 1.86803134, -0.43943077, -0.74740291],
[-0.58166534, -1.73057682, 0.7329245 ],
[ 0.57447586, 1.37888746, -1.12090191],
[-0.94361317, 1.08870533, -0.04399877],
[-0.4623089 , -1.35280962, 1.0035284 ],
[ 0.55217441, -0.91288685, -0.47402509],
[ 0.23595001, -0.46866188, 1.06009356]])

ndarray is a generic multidimensional container. Each array has a shape, a tuple, which indicates the size of the
individual dimensions. With shape, I can output the number of rows and columns in an array:
In addition to np.array, there are a number of other functions for creating new arrays. zeros and ones, for example,
create arrays of zeros and ones, respectively, with a specific length or shape. empty creates an array without initialising
its values to a specific value. To create a higher dimensional array using these methods, pass a tuple for the shape:

[9]: np.zeros(4)
[9]: array([0., 0., 0., 0.])

[10]: np.ones((3,4))
[10]: array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])

[11]: np.empty((2,3,4))
[11]: array([[[1.02724706e-316, 0.00000000e+000, 6.90657088e-310,
6.90657147e-310],
[6.90657195e-310, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000],
[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000]],

[[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,


0.00000000e+000],
[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000],
[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000]]])

2.3. NumPy 43
Python for Data Science, Release 1.0.0

Note:
You may not safely assume that the np.empty function returns an array of zeros, as it returns uninitialised memory
and may therefore contain garbage values.
arange is an array-valued version of the Built-in Python range function:

[12]: np.arange(4)
[12]: array([0, 1, 2, 3])

Other NumPy standard functions for creating arrays are:

Function Description
array converts input data (list, tuple, array or other sequence types) into an ndarray by either deriving a
dtype or explicitly specifying a dtype; by default, copies the input data into the array
asarray converts the input to an ndarray, but does not copy if the input is already an ndarray
arange like Python built-in range, but returns an ndarray instead of a list
ones, ones creates an array of 1s in the given form and dtype; ones_like takes another array and creates
ones_like an ones array in the same form and dtype
zeros, like ones and ones_like, but creates arrays with 0s instead
zeros_like
empty, creates new arrays by allocating new memory, but does not fill them with values like ones and zeros
empty_like
full, creates an array of the given shape and dtype, setting all values to the given fill value; full_like
full_like takes another array and creates a filled array with the same shape and dtype
eye, creates a square N × N identity matrix (1s on the diagonal and 0s elsewhere)
identity

2.3.3 dtype

ndarray is a container for homogeneous data, i.e. all elements must be of the same type. Each array has a dtype, an
object that describes the data type of the array:

[8]: dt = data.dtype
dt
[8]: dtype('float64')

NumPy data types:

44 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Type Type Description


code
int8, uint8 i1, u1 Signed and unsigned 8-bit (1-byte) integer types
int16, uint16 i2, u2 Signed and unsigned 16-Bit (2 Byte) integer types
int32, uint32 i4, u4 Signed and unsigned 32-Bit (4 Byte) integer types
int64, uint64 i8, u8 Signed and unsigned 64-Bit (8 Byte) integer types
float16 f2 Standard floating point with half precision
float32 f4 or Standard floating point with single precision; compatible with C float
f
float64 f8 or Standard floating point with double precision; compatible with C double and
d Python float object
complex64, c8, Complex numbers represented by two 32, 64 or 128 floating point numbers respec-
complex128, c16, tively
complex256 c32
bool ? Boolean type that stores the values True and False
object O Python object type; a value can be any Python object
string_ S ASCII string type with fixed length (1 byte per character); to create a string type
with length 7, for example, use S7; longer inputs are truncated without warning
unicode_ U Unicode type with fixed length where the number of bytes is platform-specific;
uses the same specification semantics as string_, e.g. U7

Determine the number of elements with itemsize:

[9]: dt.itemsize
[9]: 8

Determine the name of the data type:

[10]: dt.name
[10]: 'float64'

Check data type:

[11]: dt.type is np.float64


[11]: True

Change data type with astype:

[12]: data_float32 = data.astype(np.float32)


data_float32
[12]: array([[ 0.33215025, 0.9821482 , 0.14965022],
[-0.5039629 , 0.79987854, -0.55183125],
[-0.9200971 , -0.746871 , 0.37547055],
[-1.1687789 , 1.6087633 , -1.6145438 ],
[ 0.14729111, 0.4293929 , -0.11391696],
[-0.9159697 , -0.6969758 , -0.36380735],
[-0.34818023, -0.9103621 , -0.47645596]], dtype=float32)

2.3. NumPy 45
Python for Data Science, Release 1.0.0

2.3.4 Arithmetic

Arrays allow you to perform stack operations on data without having to use for loops. This is called vectorisation in
NumPy. For all arithmetic operations between arrays of the same size, the operation is applied element by element:

[1]: import numpy as np

data = np.random.randn(7, 3)
data
[1]: array([[-0.53395033, -0.90655262, -1.47419874],
[-0.60691335, 0.12786859, 0.83478769],
[-0.24791073, 0.61712855, -1.36487227],
[-0.04708711, -1.45600876, -0.88333238],
[-0.60393798, -0.12695018, 1.17832759],
[-0.83320447, -0.30302264, 0.53249707],
[ 0.85006387, 0.83626062, -1.03637456]])

[2]: 1 / data
[2]: array([[ -1.87283338, -1.10307993, -0.67833459],
[ -1.64768166, 7.82052869, 1.19790937],
[ -4.03370998, 1.62040794, -0.73266929],
[-21.23723314, -0.68680906, -1.13207669],
[ -1.65579916, -7.87710608, 0.84866043],
[ -1.2001856 , -3.30008343, 1.87794462],
[ 1.17638219, 1.19579946, -0.96490211]])

[3]: data ** 2
[3]: array([[0.28510296, 0.82183765, 2.17326192],
[0.36834382, 0.01635038, 0.69687049],
[0.06145973, 0.38084765, 1.8628763 ],
[0.0022172 , 2.1199615 , 0.7802761 ],
[0.36474108, 0.01611635, 1.3884559 ],
[0.69422968, 0.09182272, 0.28355312],
[0.72260859, 0.69933183, 1.07407224]])

Comparison of two arrays:

[4]: data2 = np.random.randn(7, 3)


data > data2
[4]: array([[False, True, False],
[False, True, True],
[False, False, False],
[ True, False, False],
[False, True, True],
[False, False, True],
[False, False, False]])

46 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

2.3.5 Indexing and slicing

Indexing is the selection of a subset of your data or individual elements. This is very easy in one-dimensional arrays;
they behave similarly to Python lists:

[1]: import numpy as np

[2]: rng = np.random.default_rng()


data = rng.normal(size=(7, 3))
data
[2]: array([[ 0.50645754, -1.55827537, -0.55228066],
[-0.8255559 , 0.7174421 , -0.01235908],
[ 0.29963999, 0.92431092, 0.54994156],
[-0.13884297, -0.32815509, 0.53517827],
[ 0.55981128, 0.10765489, 2.09209898],
[-0.7775902 , -1.83407447, 0.97028733],
[ 1.5949525 , 0.03300061, 0.69442966]])

[3]: data[4]
[3]: array([0.55981128, 0.10765489, 2.09209898])

[4]: data[2:4]
[4]: array([[ 0.29963999, 0.92431092, 0.54994156],
[-0.13884297, -0.32815509, 0.53517827]])

[5]: data[2:4] = rng.normal(size=(2, 3))

[6]: data
[6]: array([[ 0.50645754, -1.55827537, -0.55228066],
[-0.8255559 , 0.7174421 , -0.01235908],
[ 1.69557114, 0.54106355, 1.73920426],
[ 0.7296357 , 2.13535052, 0.02169761],
[ 0.55981128, 0.10765489, 2.09209898],
[-0.7775902 , -1.83407447, 0.97028733],
[ 1.5949525 , 0.03300061, 0.69442966]])

Note:
Array slices differ from Python lists in that they are views of the original array. This means that the data is not copied
and that any changes to the view are reflected in the original array.
If you want to make a copy of a part of an ndarray, you can copy the array explicitly – for example with data[2:5].
copy().
Slicing in this way always results in array views with the same number of dimensions. However, if you mix integer
indices and slices, you get slices with lower dimensions. For example, we can select the second row but only the first
two columns as follows:

[7]: data[1, :2]


[7]: array([-0.8255559, 0.7174421])

A colon means that the whole axis is taken, so you can also select higher dimensional axes:

2.3. NumPy 47
Python for Data Science, Release 1.0.0

[8]: data[:, :1]


[8]: array([[ 0.50645754],
[-0.8255559 ],
[ 1.69557114],
[ 0.7296357 ],
[ 0.55981128],
[-0.7775902 ],
[ 1.5949525 ]])

Boolean indexing

Let’s consider an example where we have some data in an array and an array of names with duplicates. I will use the
normal function in numpy.random.default_rng here to generate some random normally distributed data:

[9]: names = np.array(['Liam', 'Olivia', 'Noah', 'Liam', 'Noah', 'Olivia', 'Liam', 'Emma',
˓→'Oliver', 'Ava'])

data = rng.normal(size=(10, 4))

[10]: names
[10]: array(['Liam', 'Olivia', 'Noah', 'Liam', 'Noah', 'Olivia', 'Liam', 'Emma',
'Oliver', 'Ava'], dtype='<U6')

[11]: data
[11]: array([[ 0.78849796, -1.93868484, 1.32082796, -0.10041207],
[ 0.14511387, 1.33143889, 0.32335453, -0.86912269],
[-1.33151508, 1.86170326, -0.07634186, -0.08733154],
[-1.30468553, -1.24239052, -0.82416291, 0.18988816],
[ 1.15544397, 1.95009689, -1.58486011, 0.5644875 ],
[-0.3425853 , 0.49727459, -1.13323998, 1.33729132],
[-2.86162133, -1.44457339, 2.484492 , -0.9733903 ],
[-1.04106218, -0.89449304, 0.50255481, 1.39502699],
[ 1.0782019 , -1.53786777, 0.72229253, 1.13447707],
[ 0.82987705, -0.17516968, -1.11954313, -0.02558353]])

Suppose each name corresponds to a row in the data array and we want to select all rows with the corresponding name
Liam. Like arithmetic operations, comparisons like == are vectorised with arrays. So comparing names with the string
Liam results in a Boolean array:

[12]: names == 'Liam'


[12]: array([ True, False, False, True, False, False, True, False, False,
False])

This Boolean array can be passed when indexing the array:

[13]: data[names == 'Liam']


[13]: array([[ 0.78849796, -1.93868484, 1.32082796, -0.10041207],
[-1.30468553, -1.24239052, -0.82416291, 0.18988816],
[-2.86162133, -1.44457339, 2.484492 , -0.9733903 ]])

48 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Here, the Boolean array must have the same length as the array axis it indexes.
Note:
Selecting data from an array by Boolean indexing and assigning the result to a new variable always creates a copy of
the data, even if the returned array is unchanged.
In the following example, I select the rows where names == 'Liam' and also index the columns:

[14]: data[names == 'Liam', 2:]


[14]: array([[ 1.32082796, -0.10041207],
[-0.82416291, 0.18988816],
[ 2.484492 , -0.9733903 ]])

To select everything except Liam, you can either use != or negate the condition with ~:

[15]: names != 'Liam'


[15]: array([False, True, True, False, True, True, False, True, True,
True])

[16]: cond = names == 'Liam'


data[~cond]
[16]: array([[ 0.14511387, 1.33143889, 0.32335453, -0.86912269],
[-1.33151508, 1.86170326, -0.07634186, -0.08733154],
[ 1.15544397, 1.95009689, -1.58486011, 0.5644875 ],
[-0.3425853 , 0.49727459, -1.13323998, 1.33729132],
[-1.04106218, -0.89449304, 0.50255481, 1.39502699],
[ 1.0782019 , -1.53786777, 0.72229253, 1.13447707],
[ 0.82987705, -0.17516968, -1.11954313, -0.02558353]])

If you select two of the three names to combine several Boolean conditions, you can use the Boolean arithmetic operators
& (and) and | (or).
Warning:
The Python keywords and and or do not work with Boolean arrays.

[17]: mask = (names == 'Liam') | (names == 'Olivia')

[18]: mask
[18]: array([ True, True, False, True, False, True, True, False, False,
False])

[19]: data[mask]
[19]: array([[ 0.78849796, -1.93868484, 1.32082796, -0.10041207],
[ 0.14511387, 1.33143889, 0.32335453, -0.86912269],
[-1.30468553, -1.24239052, -0.82416291, 0.18988816],
[-0.3425853 , 0.49727459, -1.13323998, 1.33729132],
[-2.86162133, -1.44457339, 2.484492 , -0.9733903 ]])

2.3. NumPy 49
Python for Data Science, Release 1.0.0

Integer Array Indexing

Integer array indexing allows you to select any elements in the array based on your N-dimensional index. Each integer
array represents a number of indices in that dimension.
See also:
• Integer array indexing

2.3.6 Transpose arrays and swap axes

Transpose is a special form of reshaping that also provides a view of the underlying data without copying anything.
Arrays have the Transpose method and also the special T attribute:

[1]: import numpy as np

[2]: data = np.arange(16)

[3]: data
[3]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

[4]: reshaped_data = data.reshape((4, 4))

[5]: reshaped_data
[5]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

[6]: reshaped_data.T
[6]: array([[ 0, 4, 8, 12],
[ 1, 5, 9, 13],
[ 2, 6, 10, 14],
[ 3, 7, 11, 15]])

numpy.dot returns the scalar product of two arrays, for example:

[7]: np.dot(reshaped_data.T, reshaped_data)


[7]: array([[224, 248, 272, 296],
[248, 276, 304, 332],
[272, 304, 336, 368],
[296, 332, 368, 404]])

The @ infix operator is another way to perform matrix multiplication. It implements the semantics of the @ operator
introduced in Python 3.5 with PEP 465 and is an abbreviation of np.matmul.

[8]: data.T @ data


[8]: 1240

For higher dimensional arrays, transpose accepts a tuple of axis numbers to swap the axes:

50 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[9]: array_3d = np.arange(16).reshape((2, 2, 4))

[10]: array_3d
[10]: array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],

[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])

[11]: array_3d.transpose((1, 0, 2))


[11]: array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],

[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])

Here the axes have been reordered with the second axis in first place, the first axis in second place and the last axis
unchanged.
ndarray also has a swapaxes method that takes a pair of axis numbers and swaps the specified axes to rearrange the
data:

[12]: array_3d.swapaxes(1, 2)
[12]: array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],

[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])

2.3.7 Universal functions (ufunc)

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. They can
be thought of as fast vectorised wrappers for simple functions that take one or more scalar values and produce one or
more scalar results.
Many ufuncs are simple element-wise transformations, such as sqrt or exp:

[1]: import numpy as np

data = np.arange(10)

[2]: data
[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

[3]: np.sqrt(data)

2.3. NumPy 51
Python for Data Science, Release 1.0.0

[3]: array([0. , 1. , 1.41421356, 1.73205081, 2. ,


2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])

[4]: np.exp(data)
[4]: array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])

These are called single-digit ufuncs. Others, such as add or maximum, take two arrays (i.e. binary ufuncs) and return
a single array as the result:

[5]: x = np.random.randn(8)

[6]: y = np.random.randn(8)

[7]: x
[7]: array([ 0.27787269, -0.58356193, -0.10184236, 1.35055943, -1.57973878,
1.14278195, -1.08386512, 0.09227202])

[8]: y
[8]: array([ 0.32170798, 0.7041202 , -0.22124585, -0.01662225, 0.20266652,
0.53028692, 1.09999517, -0.47938682])

[9]: np.maximum(x, y)
[9]: array([ 0.32170798, 0.7041202 , -0.10184236, 1.35055943, 0.20266652,
1.14278195, 1.09999517, 0.09227202])

Here numpy.maximum calculated the element-wise maximum of the elements in x and y.


Some ufunc, such as modf, a vectorised version of the built-in Python divmod, return multiple arrays: the fractional
and integral parts of a floating-point array:

[10]: data = x * 5

[11]: data
[11]: array([ 1.38936344, -2.91780966, -0.50921179, 6.75279713, -7.89869391,
5.71390974, -5.4193256 , 0.46136012])

[12]: remainder, whole_part = np.modf(x)

[13]: remainder
[13]: array([ 0.27787269, -0.58356193, -0.10184236, 0.35055943, -0.57973878,
0.14278195, -0.08386512, 0.09227202])

[14]: whole_part
[14]: array([ 0., -0., -0., 1., -1., 1., -1., 0.])

Ufuncs accept an optional out argument that allows you to transfer your results to an existing array instead of creating
a new one:

52 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[15]: out = np.zeros_like(data)

[16]: np.add(data, 1)
[16]: array([ 2.38936344, -1.91780966, 0.49078821, 7.75279713, -6.89869391,
6.71390974, -4.4193256 , 1.46136012])

[17]: np.add(data, 1, out=out)


[17]: array([ 2.38936344, -1.91780966, 0.49078821, 7.75279713, -6.89869391,
6.71390974, -4.4193256 , 1.46136012])

[18]: out
[18]: array([ 2.38936344, -1.91780966, 0.49078821, 7.75279713, -6.89869391,
6.71390974, -4.4193256 , 1.46136012])

Some single-digit ufuncs:

Function Description
abs, fabs calculates the absolute value element by element for integer, floating point
or complex values
sqrt calculates the square root of each element (corresponds to data ** 0,5)
square calculates the square of each element (corresponds to data ** 2)
exp calculates the exponent ex of each element
log, log10, log2, log1p Natural logarithm (base e), log base 10, log base 2 and log(1 + x) respec-
tively
sign calculates the sign of each element: 1 (positive), 0 (zero), or -1 (negative)
ceil calculates the upper limit of each element (i.e. the smallest integer greater
than or equal to this number)
floor calculates the lower limit of each element (i.e. the largest integer less than
or equal to each element)
rint rounds elements to the nearest integer, preserving the dtype
modf returns the fractional and integer parts of the array as separate arrays
isnan returns a boolean array indicating whether each value is NaN (Not a Num-
ber)
isfinite, isinf returns a boolean array indicating whether each element is finite (not-inf,
not-NaN) or infinite, respectively
cos, cosh, sin, sinh, tan, tanh regular and hyperbolic trigonometric functions
arccos, arccosh, arcsin, arcsinh, Inverse trigonometric functions
arctan, arctanh
logical_not calculates the truth value of not x element by element (corresponds to
~data)

Some binary universal functions:

2.3. NumPy 53
Python for Data Science, Release 1.0.0

Function Description
add add corresponding elements in arrays
subtract subtracts elements in the second array from the first array
multiply multiply array elements
divide, floor_divide divide or truncate the remainder
power increases elements in the first array to the powers specified in the second
array
maximum, fmax element-wise maximum; fmax ignores NaN
minimum, fmin element-wise minimum; fmin ignores NaN
mod element-wise modulus (remainder of the division)
copysign copies the sign of the values in the second argument to the values in the
first argument
greater, greater_equal, less, perform element-wise comparisons that result in a Boolean array (cor-
less_equal, equal, not_equal responds to the infix operators >, >=, <, <=, ==, !=)
logical_and calculates the element-wise truth value of the logical operation AND
(&)
logical_or calculates the element-wise truth value of the logical operation OR (|)
logical_xor calculates the element-wise truth value of the logical operation XOR (^)

Note:
A complete overview of binary universal functions can be found in Universal functions (ufunc).

2.3.8 Array-oriented programming – vectorisation

Using NumPy arrays allows you to express many types of data processing tasks as concise array expressions that would
otherwise require writing for-loops. This practice of replacing loops with array expressions is also called vectorisation.
In general, vectorised array operations are significantly faster than their pure Python equivalents.

[1]: import numpy as np

First we create a NumPy array with one hundred thousand integers:

[2]: myarray = np.arange(100000)

Then we square all the elements in this array with numpy.square:

[3]: %time np.square(myarray)


CPU times: user 915 µs, sys: 169 µs, total: 1.08 ms
Wall time: 581 µs
[3]: array([ 0, 1, 4, ..., 9999400009, 9999600004,
9999800001])

For comparison, we now measure the time of Python’s quadratic function:

[4]: %time for _ in range(10): myarray2 = myarray ** 2


CPU times: user 5.89 ms, sys: 5.16 ms, total: 11.1 ms
Wall time: 10.8 ms

And finally, we compare the time with the calculation of the quadratic function of all values of a Python list:

54 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[5]: mylist = list(range(100000))


%time for _ in range(10): mylist2 = [x ** 2 for x in mylist]
CPU times: user 535 ms, sys: 30.7 ms, total: 565 ms
Wall time: 570 ms

2.3.9 Conditional logic as array operations – where

The numpy.where function is a vectorised version of if and else.


In the following example, we first create a Boolean array and two arrays with values:

[1]: import numpy as np

[2]: cond = ([False, True, False, True, False, False, False])


data1 = np.random.randn(1, 7)
data2 = np.random.randn(1, 7)

Now we want to take the values from data1 if the corresponding value in cond is True and otherwise take the value
from data2. With Python’s if-else, this could look like this:

[3]: result = [(x if c else y)


for x, y, c in zip(data1, data2, cond)]

result
[3]: [array([-0.97770921, 0.18774118, 0.02492351, -1.71622535, 0.07635826,
-0.78475485, 1.35754775])]

However, this has the following two problems:


• with large arrays the function will not be very fast
• this will not work with multidimensional arrays
With np.where you can work around these problems in a single function call:

[4]: result = np.where(cond, data1, data2)

result
[4]: array([[-0.97770921, 2.12735854, 0.02492351, 0.44203037, 0.07635826,
-0.78475485, 1.35754775]])

The second and third arguments of np.where do not have to be arrays; one or both can also be scalars. A typical
use of where in data analysis is to create a new array of values based on another array. Suppose you have a matrix of
randomly generated data and you want to make all the negative values positive values:

[5]: data = np.random.randn(4, 4)

data
[5]: array([[ 0.49903363, 0.67358076, 0.12044423, 0.61645566],
[-0.06063701, -0.39860593, -0.06098876, 1.50512719],
[ 1.04223193, -0.0546117 , 0.84049503, 0.36221393],
[-0.62065844, 0.80729481, -0.13285982, 0.79702012]])

2.3. NumPy 55
Python for Data Science, Release 1.0.0

[6]: data < 0


[6]: array([[False, False, False, False],
[ True, True, True, False],
[False, True, False, False],
[ True, False, True, False]])

[7]: np.where(data < 0, data * -1, data)


[7]: array([[0.49903363, 0.67358076, 0.12044423, 0.61645566],
[0.06063701, 0.39860593, 0.06098876, 1.50512719],
[1.04223193, 0.0546117 , 0.84049503, 0.36221393],
[0.62065844, 0.80729481, 0.13285982, 0.79702012]])

2.3.10 Mathematical and statistical methods

A number of mathematical functions that calculate statistics over an entire array or over the data along an axis are
accessible as methods of the array class. So you can use aggregations such as sum, mean and standard deviation by
either calling the array instance method or using the top-level NumPy function.
Below I generate some random data and calculate some aggregated statistics:

[1]: import numpy as np

data = np.random.randn(7, 3)

data
[1]: array([[ 0.1636264 , -0.07479277, -0.20255074],
[-0.74231862, 1.09947779, 0.78135583],
[ 0.63855308, 0.32602387, 0.61464571],
[ 0.23415688, -0.91457139, -0.31875483],
[-0.02505313, 1.25184632, 1.4875972 ],
[-0.20283564, 3.13198588, 1.74734662],
[ 0.93549941, -1.0149449 , -0.44490236]])

[2]: data.mean()
[2]: 0.40339955359802476

[3]: np.mean(data)
[3]: 0.40339955359802476

[4]: data.sum()
[4]: 8.47139062555852

Functions like mean and sum require an optional axis argument that calculates the statistic over the specified axis,
resulting in an array with one less dimension:

[5]: data.mean(axis=0)
[5]: array([0.14308977, 0.54357497, 0.52353392])

56 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[6]: data.sum(axis=0)
[6]: array([1.00162838, 3.8050248 , 3.66473745])

With data.mean(0), which is the same as data.mean(axis=0), the mean is calculated over the rows, while data.
sum(0) calculates the sum over the rows.
Other methods like cumsum and cumprod, however, do not aggregate but create a new array with the intermediate
results.
In multidimensional arrays, accumulation functions such as cumsum and cumprod return an array of the same size but
with the partial aggregates calculated along the specified axis:

[7]: data.cumsum()
[7]: array([ 0.1636264 , 0.08883362, -0.11371711, -0.85603574, 0.24344206,
1.02479789, 1.66335097, 1.98937484, 2.60402055, 2.83817743,
1.92360604, 1.60485122, 1.57979809, 2.83164441, 4.31924161,
4.11640597, 7.24839184, 8.99573847, 9.93123788, 8.91629298,
8.47139063])

[8]: data.cumprod()
[8]: array([ 1.63626397e-01, -1.22380721e-02, 2.47883053e-03, -1.84008206e-03,
-2.02312937e-03, -1.58078392e-03, -1.00941444e-03, -3.29093203e-04,
-2.02275727e-04, -4.73642533e-05, 4.33179909e-05, -1.38078186e-05,
3.45929093e-07, 4.33050061e-07, 6.44204058e-07, -1.30667543e-07,
-4.09248898e-07, -7.15099681e-07, -6.68975332e-07, 6.78973099e-07,
-3.02076731e-07])

Basic statistical methods for arrays are:

Method Description
sum Sum of all elements in the array or along an axis.
mean Arithmetic mean; for arrays with length zero, NaN is returned.
std, var Standard deviation and variance respectively
min, max Minimum and maximum
argmin, argmax Indices of the minimum and maximum elements respectively
cumsum Cumulative sum of the elements, starting with 0
cumprod Cumulative product of the elements, starting with 1

2.3.11 Methods for Boolean arrays

Boolean values have been converted to 1 (True) and 0 (False) in the previous methods. Therefore, sum is often used
to count the True values in a Boolean array:

[1]: import numpy as np

[2]: data = np.random.randn(7, 3)

Number of positive values:

2.3. NumPy 57
Python for Data Science, Release 1.0.0

[3]: (data > 0).sum()


[3]: 10

Number of negative values:

[4]: (data < 0).sum()


[4]: 11

There are two additional methods, any and all, which are particularly useful for Boolean arrays:
• any checks whether one or more values in an array are true
• all checks whether each value is true

[5]: data2 = np.random.randn(7, 3)

bools = data > data2

bools
[5]: array([[ True, False, False],
[False, False, True],
[ True, True, False],
[False, True, False],
[ True, False, False],
[False, False, False],
[ True, True, True]])

[6]: bools.any()
[6]: True

[7]: bools.all()
[7]: False

2.3.12 Sort

As in Python’s list, NumPy arrays can be sorted in-place using the numpy.sort method. You can sort any one-
dimensional section of values in a multidimensional array in place along an axis by passing the axis number to sort:

[4]: import numpy as np

data = np.random.randn(7, 3)

data
[4]: array([[ 0.01030144, 0.78524872, 0.42930263],
[-1.51182764, 0.16973497, -0.6294608 ],
[-1.63687566, 2.89978519, -1.12282476],
[-0.23121837, 0.40594707, -0.1667278 ],
[ 1.57718982, -0.4898681 , -0.38596701],
[ 0.92491973, -0.69984905, -1.66848346],
[-0.87552462, -0.04699234, 1.33797794]])

58 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[9]: data.sort(0)

data
[9]: array([[-1.63687566, -0.69984905, -1.66848346],
[-1.51182764, -0.4898681 , -1.12282476],
[-0.87552462, -0.04699234, -0.6294608 ],
[-0.23121837, 0.16973497, -0.38596701],
[ 0.01030144, 0.40594707, -0.1667278 ],
[ 0.92491973, 0.78524872, 0.42930263],
[ 1.57718982, 2.89978519, 1.33797794]])

np.sort, on the other hand, returns a sorted copy of an array instead of changing the array in place:

[7]: np.sort(data, axis=1)


[7]: array([[ 0.01030144, 0.42930263, 0.78524872],
[-1.51182764, -0.6294608 , 0.16973497],
[-1.63687566, -1.12282476, 2.89978519],
[-0.23121837, -0.1667278 , 0.40594707],
[-0.4898681 , -0.38596701, 1.57718982],
[-1.66848346, -0.69984905, 0.92491973],
[-0.87552462, -0.04699234, 1.33797794]])

[10]: data
[10]: array([[-1.63687566, -0.69984905, -1.66848346],
[-1.51182764, -0.4898681 , -1.12282476],
[-0.87552462, -0.04699234, -0.6294608 ],
[-0.23121837, 0.16973497, -0.38596701],
[ 0.01030144, 0.40594707, -0.1667278 ],
[ 0.92491973, 0.78524872, 0.42930263],
[ 1.57718982, 2.89978519, 1.33797794]])

2.3.13 unique and other set logic

NumPy has some basic set operations for one-dimensional ndarray. A commonly used one is numpy.unique, which
returns the sorted unique values in an array:

[1]: import numpy as np

names = np.array(['Liam', 'Olivia', 'Noah', 'Liam', 'Noah', 'Olivia', 'Liam', 'Emma',


˓→'Oliver', 'Ava'])

[2]: np.unique(names)
[2]: array(['Ava', 'Emma', 'Liam', 'Noah', 'Oliver', 'Olivia'], dtype='<U6')

With numpy.in1d you can check the membership of the values in a one-dimensional array to another array and a boolean
array is returned:

[3]: np.in1d(names, ['Emma', 'Ava', 'Charlotte'])

2.3. NumPy 59
Python for Data Science, Release 1.0.0

[3]: array([False, False, False, False, False, False, False, True, False,
True])

Array set operations:

Method Description
unique(x) calculates the sorted, unique elements in x
intersect1d(x, y) calculates the sorted common elements x and y
union1d(x, y) calculates the sorted union of elements
in1d(x, y) computes a boolean array indicating whether each element of x is contained in y
setdiff1d(x, y) sets the difference of the elements in x that are not contained in y
setxor1d(x, y) sets symmetric differences; elements contained in one of the arrays but not in both

2.3.14 File input and output with arrays

NumPy is able to store data in some text or binary formats on disk and load it from there. However, in this section I
only discuss NumPy’s own binary format, as mostly pandas or other tools are used to load text or table data (see Read,
persist and provide data.
np.save and np.load are the two most important functions for efficiently saving and loading array data to disk. Arrays
are saved by default in an uncompressed raw binary format with the file extension .npy:

[1]: import numpy as np

data = np.random.randn(7, 3)

np.save('my_data', data)

If the file path does not already end in .npy, the extension is appended. The array on the hard disk can then be loaded
with np.load:

[2]: np.load('my_data.npy')
[2]: array([[ 0.94742917, -1.61998927, 0.58811464],
[ 0.76779587, -1.30799936, -0.23834223],
[-2.13912266, -0.49191404, -0.49316354],
[-0.62636589, -1.2023549 , 1.41655391],
[ 0.85182508, 0.09138964, 0.79240571],
[ 1.22517044, 0.9233269 , 0.44967536],
[-0.68252408, -0.00765624, 0.10032318]])

You can save multiple arrays in an uncompressed archive by using np.savez and passing the arrays as keyword argu-
ments:

[3]: np.savez('data_archive.npz', a=data, b=np.square(data))

[4]: archive = np.load('data_archive.npz')

archive['b']
[4]: array([[8.97622033e-01, 2.62436524e+00, 3.45878832e-01],
[5.89510504e-01, 1.71086232e+00, 5.68070195e-02],
(continues on next page)

60 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

(continued from previous page)


[4.57584576e+00, 2.41979423e-01, 2.43210281e-01],
[3.92334231e-01, 1.44565730e+00, 2.00662497e+00],
[7.25605970e-01, 8.35206615e-03, 6.27906803e-01],
[1.50104262e+00, 8.52532560e-01, 2.02207929e-01],
[4.65839120e-01, 5.86179351e-05, 1.00647409e-02]])

2.4 pandas

pandas is a Python library for data analysis that has become very popular in recent years. On the website, pandas is
described thus:
„pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built
on top of the Python programming language.“
More specifically, pandas is an in-memory analysis tool that offers SQL-like constructs, as well as statistical and ana-
lytical tools. In doing so, pandas builds on Cython and NumPy, making it less memory intensive and faster than pure
Python code. Mostly pandas is used to
• replace Excel and Power BI
• implement an ETL process
• process CSV or JSON data
• prepare machine learning
See also:
• Home
• User guide
• API reference
• GitHub

2.4.1 Introduction to the data structures of pandas

To get started with pandas, you should first familiarise yourself with the two most important data structures Series and
DataFrame.

Series

A series is a one-dimensional array-like object containing a sequence of values (of similar types to the NumPy types)
and an associated array of data labels called an index. The simplest series is formed from just an array of data:

[1]: import numpy as np


import pandas as pd

[2]: rng = np.random.default_rng()


s = pd.Series(rng.normal(size=7))
s

2.4. pandas 61
Python for Data Science, Release 1.0.0

[2]: 0 0.145710
1 1.445287
2 -0.631168
3 -0.158460
4 0.408335
5 1.203595
6 0.585688
dtype: float64

The string representation of an interactively displayed series shows the index on the left and the values on the right.
Since we have not specified an index for the data, a default index is created consisting of the integers 0 to N - 1
(where N is the length of the data). You can get the array representation and the index object of the series via their
pandas.Series.array and pandas.Series.index attributes respectively:

[3]: s.array
[3]: <PandasArray>
[0.14571022099248174, 1.4452870427787552, -0.63116828428809,
-0.1584603180747951, 0.40833540099850335, 1.2035948244521983,
0.5856883717533461]
Length: 7, dtype: float64

[4]: s.index
[4]: RangeIndex(start=0, stop=7, step=1)

Often you will want to create an index that identifies each data point with a label:

[5]: idx = pd.date_range("2022-01-31", periods=7)

s2 = pd.Series(rng.normal(size=7), index=idx)

[6]: s2
[6]: 2022-01-31 -0.879307
2022-02-01 1.491183
2022-02-02 -0.778551
2022-02-03 2.963484
2022-02-04 0.134357
2022-02-05 0.357678
2022-02-06 -1.124577
Freq: D, dtype: float64

See also:
• Time series / date functionality
Compared to NumPy arrays, you can use labels in the index if you want to select individual values or a group of values:

[7]: s2['2022-02-02']
[7]: -0.7785505322106326

[8]: s2[['2022-02-02', '2022-02-03', '2022-02-04']]

62 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[8]: 2022-02-02 -0.778551


2022-02-03 2.963484
2022-02-04 0.134357
dtype: float64

Here ['2022-02-02', '2022-02-03', '2022-02-04'] is interpreted as a list of indices, even if it contains strings
instead of integers.
When using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication
or applying mathematical functions, the link between index and value is preserved:

[9]: s2[s2 > 0]


[9]: 2022-02-01 1.491183
2022-02-03 2.963484
2022-02-04 0.134357
2022-02-05 0.357678
dtype: float64

[10]: s2 ** 2
[10]: 2022-01-31 0.773180
2022-02-01 2.223628
2022-02-02 0.606141
2022-02-03 8.782240
2022-02-04 0.018052
2022-02-05 0.127933
2022-02-06 1.264674
Freq: D, dtype: float64

[11]: np.exp(s2)
[11]: 2022-01-31 0.415071
2022-02-01 4.442349
2022-02-02 0.459071
2022-02-03 19.365331
2022-02-04 1.143801
2022-02-05 1.430005
2022-02-06 0.324790
Freq: D, dtype: float64

You can also think of a series as a fixed-length ordered dict, since it is an assignment of index values to data values. It
can be used in many contexts where you could use a dict:

[12]: '2022-02-02' in s2
[12]: True

[13]: '2022-02-09' in s2
[13]: False

2.4. pandas 63
Python for Data Science, Release 1.0.0

Missing data

I will use NA and null synonymously to indicate missing data. The functions isna and notna in pandas should be
used to identify missing data:

[14]: pd.isna(s2)
[14]: 2022-01-31 False
2022-02-01 False
2022-02-02 False
2022-02-03 False
2022-02-04 False
2022-02-05 False
2022-02-06 False
Freq: D, dtype: bool

[15]: pd.notna(s2)
[15]: 2022-01-31 True
2022-02-01 True
2022-02-02 True
2022-02-03 True
2022-02-04 True
2022-02-05 True
2022-02-06 True
Freq: D, dtype: bool

Series also has these as instance methods:

[16]: s2.isna()
[16]: 2022-01-31 False
2022-02-01 False
2022-02-02 False
2022-02-03 False
2022-02-04 False
2022-02-05 False
2022-02-06 False
Freq: D, dtype: bool

Dealing with missing data is discussed in more detail in the section Managing missing data with pandas.
A useful feature of Series for many applications is the automatic alignment by index labels in arithmetic operations:

[17]: idx = pd.date_range("2022-02-01", periods=7)

s3 = pd.Series(rng.normal(size=7), index=idx)

[18]: s2, s3
[18]: (2022-01-31 -0.879307
2022-02-01 1.491183
2022-02-02 -0.778551
2022-02-03 2.963484
2022-02-04 0.134357
(continues on next page)

64 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

(continued from previous page)


2022-02-05 0.357678
2022-02-06 -1.124577
Freq: D, dtype: float64,
2022-02-01 -0.533894
2022-02-02 -1.020880
2022-02-03 0.505049
2022-02-04 0.534989
2022-02-05 -0.164027
2022-02-06 0.967125
2022-02-07 0.427053
Freq: D, dtype: float64)

[19]: s2 + s3
[19]: 2022-01-31 NaN
2022-02-01 0.957289
2022-02-02 -1.799431
2022-02-03 3.468533
2022-02-04 0.669346
2022-02-05 0.193651
2022-02-06 -0.157452
2022-02-07 NaN
Freq: D, dtype: float64

If you have experience with SQL, this is similar to a JOIN operation.


Both the Series object itself and its index have a name attribute that can be integrated into other areas of the pandas
functionality:

[20]: s3.name = 'floats'


s3.index.name = 'date'

s3
[20]: date
2022-02-01 -0.533894
2022-02-02 -1.020880
2022-02-03 0.505049
2022-02-04 0.534989
2022-02-05 -0.164027
2022-02-06 0.967125
2022-02-07 0.427053
Freq: D, Name: floats, dtype: float64

2.4. pandas 65
Python for Data Science, Release 1.0.0

DataFrame

A DataFrame represents a rectangular data table and contains an ordered, named collection of columns, each of which
can have a different value type. The DataFrame has both a row index and a column index.
Note:
Although a DataFrame is two-dimensional, you can also use it to represent higher-dimensional data in a table format
with hierarchical indexing using join, combine and Reshaping.

[21]: data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],


'Decimal': [0, 1, 2, 3, 4, 5],
'Octal': ['001', '002', '003', '004', '004', '005'],
'Key': ['NUL', 'Ctrl-A', 'Ctrl-B', 'Ctrl-C', 'Ctrl-D', 'Ctrl-E']}

df = pd.DataFrame(data)

df
[21]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E

For large DataFrames, the head method selects only the first five rows:

[22]: df.head()
[22]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D

You can also specify columns and their order:

[23]: pd.DataFrame(data, columns=['Code', 'Key'])


[23]: Code Key
0 U+0000 NUL
1 U+0001 Ctrl-A
2 U+0002 Ctrl-B
3 U+0003 Ctrl-C
4 U+0004 Ctrl-D
5 U+0005 Ctrl-E

If you want to pass a column that is not contained in the dict, it will appear without values in the result:

[24]: df2 = pd.DataFrame(data, columns=['Code', 'Decimal', 'Octal', 'Description', 'Key'])

df2

66 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[24]: Code Decimal Octal Description Key


0 U+0000 0 001 NaN NUL
1 U+0001 1 002 NaN Ctrl-A
2 U+0002 2 003 NaN Ctrl-B
3 U+0003 3 004 NaN Ctrl-C
4 U+0004 4 004 NaN Ctrl-D
5 U+0005 5 005 NaN Ctrl-E

You can retrieve a column in a DataFrame with a dict-like notation:

[25]: df['Code']
[25]: 0 U+0000
1 U+0001
2 U+0002
3 U+0003
4 U+0004
5 U+0005
Name: Code, dtype: object

This way you can also make a column the index:

[26]: df2 = pd.DataFrame(data,


columns=['Decimal', 'Octal', 'Description', 'Key'],
index=df['Code'])

df2
[26]: Decimal Octal Description Key
Code
U+0000 0 001 NaN NUL
U+0001 1 002 NaN Ctrl-A
U+0002 2 003 NaN Ctrl-B
U+0003 3 004 NaN Ctrl-C
U+0004 4 004 NaN Ctrl-D
U+0005 5 005 NaN Ctrl-E

Rows can be retrieved by position or name with the pandas.DataFrame.loc attribute:

[27]: df2.loc['U+0001']
[27]: Decimal 1
Octal 002
Description NaN
Key Ctrl-A
Name: U+0001, dtype: object

Column values can be changed by assignment. For example, a scalar value or an array of values could be assigned to
the empty Description column:

[28]: df2['Description'] = [
'Null character',
'Start of Heading',
'Start of Text',
'End-of-text character',
(continues on next page)

2.4. pandas 67
Python for Data Science, Release 1.0.0

(continued from previous page)


'End-of-transmission character',
'Enquiry character'
]

df2
[28]: Decimal Octal Description Key
Code
U+0000 0 001 Null character NUL
U+0001 1 002 Start of Heading Ctrl-A
U+0002 2 003 Start of Text Ctrl-B
U+0003 3 004 End-of-text character Ctrl-C
U+0004 4 004 End-of-transmission character Ctrl-D
U+0005 5 005 Enquiry character Ctrl-E

Assigning a non-existing column creates a new column.


Columns can be removed with pandas.DataFrame.drop and displayed with pandas.DataFrame.columns:

[29]: df3 = df2.drop(columns=['Decimal', 'Octal'])

[30]: df2.columns
[30]: Index(['Decimal', 'Octal', 'Description', 'Key'], dtype='object')

[31]: df3.columns
[31]: Index(['Description', 'Key'], dtype='object')

Another common form of data is nested dict of dicts:

[32]: u = {
'U+0006': {'Decimal': '6', 'Octal': '006', 'Description': 'Acknowledge character',
˓→'Key': 'Ctrl-F'},

'U+0007': {'Decimal': '7', 'Octal': '007', 'Description': 'Bell character', 'Key':


˓→'Ctrl-G'},

df4 = pd.DataFrame(u)

df4
[32]: U+0006 U+0007
Decimal 6 7
Octal 006 007
Description Acknowledge character Bell character
Key Ctrl-F Ctrl-G

You can transpose the DataFrame, i.e. swap the rows and columns, with a similar syntax to a NumPy array:

[33]: df4.T
[33]: Decimal Octal Description Key
U+0006 6 006 Acknowledge character Ctrl-F
U+0007 7 007 Bell character Ctrl-G

68 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Warning:
Note that when transposing, the data types of the columns are discarded if the columns do not all have the same data
type, so when transposing and then transposing back, the previous type information may be lost. In this case, the
columns become arrays of pure Python objects.
The keys in the inner dicts are combined to form the index in the result. This is not the case when an explicit index is
specified:

[34]: df5 = pd.DataFrame(u, index=['Decimal', 'Octal', 'Key'])


df5
[34]: U+0006 U+0007
Decimal 6 7
Octal 006 007
Key Ctrl-F Ctrl-G

2.4.2 Converting Python data structures into pandas

Python data structures such as lists and arrays can be converted into pandas Series or DataFrames.

[1]: import numpy as np


import pandas as pd

Series

Python lists can easily be converted into pandas Series:

[2]: list1 = [-0.751442, 0.816935, -0.272546, -0.268295, -0.296728, 0.176255, -0.322612]

pd.Series(list1)
[2]: 0 -0.751442
1 0.816935
2 -0.272546
3 -0.268295
4 -0.296728
5 0.176255
6 -0.322612
dtype: float64

Multiple lists can also be easily converted into one pandas Series:

[3]: list2 = [-0.029608, -0.277982, 2.693057, -0.850817, 0.783868, -1.137835, -0.617132]

pd.Series(list1 + list2)
[3]: 0 -0.751442
1 0.816935
2 -0.272546
3 -0.268295
4 -0.296728
5 0.176255
6 -0.322612
(continues on next page)

2.4. pandas 69
Python for Data Science, Release 1.0.0

(continued from previous page)


7 -0.029608
8 -0.277982
9 2.693057
10 -0.850817
11 0.783868
12 -1.137835
13 -0.617132
dtype: float64

A list can also be passed as an index:

[4]: date = ['2022-01-31', '2022-02-01', '2022-02-02', '2022-02-03', '2022-02-04', '2022-02-05


˓→', '2022-02-06']

pd.Series(list1, index = date)


[4]: 2022-01-31 -0.751442
2022-02-01 0.816935
2022-02-02 -0.272546
2022-02-03 -0.268295
2022-02-04 -0.296728
2022-02-05 0.176255
2022-02-06 -0.322612
dtype: float64

With Python dictionaries you can pass not only values but also the corresponding keys to a pandas series:

[5]: dict1 = {
'2022-01-31': -0.751442,
'2022-02-01': 0.816935,
'2022-02-02': -0.272546,
'2022-02-03': -0.268295,
'2022-02-04': -0.296728,
'2022-02-05': 0.176255,
'2022-02-06': -0.322612
}

pd.Series(dict1)
[5]: 2022-01-31 -0.751442
2022-02-01 0.816935
2022-02-02 -0.272546
2022-02-03 -0.268295
2022-02-04 -0.296728
2022-02-05 0.176255
2022-02-06 -0.322612
dtype: float64

When you pass a dict, the index in the resulting pandas series takes into account the order of the keys in the dict.
With collections.ChainMap you can also turn several dicts into one pandas.Series.
First we define a second dict:

70 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[6]: dict2 = {
'2022-02-07': -0.029608,
'2022-02-08': -0.277982,
'2022-02-09': 2.693057,
'2022-02-10': -0.850817,
'2022-02-11': 0.783868,
'2022-02-12': -1.137835,
'2022-02-13': -0.617132
}

[7]: from collections import ChainMap

pd.Series(ChainMap(dict1,dict2))
[7]: 2022-02-07 -0.029608
2022-02-08 -0.277982
2022-02-09 2.693057
2022-02-10 -0.850817
2022-02-11 0.783868
2022-02-12 -1.137835
2022-02-13 -0.617132
2022-01-31 -0.751442
2022-02-01 0.816935
2022-02-02 -0.272546
2022-02-03 -0.268295
2022-02-04 -0.296728
2022-02-05 0.176255
2022-02-06 -0.322612
dtype: float64

DataFrame

Lists of lists can be loaded into a pandas DataFrame with:

[8]: df = pd.DataFrame([list1, list2])


df
[8]: 0 1 2 3 4 5 6
0 -0.751442 0.816935 -0.272546 -0.268295 -0.296728 0.176255 -0.322612
1 -0.029608 -0.277982 2.693057 -0.850817 0.783868 -1.137835 -0.617132

You can also transfer a list into a DataFrame index:

[9]: pd.DataFrame([list1, list2], index= ['2022-01-31', '2022-02-01'])


[9]: 0 1 2 3 4 5 \
2022-01-31 -0.751442 0.816935 -0.272546 -0.268295 -0.296728 0.176255
2022-02-01 -0.029608 -0.277982 2.693057 -0.850817 0.783868 -1.137835

6
2022-01-31 -0.322612
2022-02-01 -0.617132

A pandas DataFrame can be created from a dict with values in lists:

2.4. pandas 71
Python for Data Science, Release 1.0.0

[10]: data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],


'Decimal': [0, 1, 2, 3, 4, 5],
'Octal': ['001', '002', '003', '004', '004', '005'],
'Key': ['NUL', 'Ctrl-A', 'Ctrl-B', 'Ctrl-C', 'Ctrl-D', 'Ctrl-E']}

[11]: pd.DataFrame(data)
[11]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E

Another common form of data is nested dict of dicts:

[12]: data2 = {
'U+0006': {'Decimal': '6', 'Octal': '006', 'Key': 'Ctrl-F'},
'U+0007': {'Decimal': '7', 'Octal': '007', 'Key': 'Ctrl-G'},
}

df2 = pd.DataFrame(data2)

df2
[12]: U+0006 U+0007
Decimal 6 7
Octal 006 007
Key Ctrl-F Ctrl-G

Dicts of Series are treated in a similar way:

[13]: data3 = {
'U+0006': df2['U+0006'][2:],
'U+0007': df2['U+0007'][2:]
}

pd.DataFrame(data3)
[13]: U+0006 U+0007
Key Ctrl-F Ctrl-G

2.4.3 Indexing

Index objects

The index objects of pandas are responsible for the axis labels and other metadata, such as the axis name. Any array
or other sequence of labels you use when constructing a series or DataFrame is internally converted into an index:

[1]: import pandas as pd

obj = pd.Series(range(7), index=pd.date_range("2022-02-02", periods=7))

72 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[2]: obj.index
[2]: DatetimeIndex(['2022-02-02', '2022-02-03', '2022-02-04', '2022-02-05',
'2022-02-06', '2022-02-07', '2022-02-08'],
dtype='datetime64[ns]', freq='D')

[3]: obj.index[3:]
[3]: DatetimeIndex(['2022-02-05', '2022-02-06', '2022-02-07', '2022-02-08'], dtype=
˓→'datetime64[ns]', freq='D')

Index objects are immutable and therefore cannot be changed by the user:

[4]: obj.index[1] = '2022-02-03'


---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_25675/1385405160.py in <module>
----> 1 obj.index[1] = '2022-02-03'

~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexes/base.py in __setitem__(self, key, value)

4583 @final
4584 def __setitem__(self, key, value):
-> 4585 raise TypeError("Index does not support mutable operations")
4586
4587 def __getitem__(self, key):

TypeError: Index does not support mutable operations

Immutability makes the sharing of index objects in data structures more secure:

[5]: import numpy as np

labels = pd.Index(np.arange(3))

labels
[5]: Int64Index([0, 1, 2], dtype='int64')

[6]: rng = np.random.default_rng()


obj2 = pd.Series(rng.normal(size=3),index=labels)

[7]: obj2
[7]: 0 0.038721
1 1.012864
2 -1.562999
dtype: float64

[8]: obj2.index is labels


[8]: True

To be similar to an array, an index also behaves like a fixed-size set:

2.4. pandas 73
Python for Data Science, Release 1.0.0

[9]: data1 = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],


'Decimal': [0, 1, 2, 3, 4, 5],
'Octal': ['001', '002', '003', '004', '004', '005']}
df1 = pd.DataFrame(data1)

[10]: df1
[10]: Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005

[11]: df1.columns
[11]: Index(['Code', 'Decimal', 'Octal'], dtype='object')

[12]: 'Code' in df1.columns


[12]: True

[13]: 'Key' in df1.columns


[13]: False

Axis indices with double labels

Unlike Python sets, a Pandas index can contain duplicate labels:

[14]: data2 = {'Code': ['U+0006', 'U+0007'],


'Decimal': [6, 7],
'Octal': ['006', '007']}
df2 = pd.DataFrame(data2)
df12 = df1.append(df2)

df12
[14]: Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
0 U+0006 6 006
1 U+0007 7 007

When selecting duplicate labels, all occurrences of the label in question are selected:

[15]: df12.loc[1]

74 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[15]: Code Decimal Octal


1 U+0001 1 002
1 U+0007 7 007

[16]: df12.loc[2]
[16]: Code U+0002
Decimal 2
Octal 003
Name: 2, dtype: object

Data selection is one of the main points that behaves differently with duplicates. Indexing a label with multiple entries
results in a series, while single entries result in a scalar value. This can complicate your code because the output type
of indexing can vary depending on whether a label is repeated or not. In addition, many pandas functions, such as
reindex, require labels to be unique. You can use the is_unique property of the index to determine whether its
labels are unique or not:

[17]: df12.index.is_unique
[17]: False

To avoid duplicate labels, you can use ignore_index=True, for example:

[18]: df12 = df1.append(df2, ignore_index=True)

df12
[18]: Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
6 U+0006 6 006
7 U+0007 7 007

Some index methods and properties

Each index has a number of set logic methods and properties that answer other general questions about the data it
contains. The following are some useful methods and properties:

2.4. pandas 75
Python for Data Science, Release 1.0.0

Method Description
concat concatenates additional index objects, creating a new index
Index.difference calculates the difference of two sets as an index
Index.intersection calculates the intersection
Index.union calculates the union set
Index.isin computes a boolean array indicating whether each value is contained in the passed collection
Index.delete computes a new index by deleting the element in index i
Index.drop computes a new index by deleting the passed values
Index.insert insert computes new index by inserting the element in index i
In- is_monotonic returns True if each element is greater than or equal to the previous element
dex.is_monotonic
Index.is_unique is_unique returns True if the index does not contain duplicate values
Index.unique calculates the array of unique values in the index

Re-indexing with Index.reindex

An important method for Pandas objects is Index.reindex, which can be used to create a new object with rearranged
values that match the new index. Consider, for example:

[19]: obj = pd.Series(range(7), index=pd.date_range("2022-02-02", periods=7))

[20]: obj
[20]: 2022-02-02 0
2022-02-03 1
2022-02-04 2
2022-02-05 3
2022-02-06 4
2022-02-07 5
2022-02-08 6
Freq: D, dtype: int64

[21]: new_index = pd.date_range("2022-02-03", periods=7)

[22]: obj.reindex(new_index)
[22]: 2022-02-03 1.0
2022-02-04 2.0
2022-02-05 3.0
2022-02-06 4.0
2022-02-07 5.0
2022-02-08 6.0
2022-02-09 NaN
Freq: D, dtype: float64

Index.reindex creates a new index and re-indexes the DataFrame. By default, values in the new index for which
there are no corresponding records in the DataFrame become NaN.
For ordered data such as time series, it may be desirable to interpolate or fill values during reindexing. The method
option allows this with a method like ffill that fills the values forward:

76 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[23]: obj.reindex(new_index, method='ffill')


[23]: 2022-02-03 1
2022-02-04 2
2022-02-05 3
2022-02-06 4
2022-02-07 5
2022-02-08 6
2022-02-09 6
Freq: D, dtype: int64

For a DataFrame, reindex can change either the (row) index, the columns or both. If only a sequence is passed, the
rows in the result are re-indexed:

[24]: df1.reindex(range(7))
[24]: Code Decimal Octal
0 U+0000 0.0 001
1 U+0001 1.0 002
2 U+0002 2.0 003
3 U+0003 3.0 004
4 U+0004 4.0 004
5 U+0005 5.0 005
6 NaN NaN NaN

The columns can be re-indexed with the keyword columns:

[25]: encoding = ['Octal', 'Code', 'Description']

df1.reindex(columns=encoding)
[25]: Octal Code Description
0 001 U+0000 NaN
1 002 U+0001 NaN
2 003 U+0002 NaN
3 004 U+0003 NaN
4 004 U+0004 NaN
5 005 U+0005 NaN

2.4. pandas 77
Python for Data Science, Release 1.0.0

Arguments of the function Index.reindex

Argu- Description
ment
labels New sequence to be used as index. Can be an index instance or another sequence-like Python data struc-
ture. An index is used exactly as it is, without being copied.
axis The new axis to index, either index (rows) or columns. The default is index. You can alternatively use
reindex(index=new_labels) or reindex(columns=new_labels).
method Interpolation method; ffill fills forwards, while bfill fills backwards.
fill_value
Substitute value to be used when missing data is inserted by re-indexing. Uses fill_value='missing'
(the default behaviour) if the missing labels in the result are to have zero values.
limit When filling forward or backward, the maximum number of elements to fill.
tolerance
When filling forward or backward, the maximum size of the gap to be filled for inexact matches.
level Match single index at MultiIndex level; otherwise select subset.
copy If True, the underlying data is always copied, even if the new index matches the old index; if False, the
data is not copied if the indices are equivalent.

Rename axis indices

The axis labels can be converted by a function or mapping to create new, differently labelled objects. You can also
change the axes in place without creating a new data structure. Here is a simple example:

[26]: df3 = pd.DataFrame(np.arange(12).reshape((3, 4)),


index=['Deutsch', 'English', 'Français'],
columns=[1,2,3,4])

df3
[26]: 1 2 3 4
Deutsch 0 1 2 3
English 4 5 6 7
Français 8 9 10 11

Rename axis indices with Index.map

The axis labels can be converted by a function or Index.map to create new, differently labeled objects. You can also
change the axes in place without creating a new data structure. Here is a simple example:

[27]: transform = lambda x: x[:2].upper()

df3.index.map(transform)
[27]: Index(['DE', 'EN', 'FR'], dtype='object')

You can assign the index and change the DataFrame on the spot:

[28]: df3.index = df3.index.map(transform)

df3

78 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[28]: 1 2 3 4
DE 0 1 2 3
EN 4 5 6 7
FR 8 9 10 11

Rename axis indices with Index.rename

If you want to create a converted version of your dataset without changing the original, you can use Index.rename:

[29]: df3.rename(index=str.lower)
[29]: 1 2 3 4
de 0 1 2 3
en 4 5 6 7
fr 8 9 10 11

In particular, Index.rename can be used in conjunction with a dict-like object that provides new values for a subset
of the axis labels:

[30]: df3.rename(index={'DE': 'BE', 'EN': 'DE', 'FR': 'EN'},


columns={1: 0, 2: 1, 3: 2, 4: 3},
inplace=True)

df3
[30]: 0 1 2 3
BE 0 1 2 3
DE 4 5 6 7
EN 8 9 10 11

Index.rename saves you from manually copying the DataFrame and assigning its index and column attributes. If you
want to change a data set on the spot, also pass inplace=True:

[31]: df3.rename(index={'DE': 'BE', 'EN': 'DE', 'FR': 'EN'},


columns={1: 0, 2: 1, 3: 2, 4: 3},
inplace=True)

df3
[31]: 0 0 1 2
BE 0 1 2 3
BE 4 5 6 7
DE 8 9 10 11

2.4. pandas 79
Python for Data Science, Release 1.0.0

Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that allows you to have multiple index levels on one axis. This
gives you the opportunity to work with higher dimensional data in a lower dimensional form.
Let’s start with a simple example: Let’s create a series of lists as an index:

[32]: hits = pd.Series([83080,20336,11376,1228,468],


index=[['Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial',
'Python Basics',
'Python Basics'],
['de', 'en', 'de', 'de', 'en']])

hits
[32]: Jupyter Tutorial de 83080
en 20336
PyViz Tutorial de 11376
Python Basics de 1228
en 468
dtype: int64

What you see is a graphical view of a series with a pandas.MultiIndex. The gaps in the index display mean that the
label above it is to be used.

[33]: hits.index
[33]: MultiIndex([('Jupyter Tutorial', 'de'),
('Jupyter Tutorial', 'en'),
( 'PyViz Tutorial', 'de'),
( 'Python Basics', 'de'),
( 'Python Basics', 'en')],
)

With a hierarchically indexed object, so-called partial indexing is possible, with which you can select subsets of the
data:

[34]: hits['Jupyter Tutorial']


[34]: de 83080
en 20336
dtype: int64

[35]: hits['Jupyter Tutorial':'Python Basics']


[35]: Jupyter Tutorial de 83080
en 20336
PyViz Tutorial de 11376
Python Basics de 1228
en 468
dtype: int64

[36]: hits.loc[['Jupyter Tutorial', 'Python Basics']]

80 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[36]: Jupyter Tutorial de 83080


en 20336
Python Basics de 1228
en 468
dtype: int64

The selection is even possible from an inner level. In the following I select all values with the value 1 from the second
index level:

[37]: hits.loc[:, 'de']


[37]: Jupyter Tutorial 83080
PyViz Tutorial 11376
Python Basics 1228
dtype: int64

View vs. copy

In Pandas, whether you get a view or not depends on the structure and data types of the original DataFrame – and
whether changes made to a view are propagated back to the original DataFrame.

stack and unstack

Hierarchical indexing plays an important role in data reshaping and group-based operations such as forming a pivot
table. For example, you can reorder this data into a DataFrame using the pandas.Series.unstack method:

[38]: hits.unstack()
[38]: de en
Jupyter Tutorial 83080.0 20336.0
PyViz Tutorial 11376.0 NaN
Python Basics 1228.0 468.0

The reverse operation of unstack is stack:

[39]: hits.unstack().stack()
[39]: Jupyter Tutorial de 83080.0
en 20336.0
PyViz Tutorial de 11376.0
Python Basics de 1228.0
en 468.0
dtype: float64

In a DataFrame, each axis can have a hierarchical index:

[40]: version_hits = [[19651,0,30134,0,33295,0],


[4722,1825,3497,2576,4009,3707],
[2573,0,4873,0,3930,0],
[525,0,427,0,276,0],
[157,0,85,0,226,0]]

df = pd.DataFrame(version_hits,
(continues on next page)

2.4. pandas 81
Python for Data Science, Release 1.0.0

(continued from previous page)


index=[['Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial',
'Python Basics',
'Python Basics'],
['de', 'en', 'de', 'de', 'en']],
columns=[['12/2021', '12/2021',
'01/2022', '01/2022',
'02/2022', '02/2022'],
['latest', 'stable',
'latest', 'stable',
'latest', 'stable']])

df
[40]: 12/2021 01/2022 02/2022
latest stable latest stable latest stable
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

The hierarchy levels can have names (as strings or any Python objects). If this is the case, they are displayed in the
console output:

[41]: df.index.names = ['Title', 'Language']


df.columns.names = ['Month', 'Version']

df
[41]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

Warning:
Make sure that the index names Month and Version are not part of the row names (of the df.index values).
With the partial column indexing you can select column groups in a similar way:

[42]: df['12/2021']
[42]: Version latest stable
Title Language
Jupyter Tutorial de 19651 0
en 4722 1825
PyViz Tutorial de 2573 0
Python Basics de 525 0
en 157 0

82 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

With MultiIndex.from_arrays, a MultiIndex can be created itself and then reused; the columns in the preceding
DataFrame with level names could be created in this way:

[43]: pd.MultiIndex.from_arrays([['Jupyter Tutorial',


'Jupyter Tutorial',
'PyViz Tutorial',
'Python Basics',
'Python Basics'],
['de', 'en', 'de', 'de', 'en']],
names=['Title', 'Language'])
[43]: MultiIndex([('Jupyter Tutorial', 'de'),
('Jupyter Tutorial', 'en'),
( 'PyViz Tutorial', 'de'),
( 'Python Basics', 'de'),
( 'Python Basics', 'en')],
names=['Title', 'Language'])

Rearranging and Sorting Levels

There may be times when you want to rearrange the order of the levels on an axis or sort the data by the values in
a particular level. The function DataFrame.swaplevel takes two level numbers or names and returns a new object in
which the levels are swapped (but the data remains unchanged):

[44]: df.swaplevel('Language', 'Title')


[44]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Language Title
de Jupyter Tutorial 19651 0 30134 0 33295 0
en Jupyter Tutorial 4722 1825 3497 2576 4009 3707
de PyViz Tutorial 2573 0 4873 0 3930 0
Python Basics 525 0 427 0 276 0
en Python Basics 157 0 85 0 226 0

DataFrame.sort_index, on the other hand, sorts the data only by the values in a single level. When swapping levels, it
is not uncommon to also use sort_index so that the result is lexicographically sorted by the specified level:

[45]: df.sort_index(level=0)
[45]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

However, the PyViz Tutorial will now be sorted before the Python Basics, as all upper case letters appear before lower
case letters in this sorting. To avoid this, you can use the following lambda function:

[46]: df.sort_index(level=0, key=lambda x: x.str.lower())

2.4. pandas 83
Python for Data Science, Release 1.0.0

[46]: Month 12/2021 01/2022 02/2022


Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0
PyViz Tutorial de 2573 0 4873 0 3930 0

[47]: df.swaplevel(0, 1).sort_index(level=0)


[47]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Language Title
de Jupyter Tutorial 19651 0 30134 0 33295 0
PyViz Tutorial 2573 0 4873 0 3930 0
Python Basics 525 0 427 0 276 0
en Jupyter Tutorial 4722 1825 3497 2576 4009 3707
Python Basics 157 0 85 0 226 0

Note:
Data selection performance is much better for hierarchically indexed objects if the index is sorted lexicographically,
starting with the outermost level, i.e. the result of calling sort_index(level=0) or sort_index().

Summary statistics by level

Many descriptive and summary statistics for DataFrame and Series have a level option that allows you to specify the
level by which you can aggregate on a particular axis. Consider the DataFrame above; we can aggregate either the
rows or the columns by level as follows:

[48]: df.groupby(level='Language').sum()
[48]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Language
de 22749 0 35434 0 37501 0
en 4879 1825 3582 2576 4235 3707

[49]: df.groupby(level='Month', axis=1).sum()


[49]: Month 01/2022 02/2022 12/2021
Title Language
Jupyter Tutorial de 30134 33295 19651
en 6073 7716 6547
PyViz Tutorial de 4873 3930 2573
Python Basics de 427 276 525
en 85 226 157

Internally, pandas’ DataFrame.groupby machinery is used for this purpose, which is explained in more detail in Group
Operations.

84 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Indexing with the columns of a DataFrame

It is not uncommon to use one or more columns of a DataFrame as a row index; alternatively, you can move the row
index into the columns of the DataFrame. Here is an example DataFrame:

[50]: data = [['Jupyter Tutorial', 'de', 19651,0,30134,0,33295,0],


['Jupyter Tutorial', 'en', 4722,1825,3497,2576,4009,3707],
['PyViz Tutorial', 'de', 2573,0,4873,0,3930,0],
['Python Basics', 'de', 525,0,427,0,276,0],
['Python Basics', 'en', 157,0,85,0,226,0]]

df = pd.DataFrame(data)

df
[50]: 0 1 2 3 4 5 6 7
0 Jupyter Tutorial de 19651 0 30134 0 33295 0
1 Jupyter Tutorial en 4722 1825 3497 2576 4009 3707
2 PyViz Tutorial de 2573 0 4873 0 3930 0
3 Python Basics de 525 0 427 0 276 0
4 Python Basics en 157 0 85 0 226 0

The function pandas.DataFrame.set_index creates a new DataFrame that uses one or more of its columns as an index:

[51]: df2 = df.set_index([0,1])

df2
[51]: 2 3 4 5 6 7
0 1
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

By default, the columns are removed from the DataFrame, but you can also leave them in by passing drop=False to
set_index:

[52]: df.set_index([0,1], drop=False)


[52]: 0 1 2 3 4 5 6 \
0 1
Jupyter Tutorial de Jupyter Tutorial de 19651 0 30134 0 33295
en Jupyter Tutorial en 4722 1825 3497 2576 4009
PyViz Tutorial de PyViz Tutorial de 2573 0 4873 0 3930
Python Basics de Python Basics de 525 0 427 0 276
en Python Basics en 157 0 85 0 226

7
0 1
Jupyter Tutorial de 0
en 3707
PyViz Tutorial de 0
Python Basics de 0
en 0

2.4. pandas 85
Python for Data Science, Release 1.0.0

DataFrame.reset_index, on the other hand, does the opposite of set_index; the hierarchical index levels are moved into
the columns:

[53]: df2.reset_index()
[53]: 0 1 2 3 4 5 6 7
0 Jupyter Tutorial de 19651 0 30134 0 33295 0
1 Jupyter Tutorial en 4722 1825 3497 2576 4009 3707
2 PyViz Tutorial de 2573 0 4873 0 3930 0
3 Python Basics de 525 0 427 0 276 0
4 Python Basics en 157 0 85 0 226 0

2.4.4 Date and Time

With pandas you can create Series with date and time information. In the following we will show common operations
with date data.
Note:
pandas supports dates stored in UTC values using the datetime64[ns] datatype. Local times from a single time zone
are also supported. Multiple time zones are supported by a pandas.Timestamp object. If you need to handle times from
multiple time zones, I would probably split the data by time zone and use a separate DataFrame or Series for each time
zone.
See also:
• Time series / date functionality

Loading UTC time data

[1]: import pandas as pd

dt = pd.date_range("2022-03-27", periods=6, freq="H")

dt
[1]: DatetimeIndex(['2022-03-27 00:00:00', '2022-03-27 01:00:00',
'2022-03-27 02:00:00', '2022-03-27 03:00:00',
'2022-03-27 04:00:00', '2022-03-27 05:00:00'],
dtype='datetime64[ns]', freq='H')

[2]: utc = pd.to_datetime(dt, utc=True)

utc
[2]: DatetimeIndex(['2022-03-27 00:00:00+00:00', '2022-03-27 01:00:00+00:00',
'2022-03-27 02:00:00+00:00', '2022-03-27 03:00:00+00:00',
'2022-03-27 04:00:00+00:00', '2022-03-27 05:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='H')

Note:
The type of the result dtype='datetime64[ns, UTC]' indicates that the data is stored as UTC.
Let’s convert this series to the time zone Europe/Berlin:

86 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[3]: utc.tz_convert('Europe/Berlin')
[3]: DatetimeIndex(['2022-03-27 01:00:00+01:00', '2022-03-27 03:00:00+02:00',
'2022-03-27 04:00:00+02:00', '2022-03-27 05:00:00+02:00',
'2022-03-27 06:00:00+02:00', '2022-03-27 07:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')

Conversion of local time to UTC

[4]: local = utc.tz_convert('Europe/Berlin')

local.tz_convert('UTC')
[4]: DatetimeIndex(['2022-03-27 00:00:00+00:00', '2022-03-27 01:00:00+00:00',
'2022-03-27 02:00:00+00:00', '2022-03-27 03:00:00+00:00',
'2022-03-27 04:00:00+00:00', '2022-03-27 05:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='H')

Conversion to Unix time

If you have a Series with UTC or local time information, you can use this code to determine the seconds according
to Unix time:

[5]: uts = pd.to_datetime(dt).view(int) / 10**9

uts
[5]: array([1.6483392e+09, 1.6483428e+09, 1.6483464e+09, 1.6483500e+09,
1.6483536e+09, 1.6483572e+09])

To load the Unix time in UTC, you can proceed as follows:

[6]: (pd.to_datetime(uts, unit='s').tz_localize('UTC'))


[6]: DatetimeIndex(['2022-03-27 00:00:00+00:00', '2022-03-27 01:00:00+00:00',
'2022-03-27 02:00:00+00:00', '2022-03-27 03:00:00+00:00',
'2022-03-27 04:00:00+00:00', '2022-03-27 05:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)

Manipulation of dates

Convert to strings

With pandas.DatetimeIndex you have some possibilities to convert date and time into strings, for example into the name
of the weekday:

[7]: local.day_name(locale='en_GB.utf8')
[7]: Index(['Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday'], dtype='object')

You can find out which locale is available to you with locale -a:

2.4. pandas 87
Python for Data Science, Release 1.0.0

[8]: !locale -a
C
C.UTF-8
de_AT.utf8
de_BE.utf8
de_CH.utf8
de_DE.utf8
de_IT.utf8
de_LI.utf8
de_LU.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX

Other attributes of DatetimeIndex that can be used to convert date and time into strings are:

88 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

Attribute Description
year the year as datetime.
month the month as January 1 and December 12
day the day of the datetime
hour the hours of the datetime
minute the minutes of the datetime
seconds the seconds of the ‘datetime
microsecond the microseconds of the datetime.
nanosecond the nanoseconds of datetime
date returns a NumPy array of Python datetime.date objects
time returns a NumPy array of datetime.time objects
timetz returns a NumPy array of datetime.time objects with timezone information
dayofyear, day_of_year the ordinal day of the year
dayofweek the day of the week with Monday (0) and Sunday (6)
day_of_week the day of the week with Monday (0) and Sunday (6)
weekday the day of the week with Monday (0) and Sunday (6)
quarter returns the quarter of the year
tz returns the time zone
freq returns the frequency object if it is set, otherwise None
freqstr returns the frequency object as a string if it is set, otherwise None
is_month_start indicates if the date is the first day of the month
is_month_end indicates whether the date is the last day of the month
is_quarter_start indicates whether the date is the first day of a quarter
is_quarter_end shows if the date is the last day of a quarter
is_year_start indicates whether the date is the first day of a year
is_year_end indicates whether the date is the last day of a year
is_leap_year Boolean indicator if the date falls in a leap year
inferred_freq tries to return a string representing a frequency determined by infer_freq

However, there are also some methods with which you can convert the DatetimeIndex into strings, for example
strftime:

[9]: local.strftime('%d.%m.%Y')
[9]: Index(['27.03.2022', '27.03.2022', '27.03.2022', '27.03.2022', '27.03.2022',
'27.03.2022'],
dtype='object')

Note:
In strftime() and strptime() Format Codes you get an overview of the different formatting possibilities of strftime.
Other methods are:

2.4. pandas 89
Python for Data Science, Release 1.0.0

Method Description
normalize converts times to midnight
strftime converts to index using the specified date format
snap snaps the timestamp to the next occurring frequency
tz_convert convert a tz capable datetime array/index from one time zone to another
tz_localize localises tz-naive datetime array/index into tz-compatible datetime array/index
round rounds the data up to the nearest specified frequency
floor rounds the data sown to the specified frequency
ceil round the data to the specified frequency
to_period converts the data to a PeriodArray/Index at a given frequency
to_perioddelta
calculates TimedeltaArray of the difference between the index values and the index converted to
PeriodArray at the specified frequency
to_pydatetimereturns Datetime array/index as ndarray object of datetime.datetime objects
to_series creates a series with index and values corresponding to index keys; useful with map for returning
an indexer
to_frame creates a DataFrame with a column containing the index
month_name returns the month names of the DateTimeIndex with the specified locale
day_name returns the day names of the DateTimeIndex with the specified locale
mean returns the mean value of the array
std returns the standard deviation of the sample across the requested axis

2.4.5 Select and filter data

Indexing series (obj[...]) works analogously to indexing NumPy arrays, except that you can use index values of the
series instead of just integers. Here are some examples:

[1]: import numpy as np


import pandas as pd

[2]: idx = pd.date_range("2022-02-02", periods=7)


rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7), index=idx)

[3]: s
[3]: 2022-02-02 2.028039
2022-02-03 0.005695
2022-02-04 0.941828
2022-02-05 0.393235
2022-02-06 1.303863
2022-02-07 -2.023356
2022-02-08 0.893236
Freq: D, dtype: float64

[4]: s['2022-02-03']
[4]: 0.005695438572672797

[5]: s[1]

90 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[5]: 0.005695438572672797

[6]: s[2:4]
[6]: 2022-02-04 0.941828
2022-02-05 0.393235
Freq: D, dtype: float64

[7]: s[['2022-02-04', '2022-02-03', '2022-02-02']]


[7]: 2022-02-04 0.941828
2022-02-03 0.005695
2022-02-02 2.028039
dtype: float64

[8]: s[[1, 3]]


[8]: 2022-02-03 0.005695
2022-02-05 0.393235
Freq: 2D, dtype: float64

[9]: s[s > 0]


[9]: 2022-02-02 2.028039
2022-02-03 0.005695
2022-02-04 0.941828
2022-02-05 0.393235
2022-02-06 1.303863
2022-02-08 0.893236
dtype: float64

While you can select data by label in this way, the preferred method for selecting index values is the loc operator:

[10]: s.loc[['2022-02-04', '2022-02-03', '2022-02-02']]


[10]: 2022-02-04 0.941828
2022-02-03 0.005695
2022-02-02 2.028039
dtype: float64

The reason for the preference for loc is the different treatment of integers when indexing with []. In regular []-based
indexing, integers are treated as labels if the index contains integers, so the behaviour varies depending on the data type
of the index. In our example, the expression s.loc[[3, 2, 1]] will fail because the index does not contain integers:

[11]: s.loc[[3, 2, 1]]


---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_25758/2224307995.py in <module>
----> 1 s.loc[[3, 2, 1]]

~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in __getitem__(self, key)

929
930 maybe_callable = com.apply_if_callable(key, self.obj)
(continues on next page)

2.4. pandas 91
Python for Data Science, Release 1.0.0

(continued from previous page)


--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):

~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _getitem_axis(self, key, axis)

1151 raise ValueError("Cannot index with multidimensional key")


1152
-> 1153 return self._getitem_iterable(key, axis=axis)
1154
1155 # nested tuple slicing

~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _getitem_iterable(self, key, axis)

1091
1092 # A collection of keys
-> 1093 keyarr, indexer = self._get_listlike_indexer(key, axis)
1094 return self.obj._reindex_with_indexers(
1095 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)

1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)


1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(

~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)

1372 if use_interval_msg:
1373 key = list(key)
-> 1374 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].
˓→unique())

KeyError: "None of [Int64Index([3, 2, 1], dtype='int64')] are in the [index]"

While the loc operator exclusively indexes labels, the iloc operator exclusively indexes with integers:

[12]: s.iloc[[3, 2, 1]]


[12]: 2022-02-05 0.393235
2022-02-04 0.941828
2022-02-03 0.005695
Freq: -1D, dtype: float64

You can also slice with labels, but this works differently from normal Python slicing because the endpoint is included:

[13]: s.loc['2022-02-03':'2022-02-04']
[13]: 2022-02-03 0.005695
(continues on next page)

92 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

(continued from previous page)


2022-02-04 0.941828
Freq: D, dtype: float64

Setting with these methods changes the corresponding section of the row:

[14]: s.loc['2022-02-03':'2022-02-04'] = 0

s
[14]: 2022-02-02 2.028039
2022-02-03 0.000000
2022-02-04 0.000000
2022-02-05 0.393235
2022-02-06 1.303863
2022-02-07 -2.023356
2022-02-08 0.893236
Freq: D, dtype: float64

Indexing in a DataFrame is used to retrieve one or more columns with either a single value or a sequence:

[15]: data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],


'Decimal': [0, 1, 2, 3, 4, 5],
'Octal': ['001', '002', '003', '004', '004', '005'],
'Key': ['NUL', 'Ctrl-A', 'Ctrl-B', 'Ctrl-C', 'Ctrl-D', 'Ctrl-E']}

df = pd.DataFrame(data)
df = pd.DataFrame(data,
columns=['Decimal', 'Octal', 'Key'],
index=df['Code'])

df
[15]: Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
U+0005 5 005 Ctrl-E

[16]: df['Key']
[16]: Code
U+0000 NUL
U+0001 Ctrl-A
U+0002 Ctrl-B
U+0003 Ctrl-C
U+0004 Ctrl-D
U+0005 Ctrl-E
Name: Key, dtype: object

[17]: df[['Decimal', 'Key']]

2.4. pandas 93
Python for Data Science, Release 1.0.0

[17]: Decimal Key


Code
U+0000 0 NUL
U+0001 1 Ctrl-A
U+0002 2 Ctrl-B
U+0003 3 Ctrl-C
U+0004 4 Ctrl-D
U+0005 5 Ctrl-E

[18]: df[:2]
[18]: Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A

[19]: df[df['Decimal'] > 2]


[19]: Decimal Octal Key
Code
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
U+0005 5 005 Ctrl-E

The line selection syntax df[:2] is provided for convenience. Passing a single item or a list to the [] operator selects
columns.
Another use case is indexing with a Boolean DataFrame, which is generated by a scalar comparison, for example:
[20]: df['Decimal'] > 2
[20]: Code
U+0000 False
U+0001 False
U+0002 False
U+0003 True
U+0004 True
U+0005 True
Name: Decimal, dtype: bool

[21]: df[df['Decimal'] > 2] = 'NA'

df
[21]: Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0003 NA NA NA
U+0004 NA NA NA
U+0005 NA NA NA

Like Series, DataFrame has special operators loc and iloc for label-based and integer indexing respectively. Since
DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either
axis labels (loc) or integers (iloc).

94 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[22]: df.loc['U+0002', ['Decimal', 'Key']]


[22]: Decimal 2
Key Ctrl-B
Name: U+0002, dtype: object

[23]: df.iloc[[2], [1, 2]]


[23]: Octal Key
Code
U+0002 003 Ctrl-B

[24]: df.iloc[[0, 1], [1, 2]]


[24]: Octal Key
Code
U+0000 001 NUL
U+0001 002 Ctrl-A

Both indexing functions work with slices in addition to individual labels or lists of labels:

[25]: df.loc[:'U+0003', 'Key']


[25]: Code
U+0000 NUL
U+0001 Ctrl-A
U+0002 Ctrl-B
U+0003 NA
Name: Key, dtype: object

[26]: df.iloc[:3, :3]


[26]: Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B

So there are many ways to select and rearrange the data contained in a pandas object. In the following, I put together a
brief summary of most of these possibilities for DataFrames:

Type Note
df[LABEL] selects a single column or a sequence of columns from the DataFrame
df.loc[LABEL] selects a single row or a subset of rows from the DataFrame by label
df.loc[:, LABEL] selects a single column or a subset of columns from the DataFrame by Label
df.loc[LABEL1, LABEL2] selects both rows and columns by label
df.iloc[INTEGER] selects a single row or a subset of rows from the DataFrame by integer position
df.iloc[INTEGER1, selects a single column or a subset of columns by integer position
INTEGER2]
df.at[LABEL1, LABEL2] selects a single value by row and column label
df.iat[INTEGER1, INTEGER2] selects a scalar value by row and column position (integers)
reindex NEW_INDEX selects rows or columns by label
get_value, set_value deprecated since version 0.21.0: use .at[] or .iat[] instead.

2.4. pandas 95
Python for Data Science, Release 1.0.0

2.4.6 Add, change and delete data

For many data sets, you may want to perform a transformation based on the values in an array, series or column in a
DataFrame. For this, we look at the first Unicode characters:

[1]: import numpy as np


import pandas as pd

[2]: df = pd.DataFrame({'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],


'Decimal': [0, 1, 2, 3, 4, 5],
'Octal': ['001', '002', '003', '004', '004', '005'],
'Key': ['NUL', 'Ctrl-A', 'Ctrl-B', 'Ctrl-C', 'Ctrl-D', 'Ctrl-E']})

df
[2]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E

Add data

Suppose you want to add a column where the characters are assigned to the C0 or C1 control code:

[3]: control_code = {
'u+0000': 'C0',
'u+0001': 'C0',
'u+0002': 'C0',
'u+0003': 'C0',
'u+0004': 'C0',
'u+0005': 'C0'
}

The map method for a series accepts a function or dict-like object that contains an assignment, but here we have a small
problem because some of the codes in control_code are lower case, but not in our DataFrame. Therefore, we need
to convert each value to lower case using the str.lower method:

[4]: lowercased = df['Code'].str.lower()

lowercased
[4]: 0 u+0000
1 u+0001
2 u+0002
3 u+0003
4 u+0004
5 u+0005
Name: Code, dtype: object

96 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[5]: df['Control code'] = lowercased.map(control_code)

df
[5]: Code Decimal Octal Key Control code
0 U+0000 0 001 NUL C0
1 U+0001 1 002 Ctrl-A C0
2 U+0002 2 003 Ctrl-B C0
3 U+0003 3 004 Ctrl-C C0
4 U+0004 4 004 Ctrl-D C0
5 U+0005 5 005 Ctrl-E C0

We could also have passed a function that does all the work:

[6]: df['Code'].map(lambda x: control_code[x.lower()])


[6]: 0 C0
1 C0
2 C0
3 C0
4 C0
5 C0
Name: Code, dtype: object

Using map is a convenient way to perform element-wise transformations and other data cleaning operations.

Change data

Note:
Replacing missing values is described in Managing missing data with pandas.

[7]: pd.Series(['Manpower', 'man-made']).str.replace('Man', 'Personal', regex=False)


[7]: 0 Personalpower
1 man-made
dtype: object

[8]: pd.Series(['Man-Power', 'man-made']).str.replace('[Mm]an', 'Personal', regex=True)


[8]: 0 Personal-Power
1 Personal-made
dtype: object

Note:
The replace method differs from str.replace in that it replaces strings element by element.

2.4. pandas 97
Python for Data Science, Release 1.0.0

Delete data

Deleting one or more entries from an axis is easy if you already have an index array or a list without these entries.
To delete duplicates, see Deduplicating data.
Since this may require a bit of set theory, we return the drop method as a new object without the deleted values:

[9]: rng = np.random.default_rng()


s = pd.Series(rng.normal(size=7))

s
[9]: 0 1.310961
1 1.630214
2 -1.442352
3 -0.348273
4 -1.249406
5 1.195304
6 -0.893336
dtype: float64

[10]: new = s.drop(2)

new
[10]: 0 1.310961
1 1.630214
3 -0.348273
4 -1.249406
5 1.195304
6 -0.893336
dtype: float64

[11]: new = s.drop([2, 3])

new
[11]: 0 1.310961
1 1.630214
4 -1.249406
5 1.195304
6 -0.893336
dtype: float64

With DataFrames, index values can be deleted on both axes. To illustrate this, we first create an example DataFrame:

[12]: data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],


'Decimal': [0, 1, 2, 3, 4, 5],
'Octal': ['001', '002', '003', '004', '004', '005'],
'Key': ['NUL', 'Ctrl-A', 'Ctrl-B', 'Ctrl-C', 'Ctrl-D', 'Ctrl-E']}

df = pd.DataFrame(data)

df

98 Chapter 2. Workspace
Python for Data Science, Release 1.0.0

[12]: Code Decimal Octal Key


0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E

[13]: df.drop([0, 1])


[13]: Code Decimal Octal Key
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E

You can also remove values from the columns by passing axis=1 or axis='columns':

[14]: df.drop('Decimal', axis=1)


[14]: Code Octal Key
0 U+0000 001 NUL
1 U+0001 002 Ctrl-A
2 U+0002 003 Ctrl-B
3 U+0003 004 Ctrl-C
4 U+0004 004 Ctrl-D
5 U+0005 005 Ctrl-E

Many functions such as drop that change the size or shape of a row or DataFrame can manipulate an object in place
without returning a new object:

[15]: df.drop(0, inplace=True)

df
[15]: Code Decimal Octal Key
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E

Warning:
Be careful with the inplace function, as the data will be irretrievably deleted.

2.4. pandas 99
Python for Data Science, Release 1.0.0

2.4.7 Manipulation of strings

pandas offers the possibility to concisely apply Python’s string methods and regular expressions to whole arrays of data.
See also:
• string
• re

Vectorised string functions in pandas

Cleaning up a cluttered dataset for analysis often requires a lot of string manipulation. To make matters worse, a column
containing strings sometimes has missing data:

[1]: import pandas as pd


import numpy as np

addresses = {'Veit': np.nan, 'Veit Schiele': '[email protected]',


'cusy GmbH': '[email protected]'}
addresses = pd.Series(addresses)

addresses
[1]: Veit NaN
Veit Schiele [email protected]
cusy GmbH [email protected]
dtype: object

[2]: addresses.isna()
[2]: Veit True
Veit Schiele False
cusy GmbH False
dtype: bool

You can apply string and regular expression methods to any value (by passing a lambda or other function) using data.
map, but this fails for NA values. To deal with this, Series has array-oriented methods for string operations that skip
and pass NA values. These are accessed via Series’ str attribute; for example, we could use str.contains to check
whether each email address contains veit:

[3]: addresses.str.contains('veit')
[3]: Veit NaN
Veit Schiele True
cusy GmbH False
dtype: object

Regular expressions can also be used, along with options such as IGNORECASE:

[4]: import re

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
matches = addresses.str.findall(pattern, flags=re.IGNORECASE).str[0]

matches

100 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

[4]: Veit NaN


Veit Schiele (veit.schiele, cusy, io)
cusy GmbH (info, cusy, io)
dtype: object

There are several ways to retrieve a vectorised element. Either use str.get or the index of str:

[5]: matches.str.get(1)
[5]: Veit NaN
Veit Schiele cusy
cusy GmbH cusy
dtype: object

Similarly, you can also cut strings with this syntax:

[6]: addresses.str[:5]
[6]: Veit NaN
Veit Schiele veit.
cusy GmbH info@
dtype: object

The pandas.Series.str.extract method returns the captured groups of a regular expression as a DataFrame:

[7]: addresses.str.extract(pattern, flags=re.IGNORECASE)


[7]: 0 1 2
Veit NaN NaN NaN
Veit Schiele veit.schiele cusy io
cusy GmbH info cusy io

More vectorised pandas string methods:

2.4. pandas 101


Python for Data Science, Release 1.0.0

Method Description
cat concatenates strings element by element with optional delimiter
contains returns a boolean array if each string contains a pattern/gex
count counts occurrences of the pattern
extract uses a regular expression with groups to extract one or more strings from a set of strings; the result is
a DataFrame with one column per group
endswith equivalent to x.endswith(pattern) for each element
startswithequivalent to x.startswith(pattern) for each element
findall computes list of all occurrences of pattern/regex for each string
get index in each element (get i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join joins strings in each element of the series with the passed separator character
len calculates the length of each string
lower, converts case; equivalent to x.lower() or x.upper() for each element
upper
match uses re.match with the passed regular expression for each element, returning True or False if
matched.
extract captures group elements (if any) by index from each string
pad inserts spaces on the left, right or both sides of strings
centre Equivalent to pad(side='both')
repeat Duplicates values (for example s.str.repeat(3) equals x * 3 for each string)
replace replaces pattern/rulex with another string
slice splits each string in the series
split splits strings using delimiters or regular expressions
strip truncates spaces on both sides, including line breaks
rstrip truncates spaces on the right side
lstrip truncates spaces on the left side

2.4.8 Arithmetic

An important function of pandas is the arithmetic behaviour for objects with different indices. When adding objects, if
the index pairs are not equal, the corresponding index in the result will be the union of the index pairs. For users with
database experience, this is comparable to an automatic outer join on the index labels. Let’s look at an example:

[1]: import numpy as np


import pandas as pd

rng = np.random.default_rng()
s1 = pd.Series(rng.normal(size=5))
s2 = pd.Series(rng.normal(size=7))

If you add these values, you get:

[2]: s1 + s2

102 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

[2]: 0 0.452985
1 -2.751479
2 1.170106
3 -1.583008
4 -0.507949
5 NaN
6 NaN
dtype: float64

The internal data matching leads to missing values at the points of the labels that do not overlap. Missing values are
then passed on in further arithmetic calculations.
For DataFrames, alignment is performed for both rows and columns:

[3]: df1 = pd.DataFrame(rng.normal(size=(5,3)))


df2 = pd.DataFrame(rng.normal(size=(7,2)))

When the two DataFrames are added together, the result is a DataFrame whose index and columns are the unions of
those in each of the DataFrames above:

[4]: df1 + df2


[4]: 0 1 2
0 -1.740413 0.566677 NaN
1 1.992954 -2.637117 NaN
2 -0.111516 -0.340413 NaN
3 0.418716 -0.847758 NaN
4 0.625907 1.675521 NaN
5 NaN NaN NaN
6 NaN NaN NaN

Since column 2 does not appear in both DataFrame objects, its values appear as missing in the result. The same applies
to the rows whose labels do not appear in both objects.

Arithmetic methods with fill values

In arithmetic operations between differently indexed objects, a special value (e.g. 0) can be useful if an axis label is
found in one object but not in the other. The add method can pass the fill_value argument:

[5]: df12 = df1.add(df2, fill_value=0)

df12
[5]: 0 1 2
0 -1.740413 0.566677 0.715587
1 1.992954 -2.637117 0.888983
2 -0.111516 -0.340413 0.811355
3 0.418716 -0.847758 -0.233179
4 0.625907 1.675521 0.494883
5 1.780402 -0.349901 NaN
6 0.922128 -0.487242 NaN

In the following example, we set the two remaining NaN values to 0:

2.4. pandas 103


Python for Data Science, Release 1.0.0

[6]: df12.iloc[[5,6], [2]] = 0

[7]: df12
[7]: 0 1 2
0 -1.740413 0.566677 0.715587
1 1.992954 -2.637117 0.888983
2 -0.111516 -0.340413 0.811355
3 0.418716 -0.847758 -0.233179
4 0.625907 1.675521 0.494883
5 1.780402 -0.349901 0.000000
6 0.922128 -0.487242 0.000000

Arithmetic methods

Method Description
add, radd methods for addition (+)
sub, rsub methods for subtraction (-)
div, rdiv methods for division (/)
floordiv, rfloordiv methods for floor division (//)
mul, rmul methods for multiplication (*)
pow, rpow methods for exponentiation (**)

r (English: reverse) reverses the method.

Operations between DataFrame and Series

As with NumPy arrays of different dimensions, the arithmetic between DataFrame and Series is also defined.

[8]: s1 + df12
[8]: 0 1 2 3 4
0 -2.533022 -1.635689 2.219348 NaN NaN
1 1.200346 -4.839483 2.392744 NaN NaN
2 -0.904124 -2.542779 2.315117 NaN NaN
3 -0.373892 -3.050124 1.270582 NaN NaN
4 -0.166702 -0.526845 1.998644 NaN NaN
5 0.987793 -2.552267 1.503761 NaN NaN
6 0.129520 -2.689608 1.503761 NaN NaN

If we add s1 with df12, the addition is done once for each line. This is called broadcasting. By default, the arithmetic
between the DataFrame and the series corresponds to the index of the series in the columns of the DataFrame, with the
rows being broadcast down.
If an index value is found neither in the columns of the DataFrame nor in the index of the series, the objects are
re-indexed to form the union:
If instead you want to transfer the columns and match the rows, you must use one of the arithmetic methods, for example:

[9]: df12.add(s2, axis='index')

104 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

[9]: 0 1 2
0 -0.494820 1.812271 1.961180
1 1.443841 -3.186230 0.339870
2 -0.445171 -0.674068 0.477700
3 -0.380913 -1.647387 -1.032809
4 0.260930 1.310545 0.129906
5 2.116184 -0.014119 0.335782
6 1.471784 0.062413 0.549655

The axis number you pass is the axis to be aligned to. In this case, the row index of the DataFrame (axis='index' or
axis=0) is to be adjusted and transmitted.

Function application and mapping

numpy.ufunc (element-wise array methods) also work with pandas objects:

[10]: np.abs(df12)
[10]: 0 1 2
0 1.740413 0.566677 0.715587
1 1.992954 2.637117 0.888983
2 0.111516 0.340413 0.811355
3 0.418716 0.847758 0.233179
4 0.625907 1.675521 0.494883
5 1.780402 0.349901 0.000000
6 0.922128 0.487242 0.000000

Another common operation is to apply a function to one-dimensional arrays on each column or row. The pan-
das.DataFrame.apply method does just that:

[11]: df12
[11]: 0 1 2
0 -1.740413 0.566677 0.715587
1 1.992954 -2.637117 0.888983
2 -0.111516 -0.340413 0.811355
3 0.418716 -0.847758 -0.233179
4 0.625907 1.675521 0.494883
5 1.780402 -0.349901 0.000000
6 0.922128 -0.487242 0.000000

[12]: f = lambda x: x.max() - x.min()

df12.apply(f)
[12]: 0 3.733368
1 4.312639
2 1.122163
dtype: float64

Here the function f, which calculates the difference between the maximum and minimum of a row, is called once for
each column of the frame. The result is a row with the columns of the frame as index.
If you pass axis='columns' to apply, the function will be called once per line instead:

2.4. pandas 105


Python for Data Science, Release 1.0.0

[13]: df12.apply(f, axis='columns')


[13]: 0 2.456000
1 4.630072
2 1.151768
3 1.266474
4 1.180639
5 2.130302
6 1.409370
dtype: float64

Many of the most common array statistics (such as sum and mean) are DataFrame methods, so the use of apply is not
necessary.
The function passed to apply does not have to return a single value; it can also return a series with multiple values:

[14]: def f(x):


return pd.Series([x.min(), x.max()], index=['min', 'max'])

df12.apply(f)
[14]: 0 1 2
min -1.740413 -2.637117 -0.233179
max 1.992954 1.675521 0.888983

You can also use element-wise Python functions. Suppose you want to round each floating point value in df12 to two
decimal places, you can do this with pandas.DataFrame.applymap:

[15]: f = lambda x: round(x, 2)

df12.applymap(f)
[15]: 0 1 2
0 -1.74 0.57 0.72
1 1.99 -2.64 0.89
2 -0.11 -0.34 0.81
3 0.42 -0.85 -0.23
4 0.63 1.68 0.49
5 1.78 -0.35 0.00
6 0.92 -0.49 0.00

The reason for the name applymap is that Series has a map method for applying an element-wise function:

[16]: df12[2].map(f)
[16]: 0 0.72
1 0.89
2 0.81
3 -0.23
4 0.49
5 0.00
6 0.00
Name: 2, dtype: float64

106 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

2.4.9 Descriptive statistics

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into
the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from
a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy
arrays, they also handle missing data.

[1]: import pandas as pd


import numpy as np

rng = np.random.default_rng()
df = pd.DataFrame(rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7))
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2
[1]: 0 1 2
2022-02-03 0.462527 -0.545698 0.044638
2022-02-04 1.400089 -0.237095 1.287454
2022-02-05 1.526841 0.926597 0.729344
2022-02-06 0.778086 1.439802 1.044385
2022-02-07 -0.190824 -1.726359 -0.426458
2022-02-08 -0.436732 0.899246 -0.906147
2022-02-09 NaN NaN NaN

Calling the pandas.DataFrame.sum method returns a series containing column totals:

[2]: df2.sum()
[2]: 0 3.539988
1 0.756492
2 1.773217
dtype: float64

Passing axis='columns' or axis=1 instead sums over the columns:

[3]: df2.sum(axis='columns')
[3]: 2022-02-03 -0.038534
2022-02-04 2.450449
2022-02-05 3.182782
2022-02-06 3.262273
2022-02-07 -2.343640
2022-02-08 -0.443633
2022-02-09 0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is 0. This can be disabled with the skipna option:

[4]: df2.sum(axis='columns', skipna=False)


[4]: 2022-02-03 -0.038534
2022-02-04 2.450449
2022-02-05 3.182782
2022-02-06 3.262273
(continues on next page)

2.4. pandas 107


Python for Data Science, Release 1.0.0

(continued from previous page)


2022-02-07 -2.343640
2022-02-08 -0.443633
2022-02-09 NaN
Freq: D, dtype: float64

Some aggregations, such as mean, require at least one non-NaN value to obtain a valuable result:

[5]: df2.mean(axis='columns')
[5]: 2022-02-03 -0.012845
2022-02-04 0.816816
2022-02-05 1.060927
2022-02-06 1.087424
2022-02-07 -0.781213
2022-02-08 -0.147878
2022-02-09 NaN
Freq: D, dtype: float64

Options for reduction methods

Method Description
axis the axis of values to reduce: 0 for the rows of the DataFrame and 1 for the columns
skipna exclude missing values; by default True.
level reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as idxmin and idxmax, provide indirect statistics such as the index value at which the minimum
or maximum value is reached:

[6]: df2.idxmax()
[6]: 0 2022-02-05
1 2022-02-06
2 2022-02-04
dtype: datetime64[ns]

Other methods are accumulations:

[7]: df2.cumsum()
[7]: 0 1 2
2022-02-03 0.462527 -0.545698 0.044638
2022-02-04 1.862616 -0.782793 1.332092
2022-02-05 3.389457 0.143803 2.061436
2022-02-06 4.167544 1.583605 3.105822
2022-02-07 3.976720 -0.142754 2.679364
2022-02-08 3.539988 0.756492 1.773217
2022-02-09 NaN NaN NaN

Another type of method is neither reductions nor accumulations. describe is one such example that produces several
summary statistics in one go:

108 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

[8]: df2.describe()
[8]: 0 1 2
count 6.000000 6.000000 6.000000
mean 0.589998 0.126082 0.295536
std 0.806152 1.181456 0.867310
min -0.436732 -1.726359 -0.906147
25% -0.027486 -0.468548 -0.308684
50% 0.620307 0.331075 0.386991
75% 1.244589 0.919759 0.965625
max 1.526841 1.439802 1.287454

For non-numeric data, describe generates alternative summary statistics:

[9]: data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],


'Octal': ['001', '002', '003', '004', '004', '005']}
df3 = pd.DataFrame(data)

df3.describe()
[9]: Code Octal
count 6 6
unique 6 5
top U+0000 004
freq 1 2

Descriptive and summary statistics:

Method Description
count number of non-NA values
describe calculation of a set of summary statistics for series or each DataFrame column
min, max calculation of minimum and maximum values
argmin, calculation of the index points (integers) at which the minimum or maximum value was reached
argmax
idxmin, calculation of the index labels at which the minimum or maximum values were reached
idxmax
quantile calculation of the sample quantile in the range from 0 to 1
sum sum of the values
mean arithmetic mean of the values
median arithmetic median (50% quantile) of the values
mad mean absolute deviation from the mean value
prod product of all values
var sample variance of the values
std sample standard deviation of the values
skew sample skewness (third moment) of the values
kurt sample kurtosis (fourth moment) of the values
cumsum cumulative sum of the values
cummin, cumulated minimum and maximum of the values respectively
cummax
cumprod cumulated product of the values
diff calculation of the first arithmetic difference (useful for time series)
pct_change calculation of the percentage changes

2.4. pandas 109


Python for Data Science, Release 1.0.0

ydata-profiling

ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy,
but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report(),
which automatically generates a standardised report for understanding the data.

Installation

$ pipenv install ydata-profiling[notebook]


...
✓ Success!
Updated Pipfile.lock (cbc5f7)!
Installing dependencies from Pipfile.lock (cbc5f7)...
80/80 —␣
00:02:26
˓→

...
$ pipenv run jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: OK

Example

[10]: from ydata_profiling import ProfileReport

profile = ProfileReport(df2, title="pandas Profiling Report")

profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTM

Configuration for large datasets

By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation
time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following
examples, we first read a larger data set into pandas:

[11]: titanic = pd.read_csv(


"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)

110 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

1. minimal mode

ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations
are turned off by default. This is the recommended starting point for larger data sets.

[12]: profile = ProfileReport(titanic, title="Minimal pandas Profiling Report", minimal=True)

profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]

Further details on settings and configuration can be found in Available settings.

2. Sample

An alternative option for very large data sets is to use only a part of them for the profiling report:

[13]: sample = titanic.sample(frac=0.05)

profile = ProfileReport(sample, title="Sample pandas Profiling Report")

profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]

3. Deactivate expensive calculations

To reduce the computational effort in large datasets, but still get some interesting information, some calculations can
be filtered only for certain columns:

[15]: profile = ProfileReport()


profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic

profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTM

The setting interactions.targets, can be changed via configuration files as well as via environment variables; see
Changing settings for details.

2.4. pandas 111


Python for Data Science, Release 1.0.0

4 Concurrency

Currently work is being done on a scalable Spark backend for pandas-profiling, see Spark Profiling Development.

2.4.10 Sorting and ranking

Sorting a record by a criterion is another important built-in function. Sorting lexicographically by row or column index
is already described in the section Reordering and sorting from levels. In the following we look at sorting the values
with DataFrame.sort_values and Series.sort_values:

[1]: import numpy as np


import pandas as pd

rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))

s.sort_index(ascending=False)
[1]: 6 0.807524
5 -0.045012
4 -0.276246
3 1.032609
2 1.067726
1 0.488613
0 1.324534
dtype: float64

All missing values are sorted to the end of the row by default:

[2]: s = pd.Series(rng.normal(size=7))
s[s < 0] = np.nan

s.sort_values()
[2]: 5 0.186232
1 0.826051
6 1.649605
0 NaN
2 NaN
3 NaN
4 NaN
dtype: float64

With a DataFrame you can sort on both axes. With by you specify which column or row is to be sorted:

[3]: df = pd.DataFrame(rng.normal(size=(7, 3)))

df.sort_values(by=2, ascending=False)
[3]: 0 1 2
1 -0.013109 0.060716 1.837680
0 0.095855 -0.804874 1.201810
2 0.278646 -0.608821 0.498333
5 -0.680013 0.314085 0.382935
(continues on next page)

112 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

(continued from previous page)


3 -0.368188 1.192103 -0.944575
6 1.097941 -0.207889 -0.974132
4 0.684861 1.651951 -2.388397

You can also sort rows with axis=1 and by:

[4]: df.sort_values(axis=1, by=[0,1], ascending=False)


[4]: 2 0 1
0 1.201810 0.095855 -0.804874
1 1.837680 -0.013109 0.060716
2 0.498333 0.278646 -0.608821
3 -0.944575 -0.368188 1.192103
4 -2.388397 0.684861 1.651951
5 0.382935 -0.680013 0.314085
6 -0.974132 1.097941 -0.207889

Ranking

DataFrame.rank and Series.rank assign ranks from one to the number of valid data points in an array:

[5]: df.rank()
[5]: 0 1 2
0 4.0 1.0 6.0
1 3.0 4.0 7.0
2 5.0 2.0 5.0
3 2.0 6.0 3.0
4 6.0 7.0 1.0
5 1.0 5.0 4.0
6 7.0 3.0 2.0

If ties occur in the ranking, the middle rank is usually assigned in each group.

[6]: df2 = df.append(df[5:])

df2.rank()
[6]: 0 1 2
0 5.0 1.0 8.0
1 4.0 5.0 9.0
2 6.0 2.0 7.0
3 3.0 8.0 4.0
4 7.0 9.0 1.0
5 1.5 6.5 5.5
6 8.5 3.5 2.5
5 1.5 6.5 5.5
6 8.5 3.5 2.5

The parameter min, on the other hand, assigns the smallest rank in the group:

[7]: df2.rank(method='min')

2.4. pandas 113


Python for Data Science, Release 1.0.0

[7]: 0 1 2
0 5.0 1.0 8.0
1 4.0 5.0 9.0
2 6.0 2.0 7.0
3 3.0 8.0 4.0
4 7.0 9.0 1.0
5 1.0 6.0 5.0
6 8.0 3.0 2.0
5 1.0 6.0 5.0
6 8.0 3.0 2.0

Other methods with rank

Method Description
average default: assign the average rank to each entry in the same group
min uses the minimum rank for the whole group
max uses the maximum rank for the whole group
first assigns the ranks in the order in which the values appear in the data
dense like method='min' but the ranks always increase by 1 between groups and not according to the number
of same items in a group

2.4.11 Subdividing and categorising data

Continuous data is often divided into domains or otherwise grouped for analysis.
Suppose you have data on a group of people in a study that you want to divide into discrete age groups. For this, we
generate a dataframe with 250 entries between 0 and 99:

[1]: import pandas as pd


import numpy as np

ages = np.random.randint(0, 99, 250)


df = pd.DataFrame({"Age": ages})

df
[1]: Age
0 29
1 91
2 45
3 16
4 22
.. ...
245 87
246 59
247 93
248 13
249 54

[250 rows x 1 columns]

114 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

Afterwards, pandas offers us a simple way to divide the results into ten ranges with pandas.cut. To get only whole
years, we additionally set precision=0:

[2]: cats = pd.cut(ages, 10, precision=0)

cats
[2]: [(20.0, 29.0], (88.0, 98.0], (39.0, 49.0], (10.0, 20.0], (20.0, 29.0], ..., (78.0, 88.0],
˓→ (59.0, 69.0], (88.0, 98.0], (10.0, 20.0], (49.0, 59.0]]

Length: 250
Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] <␣
˓→(29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]

With pandas.Categorical.categories you can display the categories:

[3]: cats.categories
[3]: IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0], (39.0, 49.0], (49.
˓→0, 59.0], (59.0, 69.0], (69.0, 78.0], (78.0, 88.0], (88.0, 98.0]], dtype=

˓→'interval[float64, right]')

. . . or even just a single category:

[4]: cats.categories[0]
[4]: Interval(-0.1, 10.0, closed='right')

With pandas.Categorical.codes you can display an array where for each value the corresponding category is shown:

[5]: cats.codes
[5]: array([2, 9, 4, 1, 2, 9, 7, 2, 7, 6, 0, 6, 0, 3, 3, 9, 2, 4, 0, 2, 0, 6,
4, 5, 5, 0, 7, 9, 5, 7, 1, 1, 4, 6, 0, 7, 9, 9, 3, 4, 9, 2, 5, 2,
4, 9, 8, 9, 2, 0, 9, 5, 4, 9, 3, 4, 3, 3, 8, 7, 2, 5, 3, 2, 2, 9,
3, 7, 9, 2, 8, 8, 8, 4, 6, 1, 4, 3, 4, 3, 4, 6, 8, 1, 7, 2, 0, 4,
1, 8, 3, 5, 2, 5, 9, 9, 1, 4, 7, 6, 5, 2, 9, 9, 5, 3, 1, 1, 7, 3,
3, 7, 8, 4, 6, 3, 0, 2, 6, 8, 4, 6, 6, 5, 0, 4, 3, 5, 4, 0, 1, 6,
4, 1, 4, 5, 3, 4, 9, 6, 6, 9, 9, 8, 4, 7, 1, 1, 4, 6, 9, 7, 5, 3,
3, 4, 2, 6, 5, 4, 3, 7, 7, 8, 5, 5, 4, 9, 1, 3, 5, 3, 3, 9, 0, 8,
7, 8, 9, 1, 5, 7, 8, 6, 0, 1, 4, 2, 2, 8, 7, 4, 2, 0, 6, 2, 7, 4,
1, 4, 9, 1, 7, 6, 0, 7, 7, 0, 9, 6, 8, 3, 4, 4, 2, 3, 5, 2, 6, 3,
6, 2, 2, 9, 8, 9, 7, 1, 9, 9, 0, 5, 0, 2, 0, 5, 2, 5, 5, 3, 9, 7,
3, 0, 0, 8, 6, 9, 1, 5], dtype=int8)

With value_counts we can now look at how the number is distributed among the individual areas:

[6]: pd.value_counts(cats)
[6]: (39.0, 49.0] 32
(88.0, 98.0] 32
(29.0, 39.0] 28
(20.0, 29.0] 27
(49.0, 59.0] 25
(69.0, 78.0] 24
(59.0, 69.0] 23
(-0.1, 10.0] 21
(10.0, 20.0] 20
(continues on next page)

2.4. pandas 115


Python for Data Science, Release 1.0.0

(continued from previous page)


(78.0, 88.0] 18
dtype: int64

It is striking that the age ranges do not contain an equal number of years, but with 20.0, 29.0 and 69.0, 78.0 two
ranges contain only 9 years. This is due to the fact that the age range only extends from 0 to 98:

[7]: df.min()
[7]: Age 0
dtype: int64

[8]: df.max()
[8]: Age 98
dtype: int64

With pandas.qcut, on the other hand, the set is divided into areas that are approximately the same size:

[9]: cats = pd.qcut(ages, 10, precision=0)

[10]: pd.value_counts(cats)
[10]: (12.0, 23.0] 29
(32.0, 41.0] 28
(78.0, 91.0] 28
(56.0, 68.0] 26
(-1.0, 12.0] 25
(48.0, 56.0] 24
(68.0, 78.0] 24
(41.0, 48.0] 23
(91.0, 98.0] 22
(23.0, 32.0] 21
dtype: int64

If we want to ensure that each age group actually includes exactly ten years, we can specify this directly with pan-
das.Categorical:

[11]: age_groups = ["{0} - {1}".format(i, i + 9) for i in range(0, 99, 10)]


cats = pd.Categorical(age_groups)

cats.categories
[11]: Index(['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59',
'60 - 69', '70 - 79', '80 - 89', '90 - 99'],
dtype='object')

For grouping we can now use pandas.cut. However, the number of labels must be one less than the number of edges:

[12]: df["Age group"] = pd.cut(df.Age, range(0, 101, 10), right=False, labels=cats)

df
[12]: Age Age group
0 29 20 - 29
1 91 90 - 99
(continues on next page)

116 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

(continued from previous page)


2 45 40 - 49
3 16 10 - 19
4 22 20 - 29
.. ... ...
245 87 80 - 89
246 59 50 - 59
247 93 90 - 99
248 13 10 - 19
249 54 50 - 59

[250 rows x 2 columns]

2.4.12 Combining and merging data sets

Data contained in pandas objects can be combined in several ways:


• pandas.merge joins rows in DataFrames based on one or more keys. This function is familiar from SQL or other
relational databases, as it implements database join operations.
• pandas.concat concatenates or stacks objects along an axis.
• The instance methods pandas.DataFrame.combine_first or pandas.Series.combine_first allow overlapping data
to be joined.
• With pandas.merge_asof you can perform time series based window joins between DataFrame objects.

Database-like DataFrame joins

Merge or join operations combine data sets by linking rows with one or more keys. These operations are especially
important in relational, SQL-based databases. The merge function in pandas is the main entry point for applying these
algorithms to your data.

[1]: import pandas as pd

[2]: encoding = pd.DataFrame({'Unicode': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004',


˓→'U+0005'],

'Decimal': [0, 1, 2, 3, 4, 5],


'Octal': ['000', '001', '002', '003', '004', '005'],
'Key': ['NUL', 'Ctrl-A', 'Ctrl-B', 'Ctrl-C', 'Ctrl-D', 'Ctrl-E
˓→']})

update = pd.DataFrame({'Unicode': ['U+0003', 'U+0004', 'U+0005', 'U+0006', 'U+0007',


˓→'U+0008', 'U+0009'],

'Decimal': [3, 4, 5, 6, 7, 8, 9],


'Octal': ['003', '004', '005', '006', '007', '008', '009'],
'Key': ['Ctrl-C', 'Ctrl-D', 'Ctrl-E', 'Ctrl-F', 'Ctrl-G', 'Ctrl-H
˓→', 'Ctrl-I']})

encoding, update
[2]: ( Unicode Decimal Octal Key
0 U+0000 0 000 NUL
(continues on next page)

2.4. pandas 117


Python for Data Science, Release 1.0.0

(continued from previous page)


1 U+0001 1 001 Ctrl-A
2 U+0002 2 002 Ctrl-B
3 U+0003 3 003 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E,
Unicode Decimal Octal Key
0 U+0003 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E
3 U+0006 6 006 Ctrl-F
4 U+0007 7 007 Ctrl-G
5 U+0008 8 008 Ctrl-H
6 U+0009 9 009 Ctrl-I)

When we call merge with these objects, we get:

[3]: pd.merge(encoding, update)


[3]: Unicode Decimal Octal Key
0 U+0003 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E

By default, merge performs a so-called inner join; the keys in the result are the intersection or common set in both
tables.
Note:
I did not specify which column to merge over. If this information is not specified, merge will use the overlapping
column names as keys. However, it is good practice to specify this explicitly:

[4]: pd.merge(encoding, update, on='Unicode')


[4]: Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y
0 U+0003 3 003 Ctrl-C 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E 5 005 Ctrl-E

If the column names are different in each object, you can specify them separately. In the following example update2
gets the key U+ and not Unicode:

[5]: update2 = pd.DataFrame({'U+': ['U+0003', 'U+0004', 'U+0005', 'U+0006', 'U+0007', 'U+0008


˓→', 'U+0009'],

'Decimal': [3, 4, 5, 6, 7, 8, 9],


'Octal': ['003', '004', '005', '006', '007', '008', '009'],
'Key': ['Ctrl-C', 'Ctrl-D', 'Ctrl-E', 'Ctrl-F', 'Ctrl-G', 'Ctrl-H
˓→', 'Ctrl-I']})

pd.merge(encoding, update2, left_on='Unicode', right_on='U+')


[5]: Unicode Decimal_x Octal_x Key_x U+ Decimal_y Octal_y Key_y
0 U+0003 3 003 Ctrl-C U+0003 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D U+0004 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E U+0005 5 005 Ctrl-E

118 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

However, you can use merge not only to perform an inner join, with which the keys in the result are the intersection or
common set in both tables. Other possible options are:

Option Behaviour
how='inner' uses only the key combinations observed in both tables
how='left' uses all key combinations found in the left table
how='right' uses all key combinations found in the right table
how='outer' uses all key combinations observed in both tables together

[6]: pd.merge(encoding, update, on='Unicode', how='left')


[6]: Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y
0 U+0000 0 000 NUL NaN NaN NaN
1 U+0001 1 001 Ctrl-A NaN NaN NaN
2 U+0002 2 002 Ctrl-B NaN NaN NaN
3 U+0003 3 003 Ctrl-C 3.0 003 Ctrl-C
4 U+0004 4 004 Ctrl-D 4.0 004 Ctrl-D
5 U+0005 5 005 Ctrl-E 5.0 005 Ctrl-E

[7]: pd.merge(encoding, update, on='Unicode', how='outer')


[7]: Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y
0 U+0000 0.0 000 NUL NaN NaN NaN
1 U+0001 1.0 001 Ctrl-A NaN NaN NaN
2 U+0002 2.0 002 Ctrl-B NaN NaN NaN
3 U+0003 3.0 003 Ctrl-C 3.0 003 Ctrl-C
4 U+0004 4.0 004 Ctrl-D 4.0 004 Ctrl-D
5 U+0005 5.0 005 Ctrl-E 5.0 005 Ctrl-E
6 U+0006 NaN NaN NaN 6.0 006 Ctrl-F
7 U+0007 NaN NaN NaN 7.0 007 Ctrl-G
8 U+0008 NaN NaN NaN 8.0 008 Ctrl-H
9 U+0009 NaN NaN NaN 9.0 009 Ctrl-I

The join method only affects the unique key values that appear in the result.
To join multiple keys, you can pass a list of column names:

[8]: pd.merge(encoding, update, on=['Unicode', 'Decimal', 'Octal', 'Key'], how='outer')


[8]: Unicode Decimal Octal Key
0 U+0000 0 000 NUL
1 U+0001 1 001 Ctrl-A
2 U+0002 2 002 Ctrl-B
3 U+0003 3 003 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E
6 U+0006 6 006 Ctrl-F
7 U+0007 7 007 Ctrl-G
8 U+0008 8 008 Ctrl-H
9 U+0009 9 009 Ctrl-I

2.4. pandas 119


Python for Data Science, Release 1.0.0

2.4.13 Group operations

By groupby is meant a process that involves one or more of the following steps:
• Split divides the data into groups according to certain criteria
• Apply applies a function independently to each group
• Combine combines the results in a data structure
In the first phase of the process, the data contained in a pandas object, be it a Series, a DataFrame or something else,
is split into groups based on one or more keys. The division is done on a particular axis of an object. For example, a
DataFrame can be grouped by its rows (axis=0) or its columns (axis=1). Then, a function is applied to each group
to create a new value. Finally, the results of all these function applications are combined in a result object. The shape
of the result object usually depends on what is done with the data.
Each grouping key can take many forms, and the keys do not all have to be of the same type:
• a list or array of values that have the same length as the axis being grouped
• a value that specifies a column name in a DataFrame
• a dict or series that is a correspondence between the values on the axis being grouped and the group names
• a function that is called on the axis index or the individual labels in the index
Note:
The latter three methods are shortcuts to create an array of values that will be used to divide the object.
Don’t worry if this all seems abstract. Throughout this chapter I will give many examples of all these methods. For
starters, here is a small table dataset as a DataFrame:

[1]: import pandas as pd


import numpy as np

[2]: df = pd.DataFrame({'Title' : ['Jupyter Tutorial',


'Jupyter Tutorial',
'PyViz Tutorial',
None,
'Python Basics',
'Python Basics'],
'Language' : ['de', 'en', 'de', None, 'de', 'en'],
'2021-12' : [19651,4722,2573,None,525,157],
'2022-01' : [30134,3497,4873,None,427,85],
'2022-02' : [33295,4009,3930,None,276,226]})

df
[2]: Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0
2 PyViz Tutorial de 2573.0 4873.0 3930.0
3 None None NaN NaN NaN
4 Python Basics de 525.0 427.0 276.0
5 Python Basics en 157.0 85.0 226.0

Suppose you want to calculate the sum of column 02/2022 using the labels of Title. There are several ways to do this.
One is to access 02/2022 and call groupby with the column (a Series) in Title:

120 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

[3]: grouped = df['2022-02'].groupby(df['Title'])

grouped
[3]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fcfe83a6970>

This grouped variable is now a special SeriesGroupBy object. It has not yet calculated anything except some inter-
mediate data about the group key df['Title']. The idea is that this object has all the information needed to apply
an operation to each of the groups. For example, to calculate the group averages, we can call the sum method of the
GroupBy object:

[4]: grouped.sum()
[4]: Title
Jupyter Tutorial 37304.0
PyViz Tutorial 3930.0
Python Basics 502.0
Name: 2022-02, dtype: float64

Later I will explain more about what happens when you call .sum(). The important thing to note here is that the data
(a row) has been aggregated by splitting the data across the group key, creating a new row that is now indexed by the
unique values in the Title column. The resulting index is Title because groupby(df['Title'] did this.
If we had passed multiple arrays as a list instead, we would get something different:

[5]: sums = df['2021-12'].groupby([df['Language'], df['Title']]).sum()

sums
[5]: Language Title
de Jupyter Tutorial 19651.0
PyViz Tutorial 2573.0
Python Basics 525.0
en Jupyter Tutorial 4722.0
Python Basics 157.0
Name: 2021-12, dtype: float64

Here we have grouped the data based on two keys, and the resulting series now has a hierarchical index consisting of
the observed unique key pairs:

[6]: sums.unstack()
[6]: Title Jupyter Tutorial PyViz Tutorial Python Basics
Language
de 19651.0 2573.0 525.0
en 4722.0 NaN 157.0

Often the grouping information is in the same DataFrame as the data you want to edit. In this case, you can pass column
names (whether they are strings, numbers or other Python objects) as group keys:

[7]: df.groupby('Title').sum()
[7]: 2021-12 2022-01 2022-02
Title
Jupyter Tutorial 24373.0 33631.0 37304.0
PyViz Tutorial 2573.0 4873.0 3930.0
Python Basics 682.0 512.0 502.0

2.4. pandas 121


Python for Data Science, Release 1.0.0

Here it is noticeable that the result does not contain a Language column. Since df['Language'] is not numeric
data, it interferes with the table layout and is therefore automatically excluded from the result. By default, all numeric
columns are aggregated.

[8]: df.groupby(['Title','Language']).sum()
[8]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 19651.0 30134.0 33295.0
en 4722.0 3497.0 4009.0
PyViz Tutorial de 2573.0 4873.0 3930.0
Python Basics de 525.0 427.0 276.0
en 157.0 85.0 226.0

Regardless of the goal of using groupby, a generally useful groupby method is size, which returns a series with the
group sizes:

[9]: df.groupby(['Language']).size()
[9]: Language
de 3
en 2
dtype: int64

Note:
All missing values in a group key are excluded from the result by default. This behaviour can be disabled
by passing dropna=False to groupby:

[10]: df.groupby('Language', dropna=False).size()


[10]: Language
de 3
en 2
NaN 1
dtype: int64

[11]: df.groupby(['Title', 'Language'], dropna=False).size()


[11]: Title Language
Jupyter Tutorial de 1
en 1
PyViz Tutorial de 1
Python Basics de 1
en 1
NaN NaN 1
dtype: int64

122 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

Iteration over groups

The object returned by groupby supports iteration and produces a sequence of 2-tuples containing the group name
along with the data packet. Consider the following:

[12]: for name, group in df.groupby('Title'):


print(name)
print(group)
Jupyter Tutorial
Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0
PyViz Tutorial
Title Language 2021-12 2022-01 2022-02
2 PyViz Tutorial de 2573.0 4873.0 3930.0
Python Basics
Title Language 2021-12 2022-01 2022-02
4 Python Basics de 525.0 427.0 276.0
5 Python Basics en 157.0 85.0 226.0

With multiple keys, the first element of the tuple is a tuple of key values:

[13]: for (i1, i2), group in df.groupby(['Title', 'Language']):


print((i1, i2))
print(group)
('Jupyter Tutorial', 'de')
Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
('Jupyter Tutorial', 'en')
Title Language 2021-12 2022-01 2022-02
1 Jupyter Tutorial en 4722.0 3497.0 4009.0
('PyViz Tutorial', 'de')
Title Language 2021-12 2022-01 2022-02
2 PyViz Tutorial de 2573.0 4873.0 3930.0
('Python Basics', 'de')
Title Language 2021-12 2022-01 2022-02
4 Python Basics de 525.0 427.0 276.0
('Python Basics', 'en')
Title Language 2021-12 2022-01 2022-02
5 Python Basics en 157.0 85.0 226.0

Next, we want to output a dict of the data as a one-liner:

[14]: books = dict(list(df.groupby('Title')))

books
[14]: {'Jupyter Tutorial': Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0,
'PyViz Tutorial': Title Language 2021-12 2022-01 2022-02
2 PyViz Tutorial de 2573.0 4873.0 3930.0,
'Python Basics': Title Language 2021-12 2022-01 2022-02
(continues on next page)

2.4. pandas 123


Python for Data Science, Release 1.0.0

(continued from previous page)


4 Python Basics de 525.0 427.0 276.0
5 Python Basics en 157.0 85.0 226.0}

By default, groupby groups on axis=0, but you can also group on any of the other axes. For example, we could group
the columns of our example df here by dtype as follows:

[15]: df.dtypes
[15]: Title object
Language object
2021-12 float64
2022-01 float64
2022-02 float64
dtype: object

[16]: grouped = df.groupby(df.dtypes, axis=1)

[17]: for dtype, group in grouped:


print(dtype)
print(group)
float64
2021-12 2022-01 2022-02
0 19651.0 30134.0 33295.0
1 4722.0 3497.0 4009.0
2 2573.0 4873.0 3930.0
3 NaN NaN NaN
4 525.0 427.0 276.0
5 157.0 85.0 226.0
object
Title Language
0 Jupyter Tutorial de
1 Jupyter Tutorial en
2 PyViz Tutorial de
3 None None
4 Python Basics de
5 Python Basics en

Selecting a column or subset of columns

Indexing a GroupBy object created from a DataFrame with a column name or an array of column names has the effect
of subdividing columns for aggregation. This means that:

[18]: df.groupby('Title')['2021-12']
df.groupby('Title')[['2022-01']]
[18]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcfc08cde50>

are simplified spellings for:

[19]: df['2021-12'].groupby(df['Title'])
df[['2022-01']].groupby(df['Title'])

124 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

[19]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcfc08e8250>

Especially for large datasets, it may be desirable to aggregate only some columns. For example, to calculate the sum
for only column 01/2022 in the previous dataset and get the result as a DataFrame, we could write:

[20]: df.groupby(['Title', 'Language'])[['2022-01']].sum()


[20]: 2022-01
Title Language
Jupyter Tutorial de 30134.0
en 3497.0
PyViz Tutorial de 4873.0
Python Basics de 427.0
en 85.0

The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped series
if only a single column name is passed as a scalar:

[21]: series_grouped = df.groupby(['Title', 'Language'])['2022-01']

series_grouped
[21]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fcfc08e8be0>

[22]: series_grouped.sum()
[22]: Title Language
Jupyter Tutorial de 30134.0
en 3497.0
PyViz Tutorial de 4873.0
Python Basics de 427.0
en 85.0
Name: 2022-01, dtype: float64

Grouping with dicts and series

Grouping information can also be in a form other than an array:

[23]: df.iloc[2:3, [2, 3]] = np.nan

Suppose I have a group correspondence for the columns and want to group the columns together by group:

[24]: mapping = {'2021-12': 'Dec 2021',


'2022-01': 'Jan 2022',
'2022-02': 'Feb 2022'}

Now an array could be constructed from this dict to pass to groupby, but instead we can just pass the dict:

[25]: by_column = df.groupby(mapping, axis=1)

by_column.sum()
[25]: Dec 2021 Feb 2022 Jan 2022
0 19651.0 33295.0 30134.0
(continues on next page)

2.4. pandas 125


Python for Data Science, Release 1.0.0

(continued from previous page)


1 4722.0 4009.0 3497.0
2 0.0 3930.0 0.0
3 0.0 0.0 0.0
4 525.0 276.0 427.0
5 157.0 226.0 85.0

The same functionality applies to Series:

[26]: map_series = pd.Series(mapping)

map_series
[26]: 2021-12 Dec 2021
2022-01 Jan 2022
2022-02 Feb 2022
dtype: object

[27]: df.groupby(map_series, axis=1).sum()


[27]: Dec 2021 Feb 2022 Jan 2022
0 19651.0 33295.0 30134.0
1 4722.0 4009.0 3497.0
2 0.0 3930.0 0.0
3 0.0 0.0 0.0
4 525.0 276.0 427.0
5 157.0 226.0 85.0

Grouping with Functions

Using Python functions is a more general method of defining a group assignment compared to a Dict or Series. Each
function passed as a group key is called once per index value, with the return values used as group names. Specifically,
consider the example DataFrame from the previous section, which contains the titles as index values. Suppose If you
want to group by the length of the names, you can calculate an array with the lengths of the strings, but it is easier to
pass the len function:

[28]: df = pd.DataFrame([[19651,30134,33295],
[4722,3497,4009],
[2573,4873,3930],
[525,427,276],
[157,85,226]],
index=['Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial',
'Python Basics',
'Python Basics'],
columns=['2021-12', '2022-01', '2022-02'])

[29]: df.groupby(len).count()
[29]: 2021-12 2022-01 2022-02
13 2 2 2
14 1 1 1
16 2 2 2

126 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

Mixing functions with arrays, dicts or series is no problem, as everything is converted internally into arrays:

[30]: languages = ['de', 'en', 'de', 'de', 'en']

[31]: df.groupby([len, languages]).count()


[31]: 2021-12 2022-01 2022-02
13 de 1 1 1
en 1 1 1
14 de 1 1 1
16 de 1 1 1
en 1 1 1

Grouping by index levels

A final practical feature for hierarchically indexed datasets is the ability to aggregate by one of the index levels of an
axis. Let’s look at an example:

[32]: version_hits = [[19651,0,30134,0,33295,0],


[4722,1825,3497,2576,4009,3707],
[2573,0,4873,0,3930,0],
[None,None,None,None,None,None],
[525,0,427,0,276,0],
[157,0,85,0,226,0]]

df = pd.DataFrame(version_hits,
index=[['Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial',
None,
'Python Basics',
'Python Basics'],
['de', 'en', 'de', None, 'de', 'en']],
columns=[['2021-12', '2021-12',
'2022-01', '2022-01',
'2022-02', '2022-02'],
['latest', 'stable',
'latest', 'stable',
'latest', 'stable']])

df.columns.names = ['Month', 'Version']

df
[32]: Month 2021-12 2022-01 2022-02
Version latest stable latest stable latest stable
Jupyter Tutorial de 19651.0 0.0 30134.0 0.0 33295.0 0.0
en 4722.0 1825.0 3497.0 2576.0 4009.0 3707.0
PyViz Tutorial de 2573.0 0.0 4873.0 0.0 3930.0 0.0
NaN NaN NaN NaN NaN NaN NaN NaN
Python Basics de 525.0 0.0 427.0 0.0 276.0 0.0
en 157.0 0.0 85.0 0.0 226.0 0.0

2.4. pandas 127


Python for Data Science, Release 1.0.0

[33]: df.groupby(level='Month', axis=1).sum()


[33]: Month 2021-12 2022-01 2022-02
Jupyter Tutorial de 19651.0 30134.0 33295.0
en 6547.0 6073.0 7716.0
PyViz Tutorial de 2573.0 4873.0 3930.0
NaN NaN 0.0 0.0 0.0
Python Basics de 525.0 427.0 276.0
en 157.0 85.0 226.0

2.4.14 Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays. In the previous examples, several
of them were used, including count and sum. You may now be wondering what happens when you apply sum() to a
GroupBy object. Optimised implementations exist for many common aggregations, such as the one in the following
table. However, they are not limited to this set of methods.
Function name Description

Function name Description


any, all Returns True if one (or more) or all of the non-NA values are truthy
count Number of non-NA values
cummin, cummax Cumulative minimum and maximum of the non-NA values
cumsum Cumulative sum of the non-NA values
cumprod Cumulative product of non-NA values
first, last First and last non-NA values
mean Mean of the non-NA values
median Arithmetic median of the non-NA values
min, max Minimum and maximum of the non-NA values
nth Retrieval of the nth largest value
ohlc calculates the four open-high-low-close statistics for time series-like data
prod Product of the non-NA values
quantile calculates the sample quantile
rank Ordinal ranks of non-NA values, as when calling Series.rank
sum Sum of non-NA values
std, var Standard deviation and variance of the sample

You can use your own aggregations and also call any method that is also defined for the grouped object. For example,
the Series method nsmallest selects the smallest requested number of values from the data.
Although nsmallest is not explicitly implemented for GroupBy, we can still use it with a non-optimised implemen-
tation. Internally, GroupBy decomposes the Series, calls df.nsmallest(n) for each part and then merges these
results in the result object:

[1]: import pandas as pd


import numpy as np

[2]: df = pd.DataFrame({'Title' : ['Jupyter Tutorial',


'Jupyter Tutorial',
'PyViz Tutorial',
None,
(continues on next page)

128 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

(continued from previous page)


'Python Basics',
'Python Basics'],
'2021-12' : [30134,6073,4873,None,427,95],
'2022-01' : [33295,7716,3930,None,276,226],
'2022-02' : [19651,6547,2573,None,525,157]})

df
[2]: Title 2021-12 2022-01 2022-02
0 Jupyter Tutorial 30134.0 33295.0 19651.0
1 Jupyter Tutorial 6073.0 7716.0 6547.0
2 PyViz Tutorial 4873.0 3930.0 2573.0
3 None NaN NaN NaN
4 Python Basics 427.0 276.0 525.0
5 Python Basics 95.0 226.0 157.0

[3]: grouped = df.groupby('Title')

[4]: grouped['2022-01'].nsmallest(1)
[4]: Title
Jupyter Tutorial 1 7716.0
PyViz Tutorial 2 3930.0
Python Basics 5 226.0
Name: 2022-01, dtype: float64

To use a custom aggregation function, pass any function that aggregates an array to the aggregate or agg method:

[5]: def range(arr):


return arr.max() - arr.min()

grouped.agg(range)
[5]: 2021-12 2022-01 2022-02
Title
Jupyter Tutorial 24061.0 25579.0 13104.0
PyViz Tutorial 0.0 0.0 0.0
Python Basics 332.0 50.0 368.0

You will find that some methods like describe also work, even though they are not strictly speaking aggregations:

[6]: grouped.describe()
[6]: 2021-12 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5
PyViz Tutorial 1.0 4873.0 NaN 4873.0 4873.00 4873.0
Python Basics 2.0 261.0 234.759451 95.0 178.00 261.0

2022-01 ... \
75% max count mean ... 75% max
Title ...
Jupyter Tutorial 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0
(continues on next page)

2.4. pandas 129


Python for Data Science, Release 1.0.0

(continued from previous page)


PyViz Tutorial 4873.00 4873.0 1.0 3930.0 ... 3930.00 3930.0
Python Basics 344.00 427.0 2.0 251.0 ... 263.50 276.0

2022-02 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0
PyViz Tutorial 1.0 2573.0 NaN 2573.0 2573.0 2573.0
Python Basics 2.0 341.0 260.215295 157.0 249.0 341.0

75% max
Title
Jupyter Tutorial 16375.0 19651.0
PyViz Tutorial 2573.0 2573.0
Python Basics 433.0 525.0

[3 rows x 24 columns]

Note:
Custom aggregation functions are generally much slower than the optimised functions in the table above. This is because
there is some extra work involved in creating the intermediate data sets for the group (function calls, reordering of data).

Additional functions column by column

As we have already seen, aggregating a Series or all columns of a DataFrame is a matter of using aggregate (or
agg) with the desired function or calling a method such as mean or std. However, it is more common to aggregate
simultaneously with another function depending on the column or with multiple functions.

[7]: grouped.agg('mean')
[7]: 2021-12 2022-01 2022-02
Title
Jupyter Tutorial 18103.5 20505.5 13099.0
PyViz Tutorial 4873.0 3930.0 2573.0
Python Basics 261.0 251.0 341.0

If you pass a list of functions or function names instead, you will get back a DataFrame with column names from the
functions:

[8]: grouped.agg(['mean', 'std', range])


[8]: 2021-12 2022-01 \
mean std range mean std
Title
Jupyter Tutorial 18103.5 17013.696262 24061.0 20505.5 18087.084356
PyViz Tutorial 4873.0 NaN 0.0 3930.0 NaN
Python Basics 261.0 234.759451 332.0 251.0 35.355339

2022-02
range mean std range
Title
(continues on next page)

130 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

(continued from previous page)


Jupyter Tutorial 25579.0 13099.0 9265.927261 13104.0
PyViz Tutorial 0.0 2573.0 NaN 0.0
Python Basics 50.0 341.0 260.215295 368.0

Here we have passed agg a list of aggregation functions to be evaluated independently for the data groups.
You don’t need to accept the names that GroupBy gives to the columns; in particular, lambda functions have the name
<lambda>, which makes them difficult to identify. When you pass a list of tuples, the first element of each tuple is used
as the column name in the DataFrame:

[9]: grouped.agg([('Mean', 'mean'), ('Standard deviation', 'std'), ('Range', range)])


[9]: 2021-12 2022-01 \
Mean Standard deviation Range Mean
Title
Jupyter Tutorial 18103.5 17013.696262 24061.0 20505.5
PyViz Tutorial 4873.0 NaN 0.0 3930.0
Python Basics 261.0 234.759451 332.0 251.0

2022-02 \
Standard deviation Range Mean Standard deviation
Title
Jupyter Tutorial 18087.084356 25579.0 13099.0 9265.927261
PyViz Tutorial NaN 0.0 2573.0 NaN
Python Basics 35.355339 50.0 341.0 260.215295

Range
Title
Jupyter Tutorial 13104.0
PyViz Tutorial 0.0
Python Basics 368.0

With a DataFrame, you have the option of specifying a list of functions to be applied to all columns or to different
functions per column. Let’s say we want to calculate the same three statistics for the columns:

[10]: stats = ['count', 'mean', 'max']

evaluations = grouped.agg(stats)

evaluations
[10]: 2021-12 2022-01 2022-02 \
count mean max count mean max count
Title
Jupyter Tutorial 2 18103.5 30134.0 2 20505.5 33295.0 2
PyViz Tutorial 1 4873.0 4873.0 1 3930.0 3930.0 1
Python Basics 2 261.0 427.0 2 251.0 276.0 2

mean max
Title
Jupyter Tutorial 13099.0 19651.0
PyViz Tutorial 2573.0 2573.0
(continues on next page)

2.4. pandas 131


Python for Data Science, Release 1.0.0

(continued from previous page)


Python Basics 341.0 525.0

As you can see, the resulting DataFrame has hierarchical columns, just as you would get if you aggregated each column
separately and used pandas.concat to join the results together, using the column names as key arguments:

[11]: evaluations['2021-12']
[11]: count mean max
Title
Jupyter Tutorial 2 18103.5 30134.0
PyViz Tutorial 1 4873.0 4873.0
Python Basics 2 261.0 427.0

As before, a list of tuples with user-defined names can be passed:

[12]: tuples = [('Mean', 'mean'), ('Variance', np.var)]

grouped[['2021-12', '2022-01']].agg(tuples)
[12]: 2021-12 2022-01
Mean Variance Mean Variance
Title
Jupyter Tutorial 18103.5 289465860.5 20505.5 327142620.5
PyViz Tutorial 4873.0 NaN 3930.0 NaN
Python Basics 261.0 55112.0 251.0 1250.0

If we now assume that potentially different functions are to be applied to one or more of the columns, we pass a dict
to agg that contains an assignment of column names to one of the function specifications:

[13]: grouped.agg({'2021-12' : 'mean', '2022-01' : np.var})


[13]: 2021-12 2022-01
Title
Jupyter Tutorial 18103.5 327142620.5
PyViz Tutorial 4873.0 NaN
Python Basics 261.0 1250.0

[14]: grouped.agg({'2021-12' : ['min', 'max', 'mean', 'std'],


'2022-01' : 'sum'})
[14]: 2021-12 2022-01
min max mean std sum
Title
Jupyter Tutorial 6073.0 30134.0 18103.5 17013.696262 41011.0
PyViz Tutorial 4873.0 4873.0 4873.0 NaN 3930.0
Python Basics 95.0 427.0 261.0 234.759451 502.0

132 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

Return aggregated data without row indices

In all the examples so far, the aggregated data is returned with an index. Since this is not always desired, you can
disable this behaviour in most cases by passing as_index=False to groupby:

[15]: grouped.agg([range], as_index=False).mean()


[15]: 2021-12 range 8131.000000
2022-01 range 8543.000000
2022-02 range 4490.666667
dtype: float64

By using the method as_index=False, some unnecessary calculations are avoided. Of course, it is always possible
to get the result back with index by calling reset_index for the result.

2.4.15 Apply

The most general GroupBy method is apply. It splits the object to be processed, calls the passed function on each part
and then tries to chain the parts together.
Suppose we want to select the five largest hit values by group. To do this, we first write a function that selects the
rows with the largest values in a particular column:

[1]: import pandas as pd


import numpy as np

[2]: df = pd.DataFrame({'2021-12' : [30134,6073,4873,None,427,95],


'2022-01' : [33295,7716,3930,None,276,226],
'2022-02' : [19651,6547,2573,None,525,157]},
index=[['Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial',
'PyViz Tutorial',
'Python Basics',
'Python Basics'],
['de', 'en', 'de', 'en', 'de', 'en']])
df.index.names = ['Title', 'Language']

df
[2]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en NaN NaN NaN
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0

[3]: def top(df, n=5, column='2021-12'):


return df.sort_values(by=column, ascending=False)[:n]

top(df, n=3)

2.4. pandas 133


Python for Data Science, Release 1.0.0

[3]: 2021-12 2022-01 2022-02


Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0

If we now group by titles, for example, and call apply with this function, we get the following:

[4]: grouped_titles = df.groupby('Title')

grouped_titles.apply(top)
[4]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en NaN NaN NaN
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0

What happened here? The upper function is called for each row group of the DataFrame, and then the results are
concatenated with pandas.concat, labelling the parts with the group names. The result therefore has a hierarchical
index whose inner level contains index values from the original DataFrame.
If you pass a function to apply that takes other arguments or keywords, you can pass them after the function:

[5]: grouped_titles.apply(top, n=1)


[5]: 2021-12 2022-01 2022-02
Title Title Language
Jupyter Tutorial Jupyter Tutorial de 30134.0 33295.0 19651.0
PyViz Tutorial PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics Python Basics de 427.0 276.0 525.0

We have now seen the basic usage of apply. What happens inside the passed function is very versatile and up to you;
it only has to return a pandas object or a single value. In the following, we will therefore mainly show examples that
can give you ideas on how to solve various problems with groupby.
First, let’s look again at describe, called over the GroupBy object:

[6]: result = grouped_titles.describe()

result
[6]: 2021-12 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5
PyViz Tutorial 1.0 4873.0 NaN 4873.0 4873.00 4873.0
Python Basics 2.0 261.0 234.759451 95.0 178.00 261.0

2022-01 ... \
75% max count mean ... 75% max
Title ...
Jupyter Tutorial 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0
(continues on next page)

134 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

(continued from previous page)


PyViz Tutorial 4873.00 4873.0 1.0 3930.0 ... 3930.00 3930.0
Python Basics 344.00 427.0 2.0 251.0 ... 263.50 276.0

2022-02 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0
PyViz Tutorial 1.0 2573.0 NaN 2573.0 2573.0 2573.0
Python Basics 2.0 341.0 260.215295 157.0 249.0 341.0

75% max
Title
Jupyter Tutorial 16375.0 19651.0
PyViz Tutorial 2573.0 2573.0
Python Basics 433.0 525.0

[3 rows x 24 columns]

When you call a method like describe within GroupBy, it is actually just an abbreviation for:

[7]: f = lambda x: x.describe()


grouped_titles.apply(f)
[7]: 2021-12 2022-01 2022-02
Title
Jupyter Tutorial count 2.000000 2.000000 2.000000
mean 18103.500000 20505.500000 13099.000000
std 17013.696262 18087.084356 9265.927261
min 6073.000000 7716.000000 6547.000000
25% 12088.250000 14110.750000 9823.000000
50% 18103.500000 20505.500000 13099.000000
75% 24118.750000 26900.250000 16375.000000
max 30134.000000 33295.000000 19651.000000
PyViz Tutorial count 1.000000 1.000000 1.000000
mean 4873.000000 3930.000000 2573.000000
std NaN NaN NaN
min 4873.000000 3930.000000 2573.000000
25% 4873.000000 3930.000000 2573.000000
50% 4873.000000 3930.000000 2573.000000
75% 4873.000000 3930.000000 2573.000000
max 4873.000000 3930.000000 2573.000000
Python Basics count 2.000000 2.000000 2.000000
mean 261.000000 251.000000 341.000000
std 234.759451 35.355339 260.215295
min 95.000000 226.000000 157.000000
25% 178.000000 238.500000 249.000000
50% 261.000000 251.000000 341.000000
75% 344.000000 263.500000 433.000000
max 427.000000 276.000000 525.000000

2.4. pandas 135


Python for Data Science, Release 1.0.0

Suppression of the group keys

In the previous examples, you saw that the resulting object has a hierarchical index formed by the group keys together
with the indices of the individual parts of the original object. You can disable this by passing group_keys=False to
groupby:

[8]: grouped_lang = df.groupby('Language', group_keys=False)

grouped_lang.apply(top)
[8]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics de 427.0 276.0 525.0
Jupyter Tutorial en 6073.0 7716.0 6547.0
Python Basics en 95.0 226.0 157.0
PyViz Tutorial en NaN NaN NaN

Quantile and bucket analysis

As described in discretisation and grouping, pandas has some tools, especially cut and qcut, to split data into buckets
with bins of your choice or by sample quantiles. Combine these functions with groupby and you can conveniently
perform bucket or quantile analysis on a dataset. Consider a simple random data set and a bucket categorisation of
equal length with cut:
[9]: rng = np.random.default_rng()
df2 = pd.DataFrame({'data1': rng.normal(size=1000),
'data2': rng.normal(size=1000)})

quartiles = pd.cut(df2.data1, 4)

quartiles[:10]
[9]: 0 (-1.469, 0.17]
1 (0.17, 1.809]
2 (-1.469, 0.17]
3 (0.17, 1.809]
4 (-3.115, -1.469]
5 (-1.469, 0.17]
6 (-3.115, -1.469]
7 (0.17, 1.809]
8 (-3.115, -1.469]
9 (-1.469, 0.17]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.115, -1.469] < (-1.469, 0.17] < (0.17, 1.
˓→809] < (1.809, 3.448]]

The category object returned by cut can be passed directly to groupby. So we could calculate a set of group statistics
for the quartiles as follows:

[10]: def stats(group):


return pd.DataFrame(
{'min': group.min(), 'max': group.max(),
(continues on next page)

136 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

(continued from previous page)


'count': group.count(), 'mean': group.mean()}
)

grouped_quart = df2.groupby(quartiles)

grouped_quart.apply(stats)
[10]: min max count mean
data1
(-3.115, -1.469] data1 -3.108349 -1.471274 70 -1.896546
data2 -2.232356 2.035516 70 -0.021125
(-1.469, 0.17] data1 -1.467552 0.168617 500 -0.497780
data2 -2.631409 3.109543 500 0.012801
(0.17, 1.809] data1 0.171525 1.807236 400 0.767486
data2 -3.063169 2.523671 400 -0.084810
(1.809, 3.448] data1 1.814571 3.448008 30 2.270333
data2 -1.967468 2.997739 30 -0.363947

These were buckets of equal length; to calculate buckets of equal size based on sample quantiles, we can use qcut. I
pass labels=False to get only quantile numbers:
[11]: quartiles_samp = pd.qcut(df2.data1, 4, labels=False)

grouped_quart_samp = df2.groupby(quartiles_samp)

grouped_quart_samp.apply(stats)
[11]: min max count mean
data1
0 data1 -3.108349 -0.615744 250 -1.232959
data2 -2.631409 2.734051 250 0.000789
1 data1 -0.615207 0.028267 250 -0.320657
data2 -2.351885 3.109543 250 -0.026376
2 data1 0.034191 0.660205 250 0.309113
data2 -3.063169 2.523671 250 -0.090296
3 data1 0.661384 3.448008 250 1.218328
data2 -2.602676 2.997739 250 -0.043800

Populating data with group-specific values

When cleaning missing data, in some cases you will replace data observations with dropna, but in other cases you may
want to fill the null values (NA) with a fixed value or a value derived from the data. fillna is the right tool for this;
here, for example, I fill the null values with the mean:
[12]: s = pd.Series(rng.normal(size=8))
s[::3] = np.nan

s
[12]: 0 NaN
1 2.219988
2 -1.006390
3 NaN
(continues on next page)

2.4. pandas 137


Python for Data Science, Release 1.0.0

(continued from previous page)


4 0.244595
5 0.253043
6 NaN
7 -0.691286
dtype: float64

[13]: s.fillna(s.mean())
[13]: 0 0.203990
1 2.219988
2 -1.006390
3 0.203990
4 0.244595
5 0.253043
6 0.203990
7 -0.691286
dtype: float64

Here are some sample data for my tutorials, divided into German and English editions:
Suppose you want the fill value to vary by group. These values can be predefined, and since the groups have an internal
name attribute, you can use this with apply:

[14]: fill_values = {'de': 10632, 'en': 3469}

fill_func = lambda g: g.fillna(fill_values[g.name])

df.groupby('Language').apply(fill_func)
[14]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en 3469.0 3469.0 3469.0
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0

You can also group the data and use apply with a function that calls fillna for each data packet:

[15]: fill_mean = lambda g: g.fillna(g.mean())

df.groupby('Language').apply(fill_mean)
[15]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en 3084.0 3971.0 3352.0
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0

138 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

Group weighted average

Since operations between columns in a DataFrame or two Series are possible, we can calculate the group-weighted
average, for example:

[16]: df3 = pd.DataFrame({'category': ['de', 'de', 'de', 'de',


'en','en','en','en'],
'data': np.random.randint(100000, size=8),
'weights': np.random.rand(8)})

df3
[16]: category data weights
0 de 43528 0.820894
1 de 55946 0.889425
2 de 75158 0.672583
3 de 22085 0.292674
4 en 79633 0.810052
5 en 44609 0.107397
6 en 45209 0.333182
7 en 19978 0.227543

The group average weighted by category would then be:

[17]: grouped_cat = df3.groupby('category')


get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

grouped_cat.apply(get_wavg)
[17]: category
de 53261.553820
en 60146.114798
dtype: float64

Correlation

An interesting task could be to calculate a DataFrame consisting of the percentage changes.


For this purpose, we first create a function that calculates the pairwise correlation of the 2021-12 column with the
subsequent columns:

[18]: corr = lambda x: x.corrwith(x['2021-12'])

Next, we calculate the percentage change:

[19]: pcts = df.pct_change().dropna()

Finally, we group these percentage changes by year, which can be extracted from each row label with a one-line function
that returns the year attribute of each date label:

[20]: grouped_lang = pcts.groupby('Language')

grouped_lang.apply(corr)

2.4. pandas 139


Python for Data Science, Release 1.0.0

[20]: 2021-12 2022-01 2022-02


Language
de 1.0 1.000000 1.00000
en 1.0 0.699088 0.99781

[21]: grouped_lang.apply(lambda g: g['2021-12'].corr(g['2022-01']))


[21]: Language
de 1.000000
en 0.699088
dtype: float64

Performance problems with apply

Since the apply method typically acts on each individual value in a Series, the function is called once for each value.
If you have thousands of values, the function will be called thousands of times. This ignores the fast vectorisations of
pandas unless you are using NumPy functions and slow Python is used. For example, we previously grouped the data
by title and then called our top method with apply. Let’s measure the time for this:

[22]: %%timeit
grouped_titles.apply(top)
3.96 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can get the same result without applying by passing the DataFrame to our top method:

[23]: %%timeit
top(df)
212 µs ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This calculation is 18 times faster.

Optimising apply with Cython

It is not always easy to find an alternative for apply. However, numerical operations like our top method can be made
faster with Cython. To use Cython in Jupyyter, we use the following IPython magic:

[24]: %load_ext Cython

Then we can define our top function with Cython:

[25]: %%cython
def top_cy(df, n=5, column='2021-12'):
return df.sort_values(by=column, ascending=False)[:n]

[26]: %%timeit
grouped_titles.apply(top_cy)
3.25 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We haven’t really gained much with this yet. Further optimisation possibilities would be to define the type in the Cython
code with cpdef. For this, however, we would have to modify our method, because then no DataFrame can be passed.

140 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

2.4.16 Pivot tables and crosstabs

A pivot table is a data summary tool often found in spreadsheet and other data analysis software. It summarises a table
of data by one or more keys and arranges the data in a rectangle, with some of the group keys along the rows and some
along the columns. Pivot tables in Python with pandas are made possible by the groupby function in combination with
reshaping operations using hierarchical indexing. DataFrame has a pivot_table method, and there is also a top-level
function pandas.pivot_table. pivot_table not only provides a convenient interface to groupby, but can also add
partial sums (margins).
Suppose we wanted to compute a table of group averages (the default aggregation type of pivot_table) ordered by
title and language in the rows:

[1]: import pandas as pd


import numpy as np

[2]: df = pd.DataFrame({'Title': ['Jupyter Tutorial',


'Jupyter Tutorial',
'PyViz Tutorial',
'PyViz Tutorial',
'Python Basics',
'Python Basics'],
'Language' : ['de', 'en', 'de', None, 'de', 'en'],
'2021-12' : [30134,6073,4873,None,427,95],
'2022-01' : [33295,7716,3930,None,276,226],
'2022-02' : [19651,6547,2573,None,525,157]})

df
[2]: Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 30134.0 33295.0 19651.0
1 Jupyter Tutorial en 6073.0 7716.0 6547.0
2 PyViz Tutorial de 4873.0 3930.0 2573.0
3 PyViz Tutorial None NaN NaN NaN
4 Python Basics de 427.0 276.0 525.0
5 Python Basics en 95.0 226.0 157.0

[3]: df.pivot_table(index=['Title', 'Language'])


[3]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0

This could also have been done directly with groupby.


Now let’s say we want to get the mean of hits of all languages per title for each individual month. For this I will enter
Title in the table columns and the months in the rows:

[4]: df.pivot_table(columns='Title')
[4]: Title Jupyter Tutorial PyViz Tutorial Python Basics
2021-12 18103.5 4873.0 261.0
(continues on next page)

2.4. pandas 141


Python for Data Science, Release 1.0.0

(continued from previous page)


2022-01 20505.5 3930.0 251.0
2022-02 13099.0 2573.0 341.0

Alternatively, we can keep the languages as columns and add the mean values by specifying margins=True:

[5]: df.pivot_table(columns=['Title', 'Language'], margins=True)


[5]: Title Jupyter Tutorial PyViz Tutorial \
Language de en All de All
2021-12 30134.0 6073.0 18103.5 4873.0 4873.0
2022-01 33295.0 7716.0 20505.5 3930.0 3930.0
2022-02 19651.0 6547.0 13099.0 2573.0 2573.0

Title Python Basics


Language de en All
2021-12 427.0 95.0 261.0
2022-01 276.0 226.0 251.0
2022-02 525.0 157.0 341.0

To use an aggregation function other than mean, pass it to the keyword argument aggfunc. With sum, for example,
you get the sum:

[6]: df.pivot_table(columns=['Title', 'Language'], aggfunc=sum, margins=True)


[6]: Title Jupyter Tutorial PyViz Tutorial \
Language de en All de All
2021-12 30134.0 6073.0 36207.0 4873.0 4873.0
2022-01 33295.0 7716.0 41011.0 3930.0 3930.0
2022-02 19651.0 6547.0 26198.0 2573.0 2573.0

Title Python Basics


Language de en All
2021-12 427.0 95.0 522.0
2022-01 276.0 226.0 502.0
2022-02 525.0 157.0 682.0

pivot_table options:

Function Description
name
values column name(s) to aggregate; by default, all numeric columns are aggregated
index column names or other group keys to be grouped in the rows of the resulting pivot table
columns column names or other group keys to be grouped in the columns of the resulting pivot table
aggfunc aggregation function or list of functions (by default mean); can be any function valid in a groupby
context
fill_value replaces missing values in the result table
dropna if True, columns whose entries are all NA are ignored
margins inserts row/column subtotals and grand totals (default: False)
margins_name name used for row/column labels if margins=True is passed, default is All.
observed For categorical group keys, if True, only the observed category values are displayed in the keys
and not all categories

142 Chapter 2. Workspace


Python for Data Science, Release 1.0.0

Crosstabs

A crosstab is a special case of a pivot table that calculates the frequency of groups. For example, in the context of
an analysis o