Python 4 Data Science
Python 4 Data Science
net/publication/371475814
CITATIONS READS
2 245
1 author:
Veit Schiele
cusy GmbH
2 PUBLICATIONS 2 CITATIONS
SEE PROFILE
All content following this page was uploaded by Veit Schiele on 11 June 2023.
Veit Schiele
1 Introduction 3
1.1 Target groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Structure of the Python for Data Science tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Follow us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Pull-Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Workspace 5
2.1 IPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Performance 337
6.1 k-Means example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
6.2 Performance measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
6.3 Search for existing implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
6.4 Find anti-patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
6.5 Vectorisations with NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
6.6 Special data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
6.7 Select compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
6.8 Task planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
6.9 Multithreading, Multiprocessing and Async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
i
7 Create a product 375
7.1 Manage code with Git . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
7.2 Manage data with DVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
7.3 Reproduce environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
7.4 Creating programme libraries and packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
7.5 Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
7.6 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
7.7 Citing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
7.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
7.9 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
7.10 Check and improve code quality and complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
7.11 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
9 Index 553
Index 555
ii
Python for Data Science, Release 1.0.0
This is a tutorial on Data Science with Python. This immediately raises the question: What is Data Science? The
term has become ubiquitous, but there is no single definition. Some even consider the term superfluous, because what
science does not have to do with data? Nevertheless, it seems to me that data science is more than just hype: scientific
data has become increasingly voluminous and often can no longer be adequately tapped with conventional mathematical
and statistical methods alone – additional hacking skills are needed. However, it is not a new field of knowledge that
you need to learn, but a set of skills that you can apply in your field. Whether you are analysing astronomical objects,
analysing machines, forecasting stock prices or working with data in other fields, the goal of this tutorial is to enable
you to solve tasks programmatically in your field.
This tutorial is not intended to be an introduction to Python or programming in general; for that there is the Python
basics tutorial. Instead, it is intended to show the Python data science stack – libraries such as IPython, NumPy, pandas,
Matplotlib and related tools – so that you can subsequently effectively analyse and visualise your data.
CONTENTS 1
Python for Data Science, Release 1.0.0
2 CONTENTS
CHAPTER
ONE
INTRODUCTION
The target groups are diverse, from data scientists to data engineers and analysts to systems engineers. Their skills and
workflows are very different. However, one of the great strengths of Python for Data Science is that it allows these
different experts to work closely together in cross-functional teams.
Data scientists
explore data with different parameters and summarise the results.
Data engineers
check the quality of the code and make it more robust, efficient and scalable.
Data analysts
use the code provided by data engineers to systematically analyse the data.
System engineers
provide the research platform based on the JupyterHub on which the other roles can perform their work.
In this tutorial we address system engineers who want to build and run a platform based on Jupyter notebooks. We then
explain how this platform can be used effectively by data scientists, data engineers and analysts.
3
Python for Data Science, Release 1.0.0
1.3 Status
1.4 Follow us
• GitHub
• Mastodon
1.5 Pull-Requests
If you have suggestions for improvements and additions, I recommend that you create a Fork of my GitHub Repository
and make your changes there. . You are also welcome to make a pull request. If the changes contained therein are
small and atomic, I’ll be happy to look at your suggestions.
4 Chapter 1. Introduction
CHAPTER
TWO
WORKSPACE
Setting up the workspace includes installing and configuring IPython and Jupyter with nbextensions and ipywidgets,
and NumPy.
2.1 IPython
IPython, or Interactive Python, was initially an advanced Python interpreter that has now grown into an extensive project
designed to provide tools for the entire life cycle of research computing. Today, IPython is not only an interactive
interface to Python, but also offers a number of useful syntactic additions for the language. In addition, IPython is
closely related to the Jupyter project.
See also:
• Miki Tebeka - IPython: The Productivity Booster
In [1]:
Alternatively, you can use IPython in a Jupyter notebook. To do this, start the notebook server first:
[I 17:35:02.428 NotebookApp] Use Control-C to stop this server and shut down all kernels␣
˓→(twice to skip confirmation).
[C 17:35:02.497 NotebookApp]
The standard browser should then be opened with the specified URL. Often this is http://localhost:8888.
Now you can start a Python process in the browser by creating a new notebook.
5
Python for Data Science, Release 1.0.0
sys.version_info
pd.__version__
[2]: '1.3.4'
print(version('pandas'))
1.3.4
Information about the host operating system and the versions of installed Python packages
[4]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-30-generic
Version : #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8
pandas : 1.3.4
numpy : 1.21.3
pytz : 2021.1
(continues on next page)
6 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
2.1. IPython 7
Python for Data Science, Release 1.0.0
Shell commands
[6]: !python3 -V
Python 3.8.12
Tab completion
Note:
As you may have noticed in surprise, the __version__ method used above is not offered in the selection. IPython
initially hides these private methods and attributes that begin with underscores. However, they can also be completed
with a tabulator if you first enter an underscore. Alternatively, you can change this setting in the IPython configuration.
. . . for almost everything:
8 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
With a question mark (?) you can display information about an object if, for example, there is a method multiply with
the following docstring:
[8]: import numpy as np
[9]: np.mean?
Signature:
np.mean(
a,
axis=None,
dtype=None,
out=None,
keepdims=<no value>,
*,
where=<no value>,
)
Docstring:
Compute the arithmetic mean along the specified axis.
Returns the average of the array elements. The average is taken over
the flattened array by default, otherwise over the specified axis.
`float64` intermediate and return values are used for integer inputs.
Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
axis : None or int or tuple of ints, optional
Axis or axes along which the means are computed. The default is to
compute the mean of the flattened array.
.. versionadded:: 1.7.0
2.1. IPython 9
Python for Data Science, Release 1.0.0
.. versionadded:: 1.20.0
Returns
-------
m : ndarray, see dtype parameter above
If `out=None`, returns a new array containing the mean values,
otherwise a reference to the output array is returned.
See Also
--------
average : Weighted average
std, var, nanmean, nanstd, nanvar
Notes
-----
The arithmetic mean is the sum of the elements along the axis divided
by the number of elements.
Note that for floating-point input, the mean is computed using the
same precision the input has. Depending on the input data, this can
cause the results to be inaccurate, especially for `float32` (see
example below). Specifying a higher-precision accumulator using the
`dtype` keyword can alleviate this issue.
Examples
--------
>>> a = np.array([[1, 2], [3, 4]])
>>> np.mean(a)
2.5
>>> np.mean(a, axis=0)
array([2., 3.])
>>> np.mean(a, axis=1)
array([1.5, 3.5])
10 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Type: function
By using ?? the source code of the function is also displayed, if this is possible:
[10]: np.mean??
Signature:
np.mean(
a,
axis=None,
dtype=None,
out=None,
keepdims=<no value>,
*,
where=<no value>,
)
Source:
@array_function_dispatch(_mean_dispatcher)
def mean(a, axis=None, dtype=None, out=None, keepdims=np._NoValue, *,
where=np._NoValue):
"""
Compute the arithmetic mean along the specified axis.
Returns the average of the array elements. The average is taken over
the flattened array by default, otherwise over the specified axis.
`float64` intermediate and return values are used for integer inputs.
Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
axis : None or int or tuple of ints, optional
Axis or axes along which the means are computed. The default is to
compute the mean of the flattened array.
2.1. IPython 11
Python for Data Science, Release 1.0.0
.. versionadded:: 1.20.0
Returns
-------
m : ndarray, see dtype parameter above
If `out=None`, returns a new array containing the mean values,
otherwise a reference to the output array is returned.
See Also
--------
average : Weighted average
std, var, nanmean, nanstd, nanvar
Notes
-----
The arithmetic mean is the sum of the elements along the axis divided
by the number of elements.
Note that for floating-point input, the mean is computed using the
same precision the input has. Depending on the input data, this can
cause the results to be inaccurate, especially for `float32` (see
example below). Specifying a higher-precision accumulator using the
`dtype` keyword can alleviate this issue.
12 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Examples
--------
>>> a = np.array([[1, 2], [3, 4]])
>>> np.mean(a)
2.5
>>> np.mean(a, axis=0)
array([2., 3.])
>>> np.mean(a, axis=1)
array([1.5, 3.5])
"""
kwargs = {}
if keepdims is not np._NoValue:
kwargs['keepdims'] = keepdims
if where is not np._NoValue:
kwargs['where'] = where
if type(a) is not mu.ndarray:
try:
mean = a.mean
except AttributeError:
pass
else:
return mean(axis=axis, dtype=dtype, out=out, **kwargs)
Type: function
2.1. IPython 13
Python for Data Science, Release 1.0.0
? can also be used to search in the IPython namespace. In doing so, a series of characters can be represented with the
wildcard (*). For example, to get a list of all functions in the top-level NumPy namespace that contain mean:
[11]: np.*mean*?
np.mean
np.nanmean
IPython not only enables Python to be used interactively, but also extends the Python syntax with so-called magic
commands, which are provided with the prefix %. They are designed to quickly and easily solve common data analysis
problems. A distinction is made between two different types of magic commands:
• line magics, denoted by a single % prefix, that run on a single input line
• cell magics which are preceded by a double symbol %% and which are executed within a notebook cell.
If you start developing larger code, you will likely be working in both IPython for interactive exploration and a text
editor to save code that you want to reuse. With the %run magic you can execute this code directly in your IPython
session.
Imagine you created a myscript.py file with the following content:
def square(x):
return x ** 2
Note that after running this script, all of the functions defined in it will be available for use in your IPython session:
[2]: square(4)
[2]: 16
There are several ways you can improve the way your code runs. As usual, you can display the documentation in
IPython with %run?.
14 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Another example of a Magic function is %timeit, which automatically determines the execution time of the following
one-line Python statement. So we can e.g. output the performance of a list comprehension with:
The advantage of %timeit is that short commands automatically run multiple runs to get more robust results. For multi-
line instructions, adding a second % character creates cell magic that can process multiple input lines. For example,
here is the equivalent construction using a for loop:
[4]: %%timeit
L = []
for n in range(1000):
L.append(n ** 2)
3.31 ms ± 855 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can immediately see that the list comprehension is about 10% faster than its equivalent with a for loop. We then
describe performance measurements and optimisations in more detail in Profiling.
IPython has a %%script script magic with which you can execute a cell in a subprocess of an interpreter on your
system, e.g. bash, ruby, perl, zsh, R etc. This can also be its own script that expects input in stdin. To do this,
simply pass a path or a shell command to the program that is specified in the %%script line. The rest of the cell is
executed by this script, capturing stdout or err from the subprocess and displaying it.
[3]: %%ruby
puts "Ruby #{RUBY_VERSION}"
Ruby 2.3.7
[4]: %%bash
echo "$BASH"
/bin/bash
You can capture stdout and err from these sub-processes in Python variables:
2.1. IPython 15
Python for Data Science, Release 1.0.0
[6]: print(error)
print(output)
stderr
stdout
The list of aliases for the script magic is configurable. By default, some common interpreters can be used if necessary.
However, you can also specify your own interpreter in ipython_config.py:
Like normal Python functions, the IPython magic functions have docstrings that can be easily accessed. E.g. to read
the documentation of the %timeit magic, just type:
[5]: %timeit?
Documentation for other functions can be accessed in a similar manner. To access a general description of the %magic
functions available, including some examples, you can type:
[6]: %magic
[7]: %lsmagic
[7]: Available line magics:
%alias %alias_magic %autoawait %autocall %automagic %autosave %bookmark %cat %cd␣
˓→ %clear %colors %conda %config %connect_info %cp %debug %dhist %dirs %doctest_
˓→mode %ed %edit %env %gui %hist %history %killbgscripts %ldir %less %lf %lk
˓→%ll %load %load_ext %loadpy %logoff %logon %logstart %logstate %logstop %ls
˓→%lsmagic %lx %macro %magic %man %matplotlib %mkdir %more %mv %notebook %page␣
˓→ %pastebin %pdb %pdef %pdoc %pfile %pinfo %pinfo2 %pip %popd %pprint
˓→%precision %prun %psearch %psource %pushd %pwd %pycat %pylab %qtconsole
˓→%quickref %recall %rehashx %reload_ext %rep %rerun %reset %reset_selective %rm␣
˓→ %rmdir %run %save %sc %set_env %store %sx %system %tb %time %timeit
˓→%unalias %unload_ext %who %who_ls %whos %xdel %xmode
16 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
You can also simply define your own magic functions. For more information, see Defining custom magics.
The IPython Notebook allows simple UNIX/Linux commands to be executed in a single input cell. There are no limits
but when using, please keep in mind that in contrast to a regular UNIX/Linux shell, start each shell command with a !,
for example !ls for the command ls (see below for further explanations about the command). Furthermore, each shell
command is executed in its own subshell. For this reason, the results of previous shell commands are not available to
you.
To begin with, the command ls lists the files in the current working directory. The output is shown below the input
cell, and lists the single file shell.ipynb :
[1]: !ls
shell.ipynb
[2]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/basics/ipython
The command !echo outputs text given as parameter to the echo command. The example below demonstrates how to
print Hello world:
There is a clever way through which you can access the output of a UNIX/Linux command as a variable in Python.
Assign the output of a UNIX/Linux command to a variable as follows:
Here the Python variable contents has been assigned the output of the command ls. As a result, contents is a list,
where each list element corresponds to a line in the output. With the print command you output the list contents:
[5]: print(contents)
['create-delete.ipynb', 'file-system.ipynb', 'grep-find.ipynb', 'index.rst', 'pipes-
˓→filters.ipynb', 'regex.ipynb', 'shell-variables.ipynb']
You will see the same result below when executing the pwd command. The current directory is stored in the variable
directory:
2.1. IPython 17
Python for Data Science, Release 1.0.0
[7]: print(directory)
['/Users/veit/cusy/trn/Python4DataScience/docs/basics/ipython']
Any command on the command line will also work in Jupyter Notebooks if prefixed with !. The results can then
interact with the Jupyter namespace, see Passing values to and from the shell.
First let us find out where we are by running a command called pwd:
[1]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython/unix-shell
Here, the response is the iPython chapter of the Jupyter tutorial in my home directory /Users/veit.
On Windows the home directory will look like C:\Documents and Settings\veit or C:\Users\veit and on
Linux like /home/veit.
To see the contents of our directory, we can use ls:
[2]: !ls
create-delete.ipynb index.rst shell-variables.ipynb
file-system.ipynb pipes-filters.ipynb
grep-find.ipynb regex.ipynb
18 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[4]: !ls -s
total 70384
24 create-delete.ipynb 16 pipes-filters.ipynb
70288 file-system.ipynb 8 regex.ipynb
24 grep-find.ipynb 16 shell-variables.ipynb
8 index.rst
[5]: !ls -S
file-system.ipynb pipes-filters.ipynb index.rst
grep-find.ipynb shell-variables.ipynb
create-delete.ipynb regex.ipynb
ls comes with a lot of other useful options. Using man you can print out the built-in manual page for the desired
UNIX/Linux-command:
[6]: !man ls
NAME
ls -- list directory contents
SYNOPSIS
ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1%] [file ...]
DESCRIPTION
For each operand that names a file of a type other than directory, ls
displays its name as well as any requested, associated information. For
each operand that names a file of type directory, ls displays the names
of files contained within that directory, as well as any requested, asso-
ciated information.
If no operands are given, the contents of the current directory are dis-
played. If more than one operand is given, non-directory operands are
displayed first; directory and non-directory operands are sorted sepa-
rately and in lexicographical order.
2.1. IPython 19
Python for Data Science, Release 1.0.0
...
BSD May 19, 2002 BSD
Illegal options
If you try to use an option that isn’t supported, the commands will usually print an error message, e.g. for:
[7]: !ls -z
ls: illegal option -- z
usage: ls [-@ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1%] [file ...]
Hidden Files
[8]: !ls -a
. file-system.ipynb regex.ipynb
.. grep-find.ipynb shell-variables.ipynb
.ipynb_checkpoints index.rst
create-delete.ipynb pipes-filters.ipynb
In addition to the hidden directories .. and ., you may also see a directory called .ipynb_checkpoints. This file
usually contains snapshots of the Jupyter notebooks.
Show directory treeThe command tree lists contents of directories in a tree-like format.
[9]: !tree
.
create-delete.ipynb
file-system.ipynb
grep-find.ipynb
index.rst
pipes-filters.ipynb
regex.ipynb
shell-variables.ipynb
0 directories, 7 files
20 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Change directory
At first it may seem irritating to some that they cannot use !cd to change to another directory.
[10]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython/unix-shell
[11]: !cd ..
[12]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython/unix-shell
The reason for this is that Jupyter uses a temporary subshell. If you want to change to another directory permanently,
you have to use the magic command %cd.
[13]: %cd ..
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython
[14]: !pwd
/Users/veit/cusy/trn/Python4DataScience/docs/workspace/ipython
With the %automagic function, these can also be used without the preceding % character:
[17]: %automagic
[18]: cd ..
/Users/veit/cusy/trn/Python4DataScience/docs/workspace
[19]: cd .
/Users/veit/cusy/trn/Python4DataScience/docs/workspace
[20]: cd ../..
/Users/veit/cusy/trn/Python4DataScience
[21]: cd ..
/Users/veit/cusy/trn
[22]: cd /
/
2.1. IPython 21
Python for Data Science, Release 1.0.0
[23]: cd
/Users/veit
[24]: cd ~
/Users/veit
[25]: cd /Users/veit
/Users/veit
Creates a new directory test and then checks this with ls:
[2]: !ls
create-delete.ipynb importing.ipynb shell-variables.ipynb
debugging.ipynb index.rst shell.ipynb
display.ipynb length.txt sorted-length.txt
dvc.list magics.ipynb start.rst
extensions.rst mypackage tests
file-system.ipynb pipes-filters.ipynb
grep-find.ipynb regex.ipynb
A directory with all the files it contains is also possible recursively with the -r option:
22 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[12]: !ls
create-delete.ipynb grep-find.ipynb pipes-filters.ipynb
debugging.ipynb importing.ipynb regex.ipynb
display.ipynb index.rst shell-variables.ipynb
dvc.list length.txt shell.ipynb
extensions.rst magics.ipynb sorted-length.txt
file-system.ipynb mypackage start.rst
Transfering files
wget
2.1. IPython 23
Python for Data Science, Release 1.0.0
cURL
Alternatively, you can use cURL, which supports a much larger range of protocols.
[1]: !ls
create-delete.ipynb index.rst shell-variables.ipynb
file-system.ipynb pipes-filters.ipynb
grep-find.ipynb regex.ipynb
With *.rst we restrict the results to all files with the suffix .rst:
We can also output only the number of lines, words and characters in these documents:
Now we write the number of characters in the file length.txt and then output the text with cat:
If we only want to know the total number of characters, i.e. only output the last line, we can do this with tail:
24 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Pipe |
You can connect commands with a pipe (|). In the following one-liner, we want to display the number of characters
for the shortest file:
If we want to display the first lines of the main text (without the first three lines for the title):
grep
grep finds and prints lines in files that match a regular expression. In the following example, we search for the string
Python:
The option -w limits the matches to the word boundaries so that IPython is ignored:
2.1. IPython 25
Python for Data Science, Release 1.0.0
grep has lots of other options. To find out what they are, you can type:
In the following example we use the -E option and put the pattern in quotes to prevent the shell from trying to interpret
it. The ^ in the pattern anchors the match to the start of the line and the . matches a single character.
find
find . searches in this directory whereby the search is restricted to directories with -type d.
26 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
With -mtime the search is limited to the last X days, in our example to the last day:
Now we count the characters in the files with the suffix .rst:
2.1. IPython 27
Python for Data Science, Release 1.0.0
Shell variables
[1]: !set
BASH=/usr/local/bin/bash
BASHOPTS=checkwinsize:cmdhist:complete_fullquote:extquote:force_fignore:globasciiranges:
˓→hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
BASH_ALIASES=()
BASH_ARGC=()
BASH_ARGV=()
BASH_CMDS=()
BASH_EXECUTION_STRING=set
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="5" [1]="0" [2]="16" [3]="1" [4]="release" [5]="x86_64-apple-darwin19.
˓→3.0")
BASH_VERSION='5.0.16(1)-release'
CLICOLOR=1
CPPFLAGS=-I/usr/local/opt/openblas/include
DIRSTACK=()
DISPLAY=/private/tmp/com.apple.launchd.WfrReZHn7e/org.macosforge.xquartz:0
...
It defines the shell’s search path, i.e., the list of directories that the shell looks in for runnable programs.
˓→Library/TeX/texbin:/usr/local/MacGPG2/bin:/opt/X11/bin:/Library/Apple/usr/bin
28 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
IPython can display objects such as HTML, JSON, PNG, JPEG, SVG and Latex
Images
To display images (JPEG, PNG) in IPython and notebooks, you can use the Image class:
Non-embedded images
Image ('img_url')
Image (url='img_url')
HTML
[4]: %%html
<ul>
<li>foo</li>
<li>bar</li>
</ul>
<IPython.core.display.HTML object>
2.1. IPython 29
Python for Data Science, Release 1.0.0
Javascript
With notebooks, objects can also declare a JavaScript representation. This enables e.g. data visualisations with
Javascript libraries like d3.js.
display(welcome)
<IPython.core.display.Javascript object>
For more extensive Javascript you can also use the %%javascript syntax.
LaTeX
IPython.display also has built-in support for displaying mathematical expressions set in LaTeX and rendered in the
browser with MathJax:
With the Latex class you have to specify the limits yourself. In this way, however, you can also use other LaTeX modes,
such as eqnarray:
\end{eqnarray}""")
[7]:
⃗
1 𝜕E
⃗ − 4𝜋⃗
∇×B = 𝑐 j (2.1)
𝑐 𝜕𝑡
(2.2)
Audio
IPython also enables interactive work with sounds. With the display.Audio class you can create an audio control
that is embedded in the notebook. The interface is analogous to that of the Image class. All audio formats supported
by the browser can be used.
In the following we will output the sine function of a NumPy array as an audio signal. The Audio class normalises and
codes the data and embeds the resulting audio in the notebook.
30 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Audio(data=signal, rate=rate)
[9]: <IPython.lib.display.Audio object>
IPython has built-in classes for generating links to local files. To do this, create a link to a single file with the FileLink
object:
Alternatively, you can generate a list with links to all files in a directory, e.g .:
Display notebooks
formatter = HtmlFormatter()
lexer = PythonLexer()
2.1. IPython 31
Python for Data Science, Release 1.0.0
show_notebook(os.path.join("mypackage/foo.ipynb"))
<IPython.core.display.HTML object>
2.1.7 foo.ipynb
To be able to develop more modularly, the import of notebooks is necessary. However, since notebooks are not Python
files, they are not easy to import. Fortunately, Python provides some hooks for the import so that IPython notebooks
can eventually be imported.
32 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Notebook Loader
2.1. IPython 33
Python for Data Science, Release 1.0.0
# extra work to ensure that magics that would affect the user_ns
# magics that would affect the user_ns actually affect the
# notebook module’s ns
save_user_ns = self.shell.user_ns
self.shell.user_ns = mod.__dict__
try:
for cell in nb.cells:
if cell.cell_type == 'code':
# transform the input to executable Python
code = self.shell.input_transformer_manager.transform_cell(cell.source)
# run the code in the module
exec(code, mod.__dict__)
finally:
self.shell.user_ns = save_user_ns
return mod
Notebook Finder
The Finder is a simple object that indicates whether a notebook can be imported based on its file name and that returns
the appropriate loader.
key = path
if path:
# lists aren’t hashable
key = os.path.sep.join(path)
34 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Register hook
[6]: sys.meta_path.append(NotebookFinder())
Check
[8]: foo.bar()
[8]: 'bar'
[9]: foo.has_ip_syntax()
[9]: ['debugging.ipynb',
'display.ipynb',
'importing.ipynb',
'index.rst',
'magics.ipynb',
'mprun_demo.py',
'mypackage',
'myscript.py',
'profiling.ipynb',
'shell.ipynb',
'show.ipynb',
'start.rst']
The import hook can also easily be executed in other notebooks with
2.1. IPython 35
Python for Data Science, Release 1.0.0
IPython extensions are Python modules that change the behavior of the shell. They are identified by an importable
module name and are usually located in .ipython/extensions/.
Some important extensions are already included in IPython: autoreload and storemagic. You can find other ex-
tensions in the Extensions Index or on PyPI with the IPython tag.
See also:
• IPython extensions docs
Use extensions
The %load_ext magic can be used to load extensions while IPython is running.
%load_ext myextension
Alternatively, an extension can also be loaded each time IPython is started by listing it in the IPython configuration file:
c.InteractiveShellApp.extensions = [
'myextension'
]
If you haven’t created an IPython configuration file yet, you can do this with:
If no profile name is given, default` is used. The file is usually created in ``~/.ipython/
profile_default/ and named depending on the purpose: ipython_config.py is used for all IPython commands,
while ipython_notebook_config.py is only used for commands in IPython notebooks.
An IPython extension is an importable Python module that has special functions for loading and unloading:
def load_ipython_extension(ipython):
# The `ipython` argument is the currently active `InteractiveShell`
# instance, which can be used in any way. This allows you to register
# new magics or aliases, for example.
def unload_ipython_extension(ipython):
# If you want your extension to be unloadable, put that logic here.
See also:
• Defining custom magics
36 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
2.1.10 Debugging
IPython contains various tools to analyse faulty code, essentially the exception reporting and the debugger.
If the execution of a Python script fails, an exception is usually thrown and relevant information about the cause of
the error is written to a traceback. With the %xmode magic function you can control the amount of information that is
displayed in IPython. Let’s look at the following code for this:
def func2(x):
a = x
b = x - 1
return func1(a, b)
[2]: func2(1)
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-2-7cb498ea7ed1> in <module>
----> 1 func2(1)
<ipython-input-1-586ccabd0db3> in func2(x)
5 a = x
6 b = x - 1
----> 7 return func1(a, b)
<ipython-input-1-586ccabd0db3> in func1(a, b)
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x
Calling func2 leads to an error and the traceback shows exactly what happened: each line shows the context of each
step that ultimately led to the error. With the %xmode magic function (short for exception mode) we can control which
information should be displayed to us.
%xmode takes a single argument, the mode, and there are three options: * Plain * Context * Verbose
The default setting is Context and outputs something like the one above. Plain is more compact and provides less
information:
2.1. IPython 37
Python for Data Science, Release 1.0.0
The Verbose mode shows some additional information, including the arguments for any functions being called:
<ipython-input-1-586ccabd0db3> in func2(x=1)
5 a = x
6 b = x - 1
global func1 = <function func1 at 0x1063129d8>
a = 1
----> 7 return func1(a, b)
b = 0
This additional information can help narrow down the reason for the exception. Conversely, however, the Verbose
mode can lead to extremely long tracebacks in the case of complex code, in which the essential points can hardly be
recognized.
38 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Debugging can help if an error cannot be found by reading a traceback. The Python standard for interactive debugging
is the Python debugger pdb. You can use it to navigate your way through the code line by line to see what is possibly
causing an error. The extended version for IPython is ipdb.
In IPython, the %debug-magic command is perhaps the most convenient way to debug. If you call it after an exception
has been thrown, an interactive debug prompt will automatically open during the exception. Using the ipdb prompt,
you can examine the current status of the stack, examine the available variables and even run Python commands.
Let’s look at the last exception, then do some basic tasks:
[5]: %debug
> <ipython-input-1-586ccabd0db3>(2)func1()
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x
ipdb> print(a)
1
ipdb> print(b)
0
ipdb> quit
However, the interactive debugger does a lot more – we can also go up and down the stack and examine the values of
variables:
[6]: %debug
> <ipython-input-1-586ccabd0db3>(2)func1()
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x
ipdb> up
> <ipython-input-1-586ccabd0db3>(7)func2()
3
4 def func2(x):
5 a = x
6 b = x - 1
----> 7 return func1(a, b)
ipdb> print(x)
1
ipdb> up
> <ipython-input-4-180acea4108b>(2)<module>()
1 get_ipython().run_line_magic('xmode', 'Verbose')
----> 2 func2(1)
ipdb> down
> <ipython-input-1-586ccabd0db3>(7)func2()
(continues on next page)
2.1. IPython 39
Python for Data Science, Release 1.0.0
ipdb> quit
This greatly simplifies the search for the function calls that led to the error.
If you want the debugger to start automatically when an exception is thrown, you can use the %pdb-magic function to
enable this behavior:
> <ipython-input-1-586ccabd0db3>(2)func1()
1 def func1(a, b):
----> 2 return a / b
3
4 def func2(x):
5 a = x
ipdb> print(b)
0
ipdb> quit
If you have a script that you want to run in interactive mode from the start, you can do so with the command %run -d.
40 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Command Description
list Show the current location in the file
h(elp) Display a list of commands or find help on a specific command
q(uit) Terminates the debugger and the program
c(ontinue) Exit the debugger, continue in the program
n(ext) Go to the next step in the program
<enter> Repeat the previous command
p(rint) Print variables
s(tep) Step into a subroutine
r(eturn) Return from a subroutine
2.2 Jupyter
We have moved the Jupyter chapter to its own tutorial: Jupyter Tutorial.
2.3 NumPy
NumPy is the abbreviation for numeric Python. Many Python packages that provide scientific functions use NumPy’s
array objects as one of the standard interfaces for data exchange. In the following, I will give a brief overview of the
main functionality of NumPy:
• ndarray, an efficient multidimensional array that provides fast array-based operations, such as shuffling and
cleaning data, subgrouping and filtering, transformation and all other kinds of computations. There are also
flexible functions for broadcasting, i.e. evaluations of arrays of different sizes.
• Mathematical functions for fast operations on whole arrays of data, such as sorting, uniqueness and set operations.
Instead of loops with if-elif-else branches, the expressions are written in conditional logic.
• Tools for reading and writing array data to disk and working with memory mapped files.
• Functions for linear algebra, random number generation and Fourier transform.
• A C API for connecting NumPy to libraries written in C, C++ or FORTRAN.
Note: This section introduces you to the basics of using NumPy arrays and should be sufficient to follow the rest of the
tutorial. For many data analytic applications, it is not necessary to have a deep understanding of NumPy, but mastering
array-oriented programming and thinking is an important step on the way to becoming a data scientist.
See also:
• Home
• Docs
• GitHub
• Tutorials
2.2. Jupyter 41
Python for Data Science, Release 1.0.0
NumPy operations perform complex calculations on entire arrays without the need for Python for loops, which can be
slow for large sequences. NumPy’s speed is explained by its C-based algorithms, which avoid the overhead of Python
code. To give you an idea of the performance difference, we measure the difference between a NumPy array and a
Python list with a hundred thousand integers:
ndarray allows mathematical operations on whole blocks of data, using a similar syntax to similar operations between
scalar elements. In NumPy, there are many different types for describing scalars, mostly based on types from the C
language and those compatible with Python.
See also:
• Array Scalars
Note:
Whenever this tutorial talks about array or ndarray, in most cases it refers to the ndarray object.
[3]: array_1d
[3]: array([ 2020, 2021, 20222])
Nested sequences, such as a list of lists of equal length, can be converted into a multidimensional array:
[4]: list_of_lists = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
array_2d = np.array(list_of_lists)
[5]: array_2d
[5]: array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
42 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Since list_of_lists was a list with three lists, the NumPy array array_2d has two dimensions whose shape is
derived from the data. With the attributes ndim and shape we can output the number of dimensions and the outline of
array_2d:
[6]: array_2d.ndim
[6]: 2
[7]: array_2d.shape
[7]: (3, 4)
To give you an idea of the syntax, I first create an array of random numbers with five columns and seven slices:
ndarray is a generic multidimensional container. Each array has a shape, a tuple, which indicates the size of the
individual dimensions. With shape, I can output the number of rows and columns in an array:
In addition to np.array, there are a number of other functions for creating new arrays. zeros and ones, for example,
create arrays of zeros and ones, respectively, with a specific length or shape. empty creates an array without initialising
its values to a specific value. To create a higher dimensional array using these methods, pass a tuple for the shape:
[9]: np.zeros(4)
[9]: array([0., 0., 0., 0.])
[10]: np.ones((3,4))
[10]: array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
[11]: np.empty((2,3,4))
[11]: array([[[1.02724706e-316, 0.00000000e+000, 6.90657088e-310,
6.90657147e-310],
[6.90657195e-310, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000],
[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000]],
2.3. NumPy 43
Python for Data Science, Release 1.0.0
Note:
You may not safely assume that the np.empty function returns an array of zeros, as it returns uninitialised memory
and may therefore contain garbage values.
arange is an array-valued version of the Built-in Python range function:
[12]: np.arange(4)
[12]: array([0, 1, 2, 3])
Function Description
array converts input data (list, tuple, array or other sequence types) into an ndarray by either deriving a
dtype or explicitly specifying a dtype; by default, copies the input data into the array
asarray converts the input to an ndarray, but does not copy if the input is already an ndarray
arange like Python built-in range, but returns an ndarray instead of a list
ones, ones creates an array of 1s in the given form and dtype; ones_like takes another array and creates
ones_like an ones array in the same form and dtype
zeros, like ones and ones_like, but creates arrays with 0s instead
zeros_like
empty, creates new arrays by allocating new memory, but does not fill them with values like ones and zeros
empty_like
full, creates an array of the given shape and dtype, setting all values to the given fill value; full_like
full_like takes another array and creates a filled array with the same shape and dtype
eye, creates a square N × N identity matrix (1s on the diagonal and 0s elsewhere)
identity
2.3.3 dtype
ndarray is a container for homogeneous data, i.e. all elements must be of the same type. Each array has a dtype, an
object that describes the data type of the array:
[8]: dt = data.dtype
dt
[8]: dtype('float64')
44 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[9]: dt.itemsize
[9]: 8
[10]: dt.name
[10]: 'float64'
2.3. NumPy 45
Python for Data Science, Release 1.0.0
2.3.4 Arithmetic
Arrays allow you to perform stack operations on data without having to use for loops. This is called vectorisation in
NumPy. For all arithmetic operations between arrays of the same size, the operation is applied element by element:
data = np.random.randn(7, 3)
data
[1]: array([[-0.53395033, -0.90655262, -1.47419874],
[-0.60691335, 0.12786859, 0.83478769],
[-0.24791073, 0.61712855, -1.36487227],
[-0.04708711, -1.45600876, -0.88333238],
[-0.60393798, -0.12695018, 1.17832759],
[-0.83320447, -0.30302264, 0.53249707],
[ 0.85006387, 0.83626062, -1.03637456]])
[2]: 1 / data
[2]: array([[ -1.87283338, -1.10307993, -0.67833459],
[ -1.64768166, 7.82052869, 1.19790937],
[ -4.03370998, 1.62040794, -0.73266929],
[-21.23723314, -0.68680906, -1.13207669],
[ -1.65579916, -7.87710608, 0.84866043],
[ -1.2001856 , -3.30008343, 1.87794462],
[ 1.17638219, 1.19579946, -0.96490211]])
[3]: data ** 2
[3]: array([[0.28510296, 0.82183765, 2.17326192],
[0.36834382, 0.01635038, 0.69687049],
[0.06145973, 0.38084765, 1.8628763 ],
[0.0022172 , 2.1199615 , 0.7802761 ],
[0.36474108, 0.01611635, 1.3884559 ],
[0.69422968, 0.09182272, 0.28355312],
[0.72260859, 0.69933183, 1.07407224]])
46 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Indexing is the selection of a subset of your data or individual elements. This is very easy in one-dimensional arrays;
they behave similarly to Python lists:
[3]: data[4]
[3]: array([0.55981128, 0.10765489, 2.09209898])
[4]: data[2:4]
[4]: array([[ 0.29963999, 0.92431092, 0.54994156],
[-0.13884297, -0.32815509, 0.53517827]])
[6]: data
[6]: array([[ 0.50645754, -1.55827537, -0.55228066],
[-0.8255559 , 0.7174421 , -0.01235908],
[ 1.69557114, 0.54106355, 1.73920426],
[ 0.7296357 , 2.13535052, 0.02169761],
[ 0.55981128, 0.10765489, 2.09209898],
[-0.7775902 , -1.83407447, 0.97028733],
[ 1.5949525 , 0.03300061, 0.69442966]])
Note:
Array slices differ from Python lists in that they are views of the original array. This means that the data is not copied
and that any changes to the view are reflected in the original array.
If you want to make a copy of a part of an ndarray, you can copy the array explicitly – for example with data[2:5].
copy().
Slicing in this way always results in array views with the same number of dimensions. However, if you mix integer
indices and slices, you get slices with lower dimensions. For example, we can select the second row but only the first
two columns as follows:
A colon means that the whole axis is taken, so you can also select higher dimensional axes:
2.3. NumPy 47
Python for Data Science, Release 1.0.0
Boolean indexing
Let’s consider an example where we have some data in an array and an array of names with duplicates. I will use the
normal function in numpy.random.default_rng here to generate some random normally distributed data:
[9]: names = np.array(['Liam', 'Olivia', 'Noah', 'Liam', 'Noah', 'Olivia', 'Liam', 'Emma',
˓→'Oliver', 'Ava'])
[10]: names
[10]: array(['Liam', 'Olivia', 'Noah', 'Liam', 'Noah', 'Olivia', 'Liam', 'Emma',
'Oliver', 'Ava'], dtype='<U6')
[11]: data
[11]: array([[ 0.78849796, -1.93868484, 1.32082796, -0.10041207],
[ 0.14511387, 1.33143889, 0.32335453, -0.86912269],
[-1.33151508, 1.86170326, -0.07634186, -0.08733154],
[-1.30468553, -1.24239052, -0.82416291, 0.18988816],
[ 1.15544397, 1.95009689, -1.58486011, 0.5644875 ],
[-0.3425853 , 0.49727459, -1.13323998, 1.33729132],
[-2.86162133, -1.44457339, 2.484492 , -0.9733903 ],
[-1.04106218, -0.89449304, 0.50255481, 1.39502699],
[ 1.0782019 , -1.53786777, 0.72229253, 1.13447707],
[ 0.82987705, -0.17516968, -1.11954313, -0.02558353]])
Suppose each name corresponds to a row in the data array and we want to select all rows with the corresponding name
Liam. Like arithmetic operations, comparisons like == are vectorised with arrays. So comparing names with the string
Liam results in a Boolean array:
48 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Here, the Boolean array must have the same length as the array axis it indexes.
Note:
Selecting data from an array by Boolean indexing and assigning the result to a new variable always creates a copy of
the data, even if the returned array is unchanged.
In the following example, I select the rows where names == 'Liam' and also index the columns:
To select everything except Liam, you can either use != or negate the condition with ~:
If you select two of the three names to combine several Boolean conditions, you can use the Boolean arithmetic operators
& (and) and | (or).
Warning:
The Python keywords and and or do not work with Boolean arrays.
[18]: mask
[18]: array([ True, True, False, True, False, True, True, False, False,
False])
[19]: data[mask]
[19]: array([[ 0.78849796, -1.93868484, 1.32082796, -0.10041207],
[ 0.14511387, 1.33143889, 0.32335453, -0.86912269],
[-1.30468553, -1.24239052, -0.82416291, 0.18988816],
[-0.3425853 , 0.49727459, -1.13323998, 1.33729132],
[-2.86162133, -1.44457339, 2.484492 , -0.9733903 ]])
2.3. NumPy 49
Python for Data Science, Release 1.0.0
Integer array indexing allows you to select any elements in the array based on your N-dimensional index. Each integer
array represents a number of indices in that dimension.
See also:
• Integer array indexing
Transpose is a special form of reshaping that also provides a view of the underlying data without copying anything.
Arrays have the Transpose method and also the special T attribute:
[3]: data
[3]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
[5]: reshaped_data
[5]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
[6]: reshaped_data.T
[6]: array([[ 0, 4, 8, 12],
[ 1, 5, 9, 13],
[ 2, 6, 10, 14],
[ 3, 7, 11, 15]])
The @ infix operator is another way to perform matrix multiplication. It implements the semantics of the @ operator
introduced in Python 3.5 with PEP 465 and is an abbreviation of np.matmul.
For higher dimensional arrays, transpose accepts a tuple of axis numbers to swap the axes:
50 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[10]: array_3d
[10]: array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
Here the axes have been reordered with the second axis in first place, the first axis in second place and the last axis
unchanged.
ndarray also has a swapaxes method that takes a pair of axis numbers and swaps the specified axes to rearrange the
data:
[12]: array_3d.swapaxes(1, 2)
[12]: array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. They can
be thought of as fast vectorised wrappers for simple functions that take one or more scalar values and produce one or
more scalar results.
Many ufuncs are simple element-wise transformations, such as sqrt or exp:
data = np.arange(10)
[2]: data
[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
[3]: np.sqrt(data)
2.3. NumPy 51
Python for Data Science, Release 1.0.0
[4]: np.exp(data)
[4]: array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
These are called single-digit ufuncs. Others, such as add or maximum, take two arrays (i.e. binary ufuncs) and return
a single array as the result:
[5]: x = np.random.randn(8)
[6]: y = np.random.randn(8)
[7]: x
[7]: array([ 0.27787269, -0.58356193, -0.10184236, 1.35055943, -1.57973878,
1.14278195, -1.08386512, 0.09227202])
[8]: y
[8]: array([ 0.32170798, 0.7041202 , -0.22124585, -0.01662225, 0.20266652,
0.53028692, 1.09999517, -0.47938682])
[9]: np.maximum(x, y)
[9]: array([ 0.32170798, 0.7041202 , -0.10184236, 1.35055943, 0.20266652,
1.14278195, 1.09999517, 0.09227202])
[10]: data = x * 5
[11]: data
[11]: array([ 1.38936344, -2.91780966, -0.50921179, 6.75279713, -7.89869391,
5.71390974, -5.4193256 , 0.46136012])
[13]: remainder
[13]: array([ 0.27787269, -0.58356193, -0.10184236, 0.35055943, -0.57973878,
0.14278195, -0.08386512, 0.09227202])
[14]: whole_part
[14]: array([ 0., -0., -0., 1., -1., 1., -1., 0.])
Ufuncs accept an optional out argument that allows you to transfer your results to an existing array instead of creating
a new one:
52 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[16]: np.add(data, 1)
[16]: array([ 2.38936344, -1.91780966, 0.49078821, 7.75279713, -6.89869391,
6.71390974, -4.4193256 , 1.46136012])
[18]: out
[18]: array([ 2.38936344, -1.91780966, 0.49078821, 7.75279713, -6.89869391,
6.71390974, -4.4193256 , 1.46136012])
Function Description
abs, fabs calculates the absolute value element by element for integer, floating point
or complex values
sqrt calculates the square root of each element (corresponds to data ** 0,5)
square calculates the square of each element (corresponds to data ** 2)
exp calculates the exponent ex of each element
log, log10, log2, log1p Natural logarithm (base e), log base 10, log base 2 and log(1 + x) respec-
tively
sign calculates the sign of each element: 1 (positive), 0 (zero), or -1 (negative)
ceil calculates the upper limit of each element (i.e. the smallest integer greater
than or equal to this number)
floor calculates the lower limit of each element (i.e. the largest integer less than
or equal to each element)
rint rounds elements to the nearest integer, preserving the dtype
modf returns the fractional and integer parts of the array as separate arrays
isnan returns a boolean array indicating whether each value is NaN (Not a Num-
ber)
isfinite, isinf returns a boolean array indicating whether each element is finite (not-inf,
not-NaN) or infinite, respectively
cos, cosh, sin, sinh, tan, tanh regular and hyperbolic trigonometric functions
arccos, arccosh, arcsin, arcsinh, Inverse trigonometric functions
arctan, arctanh
logical_not calculates the truth value of not x element by element (corresponds to
~data)
2.3. NumPy 53
Python for Data Science, Release 1.0.0
Function Description
add add corresponding elements in arrays
subtract subtracts elements in the second array from the first array
multiply multiply array elements
divide, floor_divide divide or truncate the remainder
power increases elements in the first array to the powers specified in the second
array
maximum, fmax element-wise maximum; fmax ignores NaN
minimum, fmin element-wise minimum; fmin ignores NaN
mod element-wise modulus (remainder of the division)
copysign copies the sign of the values in the second argument to the values in the
first argument
greater, greater_equal, less, perform element-wise comparisons that result in a Boolean array (cor-
less_equal, equal, not_equal responds to the infix operators >, >=, <, <=, ==, !=)
logical_and calculates the element-wise truth value of the logical operation AND
(&)
logical_or calculates the element-wise truth value of the logical operation OR (|)
logical_xor calculates the element-wise truth value of the logical operation XOR (^)
Note:
A complete overview of binary universal functions can be found in Universal functions (ufunc).
Using NumPy arrays allows you to express many types of data processing tasks as concise array expressions that would
otherwise require writing for-loops. This practice of replacing loops with array expressions is also called vectorisation.
In general, vectorised array operations are significantly faster than their pure Python equivalents.
And finally, we compare the time with the calculation of the quadratic function of all values of a Python list:
54 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Now we want to take the values from data1 if the corresponding value in cond is True and otherwise take the value
from data2. With Python’s if-else, this could look like this:
result
[3]: [array([-0.97770921, 0.18774118, 0.02492351, -1.71622535, 0.07635826,
-0.78475485, 1.35754775])]
result
[4]: array([[-0.97770921, 2.12735854, 0.02492351, 0.44203037, 0.07635826,
-0.78475485, 1.35754775]])
The second and third arguments of np.where do not have to be arrays; one or both can also be scalars. A typical
use of where in data analysis is to create a new array of values based on another array. Suppose you have a matrix of
randomly generated data and you want to make all the negative values positive values:
data
[5]: array([[ 0.49903363, 0.67358076, 0.12044423, 0.61645566],
[-0.06063701, -0.39860593, -0.06098876, 1.50512719],
[ 1.04223193, -0.0546117 , 0.84049503, 0.36221393],
[-0.62065844, 0.80729481, -0.13285982, 0.79702012]])
2.3. NumPy 55
Python for Data Science, Release 1.0.0
A number of mathematical functions that calculate statistics over an entire array or over the data along an axis are
accessible as methods of the array class. So you can use aggregations such as sum, mean and standard deviation by
either calling the array instance method or using the top-level NumPy function.
Below I generate some random data and calculate some aggregated statistics:
data = np.random.randn(7, 3)
data
[1]: array([[ 0.1636264 , -0.07479277, -0.20255074],
[-0.74231862, 1.09947779, 0.78135583],
[ 0.63855308, 0.32602387, 0.61464571],
[ 0.23415688, -0.91457139, -0.31875483],
[-0.02505313, 1.25184632, 1.4875972 ],
[-0.20283564, 3.13198588, 1.74734662],
[ 0.93549941, -1.0149449 , -0.44490236]])
[2]: data.mean()
[2]: 0.40339955359802476
[3]: np.mean(data)
[3]: 0.40339955359802476
[4]: data.sum()
[4]: 8.47139062555852
Functions like mean and sum require an optional axis argument that calculates the statistic over the specified axis,
resulting in an array with one less dimension:
[5]: data.mean(axis=0)
[5]: array([0.14308977, 0.54357497, 0.52353392])
56 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[6]: data.sum(axis=0)
[6]: array([1.00162838, 3.8050248 , 3.66473745])
With data.mean(0), which is the same as data.mean(axis=0), the mean is calculated over the rows, while data.
sum(0) calculates the sum over the rows.
Other methods like cumsum and cumprod, however, do not aggregate but create a new array with the intermediate
results.
In multidimensional arrays, accumulation functions such as cumsum and cumprod return an array of the same size but
with the partial aggregates calculated along the specified axis:
[7]: data.cumsum()
[7]: array([ 0.1636264 , 0.08883362, -0.11371711, -0.85603574, 0.24344206,
1.02479789, 1.66335097, 1.98937484, 2.60402055, 2.83817743,
1.92360604, 1.60485122, 1.57979809, 2.83164441, 4.31924161,
4.11640597, 7.24839184, 8.99573847, 9.93123788, 8.91629298,
8.47139063])
[8]: data.cumprod()
[8]: array([ 1.63626397e-01, -1.22380721e-02, 2.47883053e-03, -1.84008206e-03,
-2.02312937e-03, -1.58078392e-03, -1.00941444e-03, -3.29093203e-04,
-2.02275727e-04, -4.73642533e-05, 4.33179909e-05, -1.38078186e-05,
3.45929093e-07, 4.33050061e-07, 6.44204058e-07, -1.30667543e-07,
-4.09248898e-07, -7.15099681e-07, -6.68975332e-07, 6.78973099e-07,
-3.02076731e-07])
Method Description
sum Sum of all elements in the array or along an axis.
mean Arithmetic mean; for arrays with length zero, NaN is returned.
std, var Standard deviation and variance respectively
min, max Minimum and maximum
argmin, argmax Indices of the minimum and maximum elements respectively
cumsum Cumulative sum of the elements, starting with 0
cumprod Cumulative product of the elements, starting with 1
Boolean values have been converted to 1 (True) and 0 (False) in the previous methods. Therefore, sum is often used
to count the True values in a Boolean array:
2.3. NumPy 57
Python for Data Science, Release 1.0.0
There are two additional methods, any and all, which are particularly useful for Boolean arrays:
• any checks whether one or more values in an array are true
• all checks whether each value is true
bools
[5]: array([[ True, False, False],
[False, False, True],
[ True, True, False],
[False, True, False],
[ True, False, False],
[False, False, False],
[ True, True, True]])
[6]: bools.any()
[6]: True
[7]: bools.all()
[7]: False
2.3.12 Sort
As in Python’s list, NumPy arrays can be sorted in-place using the numpy.sort method. You can sort any one-
dimensional section of values in a multidimensional array in place along an axis by passing the axis number to sort:
data = np.random.randn(7, 3)
data
[4]: array([[ 0.01030144, 0.78524872, 0.42930263],
[-1.51182764, 0.16973497, -0.6294608 ],
[-1.63687566, 2.89978519, -1.12282476],
[-0.23121837, 0.40594707, -0.1667278 ],
[ 1.57718982, -0.4898681 , -0.38596701],
[ 0.92491973, -0.69984905, -1.66848346],
[-0.87552462, -0.04699234, 1.33797794]])
58 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[9]: data.sort(0)
data
[9]: array([[-1.63687566, -0.69984905, -1.66848346],
[-1.51182764, -0.4898681 , -1.12282476],
[-0.87552462, -0.04699234, -0.6294608 ],
[-0.23121837, 0.16973497, -0.38596701],
[ 0.01030144, 0.40594707, -0.1667278 ],
[ 0.92491973, 0.78524872, 0.42930263],
[ 1.57718982, 2.89978519, 1.33797794]])
np.sort, on the other hand, returns a sorted copy of an array instead of changing the array in place:
[10]: data
[10]: array([[-1.63687566, -0.69984905, -1.66848346],
[-1.51182764, -0.4898681 , -1.12282476],
[-0.87552462, -0.04699234, -0.6294608 ],
[-0.23121837, 0.16973497, -0.38596701],
[ 0.01030144, 0.40594707, -0.1667278 ],
[ 0.92491973, 0.78524872, 0.42930263],
[ 1.57718982, 2.89978519, 1.33797794]])
NumPy has some basic set operations for one-dimensional ndarray. A commonly used one is numpy.unique, which
returns the sorted unique values in an array:
[2]: np.unique(names)
[2]: array(['Ava', 'Emma', 'Liam', 'Noah', 'Oliver', 'Olivia'], dtype='<U6')
With numpy.in1d you can check the membership of the values in a one-dimensional array to another array and a boolean
array is returned:
2.3. NumPy 59
Python for Data Science, Release 1.0.0
[3]: array([False, False, False, False, False, False, False, True, False,
True])
Method Description
unique(x) calculates the sorted, unique elements in x
intersect1d(x, y) calculates the sorted common elements x and y
union1d(x, y) calculates the sorted union of elements
in1d(x, y) computes a boolean array indicating whether each element of x is contained in y
setdiff1d(x, y) sets the difference of the elements in x that are not contained in y
setxor1d(x, y) sets symmetric differences; elements contained in one of the arrays but not in both
NumPy is able to store data in some text or binary formats on disk and load it from there. However, in this section I
only discuss NumPy’s own binary format, as mostly pandas or other tools are used to load text or table data (see Read,
persist and provide data.
np.save and np.load are the two most important functions for efficiently saving and loading array data to disk. Arrays
are saved by default in an uncompressed raw binary format with the file extension .npy:
data = np.random.randn(7, 3)
np.save('my_data', data)
If the file path does not already end in .npy, the extension is appended. The array on the hard disk can then be loaded
with np.load:
[2]: np.load('my_data.npy')
[2]: array([[ 0.94742917, -1.61998927, 0.58811464],
[ 0.76779587, -1.30799936, -0.23834223],
[-2.13912266, -0.49191404, -0.49316354],
[-0.62636589, -1.2023549 , 1.41655391],
[ 0.85182508, 0.09138964, 0.79240571],
[ 1.22517044, 0.9233269 , 0.44967536],
[-0.68252408, -0.00765624, 0.10032318]])
You can save multiple arrays in an uncompressed archive by using np.savez and passing the arrays as keyword argu-
ments:
archive['b']
[4]: array([[8.97622033e-01, 2.62436524e+00, 3.45878832e-01],
[5.89510504e-01, 1.71086232e+00, 5.68070195e-02],
(continues on next page)
60 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
2.4 pandas
pandas is a Python library for data analysis that has become very popular in recent years. On the website, pandas is
described thus:
„pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built
on top of the Python programming language.“
More specifically, pandas is an in-memory analysis tool that offers SQL-like constructs, as well as statistical and ana-
lytical tools. In doing so, pandas builds on Cython and NumPy, making it less memory intensive and faster than pure
Python code. Mostly pandas is used to
• replace Excel and Power BI
• implement an ETL process
• process CSV or JSON data
• prepare machine learning
See also:
• Home
• User guide
• API reference
• GitHub
To get started with pandas, you should first familiarise yourself with the two most important data structures Series and
DataFrame.
Series
A series is a one-dimensional array-like object containing a sequence of values (of similar types to the NumPy types)
and an associated array of data labels called an index. The simplest series is formed from just an array of data:
2.4. pandas 61
Python for Data Science, Release 1.0.0
[2]: 0 0.145710
1 1.445287
2 -0.631168
3 -0.158460
4 0.408335
5 1.203595
6 0.585688
dtype: float64
The string representation of an interactively displayed series shows the index on the left and the values on the right.
Since we have not specified an index for the data, a default index is created consisting of the integers 0 to N - 1
(where N is the length of the data). You can get the array representation and the index object of the series via their
pandas.Series.array and pandas.Series.index attributes respectively:
[3]: s.array
[3]: <PandasArray>
[0.14571022099248174, 1.4452870427787552, -0.63116828428809,
-0.1584603180747951, 0.40833540099850335, 1.2035948244521983,
0.5856883717533461]
Length: 7, dtype: float64
[4]: s.index
[4]: RangeIndex(start=0, stop=7, step=1)
Often you will want to create an index that identifies each data point with a label:
s2 = pd.Series(rng.normal(size=7), index=idx)
[6]: s2
[6]: 2022-01-31 -0.879307
2022-02-01 1.491183
2022-02-02 -0.778551
2022-02-03 2.963484
2022-02-04 0.134357
2022-02-05 0.357678
2022-02-06 -1.124577
Freq: D, dtype: float64
See also:
• Time series / date functionality
Compared to NumPy arrays, you can use labels in the index if you want to select individual values or a group of values:
[7]: s2['2022-02-02']
[7]: -0.7785505322106326
62 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Here ['2022-02-02', '2022-02-03', '2022-02-04'] is interpreted as a list of indices, even if it contains strings
instead of integers.
When using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication
or applying mathematical functions, the link between index and value is preserved:
[10]: s2 ** 2
[10]: 2022-01-31 0.773180
2022-02-01 2.223628
2022-02-02 0.606141
2022-02-03 8.782240
2022-02-04 0.018052
2022-02-05 0.127933
2022-02-06 1.264674
Freq: D, dtype: float64
[11]: np.exp(s2)
[11]: 2022-01-31 0.415071
2022-02-01 4.442349
2022-02-02 0.459071
2022-02-03 19.365331
2022-02-04 1.143801
2022-02-05 1.430005
2022-02-06 0.324790
Freq: D, dtype: float64
You can also think of a series as a fixed-length ordered dict, since it is an assignment of index values to data values. It
can be used in many contexts where you could use a dict:
[12]: '2022-02-02' in s2
[12]: True
[13]: '2022-02-09' in s2
[13]: False
2.4. pandas 63
Python for Data Science, Release 1.0.0
Missing data
I will use NA and null synonymously to indicate missing data. The functions isna and notna in pandas should be
used to identify missing data:
[14]: pd.isna(s2)
[14]: 2022-01-31 False
2022-02-01 False
2022-02-02 False
2022-02-03 False
2022-02-04 False
2022-02-05 False
2022-02-06 False
Freq: D, dtype: bool
[15]: pd.notna(s2)
[15]: 2022-01-31 True
2022-02-01 True
2022-02-02 True
2022-02-03 True
2022-02-04 True
2022-02-05 True
2022-02-06 True
Freq: D, dtype: bool
[16]: s2.isna()
[16]: 2022-01-31 False
2022-02-01 False
2022-02-02 False
2022-02-03 False
2022-02-04 False
2022-02-05 False
2022-02-06 False
Freq: D, dtype: bool
Dealing with missing data is discussed in more detail in the section Managing missing data with pandas.
A useful feature of Series for many applications is the automatic alignment by index labels in arithmetic operations:
s3 = pd.Series(rng.normal(size=7), index=idx)
[18]: s2, s3
[18]: (2022-01-31 -0.879307
2022-02-01 1.491183
2022-02-02 -0.778551
2022-02-03 2.963484
2022-02-04 0.134357
(continues on next page)
64 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[19]: s2 + s3
[19]: 2022-01-31 NaN
2022-02-01 0.957289
2022-02-02 -1.799431
2022-02-03 3.468533
2022-02-04 0.669346
2022-02-05 0.193651
2022-02-06 -0.157452
2022-02-07 NaN
Freq: D, dtype: float64
s3
[20]: date
2022-02-01 -0.533894
2022-02-02 -1.020880
2022-02-03 0.505049
2022-02-04 0.534989
2022-02-05 -0.164027
2022-02-06 0.967125
2022-02-07 0.427053
Freq: D, Name: floats, dtype: float64
2.4. pandas 65
Python for Data Science, Release 1.0.0
DataFrame
A DataFrame represents a rectangular data table and contains an ordered, named collection of columns, each of which
can have a different value type. The DataFrame has both a row index and a column index.
Note:
Although a DataFrame is two-dimensional, you can also use it to represent higher-dimensional data in a table format
with hierarchical indexing using join, combine and Reshaping.
df = pd.DataFrame(data)
df
[21]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E
For large DataFrames, the head method selects only the first five rows:
[22]: df.head()
[22]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
If you want to pass a column that is not contained in the dict, it will appear without values in the result:
df2
66 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[25]: df['Code']
[25]: 0 U+0000
1 U+0001
2 U+0002
3 U+0003
4 U+0004
5 U+0005
Name: Code, dtype: object
df2
[26]: Decimal Octal Description Key
Code
U+0000 0 001 NaN NUL
U+0001 1 002 NaN Ctrl-A
U+0002 2 003 NaN Ctrl-B
U+0003 3 004 NaN Ctrl-C
U+0004 4 004 NaN Ctrl-D
U+0005 5 005 NaN Ctrl-E
[27]: df2.loc['U+0001']
[27]: Decimal 1
Octal 002
Description NaN
Key Ctrl-A
Name: U+0001, dtype: object
Column values can be changed by assignment. For example, a scalar value or an array of values could be assigned to
the empty Description column:
[28]: df2['Description'] = [
'Null character',
'Start of Heading',
'Start of Text',
'End-of-text character',
(continues on next page)
2.4. pandas 67
Python for Data Science, Release 1.0.0
df2
[28]: Decimal Octal Description Key
Code
U+0000 0 001 Null character NUL
U+0001 1 002 Start of Heading Ctrl-A
U+0002 2 003 Start of Text Ctrl-B
U+0003 3 004 End-of-text character Ctrl-C
U+0004 4 004 End-of-transmission character Ctrl-D
U+0005 5 005 Enquiry character Ctrl-E
[30]: df2.columns
[30]: Index(['Decimal', 'Octal', 'Description', 'Key'], dtype='object')
[31]: df3.columns
[31]: Index(['Description', 'Key'], dtype='object')
[32]: u = {
'U+0006': {'Decimal': '6', 'Octal': '006', 'Description': 'Acknowledge character',
˓→'Key': 'Ctrl-F'},
df4 = pd.DataFrame(u)
df4
[32]: U+0006 U+0007
Decimal 6 7
Octal 006 007
Description Acknowledge character Bell character
Key Ctrl-F Ctrl-G
You can transpose the DataFrame, i.e. swap the rows and columns, with a similar syntax to a NumPy array:
[33]: df4.T
[33]: Decimal Octal Description Key
U+0006 6 006 Acknowledge character Ctrl-F
U+0007 7 007 Bell character Ctrl-G
68 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Warning:
Note that when transposing, the data types of the columns are discarded if the columns do not all have the same data
type, so when transposing and then transposing back, the previous type information may be lost. In this case, the
columns become arrays of pure Python objects.
The keys in the inner dicts are combined to form the index in the result. This is not the case when an explicit index is
specified:
Python data structures such as lists and arrays can be converted into pandas Series or DataFrames.
Series
pd.Series(list1)
[2]: 0 -0.751442
1 0.816935
2 -0.272546
3 -0.268295
4 -0.296728
5 0.176255
6 -0.322612
dtype: float64
Multiple lists can also be easily converted into one pandas Series:
pd.Series(list1 + list2)
[3]: 0 -0.751442
1 0.816935
2 -0.272546
3 -0.268295
4 -0.296728
5 0.176255
6 -0.322612
(continues on next page)
2.4. pandas 69
Python for Data Science, Release 1.0.0
With Python dictionaries you can pass not only values but also the corresponding keys to a pandas series:
[5]: dict1 = {
'2022-01-31': -0.751442,
'2022-02-01': 0.816935,
'2022-02-02': -0.272546,
'2022-02-03': -0.268295,
'2022-02-04': -0.296728,
'2022-02-05': 0.176255,
'2022-02-06': -0.322612
}
pd.Series(dict1)
[5]: 2022-01-31 -0.751442
2022-02-01 0.816935
2022-02-02 -0.272546
2022-02-03 -0.268295
2022-02-04 -0.296728
2022-02-05 0.176255
2022-02-06 -0.322612
dtype: float64
When you pass a dict, the index in the resulting pandas series takes into account the order of the keys in the dict.
With collections.ChainMap you can also turn several dicts into one pandas.Series.
First we define a second dict:
70 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[6]: dict2 = {
'2022-02-07': -0.029608,
'2022-02-08': -0.277982,
'2022-02-09': 2.693057,
'2022-02-10': -0.850817,
'2022-02-11': 0.783868,
'2022-02-12': -1.137835,
'2022-02-13': -0.617132
}
pd.Series(ChainMap(dict1,dict2))
[7]: 2022-02-07 -0.029608
2022-02-08 -0.277982
2022-02-09 2.693057
2022-02-10 -0.850817
2022-02-11 0.783868
2022-02-12 -1.137835
2022-02-13 -0.617132
2022-01-31 -0.751442
2022-02-01 0.816935
2022-02-02 -0.272546
2022-02-03 -0.268295
2022-02-04 -0.296728
2022-02-05 0.176255
2022-02-06 -0.322612
dtype: float64
DataFrame
6
2022-01-31 -0.322612
2022-02-01 -0.617132
2.4. pandas 71
Python for Data Science, Release 1.0.0
[11]: pd.DataFrame(data)
[11]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E
[12]: data2 = {
'U+0006': {'Decimal': '6', 'Octal': '006', 'Key': 'Ctrl-F'},
'U+0007': {'Decimal': '7', 'Octal': '007', 'Key': 'Ctrl-G'},
}
df2 = pd.DataFrame(data2)
df2
[12]: U+0006 U+0007
Decimal 6 7
Octal 006 007
Key Ctrl-F Ctrl-G
[13]: data3 = {
'U+0006': df2['U+0006'][2:],
'U+0007': df2['U+0007'][2:]
}
pd.DataFrame(data3)
[13]: U+0006 U+0007
Key Ctrl-F Ctrl-G
2.4.3 Indexing
Index objects
The index objects of pandas are responsible for the axis labels and other metadata, such as the axis name. Any array
or other sequence of labels you use when constructing a series or DataFrame is internally converted into an index:
72 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[2]: obj.index
[2]: DatetimeIndex(['2022-02-02', '2022-02-03', '2022-02-04', '2022-02-05',
'2022-02-06', '2022-02-07', '2022-02-08'],
dtype='datetime64[ns]', freq='D')
[3]: obj.index[3:]
[3]: DatetimeIndex(['2022-02-05', '2022-02-06', '2022-02-07', '2022-02-08'], dtype=
˓→'datetime64[ns]', freq='D')
Index objects are immutable and therefore cannot be changed by the user:
~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexes/base.py in __setitem__(self, key, value)
4583 @final
4584 def __setitem__(self, key, value):
-> 4585 raise TypeError("Index does not support mutable operations")
4586
4587 def __getitem__(self, key):
Immutability makes the sharing of index objects in data structures more secure:
labels = pd.Index(np.arange(3))
labels
[5]: Int64Index([0, 1, 2], dtype='int64')
[7]: obj2
[7]: 0 0.038721
1 1.012864
2 -1.562999
dtype: float64
2.4. pandas 73
Python for Data Science, Release 1.0.0
[10]: df1
[10]: Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
[11]: df1.columns
[11]: Index(['Code', 'Decimal', 'Octal'], dtype='object')
df12
[14]: Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
0 U+0006 6 006
1 U+0007 7 007
When selecting duplicate labels, all occurrences of the label in question are selected:
[15]: df12.loc[1]
74 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[16]: df12.loc[2]
[16]: Code U+0002
Decimal 2
Octal 003
Name: 2, dtype: object
Data selection is one of the main points that behaves differently with duplicates. Indexing a label with multiple entries
results in a series, while single entries result in a scalar value. This can complicate your code because the output type
of indexing can vary depending on whether a label is repeated or not. In addition, many pandas functions, such as
reindex, require labels to be unique. You can use the is_unique property of the index to determine whether its
labels are unique or not:
[17]: df12.index.is_unique
[17]: False
df12
[18]: Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
6 U+0006 6 006
7 U+0007 7 007
Each index has a number of set logic methods and properties that answer other general questions about the data it
contains. The following are some useful methods and properties:
2.4. pandas 75
Python for Data Science, Release 1.0.0
Method Description
concat concatenates additional index objects, creating a new index
Index.difference calculates the difference of two sets as an index
Index.intersection calculates the intersection
Index.union calculates the union set
Index.isin computes a boolean array indicating whether each value is contained in the passed collection
Index.delete computes a new index by deleting the element in index i
Index.drop computes a new index by deleting the passed values
Index.insert insert computes new index by inserting the element in index i
In- is_monotonic returns True if each element is greater than or equal to the previous element
dex.is_monotonic
Index.is_unique is_unique returns True if the index does not contain duplicate values
Index.unique calculates the array of unique values in the index
An important method for Pandas objects is Index.reindex, which can be used to create a new object with rearranged
values that match the new index. Consider, for example:
[20]: obj
[20]: 2022-02-02 0
2022-02-03 1
2022-02-04 2
2022-02-05 3
2022-02-06 4
2022-02-07 5
2022-02-08 6
Freq: D, dtype: int64
[22]: obj.reindex(new_index)
[22]: 2022-02-03 1.0
2022-02-04 2.0
2022-02-05 3.0
2022-02-06 4.0
2022-02-07 5.0
2022-02-08 6.0
2022-02-09 NaN
Freq: D, dtype: float64
Index.reindex creates a new index and re-indexes the DataFrame. By default, values in the new index for which
there are no corresponding records in the DataFrame become NaN.
For ordered data such as time series, it may be desirable to interpolate or fill values during reindexing. The method
option allows this with a method like ffill that fills the values forward:
76 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
For a DataFrame, reindex can change either the (row) index, the columns or both. If only a sequence is passed, the
rows in the result are re-indexed:
[24]: df1.reindex(range(7))
[24]: Code Decimal Octal
0 U+0000 0.0 001
1 U+0001 1.0 002
2 U+0002 2.0 003
3 U+0003 3.0 004
4 U+0004 4.0 004
5 U+0005 5.0 005
6 NaN NaN NaN
df1.reindex(columns=encoding)
[25]: Octal Code Description
0 001 U+0000 NaN
1 002 U+0001 NaN
2 003 U+0002 NaN
3 004 U+0003 NaN
4 004 U+0004 NaN
5 005 U+0005 NaN
2.4. pandas 77
Python for Data Science, Release 1.0.0
Argu- Description
ment
labels New sequence to be used as index. Can be an index instance or another sequence-like Python data struc-
ture. An index is used exactly as it is, without being copied.
axis The new axis to index, either index (rows) or columns. The default is index. You can alternatively use
reindex(index=new_labels) or reindex(columns=new_labels).
method Interpolation method; ffill fills forwards, while bfill fills backwards.
fill_value
Substitute value to be used when missing data is inserted by re-indexing. Uses fill_value='missing'
(the default behaviour) if the missing labels in the result are to have zero values.
limit When filling forward or backward, the maximum number of elements to fill.
tolerance
When filling forward or backward, the maximum size of the gap to be filled for inexact matches.
level Match single index at MultiIndex level; otherwise select subset.
copy If True, the underlying data is always copied, even if the new index matches the old index; if False, the
data is not copied if the indices are equivalent.
The axis labels can be converted by a function or mapping to create new, differently labelled objects. You can also
change the axes in place without creating a new data structure. Here is a simple example:
df3
[26]: 1 2 3 4
Deutsch 0 1 2 3
English 4 5 6 7
Français 8 9 10 11
The axis labels can be converted by a function or Index.map to create new, differently labeled objects. You can also
change the axes in place without creating a new data structure. Here is a simple example:
df3.index.map(transform)
[27]: Index(['DE', 'EN', 'FR'], dtype='object')
You can assign the index and change the DataFrame on the spot:
df3
78 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[28]: 1 2 3 4
DE 0 1 2 3
EN 4 5 6 7
FR 8 9 10 11
If you want to create a converted version of your dataset without changing the original, you can use Index.rename:
[29]: df3.rename(index=str.lower)
[29]: 1 2 3 4
de 0 1 2 3
en 4 5 6 7
fr 8 9 10 11
In particular, Index.rename can be used in conjunction with a dict-like object that provides new values for a subset
of the axis labels:
df3
[30]: 0 1 2 3
BE 0 1 2 3
DE 4 5 6 7
EN 8 9 10 11
Index.rename saves you from manually copying the DataFrame and assigning its index and column attributes. If you
want to change a data set on the spot, also pass inplace=True:
df3
[31]: 0 0 1 2
BE 0 1 2 3
BE 4 5 6 7
DE 8 9 10 11
2.4. pandas 79
Python for Data Science, Release 1.0.0
Hierarchical Indexing
Hierarchical indexing is an important feature of pandas that allows you to have multiple index levels on one axis. This
gives you the opportunity to work with higher dimensional data in a lower dimensional form.
Let’s start with a simple example: Let’s create a series of lists as an index:
hits
[32]: Jupyter Tutorial de 83080
en 20336
PyViz Tutorial de 11376
Python Basics de 1228
en 468
dtype: int64
What you see is a graphical view of a series with a pandas.MultiIndex. The gaps in the index display mean that the
label above it is to be used.
[33]: hits.index
[33]: MultiIndex([('Jupyter Tutorial', 'de'),
('Jupyter Tutorial', 'en'),
( 'PyViz Tutorial', 'de'),
( 'Python Basics', 'de'),
( 'Python Basics', 'en')],
)
With a hierarchically indexed object, so-called partial indexing is possible, with which you can select subsets of the
data:
80 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
The selection is even possible from an inner level. In the following I select all values with the value 1 from the second
index level:
In Pandas, whether you get a view or not depends on the structure and data types of the original DataFrame – and
whether changes made to a view are propagated back to the original DataFrame.
Hierarchical indexing plays an important role in data reshaping and group-based operations such as forming a pivot
table. For example, you can reorder this data into a DataFrame using the pandas.Series.unstack method:
[38]: hits.unstack()
[38]: de en
Jupyter Tutorial 83080.0 20336.0
PyViz Tutorial 11376.0 NaN
Python Basics 1228.0 468.0
[39]: hits.unstack().stack()
[39]: Jupyter Tutorial de 83080.0
en 20336.0
PyViz Tutorial de 11376.0
Python Basics de 1228.0
en 468.0
dtype: float64
df = pd.DataFrame(version_hits,
(continues on next page)
2.4. pandas 81
Python for Data Science, Release 1.0.0
df
[40]: 12/2021 01/2022 02/2022
latest stable latest stable latest stable
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0
The hierarchy levels can have names (as strings or any Python objects). If this is the case, they are displayed in the
console output:
df
[41]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0
Warning:
Make sure that the index names Month and Version are not part of the row names (of the df.index values).
With the partial column indexing you can select column groups in a similar way:
[42]: df['12/2021']
[42]: Version latest stable
Title Language
Jupyter Tutorial de 19651 0
en 4722 1825
PyViz Tutorial de 2573 0
Python Basics de 525 0
en 157 0
82 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
With MultiIndex.from_arrays, a MultiIndex can be created itself and then reused; the columns in the preceding
DataFrame with level names could be created in this way:
There may be times when you want to rearrange the order of the levels on an axis or sort the data by the values in
a particular level. The function DataFrame.swaplevel takes two level numbers or names and returns a new object in
which the levels are swapped (but the data remains unchanged):
DataFrame.sort_index, on the other hand, sorts the data only by the values in a single level. When swapping levels, it
is not uncommon to also use sort_index so that the result is lexicographically sorted by the specified level:
[45]: df.sort_index(level=0)
[45]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0
However, the PyViz Tutorial will now be sorted before the Python Basics, as all upper case letters appear before lower
case letters in this sorting. To avoid this, you can use the following lambda function:
2.4. pandas 83
Python for Data Science, Release 1.0.0
Note:
Data selection performance is much better for hierarchically indexed objects if the index is sorted lexicographically,
starting with the outermost level, i.e. the result of calling sort_index(level=0) or sort_index().
Many descriptive and summary statistics for DataFrame and Series have a level option that allows you to specify the
level by which you can aggregate on a particular axis. Consider the DataFrame above; we can aggregate either the
rows or the columns by level as follows:
[48]: df.groupby(level='Language').sum()
[48]: Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Language
de 22749 0 35434 0 37501 0
en 4879 1825 3582 2576 4235 3707
Internally, pandas’ DataFrame.groupby machinery is used for this purpose, which is explained in more detail in Group
Operations.
84 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
It is not uncommon to use one or more columns of a DataFrame as a row index; alternatively, you can move the row
index into the columns of the DataFrame. Here is an example DataFrame:
df = pd.DataFrame(data)
df
[50]: 0 1 2 3 4 5 6 7
0 Jupyter Tutorial de 19651 0 30134 0 33295 0
1 Jupyter Tutorial en 4722 1825 3497 2576 4009 3707
2 PyViz Tutorial de 2573 0 4873 0 3930 0
3 Python Basics de 525 0 427 0 276 0
4 Python Basics en 157 0 85 0 226 0
The function pandas.DataFrame.set_index creates a new DataFrame that uses one or more of its columns as an index:
df2
[51]: 2 3 4 5 6 7
0 1
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0
By default, the columns are removed from the DataFrame, but you can also leave them in by passing drop=False to
set_index:
7
0 1
Jupyter Tutorial de 0
en 3707
PyViz Tutorial de 0
Python Basics de 0
en 0
2.4. pandas 85
Python for Data Science, Release 1.0.0
DataFrame.reset_index, on the other hand, does the opposite of set_index; the hierarchical index levels are moved into
the columns:
[53]: df2.reset_index()
[53]: 0 1 2 3 4 5 6 7
0 Jupyter Tutorial de 19651 0 30134 0 33295 0
1 Jupyter Tutorial en 4722 1825 3497 2576 4009 3707
2 PyViz Tutorial de 2573 0 4873 0 3930 0
3 Python Basics de 525 0 427 0 276 0
4 Python Basics en 157 0 85 0 226 0
With pandas you can create Series with date and time information. In the following we will show common operations
with date data.
Note:
pandas supports dates stored in UTC values using the datetime64[ns] datatype. Local times from a single time zone
are also supported. Multiple time zones are supported by a pandas.Timestamp object. If you need to handle times from
multiple time zones, I would probably split the data by time zone and use a separate DataFrame or Series for each time
zone.
See also:
• Time series / date functionality
dt
[1]: DatetimeIndex(['2022-03-27 00:00:00', '2022-03-27 01:00:00',
'2022-03-27 02:00:00', '2022-03-27 03:00:00',
'2022-03-27 04:00:00', '2022-03-27 05:00:00'],
dtype='datetime64[ns]', freq='H')
utc
[2]: DatetimeIndex(['2022-03-27 00:00:00+00:00', '2022-03-27 01:00:00+00:00',
'2022-03-27 02:00:00+00:00', '2022-03-27 03:00:00+00:00',
'2022-03-27 04:00:00+00:00', '2022-03-27 05:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='H')
Note:
The type of the result dtype='datetime64[ns, UTC]' indicates that the data is stored as UTC.
Let’s convert this series to the time zone Europe/Berlin:
86 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[3]: utc.tz_convert('Europe/Berlin')
[3]: DatetimeIndex(['2022-03-27 01:00:00+01:00', '2022-03-27 03:00:00+02:00',
'2022-03-27 04:00:00+02:00', '2022-03-27 05:00:00+02:00',
'2022-03-27 06:00:00+02:00', '2022-03-27 07:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
local.tz_convert('UTC')
[4]: DatetimeIndex(['2022-03-27 00:00:00+00:00', '2022-03-27 01:00:00+00:00',
'2022-03-27 02:00:00+00:00', '2022-03-27 03:00:00+00:00',
'2022-03-27 04:00:00+00:00', '2022-03-27 05:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='H')
If you have a Series with UTC or local time information, you can use this code to determine the seconds according
to Unix time:
uts
[5]: array([1.6483392e+09, 1.6483428e+09, 1.6483464e+09, 1.6483500e+09,
1.6483536e+09, 1.6483572e+09])
Manipulation of dates
Convert to strings
With pandas.DatetimeIndex you have some possibilities to convert date and time into strings, for example into the name
of the weekday:
[7]: local.day_name(locale='en_GB.utf8')
[7]: Index(['Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday'], dtype='object')
You can find out which locale is available to you with locale -a:
2.4. pandas 87
Python for Data Science, Release 1.0.0
[8]: !locale -a
C
C.UTF-8
de_AT.utf8
de_BE.utf8
de_CH.utf8
de_DE.utf8
de_IT.utf8
de_LI.utf8
de_LU.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Other attributes of DatetimeIndex that can be used to convert date and time into strings are:
88 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Attribute Description
year the year as datetime.
month the month as January 1 and December 12
day the day of the datetime
hour the hours of the datetime
minute the minutes of the datetime
seconds the seconds of the ‘datetime
microsecond the microseconds of the datetime.
nanosecond the nanoseconds of datetime
date returns a NumPy array of Python datetime.date objects
time returns a NumPy array of datetime.time objects
timetz returns a NumPy array of datetime.time objects with timezone information
dayofyear, day_of_year the ordinal day of the year
dayofweek the day of the week with Monday (0) and Sunday (6)
day_of_week the day of the week with Monday (0) and Sunday (6)
weekday the day of the week with Monday (0) and Sunday (6)
quarter returns the quarter of the year
tz returns the time zone
freq returns the frequency object if it is set, otherwise None
freqstr returns the frequency object as a string if it is set, otherwise None
is_month_start indicates if the date is the first day of the month
is_month_end indicates whether the date is the last day of the month
is_quarter_start indicates whether the date is the first day of a quarter
is_quarter_end shows if the date is the last day of a quarter
is_year_start indicates whether the date is the first day of a year
is_year_end indicates whether the date is the last day of a year
is_leap_year Boolean indicator if the date falls in a leap year
inferred_freq tries to return a string representing a frequency determined by infer_freq
However, there are also some methods with which you can convert the DatetimeIndex into strings, for example
strftime:
[9]: local.strftime('%d.%m.%Y')
[9]: Index(['27.03.2022', '27.03.2022', '27.03.2022', '27.03.2022', '27.03.2022',
'27.03.2022'],
dtype='object')
Note:
In strftime() and strptime() Format Codes you get an overview of the different formatting possibilities of strftime.
Other methods are:
2.4. pandas 89
Python for Data Science, Release 1.0.0
Method Description
normalize converts times to midnight
strftime converts to index using the specified date format
snap snaps the timestamp to the next occurring frequency
tz_convert convert a tz capable datetime array/index from one time zone to another
tz_localize localises tz-naive datetime array/index into tz-compatible datetime array/index
round rounds the data up to the nearest specified frequency
floor rounds the data sown to the specified frequency
ceil round the data to the specified frequency
to_period converts the data to a PeriodArray/Index at a given frequency
to_perioddelta
calculates TimedeltaArray of the difference between the index values and the index converted to
PeriodArray at the specified frequency
to_pydatetimereturns Datetime array/index as ndarray object of datetime.datetime objects
to_series creates a series with index and values corresponding to index keys; useful with map for returning
an indexer
to_frame creates a DataFrame with a column containing the index
month_name returns the month names of the DateTimeIndex with the specified locale
day_name returns the day names of the DateTimeIndex with the specified locale
mean returns the mean value of the array
std returns the standard deviation of the sample across the requested axis
Indexing series (obj[...]) works analogously to indexing NumPy arrays, except that you can use index values of the
series instead of just integers. Here are some examples:
[3]: s
[3]: 2022-02-02 2.028039
2022-02-03 0.005695
2022-02-04 0.941828
2022-02-05 0.393235
2022-02-06 1.303863
2022-02-07 -2.023356
2022-02-08 0.893236
Freq: D, dtype: float64
[4]: s['2022-02-03']
[4]: 0.005695438572672797
[5]: s[1]
90 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
[5]: 0.005695438572672797
[6]: s[2:4]
[6]: 2022-02-04 0.941828
2022-02-05 0.393235
Freq: D, dtype: float64
While you can select data by label in this way, the preferred method for selecting index values is the loc operator:
The reason for the preference for loc is the different treatment of integers when indexing with []. In regular []-based
indexing, integers are treated as labels if the index contains integers, so the behaviour varies depending on the data type
of the index. In our example, the expression s.loc[[3, 2, 1]] will fail because the index does not contain integers:
~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in __getitem__(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
(continues on next page)
2.4. pandas 91
Python for Data Science, Release 1.0.0
~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _getitem_axis(self, key, axis)
~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1091
1092 # A collection of keys
-> 1093 keyarr, indexer = self._get_listlike_indexer(key, axis)
1094 return self.obj._reindex_with_indexers(
1095 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
~/spack/var/spack/environments/python-38/.spack-env/view/lib/python3.8/site-packages/
˓→pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1372 if use_interval_msg:
1373 key = list(key)
-> 1374 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].
˓→unique())
While the loc operator exclusively indexes labels, the iloc operator exclusively indexes with integers:
You can also slice with labels, but this works differently from normal Python slicing because the endpoint is included:
[13]: s.loc['2022-02-03':'2022-02-04']
[13]: 2022-02-03 0.005695
(continues on next page)
92 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Setting with these methods changes the corresponding section of the row:
[14]: s.loc['2022-02-03':'2022-02-04'] = 0
s
[14]: 2022-02-02 2.028039
2022-02-03 0.000000
2022-02-04 0.000000
2022-02-05 0.393235
2022-02-06 1.303863
2022-02-07 -2.023356
2022-02-08 0.893236
Freq: D, dtype: float64
Indexing in a DataFrame is used to retrieve one or more columns with either a single value or a sequence:
df = pd.DataFrame(data)
df = pd.DataFrame(data,
columns=['Decimal', 'Octal', 'Key'],
index=df['Code'])
df
[15]: Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
U+0005 5 005 Ctrl-E
[16]: df['Key']
[16]: Code
U+0000 NUL
U+0001 Ctrl-A
U+0002 Ctrl-B
U+0003 Ctrl-C
U+0004 Ctrl-D
U+0005 Ctrl-E
Name: Key, dtype: object
2.4. pandas 93
Python for Data Science, Release 1.0.0
[18]: df[:2]
[18]: Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
The line selection syntax df[:2] is provided for convenience. Passing a single item or a list to the [] operator selects
columns.
Another use case is indexing with a Boolean DataFrame, which is generated by a scalar comparison, for example:
[20]: df['Decimal'] > 2
[20]: Code
U+0000 False
U+0001 False
U+0002 False
U+0003 True
U+0004 True
U+0005 True
Name: Decimal, dtype: bool
df
[21]: Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0003 NA NA NA
U+0004 NA NA NA
U+0005 NA NA NA
Like Series, DataFrame has special operators loc and iloc for label-based and integer indexing respectively. Since
DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either
axis labels (loc) or integers (iloc).
94 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
Both indexing functions work with slices in addition to individual labels or lists of labels:
So there are many ways to select and rearrange the data contained in a pandas object. In the following, I put together a
brief summary of most of these possibilities for DataFrames:
Type Note
df[LABEL] selects a single column or a sequence of columns from the DataFrame
df.loc[LABEL] selects a single row or a subset of rows from the DataFrame by label
df.loc[:, LABEL] selects a single column or a subset of columns from the DataFrame by Label
df.loc[LABEL1, LABEL2] selects both rows and columns by label
df.iloc[INTEGER] selects a single row or a subset of rows from the DataFrame by integer position
df.iloc[INTEGER1, selects a single column or a subset of columns by integer position
INTEGER2]
df.at[LABEL1, LABEL2] selects a single value by row and column label
df.iat[INTEGER1, INTEGER2] selects a scalar value by row and column position (integers)
reindex NEW_INDEX selects rows or columns by label
get_value, set_value deprecated since version 0.21.0: use .at[] or .iat[] instead.
2.4. pandas 95
Python for Data Science, Release 1.0.0
For many data sets, you may want to perform a transformation based on the values in an array, series or column in a
DataFrame. For this, we look at the first Unicode characters:
df
[2]: Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E
Add data
Suppose you want to add a column where the characters are assigned to the C0 or C1 control code:
[3]: control_code = {
'u+0000': 'C0',
'u+0001': 'C0',
'u+0002': 'C0',
'u+0003': 'C0',
'u+0004': 'C0',
'u+0005': 'C0'
}
The map method for a series accepts a function or dict-like object that contains an assignment, but here we have a small
problem because some of the codes in control_code are lower case, but not in our DataFrame. Therefore, we need
to convert each value to lower case using the str.lower method:
lowercased
[4]: 0 u+0000
1 u+0001
2 u+0002
3 u+0003
4 u+0004
5 u+0005
Name: Code, dtype: object
96 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
df
[5]: Code Decimal Octal Key Control code
0 U+0000 0 001 NUL C0
1 U+0001 1 002 Ctrl-A C0
2 U+0002 2 003 Ctrl-B C0
3 U+0003 3 004 Ctrl-C C0
4 U+0004 4 004 Ctrl-D C0
5 U+0005 5 005 Ctrl-E C0
We could also have passed a function that does all the work:
Using map is a convenient way to perform element-wise transformations and other data cleaning operations.
Change data
Note:
Replacing missing values is described in Managing missing data with pandas.
Note:
The replace method differs from str.replace in that it replaces strings element by element.
2.4. pandas 97
Python for Data Science, Release 1.0.0
Delete data
Deleting one or more entries from an axis is easy if you already have an index array or a list without these entries.
To delete duplicates, see Deduplicating data.
Since this may require a bit of set theory, we return the drop method as a new object without the deleted values:
s
[9]: 0 1.310961
1 1.630214
2 -1.442352
3 -0.348273
4 -1.249406
5 1.195304
6 -0.893336
dtype: float64
new
[10]: 0 1.310961
1 1.630214
3 -0.348273
4 -1.249406
5 1.195304
6 -0.893336
dtype: float64
new
[11]: 0 1.310961
1 1.630214
4 -1.249406
5 1.195304
6 -0.893336
dtype: float64
With DataFrames, index values can be deleted on both axes. To illustrate this, we first create an example DataFrame:
df = pd.DataFrame(data)
df
98 Chapter 2. Workspace
Python for Data Science, Release 1.0.0
You can also remove values from the columns by passing axis=1 or axis='columns':
Many functions such as drop that change the size or shape of a row or DataFrame can manipulate an object in place
without returning a new object:
df
[15]: Code Decimal Octal Key
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E
Warning:
Be careful with the inplace function, as the data will be irretrievably deleted.
2.4. pandas 99
Python for Data Science, Release 1.0.0
pandas offers the possibility to concisely apply Python’s string methods and regular expressions to whole arrays of data.
See also:
• string
• re
Cleaning up a cluttered dataset for analysis often requires a lot of string manipulation. To make matters worse, a column
containing strings sometimes has missing data:
addresses
[1]: Veit NaN
Veit Schiele [email protected]
cusy GmbH [email protected]
dtype: object
[2]: addresses.isna()
[2]: Veit True
Veit Schiele False
cusy GmbH False
dtype: bool
You can apply string and regular expression methods to any value (by passing a lambda or other function) using data.
map, but this fails for NA values. To deal with this, Series has array-oriented methods for string operations that skip
and pass NA values. These are accessed via Series’ str attribute; for example, we could use str.contains to check
whether each email address contains veit:
[3]: addresses.str.contains('veit')
[3]: Veit NaN
Veit Schiele True
cusy GmbH False
dtype: object
Regular expressions can also be used, along with options such as IGNORECASE:
[4]: import re
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
matches = addresses.str.findall(pattern, flags=re.IGNORECASE).str[0]
matches
There are several ways to retrieve a vectorised element. Either use str.get or the index of str:
[5]: matches.str.get(1)
[5]: Veit NaN
Veit Schiele cusy
cusy GmbH cusy
dtype: object
[6]: addresses.str[:5]
[6]: Veit NaN
Veit Schiele veit.
cusy GmbH info@
dtype: object
The pandas.Series.str.extract method returns the captured groups of a regular expression as a DataFrame:
Method Description
cat concatenates strings element by element with optional delimiter
contains returns a boolean array if each string contains a pattern/gex
count counts occurrences of the pattern
extract uses a regular expression with groups to extract one or more strings from a set of strings; the result is
a DataFrame with one column per group
endswith equivalent to x.endswith(pattern) for each element
startswithequivalent to x.startswith(pattern) for each element
findall computes list of all occurrences of pattern/regex for each string
get index in each element (get i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join joins strings in each element of the series with the passed separator character
len calculates the length of each string
lower, converts case; equivalent to x.lower() or x.upper() for each element
upper
match uses re.match with the passed regular expression for each element, returning True or False if
matched.
extract captures group elements (if any) by index from each string
pad inserts spaces on the left, right or both sides of strings
centre Equivalent to pad(side='both')
repeat Duplicates values (for example s.str.repeat(3) equals x * 3 for each string)
replace replaces pattern/rulex with another string
slice splits each string in the series
split splits strings using delimiters or regular expressions
strip truncates spaces on both sides, including line breaks
rstrip truncates spaces on the right side
lstrip truncates spaces on the left side
2.4.8 Arithmetic
An important function of pandas is the arithmetic behaviour for objects with different indices. When adding objects, if
the index pairs are not equal, the corresponding index in the result will be the union of the index pairs. For users with
database experience, this is comparable to an automatic outer join on the index labels. Let’s look at an example:
rng = np.random.default_rng()
s1 = pd.Series(rng.normal(size=5))
s2 = pd.Series(rng.normal(size=7))
[2]: s1 + s2
[2]: 0 0.452985
1 -2.751479
2 1.170106
3 -1.583008
4 -0.507949
5 NaN
6 NaN
dtype: float64
The internal data matching leads to missing values at the points of the labels that do not overlap. Missing values are
then passed on in further arithmetic calculations.
For DataFrames, alignment is performed for both rows and columns:
When the two DataFrames are added together, the result is a DataFrame whose index and columns are the unions of
those in each of the DataFrames above:
Since column 2 does not appear in both DataFrame objects, its values appear as missing in the result. The same applies
to the rows whose labels do not appear in both objects.
In arithmetic operations between differently indexed objects, a special value (e.g. 0) can be useful if an axis label is
found in one object but not in the other. The add method can pass the fill_value argument:
df12
[5]: 0 1 2
0 -1.740413 0.566677 0.715587
1 1.992954 -2.637117 0.888983
2 -0.111516 -0.340413 0.811355
3 0.418716 -0.847758 -0.233179
4 0.625907 1.675521 0.494883
5 1.780402 -0.349901 NaN
6 0.922128 -0.487242 NaN
[7]: df12
[7]: 0 1 2
0 -1.740413 0.566677 0.715587
1 1.992954 -2.637117 0.888983
2 -0.111516 -0.340413 0.811355
3 0.418716 -0.847758 -0.233179
4 0.625907 1.675521 0.494883
5 1.780402 -0.349901 0.000000
6 0.922128 -0.487242 0.000000
Arithmetic methods
Method Description
add, radd methods for addition (+)
sub, rsub methods for subtraction (-)
div, rdiv methods for division (/)
floordiv, rfloordiv methods for floor division (//)
mul, rmul methods for multiplication (*)
pow, rpow methods for exponentiation (**)
As with NumPy arrays of different dimensions, the arithmetic between DataFrame and Series is also defined.
[8]: s1 + df12
[8]: 0 1 2 3 4
0 -2.533022 -1.635689 2.219348 NaN NaN
1 1.200346 -4.839483 2.392744 NaN NaN
2 -0.904124 -2.542779 2.315117 NaN NaN
3 -0.373892 -3.050124 1.270582 NaN NaN
4 -0.166702 -0.526845 1.998644 NaN NaN
5 0.987793 -2.552267 1.503761 NaN NaN
6 0.129520 -2.689608 1.503761 NaN NaN
If we add s1 with df12, the addition is done once for each line. This is called broadcasting. By default, the arithmetic
between the DataFrame and the series corresponds to the index of the series in the columns of the DataFrame, with the
rows being broadcast down.
If an index value is found neither in the columns of the DataFrame nor in the index of the series, the objects are
re-indexed to form the union:
If instead you want to transfer the columns and match the rows, you must use one of the arithmetic methods, for example:
[9]: 0 1 2
0 -0.494820 1.812271 1.961180
1 1.443841 -3.186230 0.339870
2 -0.445171 -0.674068 0.477700
3 -0.380913 -1.647387 -1.032809
4 0.260930 1.310545 0.129906
5 2.116184 -0.014119 0.335782
6 1.471784 0.062413 0.549655
The axis number you pass is the axis to be aligned to. In this case, the row index of the DataFrame (axis='index' or
axis=0) is to be adjusted and transmitted.
[10]: np.abs(df12)
[10]: 0 1 2
0 1.740413 0.566677 0.715587
1 1.992954 2.637117 0.888983
2 0.111516 0.340413 0.811355
3 0.418716 0.847758 0.233179
4 0.625907 1.675521 0.494883
5 1.780402 0.349901 0.000000
6 0.922128 0.487242 0.000000
Another common operation is to apply a function to one-dimensional arrays on each column or row. The pan-
das.DataFrame.apply method does just that:
[11]: df12
[11]: 0 1 2
0 -1.740413 0.566677 0.715587
1 1.992954 -2.637117 0.888983
2 -0.111516 -0.340413 0.811355
3 0.418716 -0.847758 -0.233179
4 0.625907 1.675521 0.494883
5 1.780402 -0.349901 0.000000
6 0.922128 -0.487242 0.000000
df12.apply(f)
[12]: 0 3.733368
1 4.312639
2 1.122163
dtype: float64
Here the function f, which calculates the difference between the maximum and minimum of a row, is called once for
each column of the frame. The result is a row with the columns of the frame as index.
If you pass axis='columns' to apply, the function will be called once per line instead:
Many of the most common array statistics (such as sum and mean) are DataFrame methods, so the use of apply is not
necessary.
The function passed to apply does not have to return a single value; it can also return a series with multiple values:
df12.apply(f)
[14]: 0 1 2
min -1.740413 -2.637117 -0.233179
max 1.992954 1.675521 0.888983
You can also use element-wise Python functions. Suppose you want to round each floating point value in df12 to two
decimal places, you can do this with pandas.DataFrame.applymap:
df12.applymap(f)
[15]: 0 1 2
0 -1.74 0.57 0.72
1 1.99 -2.64 0.89
2 -0.11 -0.34 0.81
3 0.42 -0.85 -0.23
4 0.63 1.68 0.49
5 1.78 -0.35 0.00
6 0.92 -0.49 0.00
The reason for the name applymap is that Series has a map method for applying an element-wise function:
[16]: df12[2].map(f)
[16]: 0 0.72
1 0.89
2 0.81
3 -0.23
4 0.49
5 0.00
6 0.00
Name: 2, dtype: float64
pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into
the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from
a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy
arrays, they also handle missing data.
rng = np.random.default_rng()
df = pd.DataFrame(rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7))
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)
df2
[1]: 0 1 2
2022-02-03 0.462527 -0.545698 0.044638
2022-02-04 1.400089 -0.237095 1.287454
2022-02-05 1.526841 0.926597 0.729344
2022-02-06 0.778086 1.439802 1.044385
2022-02-07 -0.190824 -1.726359 -0.426458
2022-02-08 -0.436732 0.899246 -0.906147
2022-02-09 NaN NaN NaN
[2]: df2.sum()
[2]: 0 3.539988
1 0.756492
2 1.773217
dtype: float64
[3]: df2.sum(axis='columns')
[3]: 2022-02-03 -0.038534
2022-02-04 2.450449
2022-02-05 3.182782
2022-02-06 3.262273
2022-02-07 -2.343640
2022-02-08 -0.443633
2022-02-09 0.000000
Freq: D, dtype: float64
If an entire row or column contains all NA values, the sum is 0. This can be disabled with the skipna option:
Some aggregations, such as mean, require at least one non-NaN value to obtain a valuable result:
[5]: df2.mean(axis='columns')
[5]: 2022-02-03 -0.012845
2022-02-04 0.816816
2022-02-05 1.060927
2022-02-06 1.087424
2022-02-07 -0.781213
2022-02-08 -0.147878
2022-02-09 NaN
Freq: D, dtype: float64
Method Description
axis the axis of values to reduce: 0 for the rows of the DataFrame and 1 for the columns
skipna exclude missing values; by default True.
level reduce grouped by level if the axis is hierarchically indexed (MultiIndex)
Some methods, such as idxmin and idxmax, provide indirect statistics such as the index value at which the minimum
or maximum value is reached:
[6]: df2.idxmax()
[6]: 0 2022-02-05
1 2022-02-06
2 2022-02-04
dtype: datetime64[ns]
[7]: df2.cumsum()
[7]: 0 1 2
2022-02-03 0.462527 -0.545698 0.044638
2022-02-04 1.862616 -0.782793 1.332092
2022-02-05 3.389457 0.143803 2.061436
2022-02-06 4.167544 1.583605 3.105822
2022-02-07 3.976720 -0.142754 2.679364
2022-02-08 3.539988 0.756492 1.773217
2022-02-09 NaN NaN NaN
Another type of method is neither reductions nor accumulations. describe is one such example that produces several
summary statistics in one go:
[8]: df2.describe()
[8]: 0 1 2
count 6.000000 6.000000 6.000000
mean 0.589998 0.126082 0.295536
std 0.806152 1.181456 0.867310
min -0.436732 -1.726359 -0.906147
25% -0.027486 -0.468548 -0.308684
50% 0.620307 0.331075 0.386991
75% 1.244589 0.919759 0.965625
max 1.526841 1.439802 1.287454
df3.describe()
[9]: Code Octal
count 6 6
unique 6 5
top U+0000 004
freq 1 2
Method Description
count number of non-NA values
describe calculation of a set of summary statistics for series or each DataFrame column
min, max calculation of minimum and maximum values
argmin, calculation of the index points (integers) at which the minimum or maximum value was reached
argmax
idxmin, calculation of the index labels at which the minimum or maximum values were reached
idxmax
quantile calculation of the sample quantile in the range from 0 to 1
sum sum of the values
mean arithmetic mean of the values
median arithmetic median (50% quantile) of the values
mad mean absolute deviation from the mean value
prod product of all values
var sample variance of the values
std sample standard deviation of the values
skew sample skewness (third moment) of the values
kurt sample kurtosis (fourth moment) of the values
cumsum cumulative sum of the values
cummin, cumulated minimum and maximum of the values respectively
cummax
cumprod cumulated product of the values
diff calculation of the first arithmetic difference (useful for time series)
pct_change calculation of the percentage changes
ydata-profiling
ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy,
but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report(),
which automatically generates a standardised report for understanding the data.
Installation
...
$ pipenv run jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: OK
Example
profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTM
By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation
time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following
examples, we first read a larger data set into pandas:
1. minimal mode
ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations
are turned off by default. This is the recommended starting point for larger data sets.
profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
2. Sample
An alternative option for very large data sets is to use only a part of them for the profiling report:
profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
To reduce the computational effort in large datasets, but still get some interesting information, some calculations can
be filtered only for certain columns:
profile.to_widgets()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render widgets: 0%| | 0/1 [00:00<?, ?it/s]
VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTM
The setting interactions.targets, can be changed via configuration files as well as via environment variables; see
Changing settings for details.
4 Concurrency
Currently work is being done on a scalable Spark backend for pandas-profiling, see Spark Profiling Development.
Sorting a record by a criterion is another important built-in function. Sorting lexicographically by row or column index
is already described in the section Reordering and sorting from levels. In the following we look at sorting the values
with DataFrame.sort_values and Series.sort_values:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
s.sort_index(ascending=False)
[1]: 6 0.807524
5 -0.045012
4 -0.276246
3 1.032609
2 1.067726
1 0.488613
0 1.324534
dtype: float64
All missing values are sorted to the end of the row by default:
[2]: s = pd.Series(rng.normal(size=7))
s[s < 0] = np.nan
s.sort_values()
[2]: 5 0.186232
1 0.826051
6 1.649605
0 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
With a DataFrame you can sort on both axes. With by you specify which column or row is to be sorted:
df.sort_values(by=2, ascending=False)
[3]: 0 1 2
1 -0.013109 0.060716 1.837680
0 0.095855 -0.804874 1.201810
2 0.278646 -0.608821 0.498333
5 -0.680013 0.314085 0.382935
(continues on next page)
Ranking
DataFrame.rank and Series.rank assign ranks from one to the number of valid data points in an array:
[5]: df.rank()
[5]: 0 1 2
0 4.0 1.0 6.0
1 3.0 4.0 7.0
2 5.0 2.0 5.0
3 2.0 6.0 3.0
4 6.0 7.0 1.0
5 1.0 5.0 4.0
6 7.0 3.0 2.0
If ties occur in the ranking, the middle rank is usually assigned in each group.
df2.rank()
[6]: 0 1 2
0 5.0 1.0 8.0
1 4.0 5.0 9.0
2 6.0 2.0 7.0
3 3.0 8.0 4.0
4 7.0 9.0 1.0
5 1.5 6.5 5.5
6 8.5 3.5 2.5
5 1.5 6.5 5.5
6 8.5 3.5 2.5
The parameter min, on the other hand, assigns the smallest rank in the group:
[7]: df2.rank(method='min')
[7]: 0 1 2
0 5.0 1.0 8.0
1 4.0 5.0 9.0
2 6.0 2.0 7.0
3 3.0 8.0 4.0
4 7.0 9.0 1.0
5 1.0 6.0 5.0
6 8.0 3.0 2.0
5 1.0 6.0 5.0
6 8.0 3.0 2.0
Method Description
average default: assign the average rank to each entry in the same group
min uses the minimum rank for the whole group
max uses the maximum rank for the whole group
first assigns the ranks in the order in which the values appear in the data
dense like method='min' but the ranks always increase by 1 between groups and not according to the number
of same items in a group
Continuous data is often divided into domains or otherwise grouped for analysis.
Suppose you have data on a group of people in a study that you want to divide into discrete age groups. For this, we
generate a dataframe with 250 entries between 0 and 99:
df
[1]: Age
0 29
1 91
2 45
3 16
4 22
.. ...
245 87
246 59
247 93
248 13
249 54
Afterwards, pandas offers us a simple way to divide the results into ten ranges with pandas.cut. To get only whole
years, we additionally set precision=0:
cats
[2]: [(20.0, 29.0], (88.0, 98.0], (39.0, 49.0], (10.0, 20.0], (20.0, 29.0], ..., (78.0, 88.0],
˓→ (59.0, 69.0], (88.0, 98.0], (10.0, 20.0], (49.0, 59.0]]
Length: 250
Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] <␣
˓→(29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]
[3]: cats.categories
[3]: IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0], (39.0, 49.0], (49.
˓→0, 59.0], (59.0, 69.0], (69.0, 78.0], (78.0, 88.0], (88.0, 98.0]], dtype=
˓→'interval[float64, right]')
[4]: cats.categories[0]
[4]: Interval(-0.1, 10.0, closed='right')
With pandas.Categorical.codes you can display an array where for each value the corresponding category is shown:
[5]: cats.codes
[5]: array([2, 9, 4, 1, 2, 9, 7, 2, 7, 6, 0, 6, 0, 3, 3, 9, 2, 4, 0, 2, 0, 6,
4, 5, 5, 0, 7, 9, 5, 7, 1, 1, 4, 6, 0, 7, 9, 9, 3, 4, 9, 2, 5, 2,
4, 9, 8, 9, 2, 0, 9, 5, 4, 9, 3, 4, 3, 3, 8, 7, 2, 5, 3, 2, 2, 9,
3, 7, 9, 2, 8, 8, 8, 4, 6, 1, 4, 3, 4, 3, 4, 6, 8, 1, 7, 2, 0, 4,
1, 8, 3, 5, 2, 5, 9, 9, 1, 4, 7, 6, 5, 2, 9, 9, 5, 3, 1, 1, 7, 3,
3, 7, 8, 4, 6, 3, 0, 2, 6, 8, 4, 6, 6, 5, 0, 4, 3, 5, 4, 0, 1, 6,
4, 1, 4, 5, 3, 4, 9, 6, 6, 9, 9, 8, 4, 7, 1, 1, 4, 6, 9, 7, 5, 3,
3, 4, 2, 6, 5, 4, 3, 7, 7, 8, 5, 5, 4, 9, 1, 3, 5, 3, 3, 9, 0, 8,
7, 8, 9, 1, 5, 7, 8, 6, 0, 1, 4, 2, 2, 8, 7, 4, 2, 0, 6, 2, 7, 4,
1, 4, 9, 1, 7, 6, 0, 7, 7, 0, 9, 6, 8, 3, 4, 4, 2, 3, 5, 2, 6, 3,
6, 2, 2, 9, 8, 9, 7, 1, 9, 9, 0, 5, 0, 2, 0, 5, 2, 5, 5, 3, 9, 7,
3, 0, 0, 8, 6, 9, 1, 5], dtype=int8)
With value_counts we can now look at how the number is distributed among the individual areas:
[6]: pd.value_counts(cats)
[6]: (39.0, 49.0] 32
(88.0, 98.0] 32
(29.0, 39.0] 28
(20.0, 29.0] 27
(49.0, 59.0] 25
(69.0, 78.0] 24
(59.0, 69.0] 23
(-0.1, 10.0] 21
(10.0, 20.0] 20
(continues on next page)
It is striking that the age ranges do not contain an equal number of years, but with 20.0, 29.0 and 69.0, 78.0 two
ranges contain only 9 years. This is due to the fact that the age range only extends from 0 to 98:
[7]: df.min()
[7]: Age 0
dtype: int64
[8]: df.max()
[8]: Age 98
dtype: int64
With pandas.qcut, on the other hand, the set is divided into areas that are approximately the same size:
[10]: pd.value_counts(cats)
[10]: (12.0, 23.0] 29
(32.0, 41.0] 28
(78.0, 91.0] 28
(56.0, 68.0] 26
(-1.0, 12.0] 25
(48.0, 56.0] 24
(68.0, 78.0] 24
(41.0, 48.0] 23
(91.0, 98.0] 22
(23.0, 32.0] 21
dtype: int64
If we want to ensure that each age group actually includes exactly ten years, we can specify this directly with pan-
das.Categorical:
cats.categories
[11]: Index(['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59',
'60 - 69', '70 - 79', '80 - 89', '90 - 99'],
dtype='object')
For grouping we can now use pandas.cut. However, the number of labels must be one less than the number of edges:
df
[12]: Age Age group
0 29 20 - 29
1 91 90 - 99
(continues on next page)
Merge or join operations combine data sets by linking rows with one or more keys. These operations are especially
important in relational, SQL-based databases. The merge function in pandas is the main entry point for applying these
algorithms to your data.
encoding, update
[2]: ( Unicode Decimal Octal Key
0 U+0000 0 000 NUL
(continues on next page)
By default, merge performs a so-called inner join; the keys in the result are the intersection or common set in both
tables.
Note:
I did not specify which column to merge over. If this information is not specified, merge will use the overlapping
column names as keys. However, it is good practice to specify this explicitly:
If the column names are different in each object, you can specify them separately. In the following example update2
gets the key U+ and not Unicode:
However, you can use merge not only to perform an inner join, with which the keys in the result are the intersection or
common set in both tables. Other possible options are:
Option Behaviour
how='inner' uses only the key combinations observed in both tables
how='left' uses all key combinations found in the left table
how='right' uses all key combinations found in the right table
how='outer' uses all key combinations observed in both tables together
The join method only affects the unique key values that appear in the result.
To join multiple keys, you can pass a list of column names:
By groupby is meant a process that involves one or more of the following steps:
• Split divides the data into groups according to certain criteria
• Apply applies a function independently to each group
• Combine combines the results in a data structure
In the first phase of the process, the data contained in a pandas object, be it a Series, a DataFrame or something else,
is split into groups based on one or more keys. The division is done on a particular axis of an object. For example, a
DataFrame can be grouped by its rows (axis=0) or its columns (axis=1). Then, a function is applied to each group
to create a new value. Finally, the results of all these function applications are combined in a result object. The shape
of the result object usually depends on what is done with the data.
Each grouping key can take many forms, and the keys do not all have to be of the same type:
• a list or array of values that have the same length as the axis being grouped
• a value that specifies a column name in a DataFrame
• a dict or series that is a correspondence between the values on the axis being grouped and the group names
• a function that is called on the axis index or the individual labels in the index
Note:
The latter three methods are shortcuts to create an array of values that will be used to divide the object.
Don’t worry if this all seems abstract. Throughout this chapter I will give many examples of all these methods. For
starters, here is a small table dataset as a DataFrame:
df
[2]: Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0
2 PyViz Tutorial de 2573.0 4873.0 3930.0
3 None None NaN NaN NaN
4 Python Basics de 525.0 427.0 276.0
5 Python Basics en 157.0 85.0 226.0
Suppose you want to calculate the sum of column 02/2022 using the labels of Title. There are several ways to do this.
One is to access 02/2022 and call groupby with the column (a Series) in Title:
grouped
[3]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fcfe83a6970>
This grouped variable is now a special SeriesGroupBy object. It has not yet calculated anything except some inter-
mediate data about the group key df['Title']. The idea is that this object has all the information needed to apply
an operation to each of the groups. For example, to calculate the group averages, we can call the sum method of the
GroupBy object:
[4]: grouped.sum()
[4]: Title
Jupyter Tutorial 37304.0
PyViz Tutorial 3930.0
Python Basics 502.0
Name: 2022-02, dtype: float64
Later I will explain more about what happens when you call .sum(). The important thing to note here is that the data
(a row) has been aggregated by splitting the data across the group key, creating a new row that is now indexed by the
unique values in the Title column. The resulting index is Title because groupby(df['Title'] did this.
If we had passed multiple arrays as a list instead, we would get something different:
sums
[5]: Language Title
de Jupyter Tutorial 19651.0
PyViz Tutorial 2573.0
Python Basics 525.0
en Jupyter Tutorial 4722.0
Python Basics 157.0
Name: 2021-12, dtype: float64
Here we have grouped the data based on two keys, and the resulting series now has a hierarchical index consisting of
the observed unique key pairs:
[6]: sums.unstack()
[6]: Title Jupyter Tutorial PyViz Tutorial Python Basics
Language
de 19651.0 2573.0 525.0
en 4722.0 NaN 157.0
Often the grouping information is in the same DataFrame as the data you want to edit. In this case, you can pass column
names (whether they are strings, numbers or other Python objects) as group keys:
[7]: df.groupby('Title').sum()
[7]: 2021-12 2022-01 2022-02
Title
Jupyter Tutorial 24373.0 33631.0 37304.0
PyViz Tutorial 2573.0 4873.0 3930.0
Python Basics 682.0 512.0 502.0
Here it is noticeable that the result does not contain a Language column. Since df['Language'] is not numeric
data, it interferes with the table layout and is therefore automatically excluded from the result. By default, all numeric
columns are aggregated.
[8]: df.groupby(['Title','Language']).sum()
[8]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 19651.0 30134.0 33295.0
en 4722.0 3497.0 4009.0
PyViz Tutorial de 2573.0 4873.0 3930.0
Python Basics de 525.0 427.0 276.0
en 157.0 85.0 226.0
Regardless of the goal of using groupby, a generally useful groupby method is size, which returns a series with the
group sizes:
[9]: df.groupby(['Language']).size()
[9]: Language
de 3
en 2
dtype: int64
Note:
All missing values in a group key are excluded from the result by default. This behaviour can be disabled
by passing dropna=False to groupby:
The object returned by groupby supports iteration and produces a sequence of 2-tuples containing the group name
along with the data packet. Consider the following:
With multiple keys, the first element of the tuple is a tuple of key values:
books
[14]: {'Jupyter Tutorial': Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0,
'PyViz Tutorial': Title Language 2021-12 2022-01 2022-02
2 PyViz Tutorial de 2573.0 4873.0 3930.0,
'Python Basics': Title Language 2021-12 2022-01 2022-02
(continues on next page)
By default, groupby groups on axis=0, but you can also group on any of the other axes. For example, we could group
the columns of our example df here by dtype as follows:
[15]: df.dtypes
[15]: Title object
Language object
2021-12 float64
2022-01 float64
2022-02 float64
dtype: object
Indexing a GroupBy object created from a DataFrame with a column name or an array of column names has the effect
of subdividing columns for aggregation. This means that:
[18]: df.groupby('Title')['2021-12']
df.groupby('Title')[['2022-01']]
[18]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fcfc08cde50>
[19]: df['2021-12'].groupby(df['Title'])
df[['2022-01']].groupby(df['Title'])
Especially for large datasets, it may be desirable to aggregate only some columns. For example, to calculate the sum
for only column 01/2022 in the previous dataset and get the result as a DataFrame, we could write:
The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped series
if only a single column name is passed as a scalar:
series_grouped
[21]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fcfc08e8be0>
[22]: series_grouped.sum()
[22]: Title Language
Jupyter Tutorial de 30134.0
en 3497.0
PyViz Tutorial de 4873.0
Python Basics de 427.0
en 85.0
Name: 2022-01, dtype: float64
Suppose I have a group correspondence for the columns and want to group the columns together by group:
Now an array could be constructed from this dict to pass to groupby, but instead we can just pass the dict:
by_column.sum()
[25]: Dec 2021 Feb 2022 Jan 2022
0 19651.0 33295.0 30134.0
(continues on next page)
map_series
[26]: 2021-12 Dec 2021
2022-01 Jan 2022
2022-02 Feb 2022
dtype: object
Using Python functions is a more general method of defining a group assignment compared to a Dict or Series. Each
function passed as a group key is called once per index value, with the return values used as group names. Specifically,
consider the example DataFrame from the previous section, which contains the titles as index values. Suppose If you
want to group by the length of the names, you can calculate an array with the lengths of the strings, but it is easier to
pass the len function:
[28]: df = pd.DataFrame([[19651,30134,33295],
[4722,3497,4009],
[2573,4873,3930],
[525,427,276],
[157,85,226]],
index=['Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial',
'Python Basics',
'Python Basics'],
columns=['2021-12', '2022-01', '2022-02'])
[29]: df.groupby(len).count()
[29]: 2021-12 2022-01 2022-02
13 2 2 2
14 1 1 1
16 2 2 2
Mixing functions with arrays, dicts or series is no problem, as everything is converted internally into arrays:
A final practical feature for hierarchically indexed datasets is the ability to aggregate by one of the index levels of an
axis. Let’s look at an example:
df = pd.DataFrame(version_hits,
index=[['Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial',
None,
'Python Basics',
'Python Basics'],
['de', 'en', 'de', None, 'de', 'en']],
columns=[['2021-12', '2021-12',
'2022-01', '2022-01',
'2022-02', '2022-02'],
['latest', 'stable',
'latest', 'stable',
'latest', 'stable']])
df
[32]: Month 2021-12 2022-01 2022-02
Version latest stable latest stable latest stable
Jupyter Tutorial de 19651.0 0.0 30134.0 0.0 33295.0 0.0
en 4722.0 1825.0 3497.0 2576.0 4009.0 3707.0
PyViz Tutorial de 2573.0 0.0 4873.0 0.0 3930.0 0.0
NaN NaN NaN NaN NaN NaN NaN NaN
Python Basics de 525.0 0.0 427.0 0.0 276.0 0.0
en 157.0 0.0 85.0 0.0 226.0 0.0
2.4.14 Aggregation
Aggregations refer to any data transformation that produces scalar values from arrays. In the previous examples, several
of them were used, including count and sum. You may now be wondering what happens when you apply sum() to a
GroupBy object. Optimised implementations exist for many common aggregations, such as the one in the following
table. However, they are not limited to this set of methods.
Function name Description
You can use your own aggregations and also call any method that is also defined for the grouped object. For example,
the Series method nsmallest selects the smallest requested number of values from the data.
Although nsmallest is not explicitly implemented for GroupBy, we can still use it with a non-optimised implemen-
tation. Internally, GroupBy decomposes the Series, calls df.nsmallest(n) for each part and then merges these
results in the result object:
df
[2]: Title 2021-12 2022-01 2022-02
0 Jupyter Tutorial 30134.0 33295.0 19651.0
1 Jupyter Tutorial 6073.0 7716.0 6547.0
2 PyViz Tutorial 4873.0 3930.0 2573.0
3 None NaN NaN NaN
4 Python Basics 427.0 276.0 525.0
5 Python Basics 95.0 226.0 157.0
[4]: grouped['2022-01'].nsmallest(1)
[4]: Title
Jupyter Tutorial 1 7716.0
PyViz Tutorial 2 3930.0
Python Basics 5 226.0
Name: 2022-01, dtype: float64
To use a custom aggregation function, pass any function that aggregates an array to the aggregate or agg method:
grouped.agg(range)
[5]: 2021-12 2022-01 2022-02
Title
Jupyter Tutorial 24061.0 25579.0 13104.0
PyViz Tutorial 0.0 0.0 0.0
Python Basics 332.0 50.0 368.0
You will find that some methods like describe also work, even though they are not strictly speaking aggregations:
[6]: grouped.describe()
[6]: 2021-12 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5
PyViz Tutorial 1.0 4873.0 NaN 4873.0 4873.00 4873.0
Python Basics 2.0 261.0 234.759451 95.0 178.00 261.0
2022-01 ... \
75% max count mean ... 75% max
Title ...
Jupyter Tutorial 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0
(continues on next page)
2022-02 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0
PyViz Tutorial 1.0 2573.0 NaN 2573.0 2573.0 2573.0
Python Basics 2.0 341.0 260.215295 157.0 249.0 341.0
75% max
Title
Jupyter Tutorial 16375.0 19651.0
PyViz Tutorial 2573.0 2573.0
Python Basics 433.0 525.0
[3 rows x 24 columns]
Note:
Custom aggregation functions are generally much slower than the optimised functions in the table above. This is because
there is some extra work involved in creating the intermediate data sets for the group (function calls, reordering of data).
As we have already seen, aggregating a Series or all columns of a DataFrame is a matter of using aggregate (or
agg) with the desired function or calling a method such as mean or std. However, it is more common to aggregate
simultaneously with another function depending on the column or with multiple functions.
[7]: grouped.agg('mean')
[7]: 2021-12 2022-01 2022-02
Title
Jupyter Tutorial 18103.5 20505.5 13099.0
PyViz Tutorial 4873.0 3930.0 2573.0
Python Basics 261.0 251.0 341.0
If you pass a list of functions or function names instead, you will get back a DataFrame with column names from the
functions:
2022-02
range mean std range
Title
(continues on next page)
Here we have passed agg a list of aggregation functions to be evaluated independently for the data groups.
You don’t need to accept the names that GroupBy gives to the columns; in particular, lambda functions have the name
<lambda>, which makes them difficult to identify. When you pass a list of tuples, the first element of each tuple is used
as the column name in the DataFrame:
2022-02 \
Standard deviation Range Mean Standard deviation
Title
Jupyter Tutorial 18087.084356 25579.0 13099.0 9265.927261
PyViz Tutorial NaN 0.0 2573.0 NaN
Python Basics 35.355339 50.0 341.0 260.215295
Range
Title
Jupyter Tutorial 13104.0
PyViz Tutorial 0.0
Python Basics 368.0
With a DataFrame, you have the option of specifying a list of functions to be applied to all columns or to different
functions per column. Let’s say we want to calculate the same three statistics for the columns:
evaluations = grouped.agg(stats)
evaluations
[10]: 2021-12 2022-01 2022-02 \
count mean max count mean max count
Title
Jupyter Tutorial 2 18103.5 30134.0 2 20505.5 33295.0 2
PyViz Tutorial 1 4873.0 4873.0 1 3930.0 3930.0 1
Python Basics 2 261.0 427.0 2 251.0 276.0 2
mean max
Title
Jupyter Tutorial 13099.0 19651.0
PyViz Tutorial 2573.0 2573.0
(continues on next page)
As you can see, the resulting DataFrame has hierarchical columns, just as you would get if you aggregated each column
separately and used pandas.concat to join the results together, using the column names as key arguments:
[11]: evaluations['2021-12']
[11]: count mean max
Title
Jupyter Tutorial 2 18103.5 30134.0
PyViz Tutorial 1 4873.0 4873.0
Python Basics 2 261.0 427.0
grouped[['2021-12', '2022-01']].agg(tuples)
[12]: 2021-12 2022-01
Mean Variance Mean Variance
Title
Jupyter Tutorial 18103.5 289465860.5 20505.5 327142620.5
PyViz Tutorial 4873.0 NaN 3930.0 NaN
Python Basics 261.0 55112.0 251.0 1250.0
If we now assume that potentially different functions are to be applied to one or more of the columns, we pass a dict
to agg that contains an assignment of column names to one of the function specifications:
In all the examples so far, the aggregated data is returned with an index. Since this is not always desired, you can
disable this behaviour in most cases by passing as_index=False to groupby:
By using the method as_index=False, some unnecessary calculations are avoided. Of course, it is always possible
to get the result back with index by calling reset_index for the result.
2.4.15 Apply
The most general GroupBy method is apply. It splits the object to be processed, calls the passed function on each part
and then tries to chain the parts together.
Suppose we want to select the five largest hit values by group. To do this, we first write a function that selects the
rows with the largest values in a particular column:
df
[2]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en NaN NaN NaN
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0
top(df, n=3)
If we now group by titles, for example, and call apply with this function, we get the following:
grouped_titles.apply(top)
[4]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en NaN NaN NaN
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0
What happened here? The upper function is called for each row group of the DataFrame, and then the results are
concatenated with pandas.concat, labelling the parts with the group names. The result therefore has a hierarchical
index whose inner level contains index values from the original DataFrame.
If you pass a function to apply that takes other arguments or keywords, you can pass them after the function:
We have now seen the basic usage of apply. What happens inside the passed function is very versatile and up to you;
it only has to return a pandas object or a single value. In the following, we will therefore mainly show examples that
can give you ideas on how to solve various problems with groupby.
First, let’s look again at describe, called over the GroupBy object:
result
[6]: 2021-12 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5
PyViz Tutorial 1.0 4873.0 NaN 4873.0 4873.00 4873.0
Python Basics 2.0 261.0 234.759451 95.0 178.00 261.0
2022-01 ... \
75% max count mean ... 75% max
Title ...
Jupyter Tutorial 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0
(continues on next page)
2022-02 \
count mean std min 25% 50%
Title
Jupyter Tutorial 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0
PyViz Tutorial 1.0 2573.0 NaN 2573.0 2573.0 2573.0
Python Basics 2.0 341.0 260.215295 157.0 249.0 341.0
75% max
Title
Jupyter Tutorial 16375.0 19651.0
PyViz Tutorial 2573.0 2573.0
Python Basics 433.0 525.0
[3 rows x 24 columns]
When you call a method like describe within GroupBy, it is actually just an abbreviation for:
In the previous examples, you saw that the resulting object has a hierarchical index formed by the group keys together
with the indices of the individual parts of the original object. You can disable this by passing group_keys=False to
groupby:
grouped_lang.apply(top)
[8]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics de 427.0 276.0 525.0
Jupyter Tutorial en 6073.0 7716.0 6547.0
Python Basics en 95.0 226.0 157.0
PyViz Tutorial en NaN NaN NaN
As described in discretisation and grouping, pandas has some tools, especially cut and qcut, to split data into buckets
with bins of your choice or by sample quantiles. Combine these functions with groupby and you can conveniently
perform bucket or quantile analysis on a dataset. Consider a simple random data set and a bucket categorisation of
equal length with cut:
[9]: rng = np.random.default_rng()
df2 = pd.DataFrame({'data1': rng.normal(size=1000),
'data2': rng.normal(size=1000)})
quartiles = pd.cut(df2.data1, 4)
quartiles[:10]
[9]: 0 (-1.469, 0.17]
1 (0.17, 1.809]
2 (-1.469, 0.17]
3 (0.17, 1.809]
4 (-3.115, -1.469]
5 (-1.469, 0.17]
6 (-3.115, -1.469]
7 (0.17, 1.809]
8 (-3.115, -1.469]
9 (-1.469, 0.17]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.115, -1.469] < (-1.469, 0.17] < (0.17, 1.
˓→809] < (1.809, 3.448]]
The category object returned by cut can be passed directly to groupby. So we could calculate a set of group statistics
for the quartiles as follows:
grouped_quart = df2.groupby(quartiles)
grouped_quart.apply(stats)
[10]: min max count mean
data1
(-3.115, -1.469] data1 -3.108349 -1.471274 70 -1.896546
data2 -2.232356 2.035516 70 -0.021125
(-1.469, 0.17] data1 -1.467552 0.168617 500 -0.497780
data2 -2.631409 3.109543 500 0.012801
(0.17, 1.809] data1 0.171525 1.807236 400 0.767486
data2 -3.063169 2.523671 400 -0.084810
(1.809, 3.448] data1 1.814571 3.448008 30 2.270333
data2 -1.967468 2.997739 30 -0.363947
These were buckets of equal length; to calculate buckets of equal size based on sample quantiles, we can use qcut. I
pass labels=False to get only quantile numbers:
[11]: quartiles_samp = pd.qcut(df2.data1, 4, labels=False)
grouped_quart_samp = df2.groupby(quartiles_samp)
grouped_quart_samp.apply(stats)
[11]: min max count mean
data1
0 data1 -3.108349 -0.615744 250 -1.232959
data2 -2.631409 2.734051 250 0.000789
1 data1 -0.615207 0.028267 250 -0.320657
data2 -2.351885 3.109543 250 -0.026376
2 data1 0.034191 0.660205 250 0.309113
data2 -3.063169 2.523671 250 -0.090296
3 data1 0.661384 3.448008 250 1.218328
data2 -2.602676 2.997739 250 -0.043800
When cleaning missing data, in some cases you will replace data observations with dropna, but in other cases you may
want to fill the null values (NA) with a fixed value or a value derived from the data. fillna is the right tool for this;
here, for example, I fill the null values with the mean:
[12]: s = pd.Series(rng.normal(size=8))
s[::3] = np.nan
s
[12]: 0 NaN
1 2.219988
2 -1.006390
3 NaN
(continues on next page)
[13]: s.fillna(s.mean())
[13]: 0 0.203990
1 2.219988
2 -1.006390
3 0.203990
4 0.244595
5 0.253043
6 0.203990
7 -0.691286
dtype: float64
Here are some sample data for my tutorials, divided into German and English editions:
Suppose you want the fill value to vary by group. These values can be predefined, and since the groups have an internal
name attribute, you can use this with apply:
df.groupby('Language').apply(fill_func)
[14]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en 3469.0 3469.0 3469.0
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0
You can also group the data and use apply with a function that calls fillna for each data packet:
df.groupby('Language').apply(fill_mean)
[15]: 2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en 3084.0 3971.0 3352.0
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0
Since operations between columns in a DataFrame or two Series are possible, we can calculate the group-weighted
average, for example:
df3
[16]: category data weights
0 de 43528 0.820894
1 de 55946 0.889425
2 de 75158 0.672583
3 de 22085 0.292674
4 en 79633 0.810052
5 en 44609 0.107397
6 en 45209 0.333182
7 en 19978 0.227543
grouped_cat.apply(get_wavg)
[17]: category
de 53261.553820
en 60146.114798
dtype: float64
Correlation
Finally, we group these percentage changes by year, which can be extracted from each row label with a one-line function
that returns the year attribute of each date label:
grouped_lang.apply(corr)
Since the apply method typically acts on each individual value in a Series, the function is called once for each value.
If you have thousands of values, the function will be called thousands of times. This ignores the fast vectorisations of
pandas unless you are using NumPy functions and slow Python is used. For example, we previously grouped the data
by title and then called our top method with apply. Let’s measure the time for this:
[22]: %%timeit
grouped_titles.apply(top)
3.96 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can get the same result without applying by passing the DataFrame to our top method:
[23]: %%timeit
top(df)
212 µs ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It is not always easy to find an alternative for apply. However, numerical operations like our top method can be made
faster with Cython. To use Cython in Jupyyter, we use the following IPython magic:
[25]: %%cython
def top_cy(df, n=5, column='2021-12'):
return df.sort_values(by=column, ascending=False)[:n]
[26]: %%timeit
grouped_titles.apply(top_cy)
3.25 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We haven’t really gained much with this yet. Further optimisation possibilities would be to define the type in the Cython
code with cpdef. For this, however, we would have to modify our method, because then no DataFrame can be passed.
A pivot table is a data summary tool often found in spreadsheet and other data analysis software. It summarises a table
of data by one or more keys and arranges the data in a rectangle, with some of the group keys along the rows and some
along the columns. Pivot tables in Python with pandas are made possible by the groupby function in combination with
reshaping operations using hierarchical indexing. DataFrame has a pivot_table method, and there is also a top-level
function pandas.pivot_table. pivot_table not only provides a convenient interface to groupby, but can also add
partial sums (margins).
Suppose we wanted to compute a table of group averages (the default aggregation type of pivot_table) ordered by
title and language in the rows:
df
[2]: Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 30134.0 33295.0 19651.0
1 Jupyter Tutorial en 6073.0 7716.0 6547.0
2 PyViz Tutorial de 4873.0 3930.0 2573.0
3 PyViz Tutorial None NaN NaN NaN
4 Python Basics de 427.0 276.0 525.0
5 Python Basics en 95.0 226.0 157.0
[4]: df.pivot_table(columns='Title')
[4]: Title Jupyter Tutorial PyViz Tutorial Python Basics
2021-12 18103.5 4873.0 261.0
(continues on next page)
Alternatively, we can keep the languages as columns and add the mean values by specifying margins=True:
To use an aggregation function other than mean, pass it to the keyword argument aggfunc. With sum, for example,
you get the sum:
pivot_table options:
Function Description
name
values column name(s) to aggregate; by default, all numeric columns are aggregated
index column names or other group keys to be grouped in the rows of the resulting pivot table
columns column names or other group keys to be grouped in the columns of the resulting pivot table
aggfunc aggregation function or list of functions (by default mean); can be any function valid in a groupby
context
fill_value replaces missing values in the result table
dropna if True, columns whose entries are all NA are ignored
margins inserts row/column subtotals and grand totals (default: False)
margins_name name used for row/column labels if margins=True is passed, default is All.
observed For categorical group keys, if True, only the observed category values are displayed in the keys
and not all categories
Crosstabs
A crosstab is a special case of a pivot table that calculates the frequency of groups. For example, in the context of
an analysis o