Python Introduction 2020
Python Introduction 2020
Kevin Sheppard
University of Oxford
Solutions
Solutions for exercises and some extended examples are available on GitHub.
https://github.com/bashtage/python-for-econometrics-statistics-data-analysis
Introductory Course
A self-paced introductory course is available on GitHub in the course/introduction folder. Solutions are avail-
able in the solutions/introduction folder.
https://github.com/bashtage/python-introduction/
Video Demonstrations
• Switched examples to prefer the context manager syntax to reflect best practices.
iv
Notes to the Fourth Edition
• Removed references to NumPy’s matrix class and clarified that it should not be used.
• Verified that all code and examples work correctly against 2020 versions of modules. The notable pack-
ages and their versions are:
• Expanded description of model classes and statistical tests in statsmodels that are most relevant for econo-
metrics. TODO
• Expanded the list of packages of interest to researchers working in statistics, econometrics and machine
learning. TODO
• Introduced f-Strings in Section 21.3.3 as the preferred way to format strings using modern Python.
• Added minimize as the preferred interface for non-linear function optimization in Chapter 20. TODO
• Python 2.7 support has been officially dropped, although most examples continue to work with 2.7. Do
not Python 2.7 in 2019 for numerical code.
vi
• Fixed direct download of FRED data due to API changes, thanks to Jesper Termansen.
• Thanks for Bill Tubbs for a detailed read and multiple typo reports.
• Tested all code on Pyton 3.6. Code has been tested against the current set of modules installed by conda
as of February 2018. The notable packages and their versions are:
– NumPy: 1.13
– Pandas: 0.22
Notes to the Third Edition
This edition includes the following changes from the second edition (August 2014).
• Python 3.5 is the default version of Python instead of 2.7. Python 3.5 (or newer) is well supported by
the Python packages required to analyze data and perform statistical analysis, and bring some new useful
features, such as a new operator for matrix multiplication (@).
• Removed distinction between integers and longs in built-in data types chapter. This distinction is only
relevant for Python 2.7.
• dot has been removed from most examples and replaced with @ to produce more readable code.
• Split Cython and Numba into separate chapters to highlight the improved capabilities of Numba.
• Verified all code working on current versions of core libraries using Python 3.5.
• pandas
• New chapter introducing statsmodels, a package that facilitates statistical analysis of data. statsmodels
includes regression analysis, Generalized Linear Models (GLM) and time-series analysis using ARIMA
models.
• Added diagnostic tools and a simple method to use external code in the Cython section.
• Added examples of joblib and IPython’s cluster to the chapter on running code in parallel.
• New chapter introducing object-oriented programming as a method to provide structure and organization
to related code.
• Added seaborn to the recommended package list, and have included it be default in the graphics chapter.
• Based on experience teaching Python to economics students, the recommended installation has been
simplified by removing the suggestion to use virtual environment. The discussion of virtual environments
as been moved to the appendix.
• Changed the Anaconda install to use both create and install, which shows how to install additional pack-
ages.
This edition includes the following changes from the first edition (March 2012).
• New chapter on pandas. pandas provides a simple but powerful tool to manage data and perform prelim-
inary analysis. It also greatly simplifies importing and exporting data.
• Numba provides just-in-time compilation for numeric Python code which often produces large perfor-
mance gains when pure NumPy solutions are not available (e.g. looping code).
• Numerous typos
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Important Components of the Python Scientific Stack . . . . . . . . . . . . . . . . . . . . . 3
1.4 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Using Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.A Additional Installation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Arrays 29
3.1 Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 1-dimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 2-dimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Multidimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Accessing Elements of an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Slicing and Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 import and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.9 Calling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Basic Math 43
4.1 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Addition (+) and Subtraction (-) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Multiplication (⁎) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Matrix Multiplication (@) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Array and Matrix Division (/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
xii CONTENTS
6 Special Arrays 61
6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Array Functions 63
7.1 Shape Information and Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2 Linear Algebra Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
14 Graphics 117
14.1 seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
14.2 2D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
14.3 Advanced 2D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.4 3D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
14.5 General Plotting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.6 Exporting Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
15 pandas 137
15.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
15.2 Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
15.3 Time-series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
15.4 Importing and Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
15.5 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
15.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
29 Examples 307
29.1 Estimating the Parameters of a GARCH Model . . . . . . . . . . . . . . . . . . . . . . . . 307
29.2 Estimating the Risk Premia using Fama-MacBeth Regressions . . . . . . . . . . . . . . . . 311
29.3 Estimating the Risk Premia using GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
29.4 Outputting LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Introduction
Solutions
Solutions for exercises and some extended examples are available on GitHub at https://github.com/
bashtage/python-for-econometrics-statistics-data-analysis.
1.1 Background
These notes are designed for someone new to statistical computing wishing to develop a set of skills necessary
to perform original research using Python. They should also be useful for students, researchers or practition-
ers who require a versatile platform for econometrics, statistics or general numerical analysis (e.g. numeric
solutions to economic models or model simulation).
Python is a popular general–purpose programming language that is well suited to a wide range of problems.1
Recent developments have extended Python’s range of applicability to econometrics, statistics, and general
numerical analysis. Python – with the right set of add-ons – is comparable to domain-specific languages such
as R, MATLAB or Julia. If you are wondering whether you should bother with Python (or another language),
an incomplete list of considerations includes:
You might want to consider R if:
• You want to apply statistical methods. The statistics library of R is second to none, and R is clearly at the
forefront of new statistical algorithm development – meaning you are most likely to find that new(ish)
procedure in R.
• Free is important.
• Documentation and organization of modules are more important than the breadth of algorithms available.
• Performance is an important concern. MATLAB has optimizations, such as Just-in-Time (JIT) compila-
tion of loops, which is not automatically available in most other packages.
1
According to the ranking on http://www.tiobe.com/tiobe-index/, Python is the 5th most popular language. http:
//langpop.corger.nl/ ranks Python as 4th or 5th .
2 Introduction
1.2 Conventions
These notes will follow two conventions.
1. Code blocks will be used throughout.
"""A docstring
"""
2. When a code block contains >>>, this indicates that the command is running an interactive IPython
session. Output will often appear after the console command, and will not be preceded by a command
indicator.
>>> x = 1.0
>>> x + 2
3.0
If the code block does not contain the console session indicator, the code contained in the block is
intended to be executed in a standalone Python file.
import numpy as np
x = np.array([1,2,3,4])
y = np.sum(x)
print(x)
print(y)
1.3.2 NumPy
NumPy provides a set of array data types which are essential for statistics, econometrics and data analysis.
1.3.3 SciPy
SciPy contains a large number of routines needed for analysis of data. The most important include a wide range
of random number generators, linear algebra routines, and optimizers. SciPy depends on NumPy.
1.3.6 pandas
pandas provides high-performance data structures and is essential when working with data.
4 Introduction
1.3.7 statsmodels
statsmodels is pandas-aware and provides models used in the statistical analysis of data including linear regres-
sion, Generalized Linear Models (GLMs), and time-series models (e.g., ARIMA).
A number of modules are available to help with performance. These include Cython and Numba. Cython is a
Python module which facilitates using a Python-like language to write functions that can be compiled to native
(C code) Python extensions. Numba uses a method of just-in-time compilation to translate a subset of Python
to native code using Low-Level Virtual Machine (LLVM).
1.4 Setup
The recommended method to install the Python scientific stack is to use Continuum Analytics’ Anaconda.
Appendix ?? describes a more complex installation procedure with instructions for directly installing Python
and the required modules when it is not possible to install Anaconda.
Windows
Installation on Windows requires downloading the installer and running. Anaconda comes in both Python
2.7 and 3.x flavors, and the latest Python 3.x is required. These instructions use ANACONDA to indicate
the Anaconda installation directory (e.g., the default is C:\Anaconda). Once the setup has completed, open a
PowerShell command prompt and run
cd ANACONDA\Scripts
conda init powershell
conda update conda
conda update anaconda
conda install html5lib seaborn jupyterlab
which will first ensure that Anaconda is up-to-date. conda install can be used later to install other packages
that may be of interest. Note that if Anaconda is installed into a directory other than the default, the full path
should not contain Unicode characters or spaces.
1.5 Using Python 5
Notes
• Install for all users, which requires admin privileges. If these are not available, then choose the “Just
for me” option, but be aware of installing on a path that contains non-ASCII characters which can cause
issues.
• Run conda init powershell to ensure that Anaconda commands can be run from the PowerShell
prompt.
• Register Anaconda as the system Python unless you have a specific reason not to (unlikely).
Linux and OS X
where x.y.z will depend on the version being installed and ISA will be either x86 or more likely x86_64.
Anaconda comes in both Python 2.7 and 3.x flavors, and the latest Python 3.x is required. The OS X installer is
available either in a GUI installed (pkg format) or as a bash installer which is installed in an identical manner to
the Linux installation. It is strongly recommended that the anaconda/bin is prepended to the path. This can be
performed in a session-by-session basis by entering conda init bash and then restarting your terminal. Note
that other shells such as zsh are also supported, and can be initialized by replacing bash with the name of your
preferred shell.
After installation completes, execute
conda update conda
conda update anaconda
conda install html5lib seaborn jupyterlab
which will first ensure that Anaconda is up-to-date and then install some optional modules. conda install
can be used later to install other packages that may be of interest.
Notes
All instructions for OS X and Linux assume that conda init bash has been run. If this is not the case, it is
necessary to run
cd ANACONDA
cd bin
• Tab completion - After entering 1 or more characters, pressing the tab button will bring up a list of
functions, packages, and variables which match the typed text. If the list of matches is large, pressing tab
again allows the arrow keys can be used to browse and select a completion.
• “Magic” function which make tasks such as navigating the local file system (using %cd ~/directory/
or just cd ~/directory/ assuming that %automagic is on) or running other Python programs (using
run program.py) simple. Entering %magic inside and IPython session will produce a detailed
description of the available functions. Alternatively, %lsmagic produces a succinct list of available
magic commands. The most useful magic functions are
– cd - change directory
– edit filename - launch an editor to edit filename
– ls or ls pattern - list the contents of a directory
– run filename - run the Python file filename
– timeit - time the execution of a piece of code or function
– history - view commands recently run. When used with the -l switch, the history of previous ses-
sions can be viewed (e.g., history -l 100 will show the most recent 100 commands irrespective
of whether they were entered in the current IPython session of a previous one).
• Integrated help - When using the QtConsole, calling a function provides a view of the top of the help
function. For example, entering mean( will produce a view of the top 20 lines of its help text.
• Inline figures - Both the QtConsole and the notebook can also display figure inline which produces a
tidy, self-contained environment. This can be enabled by entering %matplotlib inline in an IPython
session.
• The special variable _ contains the last result in the console, and so the most recent result can be saved
to a new variable using the syntax x = _.
This single line launcher can be saved as filename.command where filename is a meaningful name (e.g. IPython-
Terminal) to create a launcher on OS X by entering the command
chmod 755 /FULL/PATH/TO/filename.command
and then using the command as the Command in the dialog that appears.
Windows (Anaconda)
To run IPython open PowerShell and enter IPython in the start menu. Starting IPython using the QtConsole
is similar and is simply called QtConsole in the start menu. Launching IPython from the start menu should
create a window similar to that in figure 1.1.
Next, run
in the terminal or command prompt to generate a file named jupyter_qtconsole_config.py. This file contains
settings that are useful for customizing the QtConsole window. A few recommended modifications are
c.ConsoleWidget.font_size = 12
c.ConsoleWidget.font_family = "Bitstream Vera Sans Mono"
c.JupyterWidget.syntax_style = "monokai"
These commands assume that the Bitstream Vera fonts have been locally installed, which are available from
http://ftp.gnome.org/pub/GNOME/sources/ttf-bitstream-vera/1.10/. Opening Qt-
Console should create a window similar to that in figure 1.2 (although the appearance might differ) if you
did not use the recommendation configuration.
8 Introduction
Once you have saved this file, open the console, navigate to the directory you saved the file and enter python
firstprogram.py. Finally, run the program in IPython by first launching IPython, and the using %cd to
3
Programs can also be run in the standard Python interpreter using the command:
exec(compile(open(’filename.py’).read(),’filename.py’,’exec’))
1.5 Using Python 9
change to the location of the program, and finally executing the program using %run firstprogram.py.
If everything was successfully installed, you should see something similar to figure 1.3.
jupyter lab
This command will start the server and open the default browser which should be a modern version of Chrome
(preferable), Chromium, Firefox or Edge. If the default browser is Safari or Internet Explorer, the URL can
be copied and pasted into Chrome. The first screen that appears will look similar to figure 1.4, except that the
list of notebooks will be empty. Clicking on New Notebook will create a new notebook, which, after a bit of
typing, can be transformed to resemble figure 1.5. Notebooks can be imported by dragging and dropping and
exported from the menu inside a notebook.
Figure 1.3: A successful test that matplotlib, IPython, NumPy and SciPy were all correctly installed.
Figure 1.4: The default IPython Notebook screen showing two notebooks.
1.5 Using Python 11
Figure 1.5: A jupyterlab notebook showing formatted markdown, LATEX math and cells containing code.
such as built-in consoles, code completion (or IntelliSense, for completing function names) and integrated
debugging. Discussion of IDEs is beyond the scope of these notes, although Spyder is a reasonable choice
(free, cross-platform). Visual Studio Code is an excellent alternative. My preferred IDE is PyCharm, which has
a community edition that is free for use (the professional edition is low cost for academics).
spyder
spyder is an IDE specialized for use in scientific applications of Python rather than for general purpose applica-
tion development. This is both an advantage and a disadvantage when compared to a full featured IDE such as
PyCharm or VS Code. The main advantage is that many powerful but complex features are not integrated into
Spyder, and so the learning curve is much shallower. The disadvantage is similar - in more complex projects,
or if developing something that is not straight scientific Python, Spyder is less capable. However, netting these
two, Spyder is almost certainly the IDE to use when starting Python, and it is always relatively simple to migrate
to a sophisticated IDE if needed.
Spyder is started by entering spyder in the terminal or command prompt. A window similar to that in
figure 1.6 should appear. The main components are the editor (1), the object inspector (2), which dynamically
will show help for functions that are used in the editor, and the console (3). By default, Spyder opens a standard
Python console, although it also supports using the more powerful IPython console. The object inspector
window, by default, is grouped with a variable explorer, which shows the variables that are in memory and the
file explorer, which can be used to navigate the file system. The console is grouped with an IPython console
window (needs to be activated first using the Interpreters menu along the top edge), and the history log which
contains a list of commands executed. The buttons along the top edge facilitate saving code, running code and
debugging.
12 Introduction
1.6 Exercises
1. Install Python.
3. Customize IPython QtConsole using a font or color scheme. More customization options can be found
by running ipython -h.
4. Explore tab completion in IPython by entering a<TAB> to see the list of functions which start with a and
are loaded by pylab. Next try i<TAB>, which will produce a list longer than the screen – press ESC to
exit the pager.
Python is whitespace sensitive and so indentation, either spaces or tabs, affects how Python interprets files. The
configuration files, e.g. ipython_config.py, are plain Python files and so are sensitive to whitespace. Introducing
white space before the start of a configuration option will produce an error, so ensure there is no whitespace
before active lines of a configuration.
1.A Additional Installation Issues 13
Windows
Spaces in path
Unicode in path
Python does not always work well when a path contains Unicode characters, which might occur in a user
name. While this isn’t an issue for installing Python or Anaconda, it is an issue for IPython which looks
in c:\user\username\.ipython for configuration files. The solution is to define the HOME variable before
launching IPython to a path that has only ASCII characters.
mkdir c:\anaconda\ipython_config
set HOME=c:\anaconda\ipython_config
c:\Anaconda\Scripts\activate econometrics
ipython profile create econometrics
ipython --profile=econometrics
The set HOME=c:\anaconda\ipython_config can point to any path with directories containing only ASCII
characters, and can also be added to any batch file to achieve the same effect.
OS X
Installing Anaconda to the root of the partition
If the user account used is running as root, then Anaconda may install to /anaconda and not ~/anaconda by
default. Best practice is not to run as root, although in principle this is not a problem, and /anaconda can be
used in place of ~/anaconda in any of the instructions.