0% found this document useful (0 votes)
199 views2,915 pages

Pandas

AbCe

Uploaded by

HaMu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
199 views2,915 pages

Pandas

AbCe

Uploaded by

HaMu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

pandas: powerful Python data analysis

toolkit
Release 0.25.2

Wes McKinney& PyData Development Team

Oct 23, 2019


CONTENTS

i
ii
pandas: powerful Python data analysis toolkit, Release 0.25.2

Date: Oct 23, 2019 Version: 0.25.2


Download documentation: PDF Version | Zipped HTML
Useful links: Binary Installers | Source Repository | Issues & Ideas | Q&A Support | Mailing List
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and
data analysis tools for the Python programming language.
See the overview for more detail about what’s in the library.

CONTENTS 1
pandas: powerful Python data analysis toolkit, Release 0.25.2

2 CONTENTS
CHAPTER

ONE

WHAT’S NEW IN 0.25.2 (OCTOBER 15, 2019)

These are the changes in pandas 0.25.2. See release for a full changelog including other versions of pandas.

Note: Pandas 0.25.2 adds compatibility for Python 3.8 (GH28147).

1.1 Bug fixes

1.1.1 Indexing

• Fix regression in DataFrame.reindex() not following the limit argument (GH28631).


• Fix regression in RangeIndex.get_indexer() for decreasing RangeIndex where target values may be
improperly identified as missing/present (GH28678)

1.1.2 I/O

• Fix regression in notebook display where <th> tags were missing for DataFrame.index values
(GH28204).
• Regression in to_csv() where writing a Series or DataFrame indexed by an IntervalIndex would
incorrectly raise a TypeError (GH28210)
• Fix to_csv() with ExtensionArray with list-like values (GH28840).

1.1.3 Groupby/resample/rolling

• Bug incorrectly raising an IndexError when passing a list of quantiles to pandas.core.groupby.


DataFrameGroupBy.quantile() (GH28113).
• Bug in pandas.core.groupby.GroupBy.shift(), pandas.core.groupby.GroupBy.bfill() and
pandas.core.groupby.GroupBy.ffill() where timezone information would be dropped (GH19995,
GH27992)

• Compatibility with Python 3.8 in DataFrame.query() (GH27261)


• Fix to ensure that tab-completion in an IPython console does not raise warnings for deprecated at-
tributes (GH27900).

3
pandas: powerful Python data analysis toolkit, Release 0.25.2

1.2 Contributors

A total of 6 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Felix Divo +
• Jeremy Schendel
• Joris Van den Bossche
• MeeseeksMachine
• Tom Augspurger
• jbrockmendel
{{ header }}

4 Chapter 1. What’s new in 0.25.2 (October 15, 2019)


CHAPTER

TWO

INSTALLATION

The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform
distribution for data analysis and scientific computing. This is the recommended installation method for
most users.
Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development
version are also provided.

2.1 Python version support

Officially Python 3.5.3 and above, 3.6, 3.7, and 3.8.

2.2 Installing pandas

2.2.1 Installing with Anaconda

Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make up
the SciPy stack (IPython, NumPy, Matplotlib, …) is with Anaconda, a cross-platform (Linux, Mac OS X,
Windows) Python distribution for data analytics and scientific computing.
After running the installer, the user will have access to pandas and the rest of the SciPy stack without
needing to install anything else, and without needing to wait for any software to be compiled.
Installation instructions for Anaconda can be found here.
A full list of the packages available as part of the Anaconda distribution can be found here.
Another advantage to installing Anaconda is that you don’t need admin rights to install it. Anaconda can
install in the user’s home directory, which makes it trivial to delete Anaconda if you decide (just delete that
folder).

2.2.2 Installing with Miniconda

The previous section outlined how to get pandas installed as part of the Anaconda distribution. However
this approach means you will install well over one hundred packages and involves downloading the installer
which is a few hundred megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then installing
pandas with Miniconda may be a better solution.

5
pandas: powerful Python data analysis toolkit, Release 0.25.2

Conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is
both cross-platform and language agnostic (it can play a similar role to a pip and virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the Conda
command to install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this for you.
The installer can be found here
The next step is to create a new conda environment. A conda environment is like a virtualenv that allows
you to specify a specific version of Python and set of libraries. Run the following commands from a terminal
window:

conda create -n name_of_my_env python

This will create a minimal environment with only Python installed in it. To put your self inside this
environment run:

source activate name_of_my_env

On Windows the command is:

activate name_of_my_env

The final step required is to install pandas. This can be done with the following command:

conda install pandas

To install a specific pandas version:

conda install pandas=0.20.3

To install other packages, IPython for example:

conda install ipython

To install the full Anaconda distribution:

conda install anaconda

If you need packages that are available to pip but not conda, then install pip, and then use pip to install
those packages:

conda install pip


pip install django

2.2.3 Installing from PyPI

pandas can be installed via pip from PyPI.

pip install pandas

2.2.4 Installing with ActivePython

Installation instructions for ActivePython can be found here. Versions 2.7 and 3.5 include pandas.

6 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.25.2

2.2.5 Installing using your Linux distribution’s package manager.

The commands in this table will install pandas for Python 3 from your distribution. To install pandas for
Python 2, you may need to use the python-pandas package.

Distribution Status Download / Reposi- Install method


tory Link
Debian stable official Debian repos- sudo apt-get install python3-pandas
itory
Debian unstable NeuroDebian sudo apt-get install python3-pandas
& (latest
Ubuntu pack-
ages)
Ubuntu stable official Ubuntu sudo apt-get install python3-pandas
repository
OpenSuse stable OpenSuse Reposi- zypper in python3-pandas
tory
Fedora stable official Fedora repos- dnf install python3-pandas
itory
Centos/RHEL
stable EPEL repository yum install python3-pandas

However, the packages in the linux package managers are often a few versions behind, so to get the newest
version of pandas, it’s recommended to install using the pip or conda methods described above.

2.2.6 Installing from source

See the contributing guide for complete instructions on building from the git source tree. Further, see creating
a development environment if you wish to create a pandas development environment.

2.3 Running the test suite

pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing.
To run it on your machine to verify that everything is working (and that you have all of the dependencies,
soft and hard, installed), make sure you have pytest >= 4.0.2 and Hypothesis >= 3.58, then run:

>>> pd.test()
running: pytest --skip-slow --skip-network C:\Users\TP\Anaconda3\envs\py36\lib\site-
,→packages\pandas

============================= test session starts =============================


platform win32 -- Python 3.6.2, pytest-3.6.0, py-1.4.34, pluggy-0.4.0
rootdir: C:\Users\TP\Documents\Python\pandasdev\pandas, inifile: setup.cfg
collected 12145 items / 3 skipped

..................................................................S......
........S................................................................
.........................................................................

==================== 12130 passed, 12 skipped in 368.339 seconds =====================

2.3. Running the test suite 7


pandas: powerful Python data analysis toolkit, Release 0.25.2

2.4 Dependencies

Package Minimum supported version


setuptools 24.2.0
NumPy 1.13.3
python-dateutil 2.6.1
pytz 2017.2

2.4.1 Recommended dependencies

• numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart
chunking and caching to achieve large speedups. If installed, must be Version 2.6.2 or higher.
• bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython
routines to achieve large speedups. If installed, must be Version 1.2.1 or higher.

Note: You are highly encouraged to install these libraries, as they provide speed improvements, especially
when working with large data sets.

2.4.2 Optional dependencies

Pandas has many optional dependencies that are only used for specific methods. For example, pandas.
read_hdf() requires the pytables package. If the optional dependency is not installed, pandas will raise
an ImportError when the method requiring that dependency is called.

8 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.25.2

Dependency Minimum Version Notes


BeautifulSoup4 4.6.0 HTML parser for read_html (see note)
Jinja2 Conditional formatting with DataFrame.style
PyQt4 Clipboard I/O
PyQt5 Clipboard I/O
PyTables 3.4.2 HDF5-based reading / writing
SQLAlchemy 1.1.4 SQL support for databases other than sqlite
SciPy 0.19.0 Miscellaneous statistical functions
XLsxWriter 0.9.8 Excel writing
blosc Compression for msgpack
fastparquet 0.2.1 Parquet reading / writing
gcsfs 0.2.2 Google Cloud Storage access
html5lib HTML parser for read_html (see note)
lxml 3.8.0 HTML parser for read_html (see note)
matplotlib 2.2.2 Visualization
openpyxl 2.4.8 Reading / writing for xlsx files
pandas-gbq 0.8.0 Google Big Query access
psycopg2 PostgreSQL engine for sqlalchemy
pyarrow 0.9.0 Parquet and feather reading / writing
pymysql 0.7.11 MySQL engine for sqlalchemy
pyreadstat SPSS files (.sav) reading
pytables 3.4.2 HDF5 reading / writing
qtpy Clipboard I/O
s3fs 0.0.8 Amazon S3 access
xarray 0.8.2 pandas-like API for N-dimensional data
xclip Clipboard I/O on linux
xlrd 1.1.0 Excel reading
xlwt 1.2.0 Excel writing
xsel Clipboard I/O on linux
zlib Compression for msgpack

Optional dependencies for parsing HTML

One of the following combinations of libraries is needed to use the top-level read_html() function:
Changed in version 0.23.0.
• BeautifulSoup4 and html5lib
• BeautifulSoup4 and lxml
• BeautifulSoup4 and html5lib and lxml
• Only lxml, although see HTML Table Parsing for reasons as to why you should probably not take this
approach.

Warning:
• if you install BeautifulSoup4 you must install either lxml or html5lib or both. read_html() will
not work with only BeautifulSoup4 installed.
• You are highly encouraged to read HTML Table Parsing gotchas. It explains issues surrounding
the installation and usage of the above three libraries.

2.4. Dependencies 9
pandas: powerful Python data analysis toolkit, Release 0.25.2

{{ header }}

10 Chapter 2. Installation
CHAPTER

THREE

GETTING STARTED

{{ header }}

3.1 Package overview

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working
with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building
block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis / manipulation tool available
in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data actually need not be labeled at all to
be placed into a pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle
the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For
R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on
top of NumPy and is intended to integrate well within a scientific computing environment with many other
3rd party libraries.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point
data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional
objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the
user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you
in computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for
both aggregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures
into DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

11
pandas: powerful Python data analysis toolkit, Release 0.25.2

• Intuitive merging and joining data sets


• Flexible reshaping and pivoting of data sets
• Hierarchical labeling of axes (possible to have multiple labels per tick)
• Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving
/ loading data from the ultrafast HDF5 format
• Time series-specific functionality: date range generation and frequency conversion, moving window
statistics, moving window linear regressions, date shifting and lagging, etc.
Many of these principles are here to address the shortcomings frequently experienced using other languages
/ scientific research environments. For data scientists, working with data is typically divided into multiple
stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into
a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.
Some other notes
• pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code.
However, as with anything else generalization usually sacrifices performance. So if you focus on one
feature for your application you may be able to create a faster specialized tool.
• pandas is a dependency of statsmodels, making it an important part of the statistical computing
ecosystem in Python.
• pandas has been used extensively in production in financial applications.

3.1.1 Data structures

Dimensions Name Description


1 Series 1D labeled homogeneously-typed array
2 DataFrame General 2D labeled, size-mutable tabular structure with po-
tentially heterogeneously-typed column

Why more than one data structure?

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.
For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be
able to insert and remove objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the
typical orientation of time series and cross-sectional data sets. When using ndarrays to store 2- and 3-
dimensional data, a burden is placed on the user to consider the orientation of the data set when writing
functions; axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters
for performance). In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a
particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the
amount of mental effort required to code up data transformations in downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows)
and the columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results
in more readable code:

for col in df.columns:


series = df[col]
# do something with series

12 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

3.1.2 Mutability and copying of data

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-
mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a
DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched.
In general we like to favor immutability where sensible.

3.1.3 Getting support

The first stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas
community experts can answer through Stack Overflow.

3.1.4 Community

pandas is actively supported today by a community of like-minded individuals around the world who con-
tribute their valuable time and energy to help make open source pandas possible. Thanks to all of our
contributors.
If you’re interested in contributing, please visit the contributing guide.
pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as
a world-class open-source project, and makes it possible to donate to the project.

3.1.5 Project governance

The governance process that pandas project has used informally since its inception in 2008 is formalized in
Project Governance documents. The documents clarify how decisions are made and how the various elements
of our community interact, including the relationship between open source collaborative development and
work that may be funded by for-profit or non-profit entities.
Wes McKinney is the Benevolent Dictator for Life (BDFL).

3.1.6 Development team

The list of the Core Team members and more detailed information can be found on the people’s page of the
governance repo.

3.1.7 Institutional partners

The information about current institutional partners can be found on pandas website page.

3.1.8 License

BSD 3-Clause License

Copyright (c) 2008-2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData␣
,→Development Team

All rights reserved.

(continues on next page)

3.1. Package overview 13


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,


this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

{{ header }}

3.2 10 minutes to pandas

This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in
the Cookbook.
Customarily, we import as follows:

In [1]: import numpy as np

In [2]: import pandas as pd

3.2.1 Object creation

See the Data Structure Intro section.


Creating a Series by passing a list of values, letting pandas create a default integer index:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]: s
Out[4]:
0 1.0
1 3.0
(continues on next page)

14 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [5]: dates = pd.date_range('20130101', periods=6)

In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [8]: df
Out[8]:
A B C D
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [9]: df2 = pd.DataFrame({'A': 1.,


...: 'B': pd.Timestamp('20130102'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3] * 4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
...:

In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

The columns of the resulting DataFrame have different dtypes.

In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
(continues on next page)

3.2. 10 minutes to pandas 15


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


C float32
D int32
E category
F object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically
enabled. Here’s a subset of the attributes that will be completed:

In [12]: df2.<TAB> # noqa: E225, E999


df2.A df2.bool
df2.abs df2.boxplot
df2.add df2.C
df2.add_prefix df2.clip
df2.add_suffix df2.clip_lower
df2.align df2.clip_upper
df2.all df2.columns
df2.any df2.combine
df2.append df2.combine_first
df2.apply df2.compound
df2.applymap df2.consolidate
df2.D

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of
the attributes have been truncated for brevity.

3.2.2 Viewing data

See the Basics section.


Here is how to view the top and bottom rows of the frame:
In [13]: df.head()
Out[13]:
A B C D
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189

In [14]: df.tail(3)
Out[14]:
A B C D
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
Display the index, columns:
In [15]: df.index
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],

16 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

dtype='datetime64[ns]', freq='D')

In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an
expensive operation when your DataFrame has columns with different data types, which comes down to a
fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire ar-
ray, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(),
pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being
object, which requires casting every value to a Python object.
For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying
data.

In [17]: df.to_numpy()
Out[17]:
array([[ 1.36382983, 0.77377516, -2.2758147 , -0.73333194],
[ 1.77586181, -0.86722409, 1.21445884, 1.84623251],
[ 0.74724665, 0.1665589 , -0.69400832, -0.68641742],
[ 0.34770833, 0.06047343, -1.39385516, -1.24987716],
[ 0.57886924, 0.78250045, 0.04555881, -0.49018939],
[ 0.58749057, -0.67538866, -0.43329251, -0.58303716]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)

Note: DataFrame.to_numpy() does not include the index or column labels in the output.

describe() shows a quick statistic summary of your data:

In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.900168 0.040116 -0.589492 -0.316103
std 0.549803 0.698714 1.198627 1.091824
min 0.347708 -0.867224 -2.275815 -1.249877
25% 0.581025 -0.491423 -1.218893 -0.721603
50% 0.667369 0.113516 -0.563650 -0.634727
75% 1.209684 0.621971 -0.074154 -0.513401
max 1.775862 0.782500 1.214459 1.846233

Transposing your data:

3.2. 10 minutes to pandas 17


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 1.363830 1.775862 0.747247 0.347708 0.578869 0.587491
B 0.773775 -0.867224 0.166559 0.060473 0.782500 -0.675389
C -2.275815 1.214459 -0.694008 -1.393855 0.045559 -0.433293
D -0.733332 1.846233 -0.686417 -1.249877 -0.490189 -0.583037

Sorting by an axis:
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 -0.733332 -2.275815 0.773775 1.363830
2013-01-02 1.846233 1.214459 -0.867224 1.775862
2013-01-03 -0.686417 -0.694008 0.166559 0.747247
2013-01-04 -1.249877 -1.393855 0.060473 0.347708
2013-01-05 -0.490189 0.045559 0.782500 0.578869
2013-01-06 -0.583037 -0.433293 -0.675389 0.587491

Sorting by values:
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-05 0.578869 0.782500 0.045559 -0.490189

3.2.3 Selection

Note: While standard Python / Numpy expressions for selecting and setting are intuitive and come in
handy for interactive work, for production code, we recommend the optimized pandas data access methods,
.at, .iat, .loc and .iloc.

See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

Getting

Selecting a single column, which yields a Series, equivalent to df.A:


In [23]: df['A']
Out[23]:
2013-01-01 1.363830
2013-01-02 1.775862
2013-01-03 0.747247
2013-01-04 0.347708
(continues on next page)

18 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2013-01-05 0.578869
2013-01-06 0.587491
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.


In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417

In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877

Selection by label

See more in Selection by Label.


For getting a cross section using a label:

In [26]: df.loc[dates[0]]
Out[26]:
A 1.363830
B 0.773775
C -2.275815
D -0.733332
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [27]: df.loc[:, ['A', 'B']]


Out[27]:
A B
2013-01-01 1.363830 0.773775
2013-01-02 1.775862 -0.867224
2013-01-03 0.747247 0.166559
2013-01-04 0.347708 0.060473
2013-01-05 0.578869 0.782500
2013-01-06 0.587491 -0.675389

Showing label slicing, both endpoints are included:

In [28]: df.loc['20130102':'20130104', ['A', 'B']]


Out[28]:
A B
2013-01-02 1.775862 -0.867224
(continues on next page)

3.2. 10 minutes to pandas 19


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2013-01-03 0.747247 0.166559
2013-01-04 0.347708 0.060473

Reduction in the dimensions of the returned object:

In [29]: df.loc['20130102', ['A', 'B']]


Out[29]:
A 1.775862
B -0.867224
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [30]: df.loc[dates[0], 'A']


Out[30]: 1.3638298274671674

For getting fast access to a scalar (equivalent to the prior method):

In [31]: df.at[dates[0], 'A']


Out[31]: 1.3638298274671674

Selection by position

See more in Selection by Position.


Select via the position of the passed integers:

In [32]: df.iloc[3]
Out[32]:
A 0.347708
B 0.060473
C -1.393855
D -1.249877
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python:

In [33]: df.iloc[3:5, 0:2]


Out[33]:
A B
2013-01-04 0.347708 0.060473
2013-01-05 0.578869 0.782500

By lists of integer position locations, similar to the numpy/python style:

In [34]: df.iloc[[1, 2, 4], [0, 2]]


Out[34]:
A C
2013-01-02 1.775862 1.214459
2013-01-03 0.747247 -0.694008
2013-01-05 0.578869 0.045559

For slicing rows explicitly:

20 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417

For slicing columns explicitly:

In [36]: df.iloc[:, 1:3]


Out[36]:
B C
2013-01-01 0.773775 -2.275815
2013-01-02 -0.867224 1.214459
2013-01-03 0.166559 -0.694008
2013-01-04 0.060473 -1.393855
2013-01-05 0.782500 0.045559
2013-01-06 -0.675389 -0.433293

For getting a value explicitly:

In [37]: df.iloc[1, 1]
Out[37]: -0.8672240914412521

For getting fast access to a scalar (equivalent to the prior method):

In [38]: df.iat[1, 1]
Out[38]: -0.8672240914412521

Boolean indexing

Using a single column’s values to select data.

In [39]: df[df.A > 0]


Out[39]:
A B C D
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037

Selecting values from a DataFrame where a boolean condition is met.

In [40]: df[df > 0]


Out[40]:
A B C D
2013-01-01 1.363830 0.773775 NaN NaN
2013-01-02 1.775862 NaN 1.214459 1.846233
2013-01-03 0.747247 0.166559 NaN NaN
2013-01-04 0.347708 0.060473 NaN NaN
2013-01-05 0.578869 0.782500 0.045559 NaN
2013-01-06 0.587491 NaN NaN NaN

3.2. 10 minutes to pandas 21


pandas: powerful Python data analysis toolkit, Release 0.25.2

Using the isin() method for filtering:


In [41]: df2 = df.copy()

In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

In [43]: df2
Out[43]:
A B C D E
2013-01-01 1.363830 0.773775 -2.275815 -0.733332 one
2013-01-02 1.775862 -0.867224 1.214459 1.846233 one
2013-01-03 0.747247 0.166559 -0.694008 -0.686417 two
2013-01-04 0.347708 0.060473 -1.393855 -1.249877 three
2013-01-05 0.578869 0.782500 0.045559 -0.490189 four
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037 three

In [44]: df2[df2['E'].isin(['two', 'four'])]


Out[44]:
A B C D E
2013-01-03 0.747247 0.166559 -0.694008 -0.686417 two
2013-01-05 0.578869 0.782500 0.045559 -0.490189 four

Setting

Setting a new column automatically aligns the data by the indexes.

In [45]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))

In [46]: s1
Out[46]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64

In [47]: df['F'] = s1

Setting values by label:

In [48]: df.at[dates[0], 'A'] = 0

Setting values by position:

In [49]: df.iat[0, 1] = 0

Setting by assigning with a NumPy array:

In [50]: df.loc[:, 'D'] = np.array([5] * len(df))

The result of the prior setting operations.

22 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -2.275815 5 NaN
2013-01-02 1.775862 -0.867224 1.214459 5 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0
2013-01-04 0.347708 0.060473 -1.393855 5 3.0
2013-01-05 0.578869 0.782500 0.045559 5 4.0
2013-01-06 0.587491 -0.675389 -0.433293 5 5.0

A where operation with setting.

In [52]: df2 = df.copy()

In [53]: df2[df2 > 0] = -df2

In [54]: df2
Out[54]:
A B C D F
2013-01-01 0.000000 0.000000 -2.275815 -5 NaN
2013-01-02 -1.775862 -0.867224 -1.214459 -5 -1.0
2013-01-03 -0.747247 -0.166559 -0.694008 -5 -2.0
2013-01-04 -0.347708 -0.060473 -1.393855 -5 -3.0
2013-01-05 -0.578869 -0.782500 -0.045559 -5 -4.0
2013-01-06 -0.587491 -0.675389 -0.433293 -5 -5.0

3.2.4 Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in compu-
tations. See the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1

In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -2.275815 5 NaN 1.0
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0 NaN
2013-01-04 0.347708 0.060473 -1.393855 5 3.0 NaN

To drop any rows that have missing data.

In [58]: df1.dropna(how='any')
Out[58]:
A B C D F E
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0

Filling missing data.

3.2. 10 minutes to pandas 23


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -2.275815 5 5.0 1.0
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0 5.0
2013-01-04 0.347708 0.060473 -1.393855 5 3.0 5.0

To get the boolean mask where values are nan.

In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

3.2.5 Operations

See the Basic section on Binary Ops.

Stats

Operations in general exclude missing data.


Performing a descriptive statistic:

In [61]: df.mean()
Out[61]:
A 0.672863
B -0.088847
C -0.589492
D 5.000000
F 3.000000
dtype: float64

Same operation on the other axis:

In [62]: df.mean(1)
Out[62]:
2013-01-01 0.681046
2013-01-02 1.624619
2013-01-03 1.443959
2013-01-04 1.402865
2013-01-05 2.081386
2013-01-06 1.895762
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas auto-
matically broadcasts along the specified dimension.

24 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64

In [65]: df.sub(s, axis='index')


Out[65]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.252753 -0.833441 -1.694008 4.0 1.0
2013-01-04 -2.652292 -2.939527 -4.393855 2.0 0.0
2013-01-05 -4.421131 -4.217500 -4.954441 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN

Apply

Applying functions to the data:


In [66]: df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -2.275815 5 NaN
2013-01-02 1.775862 -0.867224 -1.061356 10 1.0
2013-01-03 2.523108 -0.700665 -1.755364 15 3.0
2013-01-04 2.870817 -0.640192 -3.149219 20 6.0
2013-01-05 3.449686 0.142309 -3.103661 25 10.0
2013-01-06 4.037177 -0.533080 -3.536953 30 15.0

In [67]: df.apply(lambda x: x.max() - x.min())


Out[67]:
A 1.775862
B 1.649725
C 3.490274
D 0.000000
F 4.000000
dtype: float64

Histogramming

See more at Histogramming and Discretization.


In [68]: s = pd.Series(np.random.randint(0, 7, size=10))

In [69]: s
Out[69]:
0 3

3.2. 10 minutes to pandas 25


pandas: powerful Python data analysis toolkit, Release 0.25.2

1 2
2 4
3 5
4 1
5 1
6 3
7 2
8 4
9 0
dtype: int64

In [70]: s.value_counts()
Out[70]:
4 2
3 2
2 2
1 2
5 1
0 1
dtype: int64

String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate
on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses
regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.

In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object

3.2.6 Merge

Concat

pandas provides various facilities for easily combining together Series and DataFrame objects with vari-
ous kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type
operations.
See the Merging section.
Concatenating pandas objects together with concat():

26 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [73]: df = pd.DataFrame(np.random.randn(10, 4))

In [74]: df
Out[74]:
0 1 2 3
0 2.215176 -1.315908 -0.317116 -0.842397
1 2.213106 -0.596732 0.409649 1.286960
2 1.006929 -1.120177 0.230672 0.537003
3 -2.502802 0.785754 -0.228391 0.063769
4 -0.469153 -0.620792 1.997251 -0.360238
5 1.116982 -3.165350 -1.410497 -0.466197
6 0.712124 0.581404 0.715763 1.044883
7 -2.015823 0.212701 0.841041 0.089602
8 -1.633435 0.535489 -1.146539 0.930138
9 -0.219756 -0.913111 -0.769806 -0.355672

# break it into pieces


In [75]: pieces = [df[:3], df[3:7], df[7:]]

In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 2.215176 -1.315908 -0.317116 -0.842397
1 2.213106 -0.596732 0.409649 1.286960
2 1.006929 -1.120177 0.230672 0.537003
3 -2.502802 0.785754 -0.228391 0.063769
4 -0.469153 -0.620792 1.997251 -0.360238
5 1.116982 -3.165350 -1.410497 -0.466197
6 0.712124 0.581404 0.715763 1.044883
7 -2.015823 0.212701 0.841041 0.089602
8 -1.633435 0.535489 -1.146539 0.930138
9 -0.219756 -0.913111 -0.769806 -0.355672

Join

SQL style merges. See the Database style joining section.


In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2

In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5

3.2. 10 minutes to pandas 27


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [81]: pd.merge(left, right, on='key')


Out[81]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
Another example that can be given is:
In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2

In [85]: right
Out[85]:
key rval
0 foo 4
1 bar 5

In [86]: pd.merge(left, right, on='key')


Out[86]:
key lval rval
0 foo 1 4
1 bar 2 5

Append

Append rows to a dataframe. See the Appending section.


In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])

In [88]: df
Out[88]:
A B C D
0 0.292785 -0.036327 -0.303377 0.132313
1 -1.437443 -0.521953 0.693930 -0.556417
2 2.246185 -0.175934 1.702784 2.475101
3 0.040067 -0.987877 -0.741536 1.142627
4 -1.657556 -0.812830 0.587969 1.347117
5 0.433323 -0.045170 -0.694659 -0.954585
6 -1.258443 -0.380086 1.610889 0.841358
7 -0.660989 0.519592 -0.360741 -1.441424

In [89]: s = df.iloc[3]

In [90]: df.append(s, ignore_index=True)


Out[90]:
(continues on next page)

28 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


A B C D
0 0.292785 -0.036327 -0.303377 0.132313
1 -1.437443 -0.521953 0.693930 -0.556417
2 2.246185 -0.175934 1.702784 2.475101
3 0.040067 -0.987877 -0.741536 1.142627
4 -1.657556 -0.812830 0.587969 1.347117
5 0.433323 -0.045170 -0.694659 -0.954585
6 -1.258443 -0.380086 1.610889 0.841358
7 -0.660989 0.519592 -0.360741 -1.441424
8 0.040067 -0.987877 -0.741536 1.142627

3.2.7 Grouping

By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
See the Grouping section.

In [91]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',


....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
....:

In [92]: df
Out[92]:
A B C D
0 foo one -1.593566 1.125380
1 bar one 0.854244 1.315419
2 foo two 1.218171 1.511499
3 bar three -0.028648 0.726054
4 foo two 1.632905 0.664038
5 bar two -0.637469 -0.707051
6 foo one -0.512159 -1.576752
7 foo three 0.312766 0.831920

Grouping and then applying the sum() function to the resulting groups.

In [93]: df.groupby('A').sum()
Out[93]:
C D
A
bar 0.188127 1.334423
foo 1.058118 2.556085

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

3.2. 10 minutes to pandas 29


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [94]: df.groupby(['A', 'B']).sum()


Out[94]:
C D
A B
bar one 0.854244 1.315419
three -0.028648 0.726054
two -0.637469 -0.707051
foo one -2.105725 -0.451372
three 0.312766 0.831920
two 2.851077 2.175537

3.2.8 Reshaping

See the sections on Hierarchical Indexing and Reshaping.

Stack

In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',


....: 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two',
....: 'one', 'two', 'one', 'two']]))
....:

In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [98]: df2 = df[:4]

In [99]: df2
Out[99]:
A B
first second
bar one 1.374920 0.502088
two -0.304574 -1.021713
baz one 0.713245 0.315322
two 0.591517 0.543840

The stack() method “compresses” a level in the DataFrame’s columns.

In [100]: stacked = df2.stack()

In [101]: stacked
Out[101]:
first second
bar one A 1.374920
B 0.502088
two A -0.304574
B -1.021713
baz one A 0.713245
(continues on next page)

30 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


B 0.315322
two A 0.591517
B 0.543840
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack()
is unstack(), which by default unstacks the last level:
In [102]: stacked.unstack()
Out[102]:
A B
first second
bar one 1.374920 0.502088
two -0.304574 -1.021713
baz one 0.713245 0.315322
two 0.591517 0.543840

In [103]: stacked.unstack(1)
Out[103]:
second one two
first
bar A 1.374920 -0.304574
B 0.502088 -1.021713
baz A 0.713245 0.591517
B 0.315322 0.543840

In [104]: stacked.unstack(0)
Out[104]:
first bar baz
second
one A 1.374920 0.713245
B 0.502088 0.315322
two A -0.304574 0.591517
B -1.021713 0.543840

Pivot tables

See the section on Pivot Tables.


In [105]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
.....: 'B': ['A', 'B', 'C'] * 4,
.....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
.....: 'D': np.random.randn(12),
.....: 'E': np.random.randn(12)})
.....:

In [106]: df
Out[106]:
A B C D E
0 one A foo -0.278531 0.004556
1 one B foo 0.940889 -2.481391
2 two C foo 1.256053 1.125308
(continues on next page)

3.2. 10 minutes to pandas 31


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


3 three A bar 0.336610 0.521860
4 one B bar 1.391023 0.326947
5 one C bar 0.280735 0.468253
6 two A foo 0.766009 -0.302657
7 three B foo 0.752167 -0.107777
8 one C foo -1.228431 -0.397928
9 one A bar -0.423429 -0.479562
10 two B bar -0.846409 0.760181
11 three C bar 1.372452 -0.137678

We can produce pivot tables from this data very easily:

In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])


Out[107]:
C bar foo
A B
one A -0.423429 -0.278531
B 1.391023 0.940889
C 0.280735 -1.228431
three A 0.336610 NaN
B NaN 0.752167
C 1.372452 NaN
two A NaN 0.766009
B -0.846409 NaN
C NaN 1.256053

3.2.9 Time series

pandas has simple, powerful, and efficient functionality for performing resampling operations during fre-
quency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but
not limited to, financial applications. See the Time Series section.

In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')

In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 26405
Freq: 5T, dtype: int64

Time zone representation:

In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [113]: ts
Out[113]:
2012-03-06 1.106290
2012-03-07 -1.917986
(continues on next page)

32 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2012-03-08 1.099233
2012-03-09 -1.180562
2012-03-10 -0.140211
Freq: D, dtype: float64

In [114]: ts_utc = ts.tz_localize('UTC')

In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 1.106290
2012-03-07 00:00:00+00:00 -1.917986
2012-03-08 00:00:00+00:00 1.099233
2012-03-09 00:00:00+00:00 -1.180562
2012-03-10 00:00:00+00:00 -0.140211
Freq: D, dtype: float64

Converting to another time zone:

In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 1.106290
2012-03-06 19:00:00-05:00 -1.917986
2012-03-07 19:00:00-05:00 1.099233
2012-03-08 19:00:00-05:00 -1.180562
2012-03-09 19:00:00-05:00 -0.140211
Freq: D, dtype: float64

Converting between time span representations:


In [117]: rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [118]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [119]: ts
Out[119]:
2012-01-31 0.471911
2012-02-29 -1.370620
2012-03-31 -0.267019
2012-04-30 0.111729
2012-05-31 0.611299
Freq: M, dtype: float64

In [120]: ps = ts.to_period()

In [121]: ps
Out[121]:
2012-01 0.471911
2012-02 -1.370620
2012-03 -0.267019
2012-04 0.111729
2012-05 0.611299
Freq: M, dtype: float64

3.2. 10 minutes to pandas 33


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [122]: ps.to_timestamp()
Out[122]:
2012-01-01 0.471911
2012-02-01 -1.370620
2012-03-01 -0.267019
2012-04-01 0.111729
2012-05-01 0.611299
Freq: MS, dtype: float64
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the
following example, we convert a quarterly frequency with year ending in November to 9am of the end of the
month following the quarter end:

In [123]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In [124]: ts = pd.Series(np.random.randn(len(prng)), prng)

In [125]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [126]: ts.head()
Out[126]:
1990-03-01 09:00 -1.315293
1990-06-01 09:00 -0.487557
1990-09-01 09:00 -0.778802
1990-12-01 09:00 -0.221309
1991-03-01 09:00 0.174483
Freq: H, dtype: float64

3.2.10 Categoricals

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the
API documentation.

In [127]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],


.....: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
.....:

Convert the raw grades to a categorical data type.

In [128]: df["grade"] = df["raw_grade"].astype("category")

In [129]: df["grade"]
Out[129]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!).

34 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series .cat return
a new Series by default).

In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium",


.....: "good", "very good"])
.....:

In [132]: df["grade"]
Out[132]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

Sorting is per order in the categories, not lexical order.

In [133]: df.sort_values(by="grade")
Out[133]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

Grouping by a categorical column also shows empty categories.

In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64

3.2.11 Plotting

See the Plotting docs.

In [135]: ts = pd.Series(np.random.randn(1000),
.....: index=pd.date_range('1/1/2000', periods=1000))
.....:

(continues on next page)

3.2. 10 minutes to pandas 35


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


In [136]: ts = ts.cumsum()

In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x127495750>

On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:

In [139]: df = df.cumsum()

In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>

In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x1277a0a90>

In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x1277a5310>

36 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

3.2.12 Getting data in/out

CSV

Writing to a csv file.

In [143]: df.to_csv('foo.csv')

Reading from a csv file.

In [144]: pd.read_csv('foo.csv')
Out[144]:
Unnamed: 0 A B C D
0 2000-01-01 0.714384 1.118691 0.214306 -0.209375
1 2000-01-02 1.099535 -0.094588 1.519586 1.103485
2 2000-01-03 2.212839 -0.425442 0.268937 1.276245
3 2000-01-04 2.695004 -2.043315 1.337847 1.970893
4 2000-01-05 2.669479 -2.612468 2.095337 2.351292
.. ... ... ... ... ...
995 2002-09-22 -32.014002 -33.223408 28.122550 55.528126
996 2002-09-23 -31.418475 -33.747331 27.812849 55.601151
(continues on next page)

3.2. 10 minutes to pandas 37


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


997 2002-09-24 -32.112192 -33.797895 28.882067 55.136737
998 2002-09-25 -32.855575 -35.252159 29.933404 55.318486
999 2002-09-26 -33.897833 -36.558650 29.289489 55.887491

[1000 rows x 5 columns]

HDF5

Reading and writing to HDFStores.


Writing to a HDF5 Store.

In [145]: df.to_hdf('foo.h5', 'df')

Reading from a HDF5 Store.

In [146]: pd.read_hdf('foo.h5', 'df')


Out[146]:
A B C D
2000-01-01 0.714384 1.118691 0.214306 -0.209375
2000-01-02 1.099535 -0.094588 1.519586 1.103485
2000-01-03 2.212839 -0.425442 0.268937 1.276245
2000-01-04 2.695004 -2.043315 1.337847 1.970893
2000-01-05 2.669479 -2.612468 2.095337 2.351292
... ... ... ... ...
2002-09-22 -32.014002 -33.223408 28.122550 55.528126
2002-09-23 -31.418475 -33.747331 27.812849 55.601151
2002-09-24 -32.112192 -33.797895 28.882067 55.136737
2002-09-25 -32.855575 -35.252159 29.933404 55.318486
2002-09-26 -33.897833 -36.558650 29.289489 55.887491

[1000 rows x 4 columns]

Excel

Reading and writing to MS Excel.


Writing to an excel file.

In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

Reading from an excel file.

In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])


Out[148]:
Unnamed: 0 A B C D
0 2000-01-01 0.714384 1.118691 0.214306 -0.209375
1 2000-01-02 1.099535 -0.094588 1.519586 1.103485
2 2000-01-03 2.212839 -0.425442 0.268937 1.276245
3 2000-01-04 2.695004 -2.043315 1.337847 1.970893
4 2000-01-05 2.669479 -2.612468 2.095337 2.351292
(continues on next page)

38 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


.. ... ... ... ... ...
995 2002-09-22 -32.014002 -33.223408 28.122550 55.528126
996 2002-09-23 -31.418475 -33.747331 27.812849 55.601151
997 2002-09-24 -32.112192 -33.797895 28.882067 55.136737
998 2002-09-25 -32.855575 -35.252159 29.933404 55.318486
999 2002-09-26 -33.897833 -36.558650 29.289489 55.887491

[1000 rows x 5 columns]

3.2.13 Gotchas

If you are attempting to perform an operation you might see an exception like:

>>> if pd.Series([False, True, False]):


... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

See Comparisons for an explanation and what to do.


See Gotchas as well. {{ header }}

3.3 Essential basic functionality

Here we discuss a lot of the essential functionality common to the pandas data structures. Here’s how to
create some of the objects used in the examples from the previous section:

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index,


...: columns=['A', 'B', 'C'])
...:

3.3.1 Head and tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default
number of elements to display is five, but you may pass a custom number.
In [4]: long_series = pd.Series(np.random.randn(1000))

In [5]: long_series.head()
Out[5]:
0 -1.353625
1 0.446611
2 1.600544
3 -0.602700
4 0.096528

3.3. Essential basic functionality 39


pandas: powerful Python data analysis toolkit, Release 0.25.2

dtype: float64

In [6]: long_series.tail(3)
Out[6]:
997 -0.933084
998 -1.895481
999 -1.109266
dtype: float64

3.3.2 Attributes and underlying data

pandas objects have a number of attributes enabling you to access the metadata
• shape: gives the axis dimensions of the object, consistent with ndarray
• Axis labels
– Series: index (only axis)
– DataFrame: index (rows) and columns
Note, these attributes can be safely assigned to!

In [7]: df[:2]
Out[7]:
A B C
2000-01-01 0.717501 -0.936966 0.634588
2000-01-02 -0.712106 -0.600773 0.676949

In [8]: df.columns = [x.lower() for x in df.columns]

In [9]: df
Out[9]:
a b c
2000-01-01 0.717501 -0.936966 0.634588
2000-01-02 -0.712106 -0.600773 0.676949
2000-01-03 1.578377 -0.108295 0.317850
2000-01-04 1.020407 -0.220174 0.719467
2000-01-05 0.423406 0.742788 -0.047665
2000-01-06 1.516605 -0.287212 -0.086036
2000-01-07 0.411242 0.988457 -1.187729
2000-01-08 -0.053553 0.930622 1.334235

Pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual
data and do the actual computation. For many types, the underlying array is a numpy.ndarray. However,
pandas and 3rd party libraries may extend NumPy’s type system to add support for custom arrays (see
dtypes).
To get the actual data inside a Index or Series, use the .array property
In [10]: s.array
Out[10]:
<PandasArray>
[ -1.245150154666457, -1.1212737834640922, 0.14371758135167426,
1.6052453659926817, 0.49430022914548555]
Length: 5, dtype: float64

40 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [11]: s.index.array
Out[11]:
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object
array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas
uses them is a bit beyond the scope of this introduction. See dtypes for more.
If you know you need a NumPy array, use to_numpy() or numpy.asarray().
In [12]: s.to_numpy()
Out[12]: array([-1.24515015, -1.12127378, 0.14371758, 1.60524537, 0.49430023])

In [13]: np.asarray(s)
Out[13]: array([-1.24515015, -1.12127378, 0.14371758, 1.60524537, 0.49430023])
When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and
coercing values. See dtypes for more.
to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider
datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are
two possibly useful representations:
1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the
timezone discarded
Timezones may be preserved with dtype=object

In [14]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [15]: ser.to_numpy(dtype=object)
Out[15]:
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
dtype=object)

Or thrown away with dtype='datetime64[ns]'

In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
dtype='datetime64[ns]')

Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has
a single data type for all the columns, DataFrame.to_numpy() will return the underlying data:

In [17]: df.to_numpy()
Out[17]:
array([[ 0.71750128, -0.93696585, 0.63458795],
[-0.71210567, -0.60077296, 0.67694886],
[ 1.57837673, -0.10829478, 0.31785043],
[ 1.02040735, -0.22017439, 0.71946677],
[ 0.42340608, 0.74278775, -0.04766533],
[ 1.51660469, -0.28721192, -0.08603582],
(continues on next page)

3.3. Essential basic functionality 41


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


[ 0.41124237, 0.98845749, -1.18772875],
[-0.05355332, 0.93062244, 1.33423531]])

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and
the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s
columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis
labels, cannot be assigned to.

Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to
accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype.
If there are only floats and integers, the resulting array will be of float dtype.

In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a
Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we
recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:
1. When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy
array or the extension array. Series.array will always return an ExtensionArray, and will never
copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying
/ coercing values.
2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data
and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(),
being a method, makes it clearer that the returned NumPy array may not be a view on the same data
in the DataFrame.

3.3.3 Accelerated operations

pandas has support for accelerating certain types of binary numerical and boolean operations using the
numexpr library and the bottleneck libraries.
These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr
uses smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that
are especially fast when dealing with arrays that have nans.
Here is a sample (using 100 column x 100,000 row DataFrames):

Operation 0.11.0 (ms) Prior Version (ms) Ratio to Prior


df1 > df2 13.32 125.35 0.1063
df1 * df2 21.71 36.63 0.5928
df1 + df2 22.04 36.50 0.6039

You are highly encouraged to install both libraries. See the section Recommended Dependencies for more
installation info.
These are both enabled to be used by default, you can control this by setting the options:
New in version 0.20.0.

pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

42 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

3.3.4 Flexible binary operations

With binary operations between pandas data structures, there are two key points of interest:
• Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
• Missing data in computations.
We will demonstrate how to manage these issues independently, though they can be handled simultaneously.

Matching / broadcasting behavior

DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), … for
carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these
functions, you can use to either match on the index or columns via the axis keyword:
In [18]: df = pd.DataFrame({
....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
....:

In [19]: df
Out[19]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820

In [20]: row = df.iloc[1]

In [21]: column = df['two']

In [22]: df.sub(row, axis='columns')


Out[22]:
one two three
a 0.498982 -1.938487 NaN
b 0.000000 0.000000 0.000000
c 1.510204 -1.355125 0.951019
d NaN -2.159182 0.491513

In [23]: df.sub(row, axis=1)


Out[23]:
one two three
a 0.498982 -1.938487 NaN
b 0.000000 0.000000 0.000000
c 1.510204 -1.355125 0.951019
d NaN -2.159182 0.491513

In [24]: df.sub(column, axis='index')


Out[24]:
one two three
a -0.495853 0.0 NaN
b -2.933322 0.0 -2.352317

3.3. Essential basic functionality 43


pandas: powerful Python data analysis toolkit, Release 0.25.2

c -0.067993 0.0 -0.046172


d NaN 0.0 0.298378

In [25]: df.sub(column, axis=0)


Out[25]:
one two three
a -0.495853 0.0 NaN
b -2.933322 0.0 -2.352317
c -0.067993 0.0 -0.046172
d NaN 0.0 0.298378
Furthermore you can align a level of a MultiIndexed DataFrame with a Series.

In [26]: dfmi = df.copy()

In [27]: dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),


....: (1, 'c'), (2, 'a')],
....: names=['first', 'second'])
....:

In [28]: dfmi.sub(column, axis=0, level='second')


Out[28]:
one two three
first second
1 a -0.495853 0.000000 NaN
b -2.933322 0.000000 -2.352317
c -0.067993 0.000000 -0.046172
2 a NaN -0.220695 0.077683

Series and Index also support the divmod() builtin. This function takes the floor division and modulo
operation at the same time returning a two-tuple of the same type as the left hand side. For example:
In [29]: s = pd.Series(np.arange(10))

In [30]: s
Out[30]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64

In [31]: div, rem = divmod(s, 3)

In [32]: div
Out[32]:
0 0
1 0
2 0

44 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

3 1
4 1
5 1
6 2
7 2
8 2
9 3
dtype: int64

In [33]: rem
Out[33]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
9 0
dtype: int64

In [34]: idx = pd.Index(np.arange(10))

In [35]: idx
Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [36]: div, rem = divmod(idx, 3)

In [37]: div
Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [38]: rem
Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')
We can also do elementwise divmod():
In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [40]: div
Out[40]:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64

In [41]: rem

3.3. Essential basic functionality 45


pandas: powerful Python data analysis toolkit, Release 0.25.2

Out[41]:
0 0
1 1
2 2
3 0
4 0
5 1
6 1
7 2
8 2
9 3
dtype: int64

Missing data / operations with fill values

In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value
to substitute when at most one of the values at a location are missing. For example, when adding two
DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in
which case the result will be NaN (you can later replace NaN with some other value using fillna if you
wish).
In [42]: df
Out[42]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820

In [43]: df2
Out[43]:
one two three
a -0.466716 0.029137 1.000000
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820

In [44]: df + df2
Out[44]:
one two three
a -0.933433 0.058274 NaN
b -1.931397 3.935247 -0.769386
c 1.089011 1.224997 1.132652
d NaN -0.383116 0.213640

In [45]: df.add(df2, fill_value=0)


Out[45]:
one two three
a -0.933433 0.058274 1.000000
b -1.931397 3.935247 -0.769386
c 1.089011 1.224997 1.132652
d NaN -0.383116 0.213640

46 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Flexible comparisons

Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is
analogous to the binary arithmetic operations described above:
In [46]: df.gt(df2)
Out[46]:
one two three
a False False False
b False False False
c False False False
d False False False

In [47]: df2.ne(df)
Out[47]:
one two three
a False False True
b False False False
c False False False
d True False False
These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool.
These boolean objects can be used in indexing operations, see the section on Boolean indexing.

Boolean reductions

You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean
result.
In [48]: (df > 0).all()
Out[48]:
one False
two False
three False
dtype: bool

In [49]: (df > 0).any()


Out[49]:
one True
two True
three True
dtype: bool
You can reduce to a final boolean value.

In [50]: (df > 0).any().any()


Out[50]: True

You can test if a pandas object is empty, via the empty property.
In [51]: df.empty
Out[51]: False

In [52]: pd.DataFrame(columns=list('ABC')).empty
Out[52]: True

3.3. Essential basic functionality 47


pandas: powerful Python data analysis toolkit, Release 0.25.2

To evaluate single-element pandas objects in a boolean context, use the method bool():
In [53]: pd.Series([True]).bool()
Out[53]: True

In [54]: pd.Series([False]).bool()
Out[54]: False

In [55]: pd.DataFrame([[True]]).bool()
Out[55]: True

In [56]: pd.DataFrame([[False]]).bool()
Out[56]: False

Warning: You might be tempted to do the following:


>>> if df:
... pass

Or
>>> df and df2

These will both raise errors, as you are trying to compare multiple values.:
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

See gotchas for a more detailed discussion.

Comparing if objects are equivalent

Often you may find that there is more than one way to compute the same result. As a simple example,
consider df + df and df * 2. To test that these two computations produce the same result, given the tools
shown above, you might imagine using (df + df == df * 2).all(). But in fact, this expression is False:
In [57]: df + df == df * 2
Out[57]:
one two three
a True True False
b True True True
c True True True
d False True True

In [58]: (df + df == df * 2).all()


Out[58]:
one False
two True
three False
dtype: bool
Notice that the boolean DataFrame df + df == df * 2 contains some False values! This is because NaNs
do not compare as equals:

In [59]: np.nan == np.nan


Out[59]: False

48 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

So, NDFrames (such as Series and DataFrames) have an equals() method for testing equality, with NaNs
in corresponding locations treated as equal.

In [60]: (df + df).equals(df * 2)


Out[60]: True

Note that the Series or DataFrame index needs to be in the same order for equality to be True:
In [61]: df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})

In [62]: df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])

In [63]: df1.equals(df2)
Out[63]: False

In [64]: df1.equals(df2.sort_index())
Out[64]: True

Comparing array-like objects

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a
scalar value:
In [65]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[65]:
0 True
1 False
2 False
dtype: bool

In [66]: pd.Index(['foo', 'bar', 'baz']) == 'foo'


Out[66]: array([ True, False, False])
Pandas also handles element-wise comparisons between different array-like objects of the same length:
In [67]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[67]:
0 True
1 True
2 False
dtype: bool

In [68]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])


Out[68]:
0 True
1 True
2 False
dtype: bool
Trying to compare Index or Series objects of different lengths will raise a ValueError:

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])


ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])


ValueError: Series lengths must match to compare

3.3. Essential basic functionality 49


pandas: powerful Python data analysis toolkit, Release 0.25.2

Note that this is different from the NumPy behavior where a comparison can be broadcast:

In [69]: np.array([1, 2, 3]) == np.array([2])


Out[69]: array([False, True, False])

or it can return False if broadcasting can not be done:

In [70]: np.array([1, 2, 3]) == np.array([1, 2])


Out[70]: False

Combining overlapping data sets

A problem occasionally arising is the combination of two similar data sets where values in one are preferred
over the other. An example would be two data series representing a particular economic indicator where one is
considered to be of “higher quality”. However, the lower quality series might extend further back in history
or have more complete data coverage. As such, we would like to combine two DataFrame objects where
missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame.
The function implementing this operation is combine_first(), which we illustrate:
In [71]: df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
....: 'B': [np.nan, 2., 3., np.nan, 6.]})
....:

In [72]: df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],


....: 'B': [np.nan, np.nan, 3., 4., 6., 8.]})
....:

In [73]: df1
Out[73]:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0

In [74]: df2
Out[74]:
A B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0

In [75]: df1.combine_first(df2)
Out[75]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0

50 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

5 7.0 8.0

General DataFrame combine

The combine_first() method above calls the more general DataFrame.combine(). This method takes
another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner
function pairs of Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first() as above:

In [76]: def combiner(x, y):


....: return np.where(pd.isna(x), y, x)
....:

3.3.5 Descriptive statistics

There exists a large number of methods for computing descriptive statistics and other related operations on
Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(),
mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same
size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the
axis can be specified by name or integer:
• Series: no axis argument needed
• DataFrame: “index” (axis=0, default), “columns” (axis=1)
For example:
In [77]: df
Out[77]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820

In [78]: df.mean(0)
Out[78]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64

In [79]: df.mean(1)
Out[79]:
a -0.218790
b 0.205744
c 0.574443
d -0.042369
dtype: float64
All such methods have a skipna option signaling whether to exclude missing data (True by default):
In [80]: df.sum(0, skipna=False)
Out[80]:
one NaN

3.3. Essential basic functionality 51


pandas: powerful Python data analysis toolkit, Release 0.25.2

two 2.417701
three NaN
dtype: float64

In [81]: df.sum(axis=1, skipna=True)


Out[81]:
a -0.437579
b 0.617232
c 1.723330
d -0.084738
dtype: float64
Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like
standardization (rendering data zero mean and standard deviation 1), very concisely:

In [82]: ts_stand = (df - df.mean()) / df.std()

In [83]: ts_stand.std()
Out[83]:
one 1.0
two 1.0
three 1.0
dtype: float64

In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [85]: xs_stand.std(1)
Out[85]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat
different from expanding() and rolling(). For more details please see this note.

In [86]: df.cumsum()
Out[86]:
one two three
a -0.466716 0.029137 NaN
b -1.432415 1.996760 -0.384693
c -0.887909 2.609259 0.181633
d NaN 2.417701 0.288453

Here is a quick reference summary table of common functions. Each also takes an optional level parameter
which applies only if the object has a hierarchical index.

52 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by
default:
In [87]: np.mean(df['one'])
Out[87]: -0.29596980138380163

In [88]: np.mean(df['one'].to_numpy())
Out[88]: nan
Series.nunique() will return the number of unique non-NA values in a Series:

In [89]: series = pd.Series(np.random.randn(500))

In [90]: series[20:500] = np.nan

In [91]: series[10:20] = 5

In [92]: series.nunique()
Out[92]: 11

Summarizing data: describe

There is a convenient describe() function which computes a variety of summary statistics about a Series
or the columns of a DataFrame (excluding NAs of course):

In [93]: series = pd.Series(np.random.randn(1000))

In [94]: series[::2] = np.nan

In [95]: series.describe()
(continues on next page)

3.3. Essential basic functionality 53


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Out[95]:
count 500.000000
mean 0.008270
std 0.957001
min -2.832536
25% -0.652563
50% 0.011414
75% 0.659473
max 3.179873
dtype: float64

In [96]: frame = pd.DataFrame(np.random.randn(1000, 5),


....: columns=['a', 'b', 'c', 'd', 'e'])
....:

In [97]: frame.iloc[::2] = np.nan

In [98]: frame.describe()
Out[98]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean -0.051793 0.014006 -0.088499 -0.065968 0.099254
std 0.946751 0.953261 1.012176 1.021467 0.939771
min -2.778772 -2.932532 -3.194228 -2.490294 -2.640602
25% -0.701138 -0.632182 -0.716483 -0.787017 -0.546624
50% -0.075766 0.010519 -0.061189 -0.076329 0.095151
75% 0.620752 0.681123 0.642695 0.673043 0.743923
max 2.470484 2.415173 2.952825 2.850750 3.220515

You can select specific percentiles to include in the output:

In [99]: series.describe(percentiles=[.05, .25, .75, .95])


Out[99]:
count 500.000000
mean 0.008270
std 0.957001
min -2.832536
5% -1.640749
25% -0.652563
50% 0.011414
75% 0.659473
95% 1.524857
max 3.179873
dtype: float64

By default, the median is always included.


For a non-numerical Series object, describe() will give a simple summary of the number of unique values
and most frequently occurring values:
In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [101]: s.describe()
(continues on next page)

54 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Out[101]:
count 9
unique 4
top a
freq 5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numer-
ical columns or, if none are, only categorical columns:

In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})

In [103]: frame.describe()
Out[103]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000

This behavior can be controlled by providing a list of types as include/exclude arguments. The special
value all can also be used:
In [104]: frame.describe(include=['object'])
Out[104]:
a
count 4
unique 2
top Yes
freq 2

In [105]: frame.describe(include=['number'])
Out[105]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000

In [106]: frame.describe(include='all')
Out[106]:
a b
count 4 4.000000
unique 2 NaN
top Yes NaN

3.3. Essential basic functionality 55


pandas: powerful Python data analysis toolkit, Release 0.25.2

freq 2 NaN
mean NaN 1.500000
std NaN 1.290994
min NaN 0.000000
25% NaN 0.750000
50% NaN 1.500000
75% NaN 2.250000
max NaN 3.000000
That feature relies on select_dtypes. Refer to there for details about accepted inputs.

Index of min/max values

The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum
and maximum corresponding values:
In [107]: s1 = pd.Series(np.random.randn(5))

In [108]: s1
Out[108]:
0 -1.882481
1 0.007554
2 0.851716
3 -0.395161
4 -0.304605
dtype: float64

In [109]: s1.idxmin(), s1.idxmax()


Out[109]: (0, 2)

In [110]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])

In [111]: df1
Out[111]:
A B C
0 0.029120 -1.109406 -0.690599
1 -1.083192 0.419577 -1.165015
2 1.112228 0.169718 -1.251593
3 -0.313268 -0.517805 -0.736270
4 -0.354983 0.601455 0.452575

In [112]: df1.idxmin(axis=0)
Out[112]:
A 1
B 0
C 2
dtype: int64

In [113]: df1.idxmax(axis=1)
Out[113]:
0 A
1 B
2 A
3 A

56 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

4 B
dtype: object
When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and
idxmax() return the first matching index:
In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [115]: df3
Out[115]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN

In [116]: df3['A'].idxmin()
Out[116]: 'd'

Note: idxmin and idxmax are called argmin and argmax in NumPy.

Value counts (histogramming) / mode

The value_counts() Series method and top-level function computes a histogram of a 1D array of values.
It can also be used as a function on regular arrays:
In [117]: data = np.random.randint(0, 7, size=50)

In [118]: data
Out[118]:
array([1, 1, 0, 4, 1, 3, 4, 5, 2, 3, 0, 6, 5, 1, 5, 4, 5, 2, 6, 3, 6, 6,
1, 2, 0, 3, 3, 3, 4, 3, 2, 4, 4, 0, 3, 2, 5, 3, 5, 5, 6, 5, 1, 3,
6, 5, 0, 6, 1, 1])

In [119]: s = pd.Series(data)

In [120]: s.value_counts()
Out[120]:
3 10
5 9
1 8
6 7
4 6
2 5
0 5
dtype: int64

In [121]: pd.value_counts(data)
Out[121]:
3 10
5 9
1 8

3.3. Essential basic functionality 57


pandas: powerful Python data analysis toolkit, Release 0.25.2

6 7
4 6
2 5
0 5
dtype: int64
Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or
DataFrame:

In [122]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [123]: s5.mode()
Out[123]:
0 3
1 7
dtype: int64

In [124]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),


.....: "B": np.random.randint(-10, 15, size=50)})
.....:

In [125]: df5.mode()
Out[125]:
A B
0 1 1
1 4 2

Discretization and quantiling

Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on
sample quantiles) functions:

In [126]: arr = np.random.randn(20)

In [127]: factor = pd.cut(arr, 4)

In [128]: factor
Out[128]:
[(0.631, 1.318], (0.631, 1.318], (0.631, 1.318], (-0.0556, 0.631], (-0.742, -0.0556], ...
,→, (-0.742, -0.0556], (-0.742, -0.0556], (-0.742, -0.0556], (-0.0556, 0.631], (0.631, 1.

,→318]]

Length: 20
Categories (4, interval[float64]): [(-1.432, -0.742] < (-0.742, -0.0556] < (-0.0556, 0.
,→631] <

(0.631, 1.318]]

In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [130]: factor
Out[130]:
[(0, 1], (0, 1], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (0, 1], (1, 5]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

58 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into
equal-size quartiles like so:
In [131]: arr = np.random.randn(30)

In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])

In [133]: factor
Out[133]:
[(-0.69, 0.0118], (0.874, 2.622], (0.0118, 0.874], (-0.69, 0.0118], (0.874, 2.622], ...,␣
,→(-1.7619999999999998, -0.69], (0.874, 2.622], (0.0118, 0.874], (0.874, 2.622], (0.0118,

,→ 0.874]]

Length: 30
Categories (4, interval[float64]): [(-1.7619999999999998, -0.69] < (-0.69, 0.0118] < (0.
,→0118, 0.874] <

(0.874, 2.622]]

In [134]: pd.value_counts(factor)
Out[134]:
(0.874, 2.622] 8
(-1.7619999999999998, -0.69] 8
(0.0118, 0.874] 7
(-0.69, 0.0118] 7
dtype: int64
We can also pass infinite values to define the bins:

In [135]: arr = np.random.randn(20)

In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [137]: factor
Out[137]:
[(-inf, 0.0], (0.0, inf], (-inf, 0.0], (-inf, 0.0], (0.0, inf], ..., (0.0, inf], (0.0,␣
,→inf], (0.0, inf], (-inf, 0.0], (0.0, inf]]

Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

3.3.6 Function application

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods
below. The appropriate method to use depends on whether your function expects to operate on an entire
DataFrame or Series, row- or column-wise, or elementwise.
1. Tablewise Function Application: pipe()
2. Row or Column-wise Function Application: apply()
3. Aggregation API : agg() and transform()
4. Applying Elementwise Functions: applymap()

3.3. Essential basic functionality 59


pandas: powerful Python data analysis toolkit, Release 0.25.2

Tablewise function application

DataFrames and Series can of course just be passed into functions. However, if the function needs to be
called in a chain, consider using the pipe() method. Compare the following

# f, g, and h are functions taking and returning ``DataFrames``


>>> f(g(h(df), arg1=1), arg2=2, arg3=3)

with the equivalent

>>> (df.pipe(h)
... .pipe(g, arg1=1)
... .pipe(f, arg2=2, arg3=3))

Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your
own or another library’s functions in method chains, alongside pandas’ methods.
In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument.
What if the function you wish to apply takes its data as, say, the second argument? In this case, provide
pipe with a tuple of (callable, data_keyword). .pipe will route the DataFrame to the argument specified
in the tuple.
For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame
as the second argument, data. We pass in the function, keyword pair (sm.ols, 'data') to pipe:
In [138]: import statsmodels.formula.api as sm

In [139]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [140]: (bb.query('h > 0')


.....: .assign(ln_h=lambda df: np.log(df.h))
.....: .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
.....: .fit()
.....: .summary()
.....: )
.....:
Out[140]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: hr R-squared: 0.685
Model: OLS Adj. R-squared: 0.665
Method: Least Squares F-statistic: 34.28
Date: Wed, 23 Oct 2019 Prob (F-statistic): 3.48e-15
Time: 14:27:30 Log-Likelihood: -205.92
No. Observations: 68 AIC: 421.8
Df Residuals: 63 BIC: 432.9
Df Model: 4
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -8484.7720 4664.146 -1.819 0.074 -1.78e+04 835.780
C(lg)[T.NL] -2.2736 1.325 -1.716 0.091 -4.922 0.375
(continues on next page)

60 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


ln_h -1.3542 0.875 -1.547 0.127 -3.103 0.395
year 4.2277 2.324 1.819 0.074 -0.417 8.872
g 0.1841 0.029 6.258 0.000 0.125 0.243
==============================================================================
Omnibus: 10.875 Durbin-Watson: 1.999
Prob(Omnibus): 0.004 Jarque-Bera (JB): 17.298
Skew: 0.537 Prob(JB): 0.000175
Kurtosis: 5.225 Cond. No. 1.49e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
,→specified.

[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the
popular (%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at
home in python. We encourage you to view the source code of pipe().

Row or column-wise function application

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like
the descriptive statistics methods, takes an optional axis argument:
In [141]: df.apply(np.mean)
Out[141]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64

In [142]: df.apply(np.mean, axis=1)


Out[142]:
a -0.218790
b 0.205744
c 0.574443
d -0.042369
dtype: float64

In [143]: df.apply(lambda x: x.max() - x.min())


Out[143]:
one 1.510204
two 2.159182
three 0.951019
dtype: float64

In [144]: df.apply(np.cumsum)
Out[144]:
one two three
a -0.466716 0.029137 NaN

3.3. Essential basic functionality 61


pandas: powerful Python data analysis toolkit, Release 0.25.2

b -1.432415 1.996760 -0.384693


c -0.887909 2.609259 0.181633
d NaN 2.417701 0.288453

In [145]: df.apply(np.exp)
Out[145]:
one two three
a 0.627058 1.029566 NaN
b 0.380717 7.153656 0.680659
c 1.723756 1.845035 1.761782
d NaN 0.825672 1.112734
The apply() method will also dispatch on a string method name.
In [146]: df.apply('mean')
Out[146]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64

In [147]: df.apply('mean', axis=1)


Out[147]:
a -0.218790
b 0.205744
c 0.574443
d -0.042369
dtype: float64
The return type of the function passed to apply() affects the type of the final output from DataFrame.apply
for the default behaviour:
• If the applied function returns a Series, the final output is a DataFrame. The columns match the
index of the Series returned by the applied function.
• If the applied function returns any other type, the final output is a Series.
This default behaviour can be overridden using the result_type, which accepts three options: reduce,
broadcast, and expand. These will determine how list-likes return values expand (or not) to a DataFrame.
apply() combined with some cleverness can be used to answer many questions about a data set. For
example, suppose we wanted to extract the date where the maximum value for each column occurred:

In [148]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],


.....: index=pd.date_range('1/1/2000', periods=1000))
.....:

In [149]: tsdf.apply(lambda x: x.idxmax())


Out[149]:
A 2002-01-27
B 2000-12-15
C 2001-05-30
dtype: datetime64[ns]

You may also pass additional arguments and keyword arguments to the apply() method. For instance,
consider the following function you would like to apply:

62 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

def subtract_and_divide(x, sub, divide=1):


return (x - sub) / divide

You may then apply this function as follows:

df.apply(subtract_and_divide, args=(5,), divide=3)

Another useful feature is the ability to pass Series methods to carry out some Series operation on each
column or row:
In [150]: tsdf
Out[150]:
A B C
2000-01-01 0.752721 -0.364275 0.134481
2000-01-02 0.826213 1.208276 -0.154643
2000-01-03 0.041831 0.479876 -0.956380
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -1.549128 1.145240 0.320799
2000-01-09 0.507849 -1.142744 -0.155781
2000-01-10 -0.169863 1.813758 0.771396

In [151]: tsdf.apply(pd.Series.interpolate)
Out[151]:
A B C
2000-01-01 0.752721 -0.364275 0.134481
2000-01-02 0.826213 1.208276 -0.154643
2000-01-03 0.041831 0.479876 -0.956380
2000-01-04 -0.276361 0.612949 -0.700945
2000-01-05 -0.594552 0.746021 -0.445509
2000-01-06 -0.912744 0.879094 -0.190073
2000-01-07 -1.230936 1.012167 0.065363
2000-01-08 -1.549128 1.145240 0.320799
2000-01-09 0.507849 -1.142744 -0.155781
2000-01-10 -0.169863 1.813758 0.771396
Finally, apply() takes an argument raw which is False by default, which converts each row or column into
a Series before applying the function. When set to True, the passed function will instead receive an ndarray
object, which has positive performance implications if you do not need the indexing functionality.

Aggregation API

New in version 0.20.0.


The aggregation API allows one to express possibly multiple aggregation operations in a single concise way.
This API is similar across pandas objects, see groupby API , the window functions API , and the resample
API . The entry point for aggregation is DataFrame.aggregate(), or the alias DataFrame.agg().
We will use a similar starting frame from above:

In [152]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],


.....: index=pd.date_range('1/1/2000', periods=10))
(continues on next page)

3.3. Essential basic functionality 63


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


.....:

In [153]: tsdf.iloc[3:7] = np.nan

In [154]: tsdf
Out[154]:
A B C
2000-01-01 -0.096512 0.747788 0.465973
2000-01-02 1.596492 -1.093022 -0.571837
2000-01-03 0.090216 -1.615127 -0.025061
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -0.051882 -0.369806 -1.363744
2000-01-09 2.243620 -1.063587 0.448386
2000-01-10 0.096756 0.781555 -1.912402

Using a single function is equivalent to apply(). You can also pass named methods as strings. These will
return a Series of the aggregated output:
In [155]: tsdf.agg(np.sum)
Out[155]:
A 3.878691
B -2.612201
C -2.958685
dtype: float64

In [156]: tsdf.agg('sum')
Out[156]:
A 3.878691
B -2.612201
C -2.958685
dtype: float64

# these are equivalent to a ``.sum()`` because we are aggregating


# on a single function
In [157]: tsdf.sum()
Out[157]:
A 3.878691
B -2.612201
C -2.958685
dtype: float64
Single aggregations on a Series this will return a scalar value:

In [158]: tsdf.A.agg('sum')
Out[158]: 3.8786905103802636

64 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Aggregating with multiple functions

You can pass multiple aggregation arguments as a list. The results of each of the passed functions will be a
row in the resulting DataFrame. These are naturally named from the aggregation function.

In [159]: tsdf.agg(['sum'])
Out[159]:
A B C
sum 3.878691 -2.612201 -2.958685

Multiple functions yield multiple rows:

In [160]: tsdf.agg(['sum', 'mean'])


Out[160]:
A B C
sum 3.878691 -2.612201 -2.958685
mean 0.646448 -0.435367 -0.493114

On a Series, multiple functions return a Series, indexed by the function names:

In [161]: tsdf.A.agg(['sum', 'mean'])


Out[161]:
sum 3.878691
mean 0.646448
Name: A, dtype: float64

Passing a lambda function will yield a <lambda> named row:

In [162]: tsdf.A.agg(['sum', lambda x: x.mean()])


Out[162]:
sum 3.878691
<lambda> 0.646448
Name: A, dtype: float64

Passing a named function will yield that name for the row:

In [163]: def mymean(x):


.....: return x.mean()
.....:

In [164]: tsdf.A.agg(['sum', mymean])


Out[164]:
sum 3.878691
mymean 0.646448
Name: A, dtype: float64

Aggregating with a dict

Passing a dictionary of column names to a scalar or a list of scalars, to DataFrame.agg allows you to
customize which functions are applied to which columns. Note that the results are not in any particular
order, you can use an OrderedDict instead to guarantee ordering.

3.3. Essential basic functionality 65


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [165]: tsdf.agg({'A': 'mean', 'B': 'sum'})


Out[165]:
A 0.646448
B -2.612201
dtype: float64

Passing a list-like will generate a DataFrame output. You will get a matrix-like output of all of the aggregators.
The output will consist of all unique functions. Those that are not noted for a particular column will be
NaN:

In [166]: tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})


Out[166]:
A B
mean 0.646448 NaN
min -0.096512 NaN
sum NaN -2.612201

Mixed dtypes

When presented with mixed dtypes that cannot aggregate, .agg will only take the valid aggregations. This
is similar to how groupby .agg works.

In [167]: mdf = pd.DataFrame({'A': [1, 2, 3],


.....: 'B': [1., 2., 3.],
.....: 'C': ['foo', 'bar', 'baz'],
.....: 'D': pd.date_range('20130101', periods=3)})
.....:

In [168]: mdf.dtypes
Out[168]:
A int64
B float64
C object
D datetime64[ns]
dtype: object

In [169]: mdf.agg(['min', 'sum'])


Out[169]:
A B C D
min 1 1.0 bar 2013-01-01
sum 6 6.0 foobarbaz NaT

Custom describe

With .agg() is it possible to easily create a custom describe function, similar to the built in describe function.

In [170]: from functools import partial

In [171]: q_25 = partial(pd.Series.quantile, q=0.25)

(continues on next page)

66 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


In [172]: q_25.__name__ = '25%'

In [173]: q_75 = partial(pd.Series.quantile, q=0.75)

In [174]: q_75.__name__ = '75%'

In [175]: tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])


Out[175]:
A B C
count 6.000000 6.000000 6.000000
mean 0.646448 -0.435367 -0.493114
std 1.010421 1.010352 0.980195
min -0.096512 -1.615127 -1.912402
25% -0.016358 -1.085664 -1.165767
median 0.093486 -0.716697 -0.298449
75% 1.221558 0.468389 0.330024
max 2.243620 0.781555 0.465973

Transform API

New in version 0.20.0.


The transform() method returns an object that is indexed the same (same size) as the original. This API
allows you to provide multiple operations at the same time rather than one-by-one. Its API is quite similar
to the .agg API.
We create a frame similar to the one used in the above sections.

In [176]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],


.....: index=pd.date_range('1/1/2000', periods=10))
.....:

In [177]: tsdf.iloc[3:7] = np.nan

In [178]: tsdf
Out[178]:
A B C
2000-01-01 -0.646172 0.316546 0.442399
2000-01-02 -0.656399 -2.103519 0.373859
2000-01-03 0.479981 -0.047075 -0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 -1.402090 -0.915210 0.541518

Transform the entire frame. .transform() allows input functions as: a NumPy function, a string function
name or a user defined function.
In [179]: tsdf.transform(np.abs)
Out[179]:

3.3. Essential basic functionality 67


pandas: powerful Python data analysis toolkit, Release 0.25.2

A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 1.402090 0.915210 0.541518

In [180]: tsdf.transform('abs')
Out[180]:
A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 1.402090 0.915210 0.541518

In [181]: tsdf.transform(lambda x: x.abs())


Out[181]:
A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 1.402090 0.915210 0.541518
Here transform() received a single function; this is equivalent to a ufunc application.
In [182]: np.abs(tsdf)
Out[182]:
A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
(continues on next page)

68 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2000-01-10 1.402090 0.915210 0.541518

Passing a single function to .transform() with a Series will yield a single Series in return.

In [183]: tsdf.A.transform(np.abs)
Out[183]:
2000-01-01 0.646172
2000-01-02 0.656399
2000-01-03 0.479981
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 1.029689
2000-01-09 0.800346
2000-01-10 1.402090
Freq: D, Name: A, dtype: float64

Transform with multiple functions

Passing multiple functions will yield a column MultiIndexed DataFrame. The first level will be the original
frame column names; the second level will be the names of the transforming functions.

In [184]: tsdf.transform([np.abs, lambda x: x + 1])


Out[184]:
A B C
absolute <lambda> absolute <lambda> absolute <lambda>
2000-01-01 0.646172 0.353828 0.316546 1.316546 0.442399 1.442399
2000-01-02 0.656399 0.343601 2.103519 -1.103519 0.373859 1.373859
2000-01-03 0.479981 1.479981 0.047075 0.952925 0.002847 0.997153
2000-01-04 NaN NaN NaN NaN NaN NaN
2000-01-05 NaN NaN NaN NaN NaN NaN
2000-01-06 NaN NaN NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN NaN NaN
2000-01-08 1.029689 2.029689 0.548916 1.548916 0.960356 1.960356
2000-01-09 0.800346 1.800346 0.604359 1.604359 0.324512 1.324512
2000-01-10 1.402090 -0.402090 0.915210 0.084790 0.541518 1.541518

Passing multiple functions to a Series will yield a DataFrame. The resulting column names will be the
transforming functions.

In [185]: tsdf.A.transform([np.abs, lambda x: x + 1])


Out[185]:
absolute <lambda>
2000-01-01 0.646172 0.353828
2000-01-02 0.656399 0.343601
2000-01-03 0.479981 1.479981
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 NaN NaN
2000-01-07 NaN NaN
(continues on next page)

3.3. Essential basic functionality 69


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2000-01-08 1.029689 2.029689
2000-01-09 0.800346 1.800346
2000-01-10 1.402090 -0.402090

Transforming with a dict

Passing a dict of functions will allow selective transforming per column.

In [186]: tsdf.transform({'A': np.abs, 'B': lambda x: x + 1})


Out[186]:
A B
2000-01-01 0.646172 1.316546
2000-01-02 0.656399 -1.103519
2000-01-03 0.479981 0.952925
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 NaN NaN
2000-01-07 NaN NaN
2000-01-08 1.029689 1.548916
2000-01-09 0.800346 1.604359
2000-01-10 1.402090 0.084790

Passing a dict of lists will generate a MultiIndexed DataFrame with these selective transforms.

In [187]: tsdf.transform({'A': np.abs, 'B': [lambda x: x + 1, 'sqrt']})


Out[187]:
A B
absolute <lambda> sqrt
2000-01-01 0.646172 1.316546 0.562625
2000-01-02 0.656399 -1.103519 NaN
2000-01-03 0.479981 0.952925 NaN
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 1.548916 0.740889
2000-01-09 0.800346 1.604359 0.777405
2000-01-10 1.402090 0.084790 NaN

Applying elementwise functions

Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the
methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a
single value and returning a single value. For example:
In [188]: df4
Out[188]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326

70 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

d NaN -0.191558 0.106820

In [189]: def f(x):


.....: return len(str(x))
.....:

In [190]: df4['one'].map(f)
Out[190]:
a 19
b 19
c 17
d 3
Name: one, dtype: int64

In [191]: df4.applymap(f)
Out[191]:
one two three
a 19 20 3
b 19 18 19
c 17 17 18
d 3 20 18
Series.map() has an additional feature; it can be used to easily “link” or “map” values defined by a
secondary series. This is closely related to merging/joining functionality:
In [192]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
.....: index=['a', 'b', 'c', 'd', 'e'])
.....:

In [193]: t = pd.Series({'six': 6., 'seven': 7.})

In [194]: s
Out[194]:
a six
b seven
c six
d seven
e six
dtype: object

In [195]: s.map(t)
Out[195]:
a 6.0
b 7.0
c 6.0
d 7.0
e 6.0
dtype: float64

3.3.7 Reindexing and altering labels

reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other
features relying on label-alignment functionality. To reindex means to conform the data to match a given
set of labels along a particular axis. This accomplishes several things:

3.3. Essential basic functionality 71


pandas: powerful Python data analysis toolkit, Release 0.25.2

• Reorders the existing data to match a new set of labels


• Inserts missing value (NA) markers in label locations where no data for that label existed
• If specified, fill data for missing labels using logic (highly relevant to working with time series data)
Here is a simple example:
In [196]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [197]: s
Out[197]:
a 1.910501
b -0.247105
c -1.134167
d -1.591918
e 0.243080
dtype: float64

In [198]: s.reindex(['e', 'b', 'f', 'd'])


Out[198]:
e 0.243080
b -0.247105
f NaN
d -1.591918
dtype: float64
Here, the f label was not contained in the Series and hence appears as NaN in the result.
With a DataFrame, you can simultaneously reindex the index and columns:
In [199]: df
Out[199]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820

In [200]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])


Out[200]:
three two one
c 0.566326 0.612498 0.544505
f NaN NaN NaN
b -0.384693 1.967624 -0.965699
You may also use reindex with an axis keyword:

In [201]: df.reindex(['c', 'f', 'b'], axis='index')


Out[201]:
one two three
c 0.544505 0.612498 0.566326
f NaN NaN NaN
b -0.965699 1.967624 -0.384693

Note that the Index objects containing the actual axis labels can be shared between objects. So if we have
a Series and a DataFrame, the following can be done:
In [202]: rs = s.reindex(df.index)

72 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [203]: rs
Out[203]:
a 1.910501
b -0.247105
c -1.134167
d -1.591918
dtype: float64

In [204]: rs.index is df.index


Out[204]: True
This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.
New in version 0.21.0.
DataFrame.reindex() also supports an “axis-style” calling convention, where you specify a single labels
argument and the axis it applies to.
In [205]: df.reindex(['c', 'f', 'b'], axis='index')
Out[205]:
one two three
c 0.544505 0.612498 0.566326
f NaN NaN NaN
b -0.965699 1.967624 -0.384693

In [206]: df.reindex(['three', 'two', 'one'], axis='columns')


Out[206]:
three two one
a NaN 0.029137 -0.466716
b -0.384693 1.967624 -0.965699
c 0.566326 0.612498 0.544505
d 0.106820 -0.191558 NaN
See also:
MultiIndex / Advanced Indexing is an even more concise way of doing reindexing.

Note: When writing performance-sensitive code, there is a good reason to spend some time becoming a
reindexing ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames
internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because
reindex has been heavily optimized), but when CPU cycles matter sprinkling a few explicit reindex calls
here and there can have an impact.

Reindexing to align with another object

You may wish to take an object and reindex its axes to be labeled the same as another object. While the
syntax for this is straightforward albeit verbose, it is a common enough operation that the reindex_like()
method is available to make this simpler:
In [207]: df2
Out[207]:
one two
a -0.466716 0.029137
b -0.965699 1.967624

3.3. Essential basic functionality 73


pandas: powerful Python data analysis toolkit, Release 0.25.2

c 0.544505 0.612498

In [208]: df3
Out[208]:
one two
a -0.170747 -0.840616
b -0.669729 1.097871
c 0.840475 -0.257255

In [209]: df.reindex_like(df2)
Out[209]:
one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498

Aligning objects with each other with align

The align() method is the fastest way to simultaneously align two objects. It supports a join argument
(related to joining and merging):
• join='outer': take the union of the indexes (default)
• join='left': use the calling object’s index
• join='right': use the passed object’s index
• join='inner': intersect the indexes
It returns a tuple with both of the reindexed Series:
In [210]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [211]: s1 = s[:4]

In [212]: s2 = s[1:]

In [213]: s1.align(s2)
Out[213]:
(a -1.362711
b 0.385943
c -0.538184
d 0.502612
e NaN
dtype: float64, a NaN
b 0.385943
c -0.538184
d 0.502612
e -1.491718
dtype: float64)

In [214]: s1.align(s2, join='inner')


Out[214]:
(b 0.385943
c -0.538184
d 0.502612

74 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

dtype: float64, b 0.385943


c -0.538184
d 0.502612
dtype: float64)

In [215]: s1.align(s2, join='left')


Out[215]:
(a -1.362711
b 0.385943
c -0.538184
d 0.502612
dtype: float64, a NaN
b 0.385943
c -0.538184
d 0.502612
dtype: float64)
For DataFrames, the join method will be applied to both the index and the columns by default:

In [216]: df.align(df2, join='inner')


Out[216]:
( one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498, one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498)

You can also pass an axis option to only align on the specified axis:

In [217]: df.align(df2, join='inner', axis=0)


Out[217]:
( one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326, one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498)

If you pass a Series to DataFrame.align(), you can choose to align both objects either on the DataFrame’s
index or columns using the axis argument:

In [218]: df.align(df2.iloc[0], axis=1)


Out[218]:
( one three two
a -0.466716 NaN 0.029137
b -0.965699 -0.384693 1.967624
c 0.544505 0.566326 0.612498
d NaN 0.106820 -0.191558, one -0.466716
three NaN
two 0.029137
Name: a, dtype: float64)

3.3. Essential basic functionality 75


pandas: powerful Python data analysis toolkit, Release 0.25.2

Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following table:

Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value

We illustrate these fill methods on a simple Series:


In [219]: rng = pd.date_range('1/3/2000', periods=8)

In [220]: ts = pd.Series(np.random.randn(8), index=rng)

In [221]: ts2 = ts[[0, 3, 6]]

In [222]: ts
Out[222]:
2000-01-03 -0.267818
2000-01-04 0.951544
2000-01-05 1.220097
2000-01-06 1.083557
2000-01-07 -0.556464
2000-01-08 0.614753
2000-01-09 0.617631
2000-01-10 -0.131812
Freq: D, dtype: float64

In [223]: ts2
Out[223]:
2000-01-03 -0.267818
2000-01-06 1.083557
2000-01-09 0.617631
dtype: float64

In [224]: ts2.reindex(ts.index)
Out[224]:
2000-01-03 -0.267818
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 1.083557
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 0.617631
2000-01-10 NaN
Freq: D, dtype: float64

In [225]: ts2.reindex(ts.index, method='ffill')


Out[225]:
2000-01-03 -0.267818
2000-01-04 -0.267818
2000-01-05 -0.267818
2000-01-06 1.083557

76 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

2000-01-07 1.083557
2000-01-08 1.083557
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64

In [226]: ts2.reindex(ts.index, method='bfill')


Out[226]:
2000-01-03 -0.267818
2000-01-04 1.083557
2000-01-05 1.083557
2000-01-06 1.083557
2000-01-07 0.617631
2000-01-08 0.617631
2000-01-09 0.617631
2000-01-10 NaN
Freq: D, dtype: float64

In [227]: ts2.reindex(ts.index, method='nearest')


Out[227]:
2000-01-03 -0.267818
2000-01-04 -0.267818
2000-01-05 1.083557
2000-01-06 1.083557
2000-01-07 1.083557
2000-01-08 0.617631
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64
These methods require that the indexes are ordered increasing or decreasing.
Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

In [228]: ts2.reindex(ts.index).fillna(method='ffill')
Out[228]:
2000-01-03 -0.267818
2000-01-04 -0.267818
2000-01-05 -0.267818
2000-01-06 1.083557
2000-01-07 1.083557
2000-01-08 1.083557
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64

reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and
interpolate() will not perform any checks on the order of the index.

Limits on filling while reindexing

The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies
the maximum count of consecutive matches:

3.3. Essential basic functionality 77


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [229]: ts2.reindex(ts.index, method='ffill', limit=1)


Out[229]:
2000-01-03 -0.267818
2000-01-04 -0.267818
2000-01-05 NaN
2000-01-06 1.083557
2000-01-07 1.083557
2000-01-08 NaN
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64

In contrast, tolerance specifies the maximum distance between the index and indexer values:

In [230]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')


Out[230]:
2000-01-03 -0.267818
2000-01-04 -0.267818
2000-01-05 NaN
2000-01-06 1.083557
2000-01-07 1.083557
2000-01-08 NaN
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64

Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into
a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:
In [231]: df
Out[231]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820

In [232]: df.drop(['a', 'd'], axis=0)


Out[232]:
one two three
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326

In [233]: df.drop(['one'], axis=1)


Out[233]:
two three
a 0.029137 NaN
b 1.967624 -0.384693
c 0.612498 0.566326

78 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

d -0.191558 0.106820
Note that the following also works, but is a bit less obvious / clean:

In [234]: df.reindex(df.index.difference(['a', 'd']))


Out[234]:
one two three
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326

Renaming / mapping labels

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary
function.
In [235]: s
Out[235]:
a -1.362711
b 0.385943
c -0.538184
d 0.502612
e -1.491718
dtype: float64

In [236]: s.rename(str.upper)
Out[236]:
A -1.362711
B 0.385943
C -0.538184
D 0.502612
E -1.491718
dtype: float64
If you pass a function, it must return a value when called with any of the labels (and must produce a set of
unique values). A dict or Series can also be used:

In [237]: df.rename(columns={'one': 'foo', 'two': 'bar'},


.....: index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
.....:
Out[237]:
foo bar three
apple -0.466716 0.029137 NaN
banana -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
durian NaN -0.191558 0.106820

If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping
don’t throw an error.
New in version 0.21.0.
DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper
and the axis to apply that mapping to.
In [238]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
Out[238]:

3.3. Essential basic functionality 79


pandas: powerful Python data analysis toolkit, Release 0.25.2

foo bar three


a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820

In [239]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')


Out[239]:
one two three
apple -0.466716 0.029137 NaN
banana -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
durian NaN -0.191558 0.106820
The rename() method also provides an inplace named parameter that is by default False and copies the
underlying data. Pass inplace=True to rename the data in place.
New in version 0.18.0.
Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

In [240]: s.rename("scalar-name")
Out[240]:
a -1.362711
b 0.385943
c -0.538184
d 0.502612
e -1.491718
Name: scalar-name, dtype: float64

New in version 0.24.0.


The methods rename_axis() and rename_axis() allow specific names of a MultiIndex to be changed (as
opposed to the labels).
In [241]: df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6],
.....: 'y': [10, 20, 30, 40, 50, 60]},
.....: index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]],
.....: names=['let', 'num']))
.....:

In [242]: df
Out[242]:
x y
let num
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60

In [243]: df.rename_axis(index={'let': 'abc'})


Out[243]:
x y
abc num
a 1 1 10

80 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60

In [244]: df.rename_axis(index=str.upper)
Out[244]:
x y
LET NUM
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60

3.3.8 Iteration

The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is
regarded as array-like, and basic iteration produces the values. DataFrames follow the dict-like convention
of iterating over the “keys” of the objects.
In short, basic iteration (for i in object) produces:
• Series: values
• DataFrame: column labels
Thus, for example, iterating over a DataFrame gives you the column names:

In [245]: df = pd.DataFrame({'col1': np.random.randn(3),


.....: 'col2': np.random.randn(3)}, index=['a', 'b', 'c'])
.....:

In [246]: for col in df:


.....: print(col)
.....:
col1
col2

Pandas objects also have the dict-like items() method to iterate over the (key, value) pairs.
To iterate over the rows of a DataFrame, you can use the following methods:
• iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to
Series objects, which can change the dtypes and has some performance implications.
• itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster
than iterrows(), and is in most cases preferable to use to iterate over the values of a DataFrame.

Warning: Iterating through pandas objects is generally slow. In many cases, iterating manually over
the rows is not needed and can be avoided with one of the following approaches:
• Look for a vectorized solution: many operations can be performed using built-in methods or NumPy
functions, (boolean) indexing, …

3.3. Essential basic functionality 81


pandas: powerful Python data analysis toolkit, Release 0.25.2

• When you have a function that cannot work on the full DataFrame/Series at once, it is better to
use apply() instead of iterating over the values. See the docs on function application.
• If you need to do iterative manipulations on the values but performance is important, consider
writing the inner loop with cython or numba. See the enhancing performance section for some
examples of this approach.

Warning: You should never modify something you are iterating over. This is not guaranteed to work
in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it
will have no effect!
For example, in the following case setting the value has no effect:
In [247]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [248]: for index, row in df.iterrows():


.....: row['a'] = 10
.....:

In [249]: df
Out[249]:
a b
0 1 a
1 2 b
2 3 c

items

Consistent with the dict-like interface, items() iterates through key-value pairs:
• Series: (index, scalar value) pairs
• DataFrame: (column, Series) pairs
For example:

In [250]: for label, ser in df.items():


.....: print(label)
.....: print(ser)
.....:
a
0 1
1 2
2 3
Name: a, dtype: int64
b
0 a
1 b
2 c
Name: b, dtype: object

82 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

iterrows

iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator
yielding each index value along with a Series containing the data in each row:

In [251]: for row_index, row in df.iterrows():


.....: print(row_index, row, sep='\n')
.....:
0
a 1
b a
Name: 0, dtype: object
1
a 2
b b
Name: 1, dtype: object
2
a 3
b c
Name: 2, dtype: object

Note: Because iterrows() returns a Series for each row, it does not preserve dtypes across the rows
(dtypes are preserved across columns for DataFrames). For example,

In [252]: df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

In [253]: df_orig.dtypes
Out[253]:
int int64
float float64
dtype: object

In [254]: row = next(df_orig.iterrows())[1]

In [255]: row
Out[255]:
int 1.0
float 1.5
Name: 0, dtype: float64

All values in row, returned as a Series, are now upcasted to floats, also the original integer value in column
x:
In [256]: row['int'].dtype
Out[256]: dtype('float64')

In [257]: df_orig['int'].dtype
Out[257]: dtype('int64')
To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples
of the values and which is generally much faster than iterrows().

For instance, a contrived way to transpose the DataFrame would be:

3.3. Essential basic functionality 83


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [258]: df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

In [259]: print(df2)
x y
0 1 4
1 2 5
2 3 6

In [260]: print(df2.T)
0 1 2
x 1 2 3
y 4 5 6

In [261]: df2_t = pd.DataFrame({idx: values for idx, values in df2.iterrows()})

In [262]: print(df2_t)
0 1 2
x 1 2 3
y 4 5 6

itertuples

The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame.
The first element of the tuple will be the row’s corresponding index value, while the remaining values are
the row values.
For instance:

In [263]: for row in df.itertuples():


.....: print(row)
.....:
Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')

This method does not convert the row to a Series object; it merely returns the values inside a namedtuple.
Therefore, itertuples() preserves the data type of the values and is generally faster as iterrows().

Note: The column names will be renamed to positional names if they are invalid Python identifiers,
repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

3.3.9 .dt accessor

Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a
datetime/period like Series. This will return a Series, indexed like the existing Series.
# datetime
In [264]: s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))

In [265]: s
Out[265]:
0 2013-01-01 09:10:12

84 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

1 2013-01-02 09:10:12
2 2013-01-03 09:10:12
3 2013-01-04 09:10:12
dtype: datetime64[ns]

In [266]: s.dt.hour
Out[266]:
0 9
1 9
2 9
3 9
dtype: int64

In [267]: s.dt.second
Out[267]:
0 12
1 12
2 12
3 12
dtype: int64

In [268]: s.dt.day
Out[268]:
0 1
1 2
2 3
3 4
dtype: int64
This enables nice expressions like this:

In [269]: s[s.dt.day == 2]
Out[269]:
1 2013-01-02 09:10:12
dtype: datetime64[ns]

You can easily produces tz aware transformations:


In [270]: stz = s.dt.tz_localize('US/Eastern')

In [271]: stz
Out[271]:
0 2013-01-01 09:10:12-05:00
1 2013-01-02 09:10:12-05:00
2 2013-01-03 09:10:12-05:00
3 2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [272]: stz.dt.tz
Out[272]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>
You can also chain these types of operations:

In [273]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[273]:
(continues on next page)

3.3. Essential basic functionality 85


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


0 2013-01-01 04:10:12-05:00
1 2013-01-02 04:10:12-05:00
2 2013-01-03 04:10:12-05:00
3 2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

You can also format datetime values as strings with Series.dt.strftime() which supports the same format
as the standard strftime().
# DatetimeIndex
In [274]: s = pd.Series(pd.date_range('20130101', periods=4))

In [275]: s
Out[275]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: datetime64[ns]

In [276]: s.dt.strftime('%Y/%m/%d')
Out[276]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
# PeriodIndex
In [277]: s = pd.Series(pd.period_range('20130101', periods=4))

In [278]: s
Out[278]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]

In [279]: s.dt.strftime('%Y/%m/%d')
Out[279]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
The .dt accessor works for period and timedelta dtypes.
# period
In [280]: s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))

In [281]: s
Out[281]:
0 2013-01-01

86 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]

In [282]: s.dt.year
Out[282]:
0 2013
1 2013
2 2013
3 2013
dtype: int64

In [283]: s.dt.day
Out[283]:
0 1
1 2
2 3
3 4
dtype: int64
# timedelta
In [284]: s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))

In [285]: s
Out[285]:
0 1 days 00:00:05
1 1 days 00:00:06
2 1 days 00:00:07
3 1 days 00:00:08
dtype: timedelta64[ns]

In [286]: s.dt.days
Out[286]:
0 1
1 1
2 1
3 1
dtype: int64

In [287]: s.dt.seconds
Out[287]:
0 5
1 6
2 7
3 8
dtype: int64

In [288]: s.dt.components
Out[288]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 1 0 0 5 0 0 0
1 1 0 0 6 0 0 0
2 1 0 0 7 0 0 0

3.3. Essential basic functionality 87


pandas: powerful Python data analysis toolkit, Release 0.25.2

3 1 0 0 8 0 0 0

Note: Series.dt will raise a TypeError if you access with a non-datetime-like values.

3.3.10 Vectorized string methods

Series is equipped with a set of string processing methods that make it easy to operate on each element of
the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are
accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in
string methods. For example:

In [289]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog',


,→'cat'])

In [290]: s.str.lower()
Out[290]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object

Powerful pattern-matching methods are provided as well, but note that pattern-matching generally uses
regular expressions by default (and in some cases always uses them).
Please see Vectorized String Methods for a complete description.

3.3.11 Sorting

Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a
combination of both.

By index

The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its
index levels.
In [291]: df = pd.DataFrame({
.....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
.....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
.....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
.....:

In [292]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],


.....: columns=['three', 'two', 'one'])
.....:

88 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [293]: unsorted_df
Out[293]:
three two one
a NaN 0.498142 -0.579017
d -1.636064 -0.051421 NaN
c 0.439120 -1.425223 -1.211745
b 1.021821 0.226007 1.192046

# DataFrame
In [294]: unsorted_df.sort_index()
Out[294]:
three two one
a NaN 0.498142 -0.579017
b 1.021821 0.226007 1.192046
c 0.439120 -1.425223 -1.211745
d -1.636064 -0.051421 NaN

In [295]: unsorted_df.sort_index(ascending=False)
Out[295]:
three two one
d -1.636064 -0.051421 NaN
c 0.439120 -1.425223 -1.211745
b 1.021821 0.226007 1.192046
a NaN 0.498142 -0.579017

In [296]: unsorted_df.sort_index(axis=1)
Out[296]:
one three two
a -0.579017 NaN 0.498142
d NaN -1.636064 -0.051421
c -1.211745 0.439120 -1.425223
b 1.192046 1.021821 0.226007

# Series
In [297]: unsorted_df['three'].sort_index()
Out[297]:
a NaN
b 1.021821
c 0.439120
d -1.636064
Name: three, dtype: float64

By values

The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values()
method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.
sort_values() may used to specify one or more columns to use to determine the sorted order.

In [298]: df1 = pd.DataFrame({'one': [2, 1, 1, 1],


.....: 'two': [1, 3, 2, 4],
.....: 'three': [5, 4, 3, 2]})
.....:
(continues on next page)

3.3. Essential basic functionality 89


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

In [299]: df1.sort_values(by='two')
Out[299]:
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2

The by parameter can take a list of column names, e.g.:

In [300]: df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])


Out[300]:
one two three
2 1 2 3
1 1 3 4
3 1 4 2
0 2 1 5

These methods have special treatment of NA values via the na_position argument:
In [301]: s[2] = np.nan

In [302]: s.sort_values()
Out[302]:
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
2 NaN
5 NaN
dtype: object

In [303]: s.sort_values(na_position='first')
Out[303]:
2 NaN
5 NaN
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
dtype: object

By indexes and values

New in version 0.23.0.

90 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Strings passed as the by parameter to DataFrame.sort_values() may refer to either columns or index level
names.

# Build MultiIndex
In [304]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
.....: ('b', 2), ('b', 1), ('b', 1)])
.....:

In [305]: idx.names = ['first', 'second']

# Build DataFrame
In [306]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
.....: index=idx)
.....:

In [307]: df_multi
Out[307]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1

Sort by ‘second’ (index) and ‘A’ (column)

In [308]: df_multi.sort_values(by=['second', 'A'])


Out[308]:
A
first second
b 1 1
1 2
a 1 6
b 2 3
a 2 4
2 5

Note: If a string matches both a column name and an index level name then a warning is issued and the
column takes precedence. This will result in an ambiguity error in a future version.

searchsorted

Series has the searchsorted() method, which works similarly to numpy.ndarray.searchsorted().


In [309]: ser = pd.Series([1, 2, 3])

In [310]: ser.searchsorted([0, 3])


Out[310]: array([0, 2])

In [311]: ser.searchsorted([0, 4])

3.3. Essential basic functionality 91


pandas: powerful Python data analysis toolkit, Release 0.25.2

Out[311]: array([0, 3])

In [312]: ser.searchsorted([1, 3], side='right')


Out[312]: array([1, 3])

In [313]: ser.searchsorted([1, 3], side='left')


Out[313]: array([0, 2])

In [314]: ser = pd.Series([3, 1, 2])

In [315]: ser.searchsorted([0, 3], sorter=np.argsort(ser))


Out[315]: array([0, 2])

smallest / largest values

Series has the nsmallest() and nlargest() methods which return the smallest or largest n values. For a
large Series this can be much faster than sorting the entire Series and calling head(n) on the result.
In [316]: s = pd.Series(np.random.permutation(10))

In [317]: s
Out[317]:
0 3
1 0
2 2
3 6
4 9
5 7
6 4
7 1
8 8
9 5
dtype: int64

In [318]: s.sort_values()
Out[318]:
1 0
7 1
2 2
0 3
6 4
9 5
3 6
5 7
8 8
4 9
dtype: int64

In [319]: s.nsmallest(3)
Out[319]:
1 0
7 1
2 2
dtype: int64

92 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [320]: s.nlargest(3)
Out[320]:
4 9
8 8
5 7
dtype: int64
DataFrame also has the nlargest and nsmallest methods.
In [321]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
.....: 'b': list('abdceff'),
.....: 'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
.....:

In [322]: df.nlargest(3, 'a')


Out[322]:
a b c
5 11 f 3.0
3 10 c 3.2
4 8 e NaN

In [323]: df.nlargest(5, ['a', 'c'])


Out[323]:
a b c
5 11 f 3.0
3 10 c 3.2
4 8 e NaN
2 1 d 4.0
6 -1 f 4.0

In [324]: df.nsmallest(3, 'a')


Out[324]:
a b c
0 -2 a 1.0
1 -1 b 2.0
6 -1 f 4.0

In [325]: df.nsmallest(5, ['a', 'c'])


Out[325]:
a b c
0 -2 a 1.0
1 -1 b 2.0
6 -1 f 4.0
2 1 d 4.0
4 8 e NaN

Sorting by a MultiIndex column

You must be explicit about sorting when the column is a MultiIndex, and fully specify all levels to by.
In [326]: df1.columns = pd.MultiIndex.from_tuples([('a', 'one'),
.....: ('a', 'two'),
.....: ('b', 'three')])
(continues on next page)

3.3. Essential basic functionality 93


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


.....:

In [327]: df1.sort_values(by=('a', 'two'))


Out[327]:
a b
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2

3.3.12 Copying

The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they
are immutable) and returns a new object. Note that it is seldom necessary to copy objects. For
example, there are only a handful of ways to alter a DataFrame in-place:
• Inserting, deleting, or modifying a column.
• Assigning to the index or columns attributes.
• For homogeneous data, directly modifying the values via the values attribute or advanced indexing.
To be clear, no pandas method has the side effect of modifying your data; almost every method returns a
new object, leaving the original object untouched. If the data is modified, it is because you did so explicitly.

3.3.13 dtypes

For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame.
NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns] (note that NumPy
does not support timezone-aware datetimes).
Pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the
extensions pandas has made internally. See Extension types for how to write your own extension that works
with pandas. See Extension data types for a list of third-party libraries that have implemented an extension.
The following table lists all of pandas extension types. See the respective documentation sections for more
on each type.

Kind of Data Data Type Scalar Array Documentation


tz-aware datetime DatetimeTZDtype Timestamp arrays. Time zone handling
DatetimeArray
Categorical CategoricalDtype (none) Categorical categorical
period (time PeriodDtype Period arrays.PeriodArray Time span representa-
spans) tion
sparse SparseDtype (none) arrays.SparseArray sparse
intervals IntervalDtype Interval arrays. IntervalIndex
IntervalArray
nullable integer Int64Dtype, … (none) arrays. Nullable integer data
IntegerArray type

Pandas uses the object dtype for storing strings.

94 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible
(for performance and interoperability with other libraries and methods. See object conversion).
A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.
In [328]: dft = pd.DataFrame({'A': np.random.rand(3),
.....: 'B': 1,
.....: 'C': 'foo',
.....: 'D': pd.Timestamp('20010102'),
.....: 'E': pd.Series([1.0] * 3).astype('float32'),
.....: 'F': False,
.....: 'G': pd.Series([1] * 3, dtype='int8')})
.....:

In [329]: dft
Out[329]:
A B C D E F G
0 0.537956 1 foo 2001-01-02 1.0 False 1
1 0.623232 1 foo 2001-01-02 1.0 False 1
2 0.558728 1 foo 2001-01-02 1.0 False 1

In [330]: dft.dtypes
Out[330]:
A float64
B int64
C object
D datetime64[ns]
E float32
F bool
G int8
dtype: object
On a Series object, use the dtype attribute.

In [331]: dft['A'].dtype
Out[331]: dtype('float64')

If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be
chosen to accommodate all of the data types (object is the most general).
# these ints are coerced to floats
In [332]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[332]:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
dtype: float64

# string data forces an ``object`` dtype


In [333]: pd.Series([1, 2, 3, 6., 'foo'])
Out[333]:
0 1
1 2

3.3. Essential basic functionality 95


pandas: powerful Python data analysis toolkit, Release 0.25.2

2 3
3 6
4 foo
dtype: object
The number of columns of each type in a DataFrame can be found by calling DataFrame.dtypes.
value_counts().

In [334]: dft.dtypes.value_counts()
Out[334]:
datetime64[ns] 1
object 1
bool 1
float64 1
int64 1
int8 1
float32 1
dtype: int64

Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the
dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations.
Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.
In [335]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [336]: df1
Out[336]:
A
0 1.111131
1 0.459178
2 0.633793
3 1.315839
4 0.075656
5 0.526596
6 0.442085
7 -0.224085

In [337]: df1.dtypes
Out[337]:
A float32
dtype: object

In [338]: df2 = pd.DataFrame({'A': pd.Series(np.random.randn(8), dtype='float16'),


.....: 'B': pd.Series(np.random.randn(8)),
.....: 'C': pd.Series(np.array(np.random.randn(8),
.....: dtype='uint8'))})
.....:

In [339]: df2
Out[339]:
A B C
0 -0.809082 0.268481 0
1 -2.189453 0.991689 255
2 0.447021 -0.273606 0
3 0.072876 0.848166 1

96 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

4 -0.362061 -0.340956 255


5 -0.530273 -0.602481 0
6 -0.096802 -0.552552 0
7 1.262695 0.324815 255

In [340]: df2.dtypes
Out[340]:
A float16
B float64
C uint8
dtype: object

defaults

By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit).
The following will all result in int64 dtypes.
In [341]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[341]:
a int64
dtype: object

In [342]: pd.DataFrame({'a': [1, 2]}).dtypes


Out[342]:
a int64
dtype: object

In [343]: pd.DataFrame({'a': 1}, index=list(range(2))).dtypes


Out[343]:
a int64
dtype: object
Note that Numpy will choose platform-dependent types when creating arrays. The following WILL result
in int32 on 32-bit platform.

In [344]: frame = pd.DataFrame(np.array([1, 2]))

upcasting

Types can potentially be upcasted when combined with other types, meaning they are promoted from the
current type (e.g. int to float).
In [345]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [346]: df3
Out[346]:
A B C
0 0.302049 0.268481 0.0
1 -1.730275 0.991689 255.0
2 1.080814 -0.273606 0.0
3 1.388715 0.848166 1.0
4 -0.286405 -0.340956 255.0
5 -0.003677 -0.602481 0.0
6 0.345284 -0.552552 0.0

3.3. Essential basic functionality 97


pandas: powerful Python data analysis toolkit, Release 0.25.2

7 1.038611 0.324815 255.0

In [347]: df3.dtypes
Out[347]:
A float32
B float64
C float64
dtype: object
DataFrame.to_numpy() will return the lower-common-denominator of the dtypes, meaning the dtype that
can accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force
some upcasting.

In [348]: df3.to_numpy().dtype
Out[348]: dtype('float64')

astype

You can use the astype() method to explicitly convert dtypes from one to another. These will by default
return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition,
they will raise an exception if the astype operation is invalid.
Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then
the more general one will be used as the result of the operation.
In [349]: df3
Out[349]:
A B C
0 0.302049 0.268481 0.0
1 -1.730275 0.991689 255.0
2 1.080814 -0.273606 0.0
3 1.388715 0.848166 1.0
4 -0.286405 -0.340956 255.0
5 -0.003677 -0.602481 0.0
6 0.345284 -0.552552 0.0
7 1.038611 0.324815 255.0

In [350]: df3.dtypes
Out[350]:
A float32
B float64
C float64
dtype: object

# conversion of dtypes
In [351]: df3.astype('float32').dtypes
Out[351]:
A float32
B float32
C float32
dtype: object
Convert a subset of columns to a specified type using astype().
In [352]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

98 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [353]: dft[['a', 'b']] = dft[['a', 'b']].astype(np.uint8)

In [354]: dft
Out[354]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9

In [355]: dft.dtypes
Out[355]:
a uint8
b uint8
c int64
dtype: object
New in version 0.19.0.
Convert certain columns to a specific dtype by passing a dict to astype().
In [356]: dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [357]: dft1 = dft1.astype({'a': np.bool, 'c': np.float64})

In [358]: dft1
Out[358]:
a b c
0 True 4 7.0
1 False 5 8.0
2 True 6 9.0

In [359]: dft1.dtypes
Out[359]:
a bool
b int64
c float64
dtype: object

Note: When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting
occurs.
loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the
dtype from the right hand side. Therefore the following piece of code produces the unintended result.
In [360]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [361]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes


Out[361]:
a uint8
b uint8
dtype: object

In [362]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

(continues on next page)

3.3. Essential basic functionality 99


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


In [363]: dft.dtypes
Out[363]:
a int64
b int64
c int64
dtype: object

object conversion

pandas offers various functions to try to force conversion of types from the object dtype to other types.
In cases where the data is already of the correct type, but stored in an object array, the DataFrame.
infer_objects() and Series.infer_objects() methods can be used to soft convert to the correct type.
In [364]: import datetime

In [365]: df = pd.DataFrame([[1, 2],


.....: ['a', 'b'],
.....: [datetime.datetime(2016, 3, 2),
.....: datetime.datetime(2016, 3, 2)]])
.....:

In [366]: df = df.T

In [367]: df
Out[367]:
0 1 2
0 1 a 2016-03-02
1 2 b 2016-03-02

In [368]: df.dtypes
Out[368]:
0 object
1 object
2 datetime64[ns]
dtype: object
Because the data was transposed the original inference stored all columns as object, which infer_objects
will correct.

In [369]: df.infer_objects().dtypes
Out[369]:
0 int64
1 object
2 datetime64[ns]
dtype: object

The following functions are available for one dimensional object arrays or scalars to perform hard conversion
of objects to a specified type:
• to_numeric() (conversion to numeric dtypes)

100 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [370]: m = ['1.1', 2, 3]

In [371]: pd.to_numeric(m)
Out[371]: array([1.1, 2. , 3. ])

• to_datetime() (conversion to datetime objects)

In [372]: import datetime

In [373]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]

In [374]: pd.to_datetime(m)
Out[374]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]',␣
,→freq=None)

• to_timedelta() (conversion to timedelta objects)

In [375]: m = ['5us', pd.Timedelta('1day')]

In [376]: pd.to_timedelta(m)
Out[376]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype=
,→'timedelta64[ns]', freq=None)

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with
elements that cannot be converted to desired dtype or object. By default, errors='raise', meaning that
any errors encountered will be raised during the conversion process. However, if errors='coerce', these
errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta)
or np.nan (for numeric). This might be useful if you are reading in data which is mostly of the desired
dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to
represent as missing:

In [377]: import datetime

In [378]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [379]: pd.to_datetime(m, errors='coerce')


Out[379]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [380]: m = ['apple', 2, 3]

In [381]: pd.to_numeric(m, errors='coerce')


Out[381]: array([nan, 2., 3.])

In [382]: m = ['apple', pd.Timedelta('1day')]

In [383]: pd.to_timedelta(m, errors='coerce')


Out[383]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The errors parameter has a third option of errors='ignore', which will simply return the passed in data
if it encounters any errors with the conversion to a desired data type:

In [384]: import datetime

(continues on next page)

3.3. Essential basic functionality 101


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


In [385]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [386]: pd.to_datetime(m, errors='ignore')


Out[386]: Index(['apple', 2016-03-02 00:00:00], dtype='object')

In [387]: m = ['apple', 2, 3]

In [388]: pd.to_numeric(m, errors='ignore')


Out[388]: array(['apple', 2, 3], dtype=object)

In [389]: m = ['apple', pd.Timedelta('1day')]

In [390]: pd.to_timedelta(m, errors='ignore')


Out[390]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, to_numeric() provides another argument downcast, which gives the option
of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:
In [391]: m = ['1', 2, 3]

In [392]: pd.to_numeric(m, downcast='integer') # smallest signed int dtype


Out[392]: array([1, 2, 3], dtype=int8)

In [393]: pd.to_numeric(m, downcast='signed') # same as 'integer'


Out[393]: array([1, 2, 3], dtype=int8)

In [394]: pd.to_numeric(m, downcast='unsigned') # smallest unsigned int dtype


Out[394]: array([1, 2, 3], dtype=uint8)

In [395]: pd.to_numeric(m, downcast='float') # smallest float dtype


Out[395]: array([1., 2., 3.], dtype=float32)
As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on
multi-dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over
each column efficiently:
In [396]: import datetime

In [397]: df = pd.DataFrame([
.....: ['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
.....:

In [398]: df
Out[398]:
0 1
0 2016-07-09 2016-03-02 00:00:00
1 2016-07-09 2016-03-02 00:00:00

In [399]: df.apply(pd.to_datetime)
Out[399]:
0 1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02

102 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [400]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')

In [401]: df
Out[401]:
0 1 2
0 1.1 2 3
1 1.1 2 3

In [402]: df.apply(pd.to_numeric)
Out[402]:
0 1 2
0 1.1 2 3
1 1.1 2 3

In [403]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [404]: df
Out[404]:
0 1
0 5us 1 days 00:00:00
1 5us 1 days 00:00:00

In [405]: df.apply(pd.to_timedelta)
Out[405]:
0 1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days

gotchas

Performing selection operations on integer type data can easily upcast the data to floating. The dtype
of the input data will be preserved in cases where nans are not introduced. See also Support for integer NA.
In [406]: dfi = df3.astype('int32')

In [407]: dfi['E'] = 1

In [408]: dfi
Out[408]:
A B C E
0 0 0 0 1
1 -1 0 255 1
2 1 0 0 1
3 1 0 1 1
4 0 0 255 1
5 0 0 0 1
6 0 0 0 1
7 1 0 255 1

In [409]: dfi.dtypes
Out[409]:
A int32
B int32
C int32

3.3. Essential basic functionality 103


pandas: powerful Python data analysis toolkit, Release 0.25.2

E int64
dtype: object

In [410]: casted = dfi[dfi > 0]

In [411]: casted
Out[411]:
A B C E
0 NaN NaN NaN 1
1 NaN NaN 255.0 1
2 1.0 NaN NaN 1
3 1.0 NaN 1.0 1
4 NaN NaN 255.0 1
5 NaN NaN NaN 1
6 NaN NaN NaN 1
7 1.0 NaN 255.0 1

In [412]: casted.dtypes
Out[412]:
A float64
B float64
C float64
E int64
dtype: object
While float dtypes are unchanged.
In [413]: dfa = df3.copy()

In [414]: dfa['A'] = dfa['A'].astype('float32')

In [415]: dfa.dtypes
Out[415]:
A float32
B float64
C float64
dtype: object

In [416]: casted = dfa[df2 > 0]

In [417]: casted
Out[417]:
A B C
0 NaN 0.268481 NaN
1 NaN 0.991689 255.0
2 1.080814 NaN NaN
3 1.388715 0.848166 1.0
4 NaN NaN 255.0
5 NaN NaN NaN
6 NaN NaN NaN
7 1.038611 0.324815 255.0

In [418]: casted.dtypes
Out[418]:
A float32

104 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

B float64
C float64
dtype: object

3.3.14 Selecting columns based on dtype

The select_dtypes() method implements subsetting of columns based on their dtype.


First, let’s create a DataFrame with a slew of different dtypes:

In [419]: df = pd.DataFrame({'string': list('abc'),


.....: 'int64': list(range(1, 4)),
.....: 'uint8': np.arange(3, 6).astype('u1'),
.....: 'float64': np.arange(4.0, 7.0),
.....: 'bool1': [True, False, True],
.....: 'bool2': [False, True, False],
.....: 'dates': pd.date_range('now', periods=3),
.....: 'category': pd.Series(list("ABC")).astype('category')})
.....:

In [420]: df['tdeltas'] = df.dates.diff()

In [421]: df['uint64'] = np.arange(3, 6).astype('u8')

In [422]: df['other_dates'] = pd.date_range('20130101', periods=3)

In [423]: df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [424]: df
Out[424]:
string int64 uint8 float64 bool1 bool2 dates category␣
,→tdeltas uint64 other_dates tz_aware_dates
0 a 1 3 4.0 True False 2019-10-23 14:27:32.265347 A ␣
,→NaT 3 2013-01-01 2013-01-01 00:00:00-05:00
1 b 2 4 5.0 False True 2019-10-24 14:27:32.265347 B 1␣
,→days 4 2013-01-02 2013-01-02 00:00:00-05:00
2 c 3 5 6.0 True False 2019-10-25 14:27:32.265347 C 1␣
,→days 5 2013-01-03 2013-01-03 00:00:00-05:00

And the dtypes:

In [425]: df.dtypes
Out[425]:
string object
int64 int64
uint8 uint8
float64 float64
bool1 bool
bool2 bool
dates datetime64[ns]
category category
tdeltas timedelta64[ns]
uint64 uint64
(continues on next page)

3.3. Essential basic functionality 105


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


other_dates datetime64[ns]
tz_aware_dates datetime64[ns, US/Eastern]
dtype: object

select_dtypes() has two parameters include and exclude that allow you to say “give me the columns
with these dtypes” (include) and/or “give the columns without these dtypes” (exclude).
For example, to select bool columns:

In [426]: df.select_dtypes(include=[bool])
Out[426]:
bool1 bool2
0 True False
1 False True
2 True False

You can also pass the name of a dtype in the NumPy dtype hierarchy:

In [427]: df.select_dtypes(include=['bool'])
Out[427]:
bool1 bool2
0 True False
1 False True
2 True False

select_dtypes() also works with generic dtypes as well.


For example, to select all numeric and boolean columns while excluding unsigned integers:

In [428]: df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])


Out[428]:
int64 float64 bool1 bool2 tdeltas
0 1 4.0 True False NaT
1 2 5.0 False True 1 days
2 3 6.0 True False 1 days

To select string columns you must use the object dtype:

In [429]: df.select_dtypes(include=['object'])
Out[429]:
string
0 a
1 b
2 c

To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a
tree of child dtypes:

In [430]: def subdtypes(dtype):


.....: subs = dtype.__subclasses__()
.....: if not subs:
.....: return dtype
.....: return [dtype, [subdtypes(dt) for dt in subs]]
.....:

106 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

All NumPy dtypes are subclasses of numpy.generic:

In [431]: subdtypes(np.generic)
Out[431]:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]

Note: Pandas also defines the types category, and datetime64[ns, tz], which are not integrated into
the normal NumPy hierarchy and won’t show up with the above function.

{{ header }}

3.4 Intro to data structures

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get
you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply
across all of the objects. To get started, import NumPy and load pandas into your namespace:

In [1]: import numpy as np

In [2]: import pandas as pd

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will
not be broken unless done so explicitly by you.
We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and
methods in separate sections.

3.4. Intro to data structures 107


pandas: powerful Python data analysis toolkit, Release 0.25.2

3.4.1 Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point
numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method
to create a Series is to call:

>>> s = pd.Series(data, index=index)

Here, data can be many different things:


• a Python dict
• an ndarray
• a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:
From ndarray
If data is an ndarray, index must be the same length as data. If no index is passed, one will be created
having values [0, ..., len(data) - 1].
In [3]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [4]: s
Out[4]:
a 0.544055
b 0.457937
c -0.055260
d 0.487673
e 0.021212
dtype: float64

In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [6]: pd.Series(np.random.randn(5))
Out[6]:
0 -0.269995
1 -0.453290
2 0.118795
3 0.658569
4 -0.208094
dtype: float64

Note: pandas supports non-unique index values. If an operation that does not support duplicate index
values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all
performance-based (there are many instances in computations, like parts of GroupBy, where the index is not
used).

From dict
Series can be instantiated from dicts:

108 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [7]: d = {'b': 1, 'a': 0, 'c': 2}

In [8]: pd.Series(d)
Out[8]:
b 1
a 0
c 2
dtype: int64

Note: When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s
insertion order, if you’re using Python version >= 3.6 and Pandas version >= 0.23.
If you’re using Python < 3.6 or Pandas < 0.23, and an index is not passed, the Series index will be the
lexically ordered list of dict keys.

In the example above, if you were on a Python version lower than 3.6 or a Pandas version lower than 0.23,
the Series would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b',
'a', 'c']).
If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
In [9]: d = {'a': 0., 'b': 1., 'c': 2.}

In [10]: pd.Series(d)
Out[10]:
a 0.0
b 1.0
c 2.0
dtype: float64

In [11]: pd.Series(d, index=['b', 'c', 'd', 'a'])


Out[11]:
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

Note: NaN (not a number) is the standard missing data marker used in pandas.

From scalar value


If data is a scalar value, an index must be provided. The value will be repeated to match the length of
index.

In [12]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])


Out[12]:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64

3.4. Intro to data structures 109


pandas: powerful Python data analysis toolkit, Release 0.25.2

Series is ndarray-like

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However,
operations such as slicing will also slice the index.
In [13]: s[0]
Out[13]: 0.544055375065062

In [14]: s[:3]
Out[14]:
a 0.544055
b 0.457937
c -0.055260
dtype: float64

In [15]: s[s > s.median()]


Out[15]:
a 0.544055
d 0.487673
dtype: float64

In [16]: s[[4, 3, 1]]


Out[16]:
e 0.021212
d 0.487673
b 0.457937
dtype: float64

In [17]: np.exp(s)
Out[17]:
a 1.722980
b 1.580809
c 0.946239
d 1.628523
e 1.021439
dtype: float64

Note: We will address array-based indexing like s[[4, 3, 1]] in section.

Like a NumPy array, a pandas Series has a dtype.

In [18]: s.dtype
Out[18]: dtype('float64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few
places, in which case the dtype would be a ExtensionDtype. Some examples within pandas are categorical
and Nullable integer data type. See dtypes for more.
If you need the actual array backing a Series, use Series.array.

In [19]: s.array
Out[19]:
<PandasArray>
(continues on next page)

110 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


[ 0.544055375065062, 0.4579369393771382, -0.05526002668877243,
0.48767341506473233, 0.021212151323940893]
Length: 5, dtype: float64

Accessing the array can be useful when you need to do some operation without the index (to disable automatic
alignment, for example).
Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one
or more concrete arrays like a numpy.ndarray. Pandas knows how to take an ExtensionArray and store it
in a Series or a column of a DataFrame. See dtypes for more.
While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

In [20]: s.to_numpy()
Out[20]: array([ 0.54405538, 0.45793694, -0.05526003, 0.48767342, 0.02121215])

Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.

Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:
In [21]: s['a']
Out[21]: 0.544055375065062

In [22]: s['e'] = 12.

In [23]: s
Out[23]:
a 0.544055
b 0.457937
c -0.055260
d 0.487673
e 12.000000
dtype: float64

In [24]: 'e' in s
Out[24]: True

In [25]: 'f' in s
Out[25]: False
If a label is not contained, an exception is raised:

>>> s['f']
KeyError: 'f'

Using the get method, a missing label will return None or specified default:

In [26]: s.get('f')

In [27]: s.get('f', np.nan)


Out[27]: nan

See also the section on attribute access.

3.4. Intro to data structures 111


pandas: powerful Python data analysis toolkit, Release 0.25.2

Vectorized operations and label alignment with Series

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same
is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting
an ndarray.
In [28]: s + s
Out[28]:
a 1.088111
b 0.915874
c -0.110520
d 0.975347
e 24.000000
dtype: float64

In [29]: s * 2
Out[29]:
a 1.088111
b 0.915874
c -0.110520
d 0.975347
e 24.000000
dtype: float64

In [30]: np.exp(s)
Out[30]:
a 1.722980
b 1.580809
c 0.946239
d 1.628523
e 162754.791419
dtype: float64
A key difference between Series and ndarray is that operations between Series automatically align the data
based on label. Thus, you can write computations without giving consideration to whether the Series involved
have the same labels.

In [31]: s[1:] + s[:-1]


Out[31]:
a NaN
b 0.915874
c -0.110520
d 0.975347
e NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label
is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code
without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis
and research. The integrated data alignment features of the pandas data structures set pandas apart from
the majority of related tools for working with labeled data.

Note: In general, we chose to make the default result of operations between differently indexed objects
yield the union of the indexes in order to avoid loss of information. Having an index label, though the data

112 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

is missing, is typically important information as part of a computation. You of course have the option of
dropping labels with missing data via the dropna function.

Name attribute

Series can also have a name attribute:


In [32]: s = pd.Series(np.random.randn(5), name='something')

In [33]: s
Out[33]:
0 0.482656
1 0.041324
2 0.376516
3 0.386480
4 0.884489
Name: something, dtype: float64

In [34]: s.name
Out[34]: 'something'
The Series name will be assigned automatically in many cases, in particular when taking 1D slices of
DataFrame as you will see below.
New in version 0.18.0.
You can rename a Series with the pandas.Series.rename() method.

In [35]: s2 = s.rename("different")

In [36]: s2.name
Out[36]: 'different'

Note that s and s2 refer to different objects.

3.4.2 DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can
think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly
used pandas object. Like Series, DataFrame accepts many different kinds of input:
• Dict of 1D ndarrays, lists, dicts, or Series
• 2-D numpy.ndarray
• Structured or record ndarray
• A Series
• Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.
If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting
DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed
index.
If axis labels are not passed, they will be constructed from the input data based on common sense rules.

3.4. Intro to data structures 113


pandas: powerful Python data analysis toolkit, Release 0.25.2

Note: When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by
the dict’s insertion order, if you are using Python version >= 3.6 and Pandas >= 0.23.
If you are using Python < 3.6 or Pandas < 0.23, and columns is not specified, the DataFrame columns will
be the lexically ordered list of dict keys.

From dict of Series or dicts

The resulting index will be the union of the indexes of the various Series. If there are any nested dicts,
these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict
keys.
In [37]: d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:

In [38]: df = pd.DataFrame(d)

In [39]: df
Out[39]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0

In [40]: pd.DataFrame(d, index=['d', 'b', 'a'])


Out[40]:
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0

In [41]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])


Out[41]:
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
The row and column labels can be accessed respectively by accessing the index and columns attributes:

Note: When a particular set of columns is passed along with a dict of data, the passed columns override
the keys in the dict.

In [42]: df.index
Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')

In [43]: df.columns
Out[43]: Index(['one', 'two'], dtype='object')

114 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as
the arrays. If no index is passed, the result will be range(n), where n is the array length.
In [44]: d = {'one': [1., 2., 3., 4.],
....: 'two': [4., 3., 2., 1.]}
....:

In [45]: pd.DataFrame(d)
Out[45]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

In [46]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])


Out[46]:
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0

From structured or record array

This case is handled identically to a dict of arrays.


In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]

In [49]: pd.DataFrame(data)
Out[49]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'

In [50]: pd.DataFrame(data, index=['first', 'second'])


Out[50]:
A B C
first 1 2.0 b'Hello'
second 2 3.0 b'World'

In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])


Out[51]:
C A B
0 b'Hello' 1 2.0
1 b'World' 2 3.0

Note: DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

3.4. Intro to data structures 115


pandas: powerful Python data analysis toolkit, Release 0.25.2

From a list of dicts

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [53]: pd.DataFrame(data2)
Out[53]:
a b c
0 1 2 NaN
1 5 10 20.0

In [54]: pd.DataFrame(data2, index=['first', 'second'])


Out[54]:
a b c
first 1 2 NaN
second 5 10 20.0

In [55]: pd.DataFrame(data2, columns=['a', 'b'])


Out[55]:
a b
0 1 2
1 5 10

From a dict of tuples

You can automatically create a MultiIndexed frame by passing a tuples dictionary.

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},


....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
....:
Out[56]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0

From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name
is the original name of the Series (only if no other column name provided).
Missing data
Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing
data, we use np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the
data argument to the DataFrame constructor, and its masked entries will be considered missing.

Alternate constructors

DataFrame.from_dict

116 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It
operates like the DataFrame constructor except for the orient parameter which is 'columns' by default,
but which can be set to 'index' in order to use the dict keys as row labels.

In [57]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))


Out[57]:
A B
0 1 4
1 2 5
2 3 6

If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired
column names:

In [58]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),


....: orient='index', columns=['one', 'two', 'three'])
....:
Out[58]:
one two three
A 1 2 3
B 4 5 6

DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously
to the normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of
the structured dtype. For example:
In [59]: data
Out[59]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [60]: pd.DataFrame.from_records(data, index='C')


Out[60]:
A B
C
b'Hello' 1 2.0
b'World' 2 3.0

Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and
deleting columns works with the same syntax as the analogous dict operations:

In [61]: df['one']
Out[61]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64

In [62]: df['three'] = df['one'] * df['two']


(continues on next page)

3.4. Intro to data structures 117


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

In [63]: df['flag'] = df['one'] > 2

In [64]: df
Out[64]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False

Columns can be deleted or popped like with a dict:

In [65]: del df['two']

In [66]: three = df.pop('three')

In [67]: df
Out[67]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False

When inserting a scalar value, it will naturally be propagated to fill the column:

In [68]: df['foo'] = 'bar'

In [69]: df
Out[69]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the
DataFrame’s index:

In [70]: df['one_trunc'] = df['one'][:2]

In [71]: df
Out[71]:
one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular
location in the columns:

118 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [72]: df.insert(1, 'bar', df['one'])

In [73]: df
Out[73]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN

Assigning new columns in method chains

Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new
columns that are potentially derived from existing columns.
In [74]: iris = pd.read_csv('data/iris.data')

In [75]: iris.head()
Out[75]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [76]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])


....: .head())
....:
Out[76]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
In the example above, we inserted a precomputed value. We can also pass in a function of one argument to
be evaluated on the DataFrame being assigned to.

In [77]: iris.assign(sepal_ratio=lambda x: (x['SepalWidth'] / x['SepalLength'])).head()


Out[77]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000

assign always returns a copy of the data, leaving the original DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to
the DataFrame at hand. This is common when using assign in a chain of operations. For example, we can

3.4. Intro to data structures 119


pandas: powerful Python data analysis toolkit, Release 0.25.2

limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and
plot:

In [78]: (iris.query('SepalLength > 5')


....: .assign(SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
....: PetalRatio=lambda x: x.PetalWidth / x.PetalLength)
....: .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
....:
Out[78]: <matplotlib.axes._subplots.AxesSubplot at 0x13806f910>

Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly,
this is the DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering
happens first, and then the ratio calculations. This is an example where we didn’t have a reference to the
filtered DataFrame available.
The function signature for assign is simply **kwargs. The keys are the column names for the new fields,
and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of
one argument to be called on the DataFrame. A copy of the original DataFrame is returned, with the new
values inserted.
Changed in version 0.23.0.
Starting with Python 3.6 the order of **kwargs is preserved. This allows for dependent assignment, where
an expression later in **kwargs can refer to a column created earlier in the same assign().

120 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [79]: dfa = pd.DataFrame({"A": [1, 2, 3],


....: "B": [4, 5, 6]})
....:

In [80]: dfa.assign(C=lambda x: x['A'] + x['B'],


....: D=lambda x: x['A'] + x['C'])
....:
Out[80]:
A B C D
0 1 4 5 6
1 2 5 7 9
2 3 6 9 12

In the second expression, x['C'] will refer to the newly created column, that’s equal to dfa['A'] +
dfa['B'].
To write code compatible with all versions of Python, split the assignment in two.

In [81]: dependent = pd.DataFrame({"A": [1, 1, 1]})

In [82]: (dependent.assign(A=lambda x: x['A'] + 1)


....: .assign(B=lambda x: x['A'] + 2))
....:
Out[82]:
A B
0 2 4
1 2 4
2 2 4

Warning: Dependent assignment may subtly change the behavior of your code between Python 3.6
and older versions of Python.
If you wish to write code that supports versions of python before and after 3.6, you’ll need to take care
when passing assign expressions that
• Update an existing column
• Refer to the newly updated column in the same assign
For example, we’ll update column “A” and then refer to it when creating “B”.
>>> dependent = pd.DataFrame({"A": [1, 1, 1]})
>>> dependent.assign(A=lambda x: x["A"] + 1, B=lambda x: x["A"] + 2)

For Python 3.5 and earlier the expression creating B refers to the “old” value of A, [1, 1, 1]. The
output is then
A B
0 2 3
1 2 3
2 2 3

For Python 3.6 and later, the expression creating A refers to the “new” value of A, [2, 2, 2], which
results in

3.4. Intro to data structures 121


pandas: powerful Python data analysis toolkit, Release 0.25.2

A B
0 2 4
1 2 4
2 2 4

Indexing / selection

The basics of indexing are as follows:

Operation Syntax Result


Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

Row selection, for example, returns a Series whose index is the columns of the DataFrame:
In [83]: df.loc['b']
Out[83]:
one 2
bar 2
flag False
foo bar
one_trunc 2
Name: b, dtype: object

In [84]: df.iloc[2]
Out[84]:
one 3
bar 3
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section on indexing.
We will address the fundamentals of reindexing / conforming to new sets of labels in the section on reindexing.

Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index
(row labels). Again, the resulting object will have the union of the column and row labels.

In [85]: df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

In [86]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

In [87]: df + df2
Out[87]:
(continues on next page)

122 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


A B C D
0 -0.694484 0.192766 -0.945741 NaN
1 0.292784 -0.453735 -0.159854 NaN
2 -0.904154 -0.209882 1.955291 NaN
3 2.029295 0.009923 -0.434713 NaN
4 1.166285 -1.157271 -0.100607 NaN
5 1.101510 -0.298808 -1.311913 NaN
6 0.033708 1.731677 0.180057 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN

When doing an operation between DataFrame and Series, the default behavior is to align the Series index
on the DataFrame columns, thus broadcasting row-wise. For example:

In [88]: df - df.iloc[0]
Out[88]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 0.823010 -1.407428 1.810920 0.036966
2 -0.836963 -0.013977 3.361133 -0.321214
3 0.215511 0.136409 2.241683 -1.609052
4 0.010095 -1.006592 2.969119 -1.362727
5 1.268823 -1.691757 2.108988 -2.197647
6 0.192297 0.926025 1.937398 -0.193649
7 -1.637169 1.843618 2.244498 -2.147770
8 2.009625 -2.136301 2.017400 1.580465
9 0.027738 1.062145 3.472820 -1.503278

In the special case of working with time series data, if the DataFrame index contains dates, the broadcasting
will be column-wise:
In [89]: index = pd.date_range('1/1/2000', periods=8)

In [90]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))

In [91]: df
Out[91]:
A B C
2000-01-01 -0.334372 1.607388 -2.433251
2000-01-02 -0.196343 1.455554 -0.134245
2000-01-03 0.351420 1.148340 -1.691404
2000-01-04 -0.493614 0.073640 0.182179
2000-01-05 -0.203285 -0.010826 -1.539934
2000-01-06 -1.415928 -0.245523 -0.167154
2000-01-07 -0.223275 -1.167290 2.137569
2000-01-08 0.120902 -0.185518 1.927187

In [92]: type(df['A'])
Out[92]: pandas.core.series.Series

In [93]: df - df['A']
Out[93]:

3.4. Intro to data structures 123


pandas: powerful Python data analysis toolkit, Release 0.25.2

2000-01-01 00:00:00 2000-01-02 00:00:00 2000-01-03 00:00:00 2000-01-04␣


,→ 00:00:00 2000-01-05 00:00:00 2000-01-06 00:00:00 2000-01-07 00:00:00 2000-01-08␣
,→00:00:00 A B C
2000-01-01 NaN NaN NaN ␣
,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

2000-01-02 NaN NaN NaN ␣


,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

2000-01-03 NaN NaN NaN ␣


,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

2000-01-04 NaN NaN NaN ␣


,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

2000-01-05 NaN NaN NaN ␣


,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

2000-01-06 NaN NaN NaN ␣


,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

2000-01-07 NaN NaN NaN ␣


,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

2000-01-08 NaN NaN NaN ␣


,→ NaN NaN NaN NaN ␣
,→NaN NaN NaN NaN

Warning:
df - df['A']

is now deprecated and will be removed in a future release. The preferred way to replicate this behavior
is
df.sub(df['A'], axis=0)

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.
Operations with scalars are just as you would expect:
In [94]: df * 5 + 2
Out[94]:
A B C
2000-01-01 0.328140 10.036938 -10.166254
2000-01-02 1.018287 9.277769 1.328775
2000-01-03 3.757099 7.741701 -6.457019
2000-01-04 -0.468070 2.368199 2.910897
2000-01-05 0.983577 1.945869 -5.699668
2000-01-06 -5.079639 0.772383 1.164231
2000-01-07 0.883623 -3.836452 12.687845
2000-01-08 2.604509 1.072411 11.635937

In [95]: 1 / df
Out[95]:

124 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

A B C
2000-01-01 -2.990682 0.622128 -0.410973
2000-01-02 -5.093138 0.687024 -7.449068
2000-01-03 2.845600 0.870822 -0.591225
2000-01-04 -2.025875 13.579631 5.489096
2000-01-05 -4.919213 -92.369255 -0.649379
2000-01-06 -0.706251 -4.072930 -5.982514
2000-01-07 -4.478772 -0.856685 0.467821
2000-01-08 8.271174 -5.390317 0.518891

In [96]: df ** 4
Out[96]:
A B C
2000-01-01 0.012500 6.675478e+00 35.054796
2000-01-02 0.001486 4.488623e+00 0.000325
2000-01-03 0.015251 1.738931e+00 8.184445
2000-01-04 0.059368 2.940682e-05 0.001102
2000-01-05 0.001708 1.373695e-08 5.623517
2000-01-06 4.019430 3.633894e-03 0.000781
2000-01-07 0.002485 1.856589e+00 20.877601
2000-01-08 0.000214 1.184521e-03 13.794178
Boolean operators work as well:
In [97]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)

In [98]: df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [99]: df1 & df2


Out[99]:
a b
0 False False
1 False True
2 True False

In [100]: df1 | df2


Out[100]:
a b
0 True True
1 True True
2 True True

In [101]: df1 ^ df2


Out[101]:
a b
0 True True
1 True False
2 False True

In [102]: -df1
Out[102]:
a b
0 False True
1 True False
2 False False

3.4. Intro to data structures 125


pandas: powerful Python data analysis toolkit, Release 0.25.2

Transposing

To transpose, access the T attribute (also the transpose function), similar to an ndarray:

# only show the first 5 rows


In [103]: df[:5].T
Out[103]:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05
A -0.334372 -0.196343 0.351420 -0.493614 -0.203285
B 1.607388 1.455554 1.148340 0.073640 -0.010826
C -2.433251 -0.134245 -1.691404 0.182179 -1.539934

DataFrame interoperability with NumPy functions

Elementwise NumPy ufuncs (log, exp, sqrt, …) and various other NumPy functions can be used with no
issues on Series and DataFrame, assuming the data within are numeric:
In [104]: np.exp(df)
Out[104]:
A B C
2000-01-01 0.715788 4.989759 0.087751
2000-01-02 0.821731 4.286857 0.874376
2000-01-03 1.421084 3.152955 0.184261
2000-01-04 0.610416 1.076419 1.199829
2000-01-05 0.816046 0.989232 0.214395
2000-01-06 0.242700 0.782295 0.846069
2000-01-07 0.799894 0.311209 8.478801
2000-01-08 1.128514 0.830674 6.870160

In [105]: np.asarray(df)
Out[105]:
array([[-0.33437191, 1.60738755, -2.43325073],
[-0.19634262, 1.45555386, -0.13424497],
[ 0.35141973, 1.1483402 , -1.69140383],
[-0.49361395, 0.0736397 , 0.18217936],
[-0.20328455, -0.01082611, -1.53993362],
[-1.41592783, -0.24552349, -0.16715382],
[-0.22327549, -1.16729044, 2.13756903],
[ 0.12090182, -0.18551784, 1.92718745]])
DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics and data model
are quite different in places from an n-dimensional array.
Series implements __array_ufunc__, which allows it to work with NumPy’s universal functions.
The ufunc is applied to the underlying array in a Series.

In [106]: ser = pd.Series([1, 2, 3, 4])

In [107]: np.exp(ser)
Out[107]:
0 2.718282
1 7.389056
2 20.085537
(continues on next page)

126 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


3 54.598150
dtype: float64

Changed in version 0.25.0: When multiple Series are passed to a ufunc, they are aligned before performing
the operation.
Like other parts of the library, pandas will automatically align labeled inputs as part of a ufunc with
multiple inputs. For example, using numpy.remainder() on two Series with differently ordered labels will
align before the operation.
In [108]: ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [109]: ser2 = pd.Series([1, 3, 5], index=['b', 'a', 'c'])

In [110]: ser1
Out[110]:
a 1
b 2
c 3
dtype: int64

In [111]: ser2
Out[111]:
b 1
a 3
c 5
dtype: int64

In [112]: np.remainder(ser1, ser2)


Out[112]:
a 1
b 0
c 3
dtype: int64
As usual, the union of the two indices is taken, and non-overlapping values are filled with missing values.
In [113]: ser3 = pd.Series([2, 4, 6], index=['b', 'c', 'd'])

In [114]: ser3
Out[114]:
b 2
c 4
d 6
dtype: int64

In [115]: np.remainder(ser1, ser3)


Out[115]:
a NaN
b 0.0
c 3.0
d NaN
dtype: float64
When a binary ufunc is applied to a Series and Index, the Series implementation takes precedence and a

3.4. Intro to data structures 127


pandas: powerful Python data analysis toolkit, Release 0.25.2

Series is returned.

In [116]: ser = pd.Series([1, 2, 3])

In [117]: idx = pd.Index([4, 5, 6])

In [118]: np.maximum(ser, idx)


Out[118]:
0 4
1 5
2 6
dtype: int64

NumPy ufuncs are safe to apply to Series backed by non-ndarray arrays, for example SparseArray (see
Sparse calculation). If possible, the ufunc is applied without converting the underlying data to an ndarray.

Console display

Very large DataFrames will be truncated to display them in the console. You can also get a summary using
info(). (Here I am reading a CSV version of the baseball dataset from the plyr R package):
In [119]: baseball = pd.read_csv('data/baseball.csv')

In [120]: print(baseball)
id player year stint team lg g ab r h X2b X3b hr rbi sb ␣
,→cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 1.0 1.
,→0 4 4.0 0.0 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 0.0 0.
,→0 0 1.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... .. .. ... .. ... ... ... .. ... ... ..
,→. .. ... ... ... ... ... ...
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 3.0 0.
,→0 27 30.0 5.0 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 0.0 0.
,→0 0 3.0 0.0 0.0 0.0 0.0 0.0

[100 rows x 23 columns]

In [121]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
id 100 non-null int64
player 100 non-null object
year 100 non-null int64
stint 100 non-null int64
team 100 non-null object
lg 100 non-null object
g 100 non-null int64
ab 100 non-null int64
r 100 non-null int64
h 100 non-null int64
X2b 100 non-null int64

128 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

X3b 100 non-null int64


hr 100 non-null int64
rbi 100 non-null float64
sb 100 non-null float64
cs 100 non-null float64
bb 100 non-null int64
so 100 non-null float64
ibb 100 non-null float64
hbp 100 non-null float64
sh 100 non-null float64
sf 100 non-null float64
gidp 100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.1+ KB
However, using to_string will return a string representation of the DataFrame in tabular form, though it
won’t always fit the console width:

In [122]: print(baseball.iloc[-20:, :12].to_string())


id player year stint team lg g ab r h X2b X3b
80 89474 finlest01 2007 1 COL NL 43 94 9 17 3 0
81 89480 embreal01 2007 1 OAK AL 4 0 0 0 0 0
82 89481 edmonji01 2007 1 SLN NL 117 365 39 92 15 2
83 89482 easleda01 2007 1 NYN NL 76 193 24 54 6 0
84 89489 delgaca01 2007 1 NYN NL 139 538 71 139 30 0
85 89493 cormirh01 2007 1 CIN NL 6 0 0 0 0 0
86 89494 coninje01 2007 2 NYN NL 21 41 2 8 2 0
87 89495 coninje01 2007 1 CIN NL 80 215 23 57 11 1
88 89497 clemero02 2007 1 NYA AL 2 2 0 1 0 0
89 89498 claytro01 2007 2 BOS AL 8 6 1 0 0 0
90 89499 claytro01 2007 1 TOR AL 69 189 23 48 14 0
91 89501 cirilje01 2007 2 ARI NL 28 40 6 8 4 0
92 89502 cirilje01 2007 1 MIN AL 50 153 18 40 9 2
93 89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 0
94 89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 3
95 89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0
96 89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0
97 89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0

Wide DataFrames will be printed across multiple rows by default:

In [123]: pd.DataFrame(np.random.randn(3, 12))


Out[123]:
0 1 2 3 4 5 6 7 ␣
,→ 8 9 10 11
0 -0.835921 -0.701274 1.007114 -0.587443 0.217855 -1.238129 1.317672 -1.082700 0.
,→897335 0.877220 -0.566652 -0.530428
1 1.500972 -0.190502 1.176097 -0.133727 -0.629027 0.254270 -1.944128 0.412923 -1.
,→276202 0.162727 1.015371 -1.278855
2 -1.410771 0.289453 0.644314 0.589727 0.257609 1.860029 0.782877 -0.786969 0.
,→052563 -0.863285 0.717709 1.216119

You can change how much to print on a single row by setting the display.width option:

3.4. Intro to data structures 129


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [124]: pd.set_option('display.width', 40) # default is 80

In [125]: pd.DataFrame(np.random.randn(3, 12))


Out[125]:
0 1 2 3 4 5 6 7 ␣
,→ 8 9 10 11
0 0.909755 0.209170 -0.369342 -0.045207 0.252971 0.501157 -0.342172 0.199789 -1.
,→001807 -0.548156 -0.720468 -0.758675

1 -1.947718 -1.745282 0.953580 0.278749 1.523093 -0.481708 2.247095 0.486608 0.


,→885270 0.587560 0.195590 -0.485761
2 -0.216119 1.261238 -1.854250 -0.041328 -1.883850 0.427516 -0.324040 -0.376712 0.
,→180279 -1.032801 -0.012413 2.057730

You can adjust the max width of the individual columns by setting display.max_colwidth

In [126]: datafile = {'filename': ['filename_01', 'filename_02'],


.....: 'path': ["media/user_name/storage/folder_01/filename_01",
.....: "media/user_name/storage/folder_02/filename_02"]}
.....:

In [127]: pd.set_option('display.max_colwidth', 30)

In [128]: pd.DataFrame(datafile)
Out[128]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...

In [129]: pd.set_option('display.max_colwidth', 100)

In [130]: pd.DataFrame(datafile)
Out[130]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02

You can also disable this feature via the expand_frame_repr option. This will print the table in one block.

DataFrame column attribute access and IPython completion

If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:
In [131]: df = pd.DataFrame({'foo1': np.random.randn(5),
.....: 'foo2': np.random.randn(5)})
.....:

In [132]: df
Out[132]:
foo1 foo2
0 0.175701 0.688798
1 1.857269 0.874645
2 1.062442 -0.593865
3 0.780121 0.422091

130 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

4 0.357684 -0.227634

In [133]: df.foo1
Out[133]:
0 0.175701
1 1.857269
2 1.062442
3 0.780121
4 0.357684
Name: foo1, dtype: float64
The columns are also connected to the IPython completion mechanism so they can be tab-completed:

In [5]: df.fo<TAB> # noqa: E225, E999


df.foo1 df.foo2

{{ header }}

3.5 Comparison with other tools

{{ header }}

3.5.1 Comparison with R / R libraries

Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for,
this page was started to provide a more detailed look at the R language and its many third party libraries
as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
• Functionality / flexibility: what can/cannot be done with each tool
• Performance: how fast are operations. Hard numbers/benchmarks are preferable
• Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side
code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External compatibility
for an example.

Quick reference

We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas
equivalents.

3.5. Comparison with other tools 131


pandas: powerful Python data analysis toolkit, Release 0.25.2

Querying, filtering, sampling

R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[['col1', 'col2']]
select(df, col1:col3) df.loc[:, 'col1':'col3']
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see1
distinct(select(df, col1)) df[['col1']].drop_duplicates()
distinct(select(df, col1, col2)) df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)

Sorting

R pandas
arrange(df, col1, col2) df.sort_values(['col1', 'col2'])
arrange(df, desc(col1)) df.sort_values('col1', ascending=False)

Transforming

R pandas
select(df, col_one = col1) df.rename(columns={'col1': 'col_one'})['col_one']
rename(df, col_one = col1) df.rename(columns={'col1': 'col_one'})
mutate(df, c=a-b) df.assign(c=df.a-df.b)

Grouping and summarizing

R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na.rm=TRUE)) df.groupby('col1').agg({'col1': 'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()

Base R

Slicing with R’s c

R makes it easy to access data.frame columns by name


1 R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the

list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.

132 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))


df[, c("a", "c", "e")]

or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]

Selecting multiple columns by name in pandas is straightforward


In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))

In [2]: df[['a', 'c']]


Out[2]:
a c
0 -0.828480 1.681178
1 1.743374 0.633181
2 0.422089 0.361566
3 -0.059013 0.317450
4 0.290561 -0.038037
5 0.470639 0.018485
6 -0.177483 -0.993452
7 0.506565 -1.156543
8 -0.522029 -0.897365
9 -2.230400 0.898255

In [3]: df.loc[:, ['a', 'c']]


Out[3]:
a c
0 -0.828480 1.681178
1 1.743374 0.633181
2 0.422089 0.361566
3 -0.059013 0.317450
4 0.290561 -0.038037
5 0.470639 0.018485
6 -0.177483 -0.993452
7 0.506565 -1.156543
8 -0.522029 -0.897365
9 -2.230400 0.898255
Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the
iloc indexer attribute and numpy.r_.
In [4]: named = list('abcdefg')

In [5]: n = 30

In [6]: columns = named + np.arange(len(named), n).tolist()

In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns)

In [8]: df.iloc[:, np.r_[:10, 24:30]]


Out[8]:
a b c d e f g 7 ␣
,→ 8 9 24 25 26 27 28 29
(continues on next page)

3.5. Comparison with other tools 133


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


0 -0.161711 -0.622519 0.208225 0.915346 0.635973 1.944184 -1.173432 0.573772 -1.
,→ 363568 -0.984202 -0.582592 0.465614 0.139290 1.806161 1.121979 0.202927
1 0.527815 -0.624958 1.284564 -2.344063 0.005860 2.101818 -0.416434 -0.426243 -0.
,→127257 -1.958374 -0.330766 -1.211749 1.522725 -0.473234 -0.002865 -0.621242
2 -0.999066 0.029902 0.703150 -2.205892 -1.965842 -0.314911 -0.289613 0.254820 -0.
,→857537 -2.803506 0.819210 -1.112451 -0.683897 -0.220029 1.133887 0.688133
3 -1.463074 -0.566623 0.162533 0.766610 -0.805727 0.945124 0.087959 1.724708 -0.
,→122325 0.021923 -0.807053 0.445671 1.154067 -0.734967 0.440447 0.769739
4 0.215595 -0.513622 -0.724535 0.334407 -1.170314 -0.475158 -0.812475 1.355643 0.
,→593809 1.196558 -0.071286 -0.009840 -0.010951 1.024874 -0.393390 -1.440673
5 1.138059 -0.847605 -0.992190 -0.668747 -0.768389 -1.694061 -0.246534 0.504333 0.
,→442761 -0.354326 1.956604 1.402969 -0.266741 -1.130245 0.421790 0.223395
6 -1.162280 -1.597809 0.243894 -0.561676 0.037052 -0.938355 -0.108639 0.299087 0.
,→983394 0.703327 0.653705 0.461272 0.631294 0.167015 -0.784300 -1.068329
7 0.493737 -0.733397 -2.228098 -0.006975 -0.664988 0.472423 -0.737506 0.138756 -0.
,→128530 1.142776 -1.264995 0.357805 1.130951 0.821112 1.169008 0.630292
8 0.040723 0.082398 1.942241 1.657618 0.356987 0.777680 0.984500 1.089167 -1.
,→101399 -0.587561 -0.936084 -0.131603 -0.176408 0.950228 0.637109 -1.124648
9 -1.311002 1.724906 -2.441010 0.648934 -0.416592 0.743233 -1.159325 -1.274940 1.
,→215875 -0.338257 -1.396232 0.764043 0.271799 1.091285 2.555156 1.391844
10 -0.943759 -0.134100 0.273861 -0.955675 0.460598 -1.577831 -0.441305 -0.322939 1.
,→522145 0.707912 -0.118141 -1.015213 1.228203 -2.588917 0.573895 -0.910279
11 1.692702 -0.987310 1.567212 -0.529861 -0.537105 2.620652 0.692978 0.280707 -0.
,→625992 -1.424688 0.131929 -0.702685 -1.210344 -0.221143 -0.299330 -0.259589
12 0.915448 0.013494 -0.180022 -0.965226 -1.691447 0.020024 0.093684 -0.587921 -1.
,→342976 0.560023 0.734603 -1.475418 -0.827790 0.794828 -1.044533 -0.177689
13 1.348951 0.558242 0.009562 0.917228 0.300415 0.567582 -0.225959 -1.008159 0.
,→411266 0.542485 1.244790 -0.186405 0.044299 0.004000 0.411114 0.183082
14 0.390766 2.131754 0.138597 -0.105058 -0.625766 -0.339669 -0.571208 1.046532 -1.
,→094131 -0.387665 -1.297371 -0.655904 -0.874871 1.742691 -0.624580 -0.060751
15 -0.565421 -0.577121 0.468792 -0.683787 0.270778 -0.018834 1.345719 0.403336 0.
,→619400 0.715512 -0.666283 0.606945 -0.721326 0.002093 0.254483 0.713546
16 0.194919 0.670606 -1.058857 1.940832 0.222525 0.052094 1.914943 -0.447290 -1.
,→881098 -0.325380 -0.406802 0.526870 0.799240 -0.450075 0.290738 -0.985542
17 0.312100 -0.360546 -1.462463 0.178105 -1.126410 0.366556 -1.951150 0.023669 -1.
,→236156 1.025478 0.822134 0.852692 -0.979683 0.270377 2.082737 0.668868
18 1.252015 0.063103 -1.351977 1.082954 0.852517 -0.214895 -0.801415 0.365058 -0.
,→273134 0.921398 1.278862 -1.748976 -1.170064 0.960483 -0.697434 0.628717
19 -0.366057 -0.753817 -0.702712 -1.016079 1.257663 0.041608 0.814353 0.108811 0.
,→191233 1.038856 -0.796688 -0.781127 1.369885 2.732703 0.849647 -0.548488
20 -0.557851 0.782481 0.668544 -0.239956 -0.894642 0.805307 0.380797 0.609791 0.
,→568315 0.256028 1.098834 0.456425 0.169587 1.508115 1.472847 -0.904914
21 -0.720713 -0.229595 -1.145494 -0.055998 -0.987261 0.709130 1.051061 1.006326 -2.
,→875135 -0.326405 0.367558 -1.770901 -0.696234 -3.050078 0.108486 2.290667
22 1.372930 0.444712 -0.111335 -0.296310 -1.486215 -0.462569 0.506591 -1.211042 -0.
,→762951 -0.312849 0.419791 1.714021 -1.206119 -0.755663 -1.079239 0.855764
23 0.121264 -0.685644 1.979086 -1.133284 -0.868269 0.467931 -1.953613 -0.083020 -0.
,→600455 0.198438 0.625309 0.509967 -1.606826 -1.005504 0.063706 1.112144
24 0.128090 -0.330723 0.724605 -0.088528 0.138086 2.069012 -0.122688 1.117345 -0.
,→875341 0.477046 -0.718772 1.051831 -1.241496 -0.822363 -1.143073 -0.814782
25 -2.641425 0.384697 -0.952814 -1.825965 -0.615649 1.104175 0.496878 1.184616 -0.
,→110253 0.169438 1.223878 -0.931636 -0.557628 1.691746 -1.814562 -1.042293
(continues on next page)

134 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


26 0.526819 -0.128238 1.366779 0.579849 0.802006 0.632004 -0.132257 -0.338430 0.
,→ 624478 0.788751 1.738257 0.021383 0.226782 -0.136237 -0.286149 1.044522
27 -0.496393 0.199336 -0.986075 -0.041069 -0.112649 -1.521067 -2.245727 -1.776357 2.
,→644873 0.679011 -2.417109 1.155864 -0.305424 -0.006122 -0.794772 -0.737119
28 -1.925161 -2.690689 0.309696 -0.098474 0.011248 -0.146943 -0.262566 0.491443 -0.
,→979954 -0.100812 0.169930 0.414276 0.741593 -0.640909 0.816449 0.711821
29 0.209986 -1.146731 -0.425711 1.538095 0.396371 -1.225278 1.326060 -0.357467 -0.
,→074940 -0.871937 0.024117 0.762464 -0.468373 0.352259 -1.788974 -0.560986

aggregate

In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df
and splitting it into groups by1 and by2:

df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)

The groupby() method is similar to base R aggregate function.

In [9]: df = pd.DataFrame(
...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
...: np.nan]})
...:

In [10]: g = df.groupby(['by1', 'by2'])

In [11]: g[['v1', 'v2']].mean()


Out[11]:
v1 v2
by1 by2
1 95 5.0 55.0
99 5.0 55.0
2 95 7.0 77.0
99 NaN NaN
big damp 3.0 33.0
blue dry 3.0 33.0
red red 4.0 44.0
wet 1.0 11.0

For more details and examples see the groupby documentation.

3.5. Comparison with other tools 135


pandas: powerful Python data analysis toolkit, Release 0.25.2

match / %in%

A common way to select data in R is using %in% which is defined using the function match. The operator
%in% is used to return a logical vector indicating if there is a match or not:

s <- 0:4
s %in% c(2,4)

The isin() method is similar to R %in% operator:

In [12]: s = pd.Series(np.arange(5), dtype=np.float32)

In [13]: s.isin([2, 4])


Out[13]:
0 False
1 False
2 True
3 False
4 True
dtype: bool

The match function returns a vector of the positions of matches of its first argument in its second:

s <- 0:4
match(s, c(2,4))

For more details and examples see the reshaping documentation.

tapply

tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly
irregular. Using a data.frame called baseball, and retrieving information based on the array team:

baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))

tapply(baseball$batting.average, baseball.example$team,
max)

In pandas we may use pivot_table() method to handle this:

In [14]: import random

In [15]: import string

In [16]: baseball = pd.DataFrame(


....: {'team': ["team %d" % (x + 1) for x in range(5)] * 5,
....: 'player': random.sample(list(string.ascii_lowercase), 25),
....: 'batting avg': np.random.uniform(.200, .400, 25)})
....:
(continues on next page)

136 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

In [17]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)


Out[17]:
team team 1 team 2 team 3 team 4 team 5
batting avg 0.375407 0.386991 0.382943 0.3652 0.370616

For more details and examples see the reshaping documentation.

subset

The query() method is similar to the base R subset function. In R you might want to get the rows of a
data.frame where one column’s values are less than another column’s values:

df <- data.frame(a=rnorm(10), b=rnorm(10))


subset(df, a <= b)
df[df$a <= df$b,] # note the comma

In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it
were an index/slice as well as standard boolean indexing:
In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})

In [19]: df.query('a <= b')


Out[19]:
a b
2 -1.356239 -0.339505

In [20]: df[df.a <= df.b]


Out[20]:
a b
2 -1.356239 -0.339505

In [21]: df.loc[df.a <= df.b]


Out[21]:
a b
2 -1.356239 -0.339505
For more details and examples see the query documentation.

with

An expression using a data.frame called df in R with the columns a and b would be evaluated using with
like so:

df <- data.frame(a=rnorm(10), b=rnorm(10))


with(df, a + b)
df$a + df$b # same as the previous expression

In pandas the equivalent expression, using the eval() method, would be:
In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})

In [23]: df.eval('a + b')

3.5. Comparison with other tools 137


pandas: powerful Python data analysis toolkit, Release 0.25.2

Out[23]:
0 1.582100
1 1.102172
2 -0.621243
3 0.059273
4 -1.744880
5 0.915561
6 -2.643852
7 -0.519496
8 0.733243
9 -2.914023
dtype: float64

In [24]: df.a + df.b # same as the previous expression


Out[24]:
0 1.582100
1 1.102172
2 -0.621243
3 0.059273
4 -1.744880
5 0.915561
6 -2.643852
7 -0.519496
8 0.733243
9 -2.914023
dtype: float64
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples
see the eval documentation.

plyr

plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around
three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how
these data structures could be mapped in Python.

R Python
array list
lists dictionary or list of objects
data.frame dataframe

ddply

An expression using a data.frame called df in R where you want to summarize x by month:

require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
(continues on next page)

138 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


week = sample(1:4, 120, TRUE)
)

ddply(df, .(month, week), summarize,


mean = round(mean(x), 2),
sd = round(sd(x), 2))

In pandas the equivalent expression, using the groupby() method, would be:

In [25]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 120),


....: 'y': np.random.uniform(7., 334., 120),
....: 'z': np.random.uniform(1.7, 20.7, 120),
....: 'month': [5, 6, 7, 8] * 30,
....: 'week': np.random.randint(1, 4, 120)})
....:

In [26]: grouped = df.groupby(['month', 'week'])

In [27]: grouped['x'].agg([np.mean, np.std])


Out[27]:
mean std
month week
5 1 65.472325 45.594965
2 69.963684 53.928648
3 58.151030 65.291197
6 1 90.928081 51.648455
2 95.919255 36.887073
3 102.051638 54.414241
7 1 101.783668 39.286572
2 81.657661 54.996207
3 73.186147 46.615222
8 1 85.351475 54.016463
2 88.593449 39.724889
3 82.336311 45.392411

For more details and examples see the groupby documentation.

reshape / reshape2

melt.array

An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:

a <- array(c(1:23, NA), c(2,3,4))


data.frame(melt(a))

In Python, since a is a list, you can simply use list comprehension.

In [28]: a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)

In [29]: pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)])


(continues on next page)

3.5. Comparison with other tools 139


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Out[29]:
0 1 2 3
0 0 0 0 1.0
1 0 0 1 2.0
2 0 0 2 3.0
3 0 0 3 4.0
4 0 1 0 5.0
5 0 1 1 6.0
6 0 1 2 7.0
7 0 1 3 8.0
8 0 2 0 9.0
9 0 2 1 10.0
10 0 2 2 11.0
11 0 2 3 12.0
12 1 0 0 13.0
13 1 0 1 14.0
14 1 0 2 15.0
15 1 0 3 16.0
16 1 1 0 17.0
17 1 1 1 18.0
18 1 1 2 19.0
19 1 1 3 20.0
20 1 2 0 21.0
21 1 2 1 22.0
22 1 2 2 23.0
23 1 2 3 NaN

melt.list

An expression using a list called a in R where you want to melt it into a data.frame:

a <- as.list(c(1:4, NA))


data.frame(melt(a))

In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as
required.

In [30]: a = list(enumerate(list(range(1, 5)) + [np.NAN]))

In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN

For more details and examples see the Into to Data Structures documentation.

140 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

melt.data.frame

An expression using a data.frame called cheese in R where you want to reshape the data.frame:

cheese <- data.frame(


first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))

In Python, the melt() method is the R equivalent:


In [32]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
....: 'last': ['Doe', 'Bo'],
....: 'height': [5.5, 6.0],
....: 'weight': [130, 150]})
....:

In [33]: pd.melt(cheese, id_vars=['first', 'last'])


Out[33]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

In [34]: cheese.set_index(['first', 'last']).stack() # alternative way


Out[34]:
first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
dtype: float64
For more details and examples see the reshaping documentation.

cast

In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:

df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)

mdf <- melt(df, id=c("month", "week"))


acast(mdf, week ~ month ~ variable, mean)

3.5. Comparison with other tools 141


pandas: powerful Python data analysis toolkit, Release 0.25.2

In Python the best way is to make use of pivot_table():


In [35]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 12),
....: 'y': np.random.uniform(7., 334., 12),
....: 'z': np.random.uniform(1.7, 20.7, 12),
....: 'month': [5, 6, 7] * 4,
....: 'week': [1, 2] * 6})
....:

In [36]: mdf = pd.melt(df, id_vars=['month', 'week'])

In [37]: pd.pivot_table(mdf, values='value', index=['variable', 'week'],


....: columns=['month'], aggfunc=np.mean)
....:
Out[37]:
month 5 6 7
variable week
x 1 33.676819 14.477056 84.573585
2 152.619607 70.434830 53.454458
y 1 182.422131 124.361465 209.180885
2 187.435897 256.116538 228.092618
z 1 10.668081 18.434863 11.888183
2 18.817980 16.421603 10.912752

Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and
FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)

dcast(df, Animal ~ FeedType, sum, fill=NaN)


# Alternative method using base R
with(df, tapply(Amount, list(Animal, FeedType), sum))

Python can approach this in two different ways. Firstly, similar to above using pivot_table():

In [38]: df = pd.DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
....:

In [39]: df.pivot_table(values='Amount', index='Animal', columns='FeedType',


....: aggfunc='sum')
....:
Out[39]:
FeedType A B
Animal
(continues on next page)

142 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Animal1 10.0 5.0
Animal2 2.0 13.0
Animal3 6.0 NaN

The second approach is to use the groupby() method:

In [40]: df.groupby(['Animal', 'FeedType'])['Amount'].sum()


Out[40]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64

For more details and examples see the reshaping documentation or the groupby documentation.

factor

pandas has a data type for categorical data.

cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))

In pandas this is accomplished with pd.cut and astype("category"):


In [41]: pd.cut(pd.Series([1, 2, 3, 4, 5, 6]), 3)
Out[41]:
0 (0.995, 2.667]
1 (0.995, 2.667]
2 (2.667, 4.333]
3 (2.667, 4.333]
4 (4.333, 6.0]
5 (4.333, 6.0]
dtype: category
Categories (3, interval[float64]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]

In [42]: pd.Series([1, 2, 3, 2, 2, 3]).astype("category")


Out[42]:
0 1
1 2
2 3
3 2
4 2
5 3
dtype: category
Categories (3, int64): [1, 2, 3]
For more details and examples see categorical introduction and the API documentation. There is also a
documentation regarding the differences to R’s factor. {{ header }}

3.5. Comparison with other tools 143


pandas: powerful Python data analysis toolkit, Release 0.25.2

3.5.2 Comparison with SQL

Since many potential pandas users have some familiarity with SQL, this page is meant to provide some
examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a
DataFrame called tips and assume we have a database table of the same name and structure.

In [3]: url = ('https://raw.github.com/pandas-dev'


...: '/pandas/master/pandas/tests/data/tips.csv')
...:

In [4]: tips = pd.read_csv(url)

In [5]: tips.head()
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

SELECT

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all
columns):

SELECT total_bill, tip, smoker, time


FROM tips
LIMIT 5;

With pandas, column selection is done by passing a list of column names to your DataFrame:

In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)


Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).

144 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

WHERE

Filtering in SQL is done via a WHERE clause.

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [7]: tips[tips['time'] == 'Dinner'].head(5)


Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows
with True.
In [8]: is_dinner = tips['time'] == 'Dinner'

In [9]: is_dinner.value_counts()
Out[9]:
True 176
False 68
Name: time, dtype: int64

In [10]: tips[is_dinner].head(5)
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).

-- tips of more than $5.00 at Dinner meals


SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

# tips of more than $5.00 at Dinner meals


In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
Out[11]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
(continues on next page)

3.5. Comparison with other tools 145


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3

-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45
In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[12]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5

NULL checking is done using the notna() and isna() methods.

In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],


....: 'col2': ['F', np.NaN, 'G', 'H', 'I']})
....:

In [14]: frame
Out[14]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I

Assume we have a table of the same structure as our DataFrame above. We can see only the records where

146 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

col2 IS NULL with the following query:

SELECT *
FROM frame
WHERE col2 IS NULL;

In [15]: frame[frame['col2'].isna()]
Out[15]:
col1 col2
1 B NaN

Getting items where col1 IS NOT NULL can be done with notna().

SELECT *
FROM frame
WHERE col1 IS NOT NULL;

In [16]: frame[frame['col1'].notna()]
Out[16]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I

GROUP BY

In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method.
groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function
(typically aggregation) , and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For
instance, a query getting us the number of tips left by sex:

SELECT sex, count(*)


FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/

The pandas equivalent would be:

In [17]: tips.groupby('sex').size()
Out[17]:
sex
Female 87
Male 157
dtype: int64

Notice that in the pandas code we used size() and not count(). This is because count() applies the
function to each column, returning the number of not null records within each.

3.5. Comparison with other tools 147


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [18]: tips.groupby('sex').count()
Out[18]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157

Alternatively, we could have applied the count() method to an individual column:

In [19]: tips.groupby('sex')['total_bill'].count()
Out[19]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64

Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs
by day of the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which
functions to apply to specific columns.

SELECT day, AVG(tip), COUNT(*)


FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/

In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})


Out[20]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SELECT smoker, day, COUNT(*), AVG(tip)


FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
(continues on next page)

148 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Thur 17 3.030000
*/

In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})


Out[21]:
tip
size mean
smoker day
No Fri 4.0 2.812500
Sat 45.0 3.102889
Sun 57.0 3.167895
Thur 45.0 2.673778
Yes Fri 15.0 2.714000
Sat 42.0 2.875476
Sun 19.0 3.516842
Thur 17.0 3.030000

JOIN

JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their
indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT,
INNER, FULL) or the columns to join on (column names or indices).

In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],


....: 'value': np.random.randn(4)})
....:

In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],


....: 'value': np.random.randn(4)})
....:

Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.

INNER JOIN

SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;

# merge performs an INNER JOIN by default


In [24]: pd.merge(df1, df2, on='key')
Out[24]:
key value_x value_y
0 B -1.181780 0.292988
1 D 0.606962 0.095945
2 D 0.606962 0.245328

3.5. Comparison with other tools 149


pandas: powerful Python data analysis toolkit, Release 0.25.2

merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another
DataFrame’s index.

In [25]: indexed_df2 = df2.set_index('key')

In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)


Out[26]:
key value_x value_y
1 B -1.181780 0.292988
3 D 0.606962 0.095945
3 D 0.606962 0.245328

LEFT OUTER JOIN

-- show all records from df1


SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;

# show all records from df1


In [27]: pd.merge(df1, df2, on='key', how='left')
Out[27]:
key value_x value_y
0 A -0.630829 NaN
1 B -1.181780 0.292988
2 C -0.290432 NaN
3 D 0.606962 0.095945
4 D 0.606962 0.245328

RIGHT JOIN

-- show all records from df2


SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;

# show all records from df2


In [28]: pd.merge(df1, df2, on='key', how='right')
Out[28]:
key value_x value_y
0 B -1.181780 0.292988
1 D 0.606962 0.095945
2 D 0.606962 0.245328
3 E NaN -0.807939

150 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

FULL JOIN

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined
columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).

-- show all records from both tables


SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;

# show all records from both frames


In [29]: pd.merge(df1, df2, on='key', how='outer')
Out[29]:
key value_x value_y
0 A -0.630829 NaN
1 B -1.181780 0.292988
2 C -0.290432 NaN
3 D 0.606962 0.095945
4 D 0.606962 0.245328
5 E NaN -0.807939

UNION

UNION ALL can be performed using concat().

In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],


....: 'rank': range(1, 4)})
....:

In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],


....: 'rank': [1, 4, 5]})
....:

SELECT city, rank


FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/

In [32]: pd.concat([df1, df2])


Out[32]:
(continues on next page)

3.5. Comparison with other tools 151


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.

SELECT city, rank


FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/

In pandas, you can use concat() in conjunction with drop_duplicates().

In [33]: pd.concat([df1, df2]).drop_duplicates()


Out[33]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5

Pandas equivalents for some SQL analytic and aggregate functions

Top N rows with offset

-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;

In [34]: tips.nlargest(10 + 5, columns='tip').tail(10)


Out[34]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
(continues on next page)

152 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4

Top N rows per group

-- Oracle's ROW_NUMBER() analytic function


SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;

In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)


....: .groupby(['day'])
....: .cumcount() + 1)
....: .query('rn < 3')
....: .sort_values(['day', 'rn']))
....:
Out[35]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2

the same using rank(method=’first’) function

In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
....: .rank(method='first', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['day', 'rnk']))
....:
Out[36]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
(continues on next page)

3.5. Comparison with other tools 153


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0

-- Oracle's RANK() analytic function


SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;

Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min')
function rnk_min remains the same for the same tip (as Oracle’s RANK() function)

In [37]: (tips[tips['tip'] < 2]


....: .assign(rnk_min=tips.groupby(['sex'])['tip']
....: .rank(method='min'))
....: .query('rnk_min < 3')
....: .sort_values(['sex', 'rnk_min']))
....:
Out[37]:
total_bill tip sex smoker day time size rnk_min
67 3.07 1.00 Female Yes Sat Dinner 1 1.0
92 5.75 1.00 Female Yes Fri Dinner 2 1.0
111 7.25 1.00 Female No Sat Dinner 1 1.0
236 12.60 1.00 Male Yes Sat Dinner 2 1.0
237 32.83 1.17 Male Yes Sat Dinner 2 2.0

UPDATE

UPDATE tips
SET tip = tip*2
WHERE tip < 2;

In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE

DELETE FROM tips


WHERE tip > 9;

In pandas we select the rows that should remain, instead of deleting them

154 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [39]: tips = tips.loc[tips['tip'] <= 9]

{{ header }}

3.5.3 Comparison with SAS

For potential users coming from SAS this page is meant to demonstrate how different SAS operations would
be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which
displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter
notebook or terminal) - the equivalent in SAS would be:

proc print data=df(obs=5);


run;

Data structures

General terminology translation

pandas SAS
DataFrame data set
column variable
row observation
groupby BY-group
NaN .

DataFrame / Series

A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns
that can be of different types. As will be shown in this document, almost any operation that can be applied
to a data set using SAS’s DATA step, can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data
structure for a single column, but in general, working with a Series is analogous to referencing a column in
the DATA step.

3.5. Comparison with other tools 155


pandas: powerful Python data analysis toolkit, Release 0.25.2

Index

Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an
exactly analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index
that can be accessed during the DATA step (_N_).
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1,
and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately
an important part of pandas to understand, for this comparison we will essentially ignore the Index and just
treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on
how to use an Index effectively.

Data input / output

Constructing a DataFrame from values

A SAS data set can be built from specified values by placing the data after a datalines statement and
specifying the column names.

data df;
input x y;
datalines;
1 2
3 4
5 6
;
run;

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the
data.

In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})

In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6

Reading external data

Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within
the pandas tests (csv) will be used in many of the following examples.
SAS provides PROC IMPORT to read csv data into a data set.

proc import datafile='tips.csv' dbms=csv out=tips replace;


getnames=yes;
run;

The pandas method is read_csv(), which works similarly.

156 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [5]: url = ('https://raw.github.com/pandas-dev/'


...: 'pandas/master/pandas/tests/data/tips.csv')
...:

In [6]: tips = pd.read_csv(url)

In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed.
For example, if the data was instead tab delimited, and did not have column names, the pandas command
would be:

tips = pd.read_csv('tips.csv', sep='\t', header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter


tips = pd.read_table('tips.csv', header=None)

In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL
databases. These are all read via a pd.read_* function. See the IO documentation for more details.

Exporting data

The inverse of PROC IMPORT in SAS is PROC EXPORT

proc export data=tips outfile='tips2.csv' dbms=csv;


run;

Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.

tips.to_csv('tips2.csv')

Data operations

Operations on columns

In the DATA step, arbitrary math expressions can be used on new or existing columns.

data tips;
set tips;
total_bill = total_bill - 2;
new_bill = total_bill / 2;
run;

pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way.

3.5. Comparison with other tools 157


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [8]: tips['total_bill'] = tips['total_bill'] - 2

In [9]: tips['new_bill'] = tips['total_bill'] / 2.0

In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295

Filtering

Filtering in SAS is done with an if or where statement, on one or more columns.

data tips;
set tips;
if total_bill > 10;
run;

data tips;
set tips;
where total_bill > 10;
/* equivalent in this case - where happens before the
DATA step begins and can also be used in PROC statements */
run;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing

In [11]: tips[tips['total_bill'] > 10].head()


Out[11]:
total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
5 23.29 4.71 Male No Sun Dinner 4

If/then logic

In SAS, if/then logic can be used to create new columns.

data tips;
set tips;
format bucket $4.;

if total_bill < 10 then bucket = 'low';


else bucket = 'high';
run;

158 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

The same operation in pandas can be accomplished using the where method from numpy.

In [12]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')

In [13]: tips.head()
Out[13]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high

Date functionality

SAS provides a variety of functions to do operations on date/datetime columns.

data tips;
set tips;
format date1 date2 date1_plusmonth mmddyy10.;
date1 = mdy(1, 15, 2013);
date2 = mdy(2, 15, 2015);
date1_year = year(date1);
date2_month = month(date2);
* shift date to beginning of next interval;
date1_next = intnx('MONTH', date1, 1);
* count intervals between dates;
months_between = intck('MONTH', date1, date2);
run;

The equivalent pandas operations are shown below. In addition to these functions pandas supports other
Time Series features not available in Base SAS (such as resampling and custom offsets) - see the timeseries
documentation for more details.

In [14]: tips['date1'] = pd.Timestamp('2013-01-15')

In [15]: tips['date2'] = pd.Timestamp('2015-02-15')

In [16]: tips['date1_year'] = tips['date1'].dt.year

In [17]: tips['date2_month'] = tips['date2'].dt.month

In [18]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()

In [19]: tips['months_between'] = (
....: tips['date2'].dt.to_period('M') - tips['date1'].dt.to_period('M'))
....:

In [20]: tips[['date1', 'date2', 'date1_year', 'date2_month',


....: 'date1_next', 'months_between']].head()
....:
Out[20]:
(continues on next page)

3.5. Comparison with other tools 159


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

Selection of columns

SAS provides keywords in the DATA step to select, drop, and rename columns.

data tips;
set tips;
keep sex total_bill tip;
run;

data tips;
set tips;
drop sex;
run;

data tips;
set tips;
rename total_bill=total_bill_2;
run;

The same operations are expressed in pandas below.


# keep
In [21]: tips[['sex', 'total_bill', 'tip']].head()
Out[21]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61

# drop
In [22]: tips.drop('sex', axis=1).head()
Out[22]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4

# rename
In [23]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
Out[23]:
total_bill_2 tip sex smoker day time size

160 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

0 14.99 1.01 Female No Sun Dinner 2


1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4

Sorting by values

Sorting in SAS is accomplished via PROC SORT

proc sort data=tips;


by sex total_bill;
run;

pandas objects have a sort_values() method, which takes a list of columns to sort by.

In [24]: tips = tips.sort_values(['sex', 'total_bill'])

In [25]: tips.head()
Out[25]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2

String processing

Length

SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN
excludes trailing blanks and LENGTHC includes trailing blanks.

data _null_;
set tips;
put(LENGTHN(time));
put(LENGTHC(time));
run;

Python determines the length of a character string with the len function. len includes trailing blanks. Use
len and rstrip to exclude trailing blanks.
In [26]: tips['time'].str.len().head()
Out[26]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64

In [27]: tips['time'].str.rstrip().str.len().head()

3.5. Comparison with other tools 161


pandas: powerful Python data analysis toolkit, Release 0.25.2

Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64

Find

SAS determines the position of a character in a string with the FINDW function. FINDW takes the string
defined by the first argument and searches for the first position of the substring you supply as the second
argument.

data _null_;
set tips;
put(FINDW(sex,'ale'));
run;

Python determines the position of a character in a string with the find function. find searches for the first
position of the substring. If the substring is found, the function returns its position. Keep in mind that
Python indexes are zero-based and the function will return -1 if it fails to find the substring.

In [28]: tips['sex'].str.find("ale").head()
Out[28]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64

Substring

SAS extracts a substring from a string based on its position with the SUBSTR function.

data _null_;
set tips;
put(substr(sex,1,1));
run;

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in
mind that Python indexes are zero-based.

In [29]: tips['sex'].str[0:1].head()
Out[29]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object

162 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Scan

The SAS SCAN function returns the nth word from a string. The first argument is the string you want to
parse and the second argument specifies which word you want to extract.

data firstlast;
input String $60.;
First_Name = scan(string, 1);
Last_Name = scan(string, -1);
datalines2;
John Smith;
Jane Cook;
;;;
run;

Python extracts a substring from a string based on its text by using regular expressions. There are much
more powerful approaches, but this just shows a simple approach.

In [30]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})

In [31]: firstlast['First_Name'] = firstlast['String'].str.split(" ", expand=True)[0]

In [32]: firstlast['Last_Name'] = firstlast['String'].str.rsplit(" ", expand=True)[0]

In [33]: firstlast
Out[33]:
String First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane

Upcase, lowcase, and propcase

The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.

data firstlast;
input String $60.;
string_up = UPCASE(string);
string_low = LOWCASE(string);
string_prop = PROPCASE(string);
datalines2;
John Smith;
Jane Cook;
;;;
run;

The equivalent Python functions are upper, lower, and title.

In [34]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})

In [35]: firstlast['string_up'] = firstlast['String'].str.upper()

In [36]: firstlast['string_low'] = firstlast['String'].str.lower()


(continues on next page)

3.5. Comparison with other tools 163


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

In [37]: firstlast['string_prop'] = firstlast['String'].str.title()

In [38]: firstlast
Out[38]:
String string_up string_low string_prop
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook

Merging

The following tables will be used in the merge examples

In [39]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],


....: 'value': np.random.randn(4)})
....:

In [40]: df1
Out[40]:
key value
0 A 0.970270
1 B 0.438338
2 C 0.402038
3 D -0.629526

In [41]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],


....: 'value': np.random.randn(4)})
....:

In [42]: df2
Out[42]:
key value
0 B -0.600662
1 D -0.478682
2 D -0.546915
3 E -0.054672

In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the
in= dummy variables to track whether a match was found in one or both input frames.

proc sort data=df1;


by key;
run;

proc sort data=df2;


by key;
run;

data left_join inner_join right_join outer_join;


merge df1(in=a) df2(in=b);

(continues on next page)

164 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


if a and b then output inner_join;
if a then output left_join;
if b then output right_join;
if a or b then output outer_join;
run;

pandas DataFrames have a merge() method, which provides similar functionality. Note that the data does
not have to be sorted ahead of time, and different join types are accomplished via the how keyword.

In [43]: inner_join = df1.merge(df2, on=['key'], how='inner')

In [44]: inner_join
Out[44]:
key value_x value_y
0 B 0.438338 -0.600662
1 D -0.629526 -0.478682
2 D -0.629526 -0.546915

In [45]: left_join = df1.merge(df2, on=['key'], how='left')

In [46]: left_join
Out[46]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915

In [47]: right_join = df1.merge(df2, on=['key'], how='right')

In [48]: right_join
Out[48]:
key value_x value_y
0 B 0.438338 -0.600662
1 D -0.629526 -0.478682
2 D -0.629526 -0.546915
3 E NaN -0.054672

In [49]: outer_join = df1.merge(df2, on=['key'], how='outer')

In [50]: outer_join
Out[50]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E NaN -0.054672

3.5. Comparison with other tools 165


pandas: powerful Python data analysis toolkit, Release 0.25.2

Missing data

Like SAS, pandas has a representation for missing data - which is the special float value NaN (not a number).
Many of the semantics are the same, for example missing data propagates through numeric operations, and
is ignored by default for aggregations.
In [51]: outer_join
Out[51]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E NaN -0.054672

In [52]: outer_join['value_x'] + outer_join['value_y']


Out[52]:
0 NaN
1 -0.162324
2 NaN
3 -1.108208
4 -1.176442
5 NaN
dtype: float64

In [53]: outer_join['value_x'].sum()
Out[53]: 0.5515934095727003
One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could
do this to filter missing values.

data outer_join_nulls;
set outer_join;
if value_x = .;
run;

data outer_join_no_nulls;
set outer_join;
if value_x ^= .;
run;

Which doesn’t work in pandas. Instead, the pd.isna or pd.notna functions should be used for comparisons.
In [54]: outer_join[pd.isna(outer_join['value_x'])]
Out[54]:
key value_x value_y
5 E NaN -0.054672

In [55]: outer_join[pd.notna(outer_join['value_x'])]
Out[55]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682

166 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

4 D -0.629526 -0.546915
pandas also provides a variety of methods to work with missing data - some of which would be challenging to
express in SAS. For example, there are methods to drop all rows with any missing values, replacing missing
values with a specified value, like the mean, or forward filling from previous rows. See the missing data
documentation for more.
In [56]: outer_join.dropna()
Out[56]:
key value_x value_y
1 B 0.438338 -0.600662
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915

In [57]: outer_join.fillna(method='ffill')
Out[57]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 -0.600662
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E -0.629526 -0.054672

In [58]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
Out[58]:
0 0.970270
1 0.438338
2 0.402038
3 -0.629526
4 -0.629526
5 0.110319
Name: value_x, dtype: float64

GroupBy

Aggregation

SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on
numeric columns.

proc summary data=tips nway;


class sex smoker;
var total_bill tip;
output out=tips_summed sum=;
run;

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documen-
tation for more details and examples.

In [59]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()

In [60]: tips_summed.head()
(continues on next page)

3.5. Comparison with other tools 167


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Out[60]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07

Transformation

In SAS, if the group aggregations need to be used with the original frame, it must be merged back together.
For example, to subtract the mean for each observation by smoker group.

proc summary data=tips missing nway;


class smoker;
var total_bill;
output out=smoker_means mean(total_bill)=group_bill;
run;

proc sort data=tips;


by smoker;
run;

data tips;
merge tips(in=a) smoker_means(in=b);
by smoker;
adj_total_bill = total_bill - group_bill;
if a and b;
run;

pandas groupby provides a transform mechanism that allows these type of operations to be succinctly
expressed in one operation.

In [61]: gb = tips.groupby('smoker')['total_bill']

In [62]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')

In [63]: tips.head()
Out[63]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278

By group processing

In addition to aggregation, pandas groupby can be used to replicate most other by group processing from
SAS. For example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.

168 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

proc sort data=tips;


by sex smoker;
run;

data tips_first;
set tips;
by sex smoker;
if FIRST.sex or FIRST.smoker then output;
run;

In pandas this would be written as:

In [64]: tips.groupby(['sex', 'smoker']).first()


Out[64]:
total_bill tip day time size adj_total_bill
sex smoker
Female No 5.25 1.00 Sat Dinner 1 -11.938278
Yes 1.07 1.00 Sat Dinner 1 -17.686344
Male No 5.51 2.00 Thur Lunch 2 -11.678278
Yes 5.25 5.15 Sun Dinner 2 -13.506344

Other Considerations

Disk vs memory

pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of
data able to be loaded in pandas is limited by your machine’s memory, but also that the operations on that
data may be faster.
If out of core processing is needed, one possibility is the dask.dataframe library (currently in development)
which provides a subset of pandas functionality for an on-disk DataFrame

Data interop

pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary
format.

libname xportout xport 'transport-file.xpt';


data xportout.tips;
set tips(rename=(total_bill=tbill));
* xport variable names limited to 6 characters;
run;

df = pd.read_sas('transport-file.xpt')
df = pd.read_sas('binary-file.sas7bdat')

You can also specify the file format directly. By default, pandas will try to infer the file format based on its
extension.

df = pd.read_sas('transport-file.xpt', format='xport')
df = pd.read_sas('binary-file.sas7bdat', format='sas7bdat')

3.5. Comparison with other tools 169


pandas: powerful Python data analysis toolkit, Release 0.25.2

XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas
readers. An alternative way to interop data between SAS and pandas is to serialize to csv.

# version 0.17, 10M rows

In [8]: %time df = pd.read_sas('big.xpt')


Wall time: 14.6 s

In [9]: %time df = pd.read_csv('big.csv')


Wall time: 4.86 s

{{ header }}

3.5.4 Comparison with Stata

For potential users coming from Stata this page is meant to demonstrate how different Stata operations
would be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows. This means that we can refer to the libraries as
pd and np, respectively, for the rest of the document.

In [1]: import pandas as pd

In [2]: import numpy as np

Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which
displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter
notebook or terminal) – the equivalent in Stata would be:

list in 1/5

Data structures

General terminology translation

pandas Stata
DataFrame data set
column variable
row observation
groupby bysort
NaN .

DataFrame / Series

A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns
that can be of different types. As will be shown in this document, almost any operation that can be applied

170 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

to a data set in Stata can also be accomplished in pandas.


A Series is the data structure that represents one column of a DataFrame. Stata doesn’t have a separate
data structure for a single column, but in general, working with a Series is analogous to referencing a
column of a data set in Stata.

Index

Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly
analogous concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index
that can be accessed with _n.
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1,
and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately
an important part of pandas to understand, for this comparison we will essentially ignore the Index and just
treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on
how to use an Index effectively.

Data input / output

Constructing a DataFrame from values

A Stata data set can be built from specified values by placing the data after an input statement and
specifying the column names.

input x y
1 2
3 4
5 6
end

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the
data.

In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})

In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6

Reading external data

Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within
the pandas tests (csv) will be used in many of the following examples.
Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the
current working directory, we can import it as follows.

3.5. Comparison with other tools 171


pandas: powerful Python data analysis toolkit, Release 0.25.2

import delimited tips.csv

The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the
data set if presented with a url.

In [5]: url = ('https://raw.github.com/pandas-dev'


...: '/pandas/master/pandas/tests/data/tips.csv')
...:

In [6]: tips = pd.read_csv(url)

In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Like import delimited, read_csv() can take a number of parameters to specify how the data should be
parsed. For example, if the data were instead tab delimited, did not have column names, and existed in the
current working directory, the pandas command would be:

tips = pd.read_csv('tips.csv', sep='\t', header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter


tips = pd.read_table('tips.csv', header=None)

Pandas can also read Stata data sets in .dta format with the read_stata() function.

df = pd.read_stata('data.dta')

In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS,
HDF5, Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation
for more details.

Exporting data

The inverse of import delimited in Stata is export delimited

export delimited tips2.csv

Similarly in pandas, the opposite of read_csv is DataFrame.to_csv().

tips.to_csv('tips2.csv')

Pandas can also export to Stata file format with the DataFrame.to_stata() method.

tips.to_stata('tips2.dta')

Data operations

172 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Operations on columns

In Stata, arbitrary math expressions can be used with the generate and replace commands on new or
existing columns. The drop command drops the column from the data set.

replace total_bill = total_bill - 2


generate new_bill = total_bill / 2
drop new_bill

pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way. The DataFrame.drop() method drops a column from the
DataFrame.

In [8]: tips['total_bill'] = tips['total_bill'] - 2

In [9]: tips['new_bill'] = tips['total_bill'] / 2

In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295

In [11]: tips = tips.drop('new_bill', axis=1)

Filtering

Filtering in Stata is done with an if clause on one or more columns.

list if total_bill > 10

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [12]: tips[tips['total_bill'] > 10].head()


Out[12]:
total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
5 23.29 4.71 Male No Sun Dinner 4

If/then logic

In Stata, an if clause can also be used to create new columns.

generate bucket = "low" if total_bill < 10


replace bucket = "high" if total_bill >= 10

3.5. Comparison with other tools 173


pandas: powerful Python data analysis toolkit, Release 0.25.2

The same operation in pandas can be accomplished using the where method from numpy.

In [13]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')

In [14]: tips.head()
Out[14]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high

Date functionality

Stata provides a variety of functions to do operations on date/datetime columns.

generate date1 = mdy(1, 15, 2013)


generate date2 = date("Feb152015", "MDY")

generate date1_year = year(date1)


generate date2_month = month(date2)

* shift date to beginning of next month


generate date1_next = mdy(month(date1) + 1, 1, year(date1)) if month(date1) != 12
replace date1_next = mdy(1, 1, year(date1) + 1) if month(date1) == 12
generate months_between = mofd(date2) - mofd(date1)

list date1 date2 date1_year date2_month date1_next months_between

The equivalent pandas operations are shown below. In addition to these functions, pandas supports other
Time Series features not available in Stata (such as time zone handling and custom offsets) – see the timeseries
documentation for more details.

In [15]: tips['date1'] = pd.Timestamp('2013-01-15')

In [16]: tips['date2'] = pd.Timestamp('2015-02-15')

In [17]: tips['date1_year'] = tips['date1'].dt.year

In [18]: tips['date2_month'] = tips['date2'].dt.month

In [19]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()

In [20]: tips['months_between'] = (tips['date2'].dt.to_period('M')


....: - tips['date1'].dt.to_period('M'))
....:

In [21]: tips[['date1', 'date2', 'date1_year', 'date2_month', 'date1_next',


....: 'months_between']].head()
....:
Out[21]:
(continues on next page)

174 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

Selection of columns

Stata provides keywords to select, drop, and rename columns.

keep sex total_bill tip

drop sex

rename total_bill total_bill_2

The same operations are expressed in pandas below. Note that in contrast to Stata, these operations do not
happen in place. To make these changes persist, assign the operation back to a variable.
# keep
In [22]: tips[['sex', 'total_bill', 'tip']].head()
Out[22]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61

# drop
In [23]: tips.drop('sex', axis=1).head()
Out[23]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4

# rename
In [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
Out[24]:
total_bill_2 tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4

3.5. Comparison with other tools 175


pandas: powerful Python data analysis toolkit, Release 0.25.2

Sorting by values

Sorting in Stata is accomplished via sort

sort sex total_bill

pandas objects have a DataFrame.sort_values() method, which takes a list of columns to sort by.

In [25]: tips = tips.sort_values(['sex', 'total_bill'])

In [26]: tips.head()
Out[26]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2

String processing

Finding length of string

Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII
and Unicode strings, respectively.

generate strlen_time = strlen(time)


generate ustrlen_time = ustrlen(time)

Python determines the length of a character string with the len function. In Python 3, all strings are
Unicode strings. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.
In [27]: tips['time'].str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64

In [28]: tips['time'].str.rstrip().str.len().head()
Out[28]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64

176 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Finding position of substring

Stata determines the position of a character in a string with the strpos() function. This takes the string
defined by the first argument and searches for the first position of the substring you supply as the second
argument.

generate str_position = strpos(sex, "ale")

Python determines the position of a character in a string with the find() function. find searches for the
first position of the substring. If the substring is found, the function returns its position. Keep in mind that
Python indexes are zero-based and the function will return -1 if it fails to find the substring.

In [29]: tips['sex'].str.find("ale").head()
Out[29]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64

Extracting substring by position

Stata extracts a substring from a string based on its position with the substr() function.

generate short_sex = substr(sex, 1, 1)

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in
mind that Python indexes are zero-based.

In [30]: tips['sex'].str[0:1].head()
Out[30]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object

Extracting nth word

The Stata word() function returns the nth word from a string. The first argument is the string you want to
parse and the second argument specifies which word you want to extract.

clear
input str20 string
"John Smith"
"Jane Cook"
end

(continues on next page)

3.5. Comparison with other tools 177


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


generate first_name = word(name, 1)
generate last_name = word(name, -1)

Python extracts a substring from a string based on its text by using regular expressions. There are much
more powerful approaches, but this just shows a simple approach.

In [31]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})

In [32]: firstlast['First_Name'] = firstlast['string'].str.split(" ", expand=True)[0]

In [33]: firstlast['Last_Name'] = firstlast['string'].str.rsplit(" ", expand=True)[0]

In [34]: firstlast
Out[34]:
string First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane

Changing case

The Stata strupper(), strlower(), strproper(), ustrupper(), ustrlower(), and ustrtitle() functions
change the case of ASCII and Unicode strings, respectively.

clear
input str20 string
"John Smith"
"Jane Cook"
end

generate upper = strupper(string)


generate lower = strlower(string)
generate title = strproper(string)
list

The equivalent Python functions are upper, lower, and title.

In [35]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})

In [36]: firstlast['upper'] = firstlast['string'].str.upper()

In [37]: firstlast['lower'] = firstlast['string'].str.lower()

In [38]: firstlast['title'] = firstlast['string'].str.title()

In [39]: firstlast
Out[39]:
string upper lower title
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook

178 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Merging

The following tables will be used in the merge examples

In [40]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],


....: 'value': np.random.randn(4)})
....:

In [41]: df1
Out[41]:
key value
0 A -1.456807
1 B 1.779988
2 C -1.602829
3 D 0.944411

In [42]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],


....: 'value': np.random.randn(4)})
....:

In [43]: df2
Out[43]:
key value
0 B 0.750030
1 D 0.349022
2 D 1.433851
3 E -0.720900

In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file
name on disk. In contrast, Python must have both DataFrames already in memory.
By default, Stata performs an outer join, where all observations from both data sets are left in memory after
the merge. One can keep only observations from the initial data set, the merged data set, or the intersection
of the two by using the values created in the _merge variable.

* F irst create df2 and save to disk


clear
input str1 key
B
D
D
E
end
generate value = rnormal()
save df2.dta

* N ow create df1 in memory


clear
input str1 key
A
B
C
D
end
(continues on next page)

3.5. Comparison with other tools 179


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


generate value = rnormal()

preserve

* Left join
merge 1:n key using df2.dta
keep if _merge == 1

* Right join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 2

* Inner join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 3

* Outer join
restore
merge 1:n key using df2.dta

pandas DataFrames have a DataFrame.merge() method, which provides similar functionality. Note that
different join types are accomplished via the how keyword.

In [44]: inner_join = df1.merge(df2, on=['key'], how='inner')

In [45]: inner_join
Out[45]:
key value_x value_y
0 B 1.779988 0.750030
1 D 0.944411 0.349022
2 D 0.944411 1.433851

In [46]: left_join = df1.merge(df2, on=['key'], how='left')

In [47]: left_join
Out[47]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851

In [48]: right_join = df1.merge(df2, on=['key'], how='right')

In [49]: right_join
Out[49]:
key value_x value_y
0 B 1.779988 0.750030
1 D 0.944411 0.349022
(continues on next page)

180 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2 D 0.944411 1.433851
3 E NaN -0.720900

In [50]: outer_join = df1.merge(df2, on=['key'], how='outer')

In [51]: outer_join
Out[51]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E NaN -0.720900

Missing data

Like Stata, pandas has a representation for missing data – the special float value NaN (not a number). Many
of the semantics are the same; for example missing data propagates through numeric operations, and is
ignored by default for aggregations.
In [52]: outer_join
Out[52]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E NaN -0.720900

In [53]: outer_join['value_x'] + outer_join['value_y']


Out[53]:
0 NaN
1 2.530018
2 NaN
3 1.293433
4 2.378261
5 NaN
dtype: float64

In [54]: outer_join['value_x'].sum()
Out[54]: 0.6091740246349007
One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you
could do this to filter missing values.

* K eep missing values


list if value_x == .
* K eep non-missing values
list if value_x != .

3.5. Comparison with other tools 181


pandas: powerful Python data analysis toolkit, Release 0.25.2

This doesn’t work in pandas. Instead, the pd.isna() or pd.notna() functions should be used for compar-
isons.
In [55]: outer_join[pd.isna(outer_join['value_x'])]
Out[55]:
key value_x value_y
5 E NaN -0.7209

In [56]: outer_join[pd.notna(outer_join['value_x'])]
Out[56]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
Pandas also provides a variety of methods to work with missing data – some of which would be challenging
to express in Stata. For example, there are methods to drop all rows with any missing values, replacing
missing values with a specified value, like the mean, or forward filling from previous rows. See the missing
data documentation for more.
# Drop rows with any missing value
In [57]: outer_join.dropna()
Out[57]:
key value_x value_y
1 B 1.779988 0.750030
3 D 0.944411 0.349022
4 D 0.944411 1.433851

# Fill forwards
In [58]: outer_join.fillna(method='ffill')
Out[58]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 0.750030
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E 0.944411 -0.720900

# Impute missing values with the mean


In [59]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
Out[59]:
0 -1.456807
1 1.779988
2 -1.602829
3 0.944411
4 0.944411
5 0.121835
Name: value_x, dtype: float64

GroupBy

182 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Aggregation

Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric
columns.

collapse (sum) total_bill tip, by(sex smoker)

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documen-
tation for more details and examples.

In [60]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()

In [61]: tips_summed.head()
Out[61]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07

Transformation

In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort
with egen(). For example, to subtract the mean for each observation by smoker group.

bysort sex smoker: egen group_bill = mean(total_bill)


generate adj_total_bill = total_bill - group_bill

pandas groupby provides a transform mechanism that allows these type of operations to be succinctly
expressed in one operation.

In [62]: gb = tips.groupby('smoker')['total_bill']

In [63]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')

In [64]: tips.head()
Out[64]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278

By group processing

In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from
Stata. For example, the following example lists the first observation in the current sort order by sex/smoker
group.

3.5. Comparison with other tools 183


pandas: powerful Python data analysis toolkit, Release 0.25.2

bysort sex smoker: list if _n == 1

In pandas this would be written as:

In [65]: tips.groupby(['sex', 'smoker']).first()


Out[65]:
total_bill tip day time size adj_total_bill
sex smoker
Female No 5.25 1.00 Sat Dinner 1 -11.938278
Yes 1.07 1.00 Sat Dinner 1 -17.686344
Male No 5.51 2.00 Thur Lunch 2 -11.678278
Yes 5.25 5.15 Sun Dinner 2 -13.506344

Other considerations

Disk vs memory

Pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded
in pandas is limited by your machine’s memory. If out of core processing is needed, one possibility is the
dask.dataframe library, which provides a subset of pandas functionality for an on-disk DataFrame. {{
header }}

3.6 Tutorials

This is a guide to many pandas tutorials, geared mainly for new users.

3.6.1 Internal guides

pandas’ own 10 Minutes to pandas.


More complex recipes are in the Cookbook.
A handy pandas cheat sheet.

3.6.2 Community guides

pandas Cookbook by Julia Evans

The goal of this 2015 cookbook (by Julia Evans) is to give you some concrete examples for getting started
with pandas. These are examples with real-world data, and all the bugs and weirdness that entails. For the
table of contents, see the pandas-cookbook GitHub repository.

Learn Pandas by Hernan Rojas

A set of lesson for new pandas users: https://bitbucket.org/hrojas/learn-pandas

184 Chapter 3. Getting started


pandas: powerful Python data analysis toolkit, Release 0.25.2

Practical data analysis with Python

This guide is an introduction to the data analysis process using the Python data ecosystem and an interesting
open dataset. There are four sections covering selected topics as munging data, aggregating data, visualizing
data and time series.

Exercises for new users

Practice your skills with real data sets and exercises. For more resources, please visit the main repository.

Modern pandas

Tutorial series written in 2016 by Tom Augspurger. The source may be found in the GitHub repository
TomAugspurger/effective-pandas.
• Modern Pandas
• Method Chaining
• Indexes
• Performance
• Tidy Data
• Visualization
• Timeseries

Excel charts with pandas, vincent and xlsxwriter

• Using Pandas and XlsxWriter to create Excel charts

Video tutorials

• Pandas From The Ground Up (2015) (2:24) GitHub repo


• Introduction Into Pandas (2016) (1:28) GitHub repo
• Pandas: .head() to .tail() (2016) (1:26) GitHub repo
• Data analysis in Python with pandas (2016-2018) GitHub repo and Jupyter Notebook
• Best practices with pandas (2018) GitHub repo and Jupyter Notebook

Various tutorials

• Wes McKinney’s (pandas BDFL) blog


• Statistical analysis made easy in Python with SciPy and pandas DataFrames, by Randal Olson
• Statistical Data Analysis in Python, tutorial videos, by Christopher Fonnesbeck from SciPy 2013
• Financial analysis in Python, by Thomas Wiecki
• Intro to pandas data structures, by Greg Reda
• Pandas and Python: Top 10, by Manish Amde

3.6. Tutorials 185


pandas: powerful Python data analysis toolkit, Release 0.25.2

• Pandas DataFrames Tutorial, by Karlijn Willems


• A concise tutorial with real life examples
{{ header }}

186 Chapter 3. Getting started


CHAPTER

FOUR

USER GUIDE

The User Guide covers all of pandas by topic area. Each of the subsections introduces a topic (such as
“working with missing data”), and discusses how pandas approaches the problem, with many examples
throughout.
Users brand-new to pandas should start with 10min.
Further information on any specific method can be obtained in the API reference. {{ header }}

4.1 IO tools (text, CSV, HDF5, …)

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally
return a pandas object. The corresponding writer functions are object methods that are accessed like
DataFrame.to_csv(). Below is a table containing available readers and writers.

Format Data Description Reader Writer


Type
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary OpenDocument read_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

Here is an informal performance comparison for some of these IO methods.

Note: For examples that use the StringIO class, make sure you import it according to your Python
version, i.e. from StringIO import StringIO for Python 2 and from io import StringIO for Python 3.

187
pandas: powerful Python data analysis toolkit, Release 0.25.2

4.1.1 CSV & text files

The workhorse function for reading text files (a.k.a. flat files) is read_csv(). See the cookbook for some
advanced strategies.

Parsing options

read_csv() accepts the following common arguments:

Basic

filepath_or_buffer [various] Either a path to a file (a str, pathlib.Path, or py._path.local.


LocalPath), URL (including http, ftp, and S3 locations), or any object with a read() method (such
as an open file or StringIO).
sep [str, defaults to ',' for read_csv(), \t for read_table()] Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will
be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In
addition, separators longer than 1 character and different from '\s+' will be interpreted as regular
expressions and will also force the use of the Python parsing engine. Note that regex delimiters are
prone to ignoring quoted data. Regex example: '\\r\\t'.
delimiter [str, default None] Alternative argument name for sep.
delim_whitespace [boolean, default False] Specifies whether or not whitespace (e.g. ' ' or '\t') will be
used as the delimiter. Equivalent to setting sep='\s+'. If this option is set to True, nothing should
be passed in for the delimiter parameter.
New in version 0.18.1: support for the Python parser.

Column and index locations and names

header [int or list of ints, default 'infer'] Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names are passed the behavior is identical
to header=0 and column names are inferred from the first line of the file, if column names are passed
explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace
existing names.
The header can be a list of ints that specify row locations for a MultiIndex on the columns e.g. [0,1,
3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note
that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
names [array-like, default None] List of column names to use. If file contains no header row, then you should
explicitly pass header=None. Duplicates in this list are not allowed.
index_col [int, str, sequence of int / str, or False, default None] Column(s) to use as the row labels of
the DataFrame, either given as string name or column index. If a sequence of int / str is given, a
MultiIndex is used.
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g.
when you have a malformed file with delimiters at the end of each line.
usecols [list-like or callable, default None] Return a subset of the columns. If list-like, all elements must
either be positional (i.e. integer indices into the document columns) or strings that correspond to

188 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

column names provided either by the user in names or inferred from the document header row(s). For
example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].
Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame
from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo',
'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the column names, returning names where
the callable function evaluates to True:
In [1]: from io import StringIO, BytesIO

In [2]: data = ('col1,col2,col3\n'


...: 'a,b,1\n'
...: 'a,b,2\n'
...: 'c,d,3')
...:

In [3]: pd.read_csv(StringIO(data))
Out[3]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In [4]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])


Out[4]:
col1 col3
0 a 1
1 a 2
2 c 3
Using this parameter results in much faster parsing time and lower memory usage.
squeeze [boolean, default False] If the parsed data only contains one column then return a Series.
prefix [str, default None] Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
mangle_dupe_cols [boolean, default True] Duplicate columns will be specified as ‘X’, ‘X.1’…’X.N’, rather
than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the
columns.

General parsing configuration

dtype [Type name or dict of column -> type, default None] Data type for data or columns. E.g. {'a':
np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object together with
suitable na_values settings to preserve and not interpret dtype.
New in version 0.20.0: support for the Python parser.
engine [{'c', 'python'}] Parser engine to use. The C engine is faster while the Python engine is currently
more feature-complete.
converters [dict, default None] Dict of functions for converting values in certain columns. Keys can either
be integers or column labels.
true_values [list, default None] Values to consider as True.

4.1. IO tools (text, CSV, HDF5, …) 189


pandas: powerful Python data analysis toolkit, Release 0.25.2

false_values [list, default None] Values to consider as False.


skipinitialspace [boolean, default False] Skip spaces after delimiter.
skiprows [list-like or integer, default None] Line numbers to skip (0-indexed) or number of lines to skip
(int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row
should be skipped and False otherwise:
In [5]: data = ('col1,col2,col3\n'
...: 'a,b,1\n'
...: 'a,b,2\n'
...: 'c,d,3')
...:

In [6]: pd.read_csv(StringIO(data))
Out[6]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In [7]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)


Out[7]:
col1 col2 col3
0 a b 2
skipfooter [int, default 0] Number of lines at bottom of file to skip (unsupported with engine=’c’).
nrows [int, default None] Number of rows of file to read. Useful for reading pieces of large files.
low_memory [boolean, default True] Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify
the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless,
use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)
memory_map [boolean, default False] If a filepath is provided for filepath_or_buffer, map the file
object directly onto memory and access the data directly from there. Using this option can improve
performance because there is no longer any I/O overhead.

NA and missing data handling

na_values [scalar, str, list-like, or dict, default None] Additional strings to recognize as NA/NaN. If dict
passed, specific per-column NA values. See na values const below for a list of the values interpreted as
NaN by default.
keep_default_na [boolean, default True] Whether or not to include the default NaN values when parsing
the data. Depending on whether na_values is passed in, the behavior is as follows:
• If keep_default_na is True, and na_values are specified, na_values is appended to the default
NaN values used for parsing.
• If keep_default_na is True, and na_values are not specified, only the default NaN values are used
for parsing.
• If keep_default_na is False, and na_values are specified, only the NaN values specified na_values
are used for parsing.

190 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

• If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be
ignored.
na_filter [boolean, default True] Detect missing value markers (empty strings and the value of na_values).
In data without any NAs, passing na_filter=False can improve the performance of reading a large
file.
verbose [boolean, default False] Indicate number of NA values placed in non-numeric columns.
skip_blank_lines [boolean, default True] If True, skip over blank lines rather than interpreting as NaN
values.

Datetime handling

parse_dates [boolean or list of ints or names or list of lists or dict, default False.]
• If True -> try parsing the index.
• If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
• If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
• If {'foo': [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’. A fast-path exists for
iso8601-formatted dates.
infer_datetime_format [boolean, default False] If True and parse_dates is enabled for a column, at-
tempt to infer the datetime format to speed up the processing.
keep_date_col [boolean, default False] If True and parse_dates specifies combining multiple columns
then keep the original columns.
date_parser [function, default None] Function to use for converting a sequence of string columns to an
array of datetime instances. The default uses dateutil.parser.parser to do the conversion. pandas
will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1)
Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array and pass that; and 3) call
date_parser once for each row using one or more strings (corresponding to the columns defined by
parse_dates) as arguments.
dayfirst [boolean, default False] DD/MM format dates, international and European format.
cache_dates [boolean, default True] If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with
timezone offsets.
New in version 0.25.0.

Iteration

iterator [boolean, default False] Return TextFileReader object for iteration or getting chunks with
get_chunk().
chunksize [int, default None] Return TextFileReader object for iteration. See iterating and chunking below.

4.1. IO tools (text, CSV, HDF5, …) 191


pandas: powerful Python data analysis toolkit, Release 0.25.2

Quoting, compression, and file format

compression [{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'] For on-the-fly decompres-
sion of on-disk data. If ‘infer’, then use gzip, bz2, zip, or xz if filepath_or_buffer is a string ending
in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file
must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
Changed in version 0.24.0: ‘infer’ option added and set to default.
thousands [str, default None] Thousands separator.
decimal [str, default '.'] Character to recognize as decimal point. E.g. use ',' for European data.
float_precision [string, default None] Specifies which converter the C engine should use for floating-point
values. The options are None for the ordinary converter, high for the high-precision converter, and
round_trip for the round-trip converter.
lineterminator [str (length 1), default None] Character to break file into lines. Only valid with C parser.
quotechar [str (length 1)] The character used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per csv.QUOTE_* constants.
Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [boolean, default True] When quotechar is specified and quoting is not QUOTE_NONE, indicate
whether or not to interpret two consecutive quotechar elements inside a field as a single quotechar
element.
escapechar [str (length 1), default None] One-character string used to escape delimiter when quoting is
QUOTE_NONE.
comment [str, default None] Indicates remainder of line should not be parsed. If found at the beginning of
a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines
(as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but
not by skiprows. For example, if comment='#', parsing ‘#empty\na,b,c\n1,2,3’ with header=0 will
result in ‘a,b,c’ being treated as the header.
encoding [str, default None] Encoding to use for UTF when reading/writing (e.g. 'utf-8'). List of Python
standard encodings.
dialect [str or csv.Dialect instance, default None] If provided, this parameter will override values (default
or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar,
and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect
documentation for more details.

Error handling

error_bad_lines [boolean, default True] Lines with too many fields (e.g. a csv line with too many
commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False,
then these “bad lines” will dropped from the DataFrame that is returned. See bad lines below.
warn_bad_lines [boolean, default True] If error_bad_lines is False, and warn_bad_lines is True, a
warning for each “bad line” will be output.

192 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Specifying column data types

You can indicate the data type for the whole DataFrame or individual columns:
In [8]: data = ('a,b,c,d\n'
...: '1,2,3,4\n'
...: '5,6,7,8\n'
...: '9,10,11')
...:

In [9]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [10]: df = pd.read_csv(StringIO(data), dtype=object)

In [11]: df
Out[11]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN

In [12]: df['a'][0]
Out[12]: '1'

In [13]: df = pd.read_csv(StringIO(data),
....: dtype={'b': object, 'c': np.float64, 'd': 'Int64'})
....:

In [14]: df.dtypes
Out[14]:
a int64
b object
c float64
d Int64
dtype: object
Fortunately, pandas offers more than one way to ensure that your column(s) contain only one dtype. If
you’re unfamiliar with these concepts, you can see here to learn more about dtypes, and here to learn more
about object conversion in pandas.
For instance, you can use the converters argument of read_csv():
In [15]: data = ("col_1\n"
....: "1\n"
....: "2\n"
....: "'A'\n"
....: "4.22")
....:

In [16]: df = pd.read_csv(StringIO(data), converters={'col_1': str})

In [17]: df

4.1. IO tools (text, CSV, HDF5, …) 193


pandas: powerful Python data analysis toolkit, Release 0.25.2

Out[17]:
col_1
0 1
1 2
2 'A'
3 4.22

In [18]: df['col_1'].apply(type).value_counts()
Out[18]:
<class 'str'> 4
Name: col_1, dtype: int64
Or you can use the to_numeric() function to coerce the dtypes after reading in the data,
In [19]: df2 = pd.read_csv(StringIO(data))

In [20]: df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')

In [21]: df2
Out[21]:
col_1
0 1.00
1 2.00
2 NaN
3 4.22

In [22]: df2['col_1'].apply(type).value_counts()
Out[22]:
<class 'float'> 4
Name: col_1, dtype: int64
which will convert all valid parsing to floats, leaving the invalid parsing as NaN.
Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs.
In the case above, if you wanted to NaN out the data anomalies, then to_numeric() is probably your best
option. However, if you wanted for all the data to be coerced, no matter the type, then using the converters
argument of read_csv() would certainly be worth trying.
New in version 0.20.0: support for the Python parser.
The dtype option is supported by the ‘python’ engine.

Note: In some cases, reading in abnormal data with columns containing mixed dtypes will result in an
inconsistent dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go
and infer the dtypes for different chunks of the data, rather than the whole dataset at once. Consequently,
you can end up with column(s) with mixed dtypes. For example,
In [23]: col_1 = list(range(500000)) + ['a', 'b'] + list(range(500000))

In [24]: df = pd.DataFrame({'col_1': col_1})

In [25]: df.to_csv('foo.csv')

In [26]: mixed_df = pd.read_csv('foo.csv')

In [27]: mixed_df['col_1'].apply(type).value_counts()
Out[27]:

194 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

<class 'int'> 737858


<class 'str'> 262144
Name: col_1, dtype: int64

In [28]: mixed_df['col_1'].dtype
Out[28]: dtype('O')
will result with mixed_df containing an int dtype for certain chunks of the column, and str for others due
to the mixed dtypes from the data that was read in. It is important to note that the overall column will be
marked with a dtype of object, which is used for columns with mixed dtypes.

Specifying categorical dtype

New in version 0.19.0.


Categorical columns can be parsed directly by specifying dtype='category' or
dtype=CategoricalDtype(categories, ordered).
In [29]: data = ('col1,col2,col3\n'
....: 'a,b,1\n'
....: 'a,b,2\n'
....: 'c,d,3')
....:

In [30]: pd.read_csv(StringIO(data))
Out[30]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In [31]: pd.read_csv(StringIO(data)).dtypes
Out[31]:
col1 object
col2 object
col3 int64
dtype: object

In [32]: pd.read_csv(StringIO(data), dtype='category').dtypes


Out[32]:
col1 category
col2 category
col3 category
dtype: object
Individual columns can be parsed as a Categorical using a dict specification:

In [33]: pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes


Out[33]:
col1 category
col2 object
col3 int64
dtype: object

4.1. IO tools (text, CSV, HDF5, …) 195


pandas: powerful Python data analysis toolkit, Release 0.25.2

New in version 0.21.0.


Specifying dtype='category' will result in an unordered Categorical whose categories are the unique
values observed in the data. For more control on the categories and order, create a CategoricalDtype ahead
of time, and pass that for that column’s dtype.

In [34]: from pandas.api.types import CategoricalDtype

In [35]: dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)

In [36]: pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes


Out[36]:
col1 category
col2 object
col3 int64
dtype: object

When using dtype=CategoricalDtype, “unexpected” values outside of dtype.categories are treated as


missing values.

In [37]: dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c'

In [38]: pd.read_csv(StringIO(data), dtype={'col1': dtype}).col1


Out[38]:
0 a
1 a
2 NaN
Name: col1, dtype: category
Categories (3, object): [a, b, d]

This matches the behavior of Categorical.set_categories().

Note: With dtype='category', the resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the to_numeric() function, or as appropriate,
another converter such as to_datetime().
When dtype is a CategoricalDtype with homogeneous categories ( all numeric, all datetimes, etc.), the
conversion is done automatically.
In [39]: df = pd.read_csv(StringIO(data), dtype='category')

In [40]: df.dtypes
Out[40]:
col1 category
col2 category
col3 category
dtype: object

In [41]: df['col3']
Out[41]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, object): [1, 2, 3]

196 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [42]: df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)

In [43]: df['col3']
Out[43]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]

Naming and using columns

Handling column names

A file may or may not have a header row. pandas assumes the first row should be used as the column names:
In [44]: data = ('a,b,c\n'
....: '1,2,3\n'
....: '4,5,6\n'
....: '7,8,9')
....:

In [45]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9

In [46]: pd.read_csv(StringIO(data))
Out[46]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
By specifying the names argument in conjunction with header you can indicate other names to use and
whether or not to throw away the header row (if any):
In [47]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9

In [48]: pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=0)


Out[48]:
foo bar baz
0 1 2 3
1 4 5 6
2 7 8 9

In [49]: pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=None)

4.1. IO tools (text, CSV, HDF5, …) 197


pandas: powerful Python data analysis toolkit, Release 0.25.2

Out[49]:
foo bar baz
0 a b c
1 1 2 3
2 4 5 6
3 7 8 9
If the header is in a row other than the first, pass the row number to header. This will skip the preceding
rows:

In [50]: data = ('skip this skip it\n'


....: 'a,b,c\n'
....: '1,2,3\n'
....: '4,5,6\n'
....: '7,8,9')
....:

In [51]: pd.read_csv(StringIO(data), header=1)


Out[51]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9

Note: Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0 and column names are inferred from the first non-blank line of the file, if column names are passed
explicitly then the behavior is identical to header=None.

Duplicate names parsing

If the file or header contains duplicate names, pandas will by default distinguish between them so as to
prevent overwriting data:

In [52]: data = ('a,b,a\n'


....: '0,1,2\n'
....: '3,4,5')
....:

In [53]: pd.read_csv(StringIO(data))
Out[53]:
a b a.1
0 0 1 2
1 3 4 5

There is no more duplicate data because mangle_dupe_cols=True by default, which modifies a series of
duplicate columns ‘X’, …, ‘X’ to become ‘X’, ‘X.1’, …, ‘X.N’. If mangle_dupe_cols=False, duplicate data
can arise:

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'


In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
Out[3]:
(continues on next page)

198 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


a b a
0 2 1 2
1 5 4 5

To prevent users from encountering this problem with duplicate data, a ValueError exception is raised if
mangle_dupe_cols != True:

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'


In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
...
ValueError: Setting mangle_dupe_cols=False is not supported yet

Filtering columns (usecols)

The usecols argument allows you to select any subset of the columns in a file, either using the column
names, position numbers or a callable:
New in version 0.20.0: support for callable usecols arguments
In [54]: data = 'a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz'

In [55]: pd.read_csv(StringIO(data))
Out[55]:
a b c d
0 1 2 3 foo
1 4 5 6 bar
2 7 8 9 baz

In [56]: pd.read_csv(StringIO(data), usecols=['b', 'd'])


Out[56]:
b d
0 2 foo
1 5 bar
2 8 baz

In [57]: pd.read_csv(StringIO(data), usecols=[0, 2, 3])


Out[57]:
a c d
0 1 3 foo
1 4 6 bar
2 7 9 baz

In [58]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['A', 'C'])


Out[58]:
a c
0 1 3
1 4 6
2 7 9
The usecols argument can also be used to specify which columns not to use in the final result:

4.1. IO tools (text, CSV, HDF5, …) 199


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [59]: pd.read_csv(StringIO(data), usecols=lambda x: x not in ['a', 'c'])


Out[59]:
b d
0 2 foo
1 5 bar
2 8 baz

In this case, the callable is specifying that we exclude the “a” and “c” columns from the output.

Comments and empty lines

Ignoring line comments and empty lines

If the comment parameter is specified, then completely commented lines will be ignored. By default, com-
pletely blank lines will be ignored as well.
In [60]: data = ('\n'
....: 'a,b,c\n'
....: ' \n'
....: '# commented line\n'
....: '1,2,3\n'
....: '\n'
....: '4,5,6')
....:

In [61]: print(data)

a,b,c

# commented line
1,2,3

4,5,6

In [62]: pd.read_csv(StringIO(data), comment='#')


Out[62]:
a b c
0 1 2 3
1 4 5 6
If skip_blank_lines=False, then read_csv will not ignore blank lines:
In [63]: data = ('a,b,c\n'
....: '\n'
....: '1,2,3\n'
....: '\n'
....: '\n'
....: '4,5,6')
....:

In [64]: pd.read_csv(StringIO(data), skip_blank_lines=False)


Out[64]:
a b c
(continues on next page)

200 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


0 NaN NaN NaN
1 1.0 2.0 3.0
2 NaN NaN NaN
3 NaN NaN NaN
4 4.0 5.0 6.0

Warning: The presence of ignored lines might create ambiguities involving line numbers; the param-
eter header uses row numbers (ignoring commented/empty lines), while skiprows uses line numbers
(including commented/empty lines):
In [65]: data = ('#comment\n'
....: 'a,b,c\n'
....: 'A,B,C\n'
....: '1,2,3')
....:

In [66]: pd.read_csv(StringIO(data), comment='#', header=1)


Out[66]:
A B C
0 1 2 3

In [67]: data = ('A,B,C\n'


....: '#comment\n'
....: 'a,b,c\n'
....: '1,2,3')
....:

In [68]: pd.read_csv(StringIO(data), comment='#', skiprows=2)


Out[68]:
a b c
0 1 2 3

If both header and skiprows are specified, header will be relative to the end of skiprows. For example:

In [69]: data = ('# empty\n'


....: '# second empty line\n'
....: '# third emptyline\n'
....: 'X,Y,Z\n'
....: '1,2,3\n'
....: 'A,B,C\n'
....: '1,2.,4.\n'
....: '5.,NaN,10.0\n')
....:

In [70]: print(data)
# empty
# second empty line
# third emptyline
X,Y,Z
1,2,3
A,B,C
1,2.,4.

4.1. IO tools (text, CSV, HDF5, …) 201


pandas: powerful Python data analysis toolkit, Release 0.25.2

5.,NaN,10.0

In [71]: pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)


Out[71]:
A B C
0 1.0 2.0 4.0
1 5.0 NaN 10.0

Comments

Sometimes comments or meta data may be included in a file:

In [72]: print(open('tmp.csv').read())
ID,level,category
Patient1,123000,x # really unpleasant
Patient2,23000,y # wouldn't take his medicine
Patient3,1234018,z # awesome

By default, the parser includes the comments in the output:

In [73]: df = pd.read_csv('tmp.csv')

In [74]: df
Out[74]:
ID level category
0 Patient1 123000 x # really unpleasant
1 Patient2 23000 y # wouldn't take his medicine
2 Patient3 1234018 z # awesome

We can suppress the comments using the comment keyword:

In [75]: df = pd.read_csv('tmp.csv', comment='#')

In [76]: df
Out[76]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z

Dealing with Unicode data

The encoding argument should be used for encoded unicode data, which will result in byte strings being
decoded to unicode in the result:
In [77]: data = (b'word,length\n'
....: b'Tr\xc3\xa4umen,7\n'
....: b'Gr\xc3\xbc\xc3\x9fe,5')
....:

In [78]: data = data.decode('utf8').encode('latin-1')

202 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [79]: df = pd.read_csv(BytesIO(data), encoding='latin-1')

In [80]: df
Out[80]:
word length
0 Träumen 7
1 Grüße 5

In [81]: df['word'][1]
Out[81]: 'Grüße'
Some formats which encode all characters as multiple bytes, like UTF-16, won’t parse correctly at all without
specifying the encoding. Full list of Python standard encodings.

Index columns and trailing delimiters

If a file has one more column of data than the number of column names, the first column will be used as the
DataFrame’s row names:

In [82]: data = ('a,b,c\n'


....: '4,apple,bat,5.7\n'
....: '8,orange,cow,10')
....:

In [83]: pd.read_csv(StringIO(data))
Out[83]:
a b c
4 apple bat 5.7
8 orange cow 10.0

In [84]: data = ('index,a,b,c\n'


....: '4,apple,bat,5.7\n'
....: '8,orange,cow,10')
....:

In [85]: pd.read_csv(StringIO(data), index_col=0)


Out[85]:
a b c
index
4 apple bat 5.7
8 orange cow 10.0

Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line,
confusing the parser. To explicitly disable the index column inference and discard the last column, pass
index_col=False:
In [86]: data = ('a,b,c\n'
....: '4,apple,bat,\n'
....: '8,orange,cow,')
....:

In [87]: print(data)
a,b,c

4.1. IO tools (text, CSV, HDF5, …) 203


pandas: powerful Python data analysis toolkit, Release 0.25.2

4,apple,bat,
8,orange,cow,

In [88]: pd.read_csv(StringIO(data))
Out[88]:
a b c
4 apple bat NaN
8 orange cow NaN

In [89]: pd.read_csv(StringIO(data), index_col=False)


Out[89]:
a b c
0 4 apple bat
1 8 orange cow
If a subset of data is being parsed using the usecols option, the index_col specification is based on that
subset, not the original data.
In [90]: data = ('a,b,c\n'
....: '4,apple,bat,\n'
....: '8,orange,cow,')
....:

In [91]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,

In [92]: pd.read_csv(StringIO(data), usecols=['b', 'c'])


Out[92]:
b c
4 bat NaN
8 cow NaN

In [93]: pd.read_csv(StringIO(data), usecols=['b', 'c'], index_col=0)


Out[93]:
b c
4 bat NaN
8 cow NaN

Date Handling

Specifying date columns

To better facilitate working with datetime data, read_csv() uses the keyword arguments parse_dates and
date_parser to allow users to specify a variety of columns and date/time formats to turn the input text
data into datetime objects.
The simplest case is to just pass in parse_dates=True:
# Use a column as an index, and parse it as dates.
In [94]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True)

In [95]: df
Out[95]:

204 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5

# These are Python datetime objects


In [96]: df.index
Out[96]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'],␣
,→dtype='datetime64[ns]', name='date', freq=None)

It is often the case that we may want to store date and time data separately, or store various date fields
separately. the parse_dates keyword can be used to specify a combination of columns to parse the dates
and/or times from.
You can specify a list of column lists to parse_dates, the resulting date columns will be prepended to the
output (so as to not affect the existing column order) and the new column names will be the concatenation
of the component column names:

In [97]: print(open('tmp.csv').read())
KORD,19990127, 19:00:00, 18:56:00, 0.8100
KORD,19990127, 20:00:00, 19:56:00, 0.0100
KORD,19990127, 21:00:00, 20:56:00, -0.5900
KORD,19990127, 21:00:00, 21:18:00, -0.9900
KORD,19990127, 22:00:00, 21:56:00, -0.5900
KORD,19990127, 23:00:00, 22:56:00, -0.5900

In [98]: df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]])

In [99]: df
Out[99]:
1_2 1_3 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

By default the parser removes the component date columns, but you can choose to retain them via the
keep_date_col keyword:

In [100]: df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]],


.....: keep_date_col=True)
.....:

In [101]: df
Out[101]:
1_2 1_3 0 1 2 3 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 19990127 19:00:00 18:56:00 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 19990127 20:00:00 19:56:00 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD 19990127 21:00:00 20:56:00 -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD 19990127 21:00:00 21:18:00 -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD 19990127 22:00:00 21:56:00 -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD 19990127 23:00:00 22:56:00 -0.59

4.1. IO tools (text, CSV, HDF5, …) 205


pandas: powerful Python data analysis toolkit, Release 0.25.2

Note that if you wish to combine multiple columns into a single date column, a nested list must be used.
In other words, parse_dates=[1, 2] indicates that the second and third columns should each be parsed as
separate date columns while parse_dates=[[1, 2]] means the two columns should be parsed into a single
column.
You can also use a dict to specify custom name columns:

In [102]: date_spec = {'nominal': [1, 2], 'actual': [1, 3]}

In [103]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec)

In [104]: df
Out[104]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

It is important to remember that if multiple text columns are to be parsed into a single date column, then
a new column is prepended to the data. The index_col specification is based off of this new set of columns
rather than the original data columns:

In [105]: date_spec = {'nominal': [1, 2], 'actual': [1, 3]}

In [106]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,


.....: index_col=0) # index is the nominal column
.....:

In [107]: df
Out[107]:
actual 0 4
nominal
1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

Note: If a column or index contains an unparsable date, the entire column or index will be returned
unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after pd.read_csv.

Note: read_csv has a fast_path for parsing datetime strings in iso8601 format, e.g “2000-01-
01T00:01:02+00:00” and similar variations. If you can arrange for your data to store datetimes in this
format, load times will be significantly faster, ~20x has been observed.

Note: When passing a dict as the parse_dates argument, the order of the columns prepended is not
guaranteed, because dict objects do not impose an ordering on their keys. On Python 2.7+ you may use

206 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

collections.OrderedDict instead of a regular dict if this matters to you. Because of this, when using a dict for
‘parse_dates’ in conjunction with the index_col argument, it’s best to specify index_col as a column label
rather then as an index on the resulting frame.

Date parsing functions

Finally, the parser allows you to specify a custom date_parser function to take full advantage of the
flexibility of the date parsing API:

In [108]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,


.....: date_parser=pd.io.date_converters.parse_date_time)
.....:

In [109]: df
Out[109]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

Pandas will try to call the date_parser function in three different ways. If an exception is raised, the next
one is tried:
1. date_parser is first called with one or more arrays as arguments, as defined using parse_dates (e.g.,
date_parser(['2013', '2013'], ['1', '2'])).
2. If #1 fails, date_parser is called with all the columns concatenated row-wise into a single array (e.g.,
date_parser(['2013 1', '2013 2'])).
3. If #2 fails, date_parser is called once for every row with one or more string arguments from
the columns indicated with parse_dates (e.g., date_parser('2013', '1') for the first row,
date_parser('2013', '2') for the second, etc.).
Note that performance-wise, you should try these methods of parsing dates in order:
1. Try to infer the format using infer_datetime_format=True (see section below).
2. If you know the format, use pd.to_datetime(): date_parser=lambda x: pd.to_datetime(x,
format=...).
3. If you have a really non-standard format, use a custom date_parser function. For optimal perfor-
mance, this should be vectorized, i.e., it should accept arrays as arguments.
You can explore the date parsing functionality in date_converters.py and add your own. We would
love to turn this module into a community supported set of date/time parsers. To get you started,
date_converters.py contains functions to parse dual date and time columns, year/month/day columns,
and year/month/day/hour/minute/second columns. It also contains a generic_parser function so you can
curry it with a function that deals with a single date rather than the entire array.

4.1. IO tools (text, CSV, HDF5, …) 207


pandas: powerful Python data analysis toolkit, Release 0.25.2

Parsing a CSV with mixed timezones

Pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains
columns with a mixture of timezones, the default result will be an object-dtype column with strings, even
with parse_dates.

In [110]: content = """\


.....: a
.....: 2000-01-01T00:00:00+05:00
.....: 2000-01-01T00:00:00+06:00"""
.....:

In [111]: df = pd.read_csv(StringIO(content), parse_dates=['a'])

In [112]: df['a']
Out[112]:
0 2000-01-01 00:00:00+05:00
1 2000-01-01 00:00:00+06:00
Name: a, dtype: object

To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with
utc=True as the date_parser.

In [113]: df = pd.read_csv(StringIO(content), parse_dates=['a'],


.....: date_parser=lambda col: pd.to_datetime(col, utc=True))
.....:

In [114]: df['a']
Out[114]:
0 1999-12-31 19:00:00+00:00
1 1999-12-31 18:00:00+00:00
Name: a, dtype: datetime64[ns, UTC]

Inferring datetime format

If you have parse_dates enabled for some or all of your columns, and your datetime strings are all formatted
the same way, you may get a large speed up by setting infer_datetime_format=True. If set, pandas will
attempt to guess the format of your datetime strings, and then use a faster means of parsing the strings.
5-10x parsing speeds have been observed. pandas will fallback to the usual parsing if either the format
cannot be guessed or the format that was guessed cannot properly parse the entire column of strings. So in
general, infer_datetime_format should not have any negative consequences if enabled.
Here are some examples of datetime strings that can be guessed (All representing December 30th, 2011 at
00:00:00):
• “20111230”
• “2011/12/30”
• “20111230 00:00:00”
• “12/30/2011 00:00:00”
• “30/Dec/2011 00:00:00”
• “30/December/2011 00:00:00”

208 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Note that infer_datetime_format is sensitive to dayfirst. With dayfirst=True, it will guess


“01/12/2011” to be December 1st. With dayfirst=False (default) it will guess “01/12/2011” to be January
12th.

# Try to infer the format for the index column


In [115]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
.....: infer_datetime_format=True)
.....:

In [116]: df
Out[116]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5

International date formats

While US date formats tend to be MM/DD/YYYY, many international formats use DD/MM/YYYY instead.
For convenience, a dayfirst keyword is provided:
In [117]: print(open('tmp.csv').read())
date,value,cat
1/6/2000,5,a
2/6/2000,10,b
3/6/2000,15,c

In [118]: pd.read_csv('tmp.csv', parse_dates=[0])


Out[118]:
date value cat
0 2000-01-06 5 a
1 2000-02-06 10 b
2 2000-03-06 15 c

In [119]: pd.read_csv('tmp.csv', dayfirst=True, parse_dates=[0])


Out[119]:
date value cat
0 2000-06-01 5 a
1 2000-06-02 10 b
2 2000-06-03 15 c

Specifying method for floating-point conversion

The parameter float_precision can be specified in order to use a specific floating-point converter during
parsing with the C engine. The options are the ordinary converter, the high-precision converter, and the
round-trip converter (which is guaranteed to round-trip values after writing to a file). For example:
In [120]: val = '0.3066101993807095471566981359501369297504425048828125'

In [121]: data = 'a,b,c\n1,2,{0}'.format(val)

In [122]: abs(pd.read_csv(StringIO(data), engine='c',

4.1. IO tools (text, CSV, HDF5, …) 209


pandas: powerful Python data analysis toolkit, Release 0.25.2

.....: float_precision=None)['c'][0] - float(val))


.....:
Out[122]: 1.1102230246251565e-16

In [123]: abs(pd.read_csv(StringIO(data), engine='c',


.....: float_precision='high')['c'][0] - float(val))
.....:
Out[123]: 5.551115123125783e-17

In [124]: abs(pd.read_csv(StringIO(data), engine='c',


.....: float_precision='round_trip')['c'][0] - float(val))
.....:
Out[124]: 0.0

Thousand separators

For large numbers that have been written with a thousands separator, you can set the thousands keyword
to a string of length 1 so that integers will be parsed correctly:
By default, numbers with a thousands separator will be parsed as strings:
In [125]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z

In [126]: df = pd.read_csv('tmp.csv', sep='|')

In [127]: df
Out[127]:
ID level category
0 Patient1 123,000 x
1 Patient2 23,000 y
2 Patient3 1,234,018 z

In [128]: df.level.dtype
Out[128]: dtype('O')
The thousands keyword allows integers to be parsed correctly:
In [129]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z

In [130]: df = pd.read_csv('tmp.csv', sep='|', thousands=',')

In [131]: df
Out[131]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z

210 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [132]: df.level.dtype
Out[132]: dtype('int64')

NA values

To control which values are parsed as missing values (which are signified by NaN), specify a string in
na_values. If you specify a list of strings, then all values in it are considered to be missing values. If
you specify a number (a float, like 5.0 or an integer like 5), the corresponding equivalent values will also
imply a missing value (in this case effectively [5.0, 5] are recognized as NaN).
To completely override the default values that are recognized as missing, specify keep_default_na=False.
The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A',
'#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].
Let us consider some examples:

pd.read_csv('path_to_file.csv', na_values=[5])

In the example above 5 and 5.0 will be recognized as NaN, in addition to the defaults. A string will first be
interpreted as a numerical 5, then as a NaN.

pd.read_csv('path_to_file.csv', keep_default_na=False, na_values=[""])

Above, only an empty field will be recognized as NaN.

pd.read_csv('path_to_file.csv', keep_default_na=False, na_values=["NA", "0"])

Above, both NA and 0 as strings are NaN.

pd.read_csv('path_to_file.csv', na_values=["Nope"])

The default values, in addition to the string "Nope" are recognized as NaN.

Infinity

inf like values will be parsed as np.inf (positive infinity), and -inf as -np.inf (negative infinity). These
will ignore the case of the value, meaning Inf, will also be parsed as np.inf.

Returning Series

Using the squeeze keyword, the parser will return output with a single column as a Series:
In [133]: print(open('tmp.csv').read())
level
Patient1,123000
Patient2,23000
Patient3,1234018

In [134]: output = pd.read_csv('tmp.csv', squeeze=True)

In [135]: output
Out[135]:
Patient1 123000

4.1. IO tools (text, CSV, HDF5, …) 211


pandas: powerful Python data analysis toolkit, Release 0.25.2

Patient2 23000
Patient3 1234018
Name: level, dtype: int64

In [136]: type(output)
Out[136]: pandas.core.series.Series

Boolean values

The common values True, False, TRUE, and FALSE are all recognized as boolean. Occasionally you might
want to recognize other values as being boolean. To do this, use the true_values and false_values options
as follows:
In [137]: data = ('a,b,c\n'
.....: '1,Yes,2\n'
.....: '3,No,4')
.....:

In [138]: print(data)
a,b,c
1,Yes,2
3,No,4

In [139]: pd.read_csv(StringIO(data))
Out[139]:
a b c
0 1 Yes 2
1 3 No 4

In [140]: pd.read_csv(StringIO(data), true_values=['Yes'], false_values=['No'])


Out[140]:
a b c
0 1 True 2
1 3 False 4

Handling “bad” lines

Some files may have malformed lines with too few fields or too many. Lines with too few fields will have NA
values filled in the trailing fields. Lines with too many fields will raise an error by default:

In [141]: data = ('a,b,c\n'


.....: '1,2,3\n'
.....: '4,5,6,7\n'
.....: '8,9,10')
.....:

In [142]: pd.read_csv(StringIO(data))
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-142-6388c394e6b8> in <module>
----> 1 pd.read_csv(StringIO(data))

(continues on next page)

212 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


~/sandbox/pandas-doc/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter,
,→ header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine,␣

,→converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows,␣

,→na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_

,→datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator,␣

,→chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting,␣

,→doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines,␣

,→delim_whitespace, low_memory, memory_map, float_precision)

683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name

~/sandbox/pandas-doc/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)


461
462 try:
--> 463 data = parser.read(nrows)
464 finally:
465 parser.close()

~/sandbox/pandas-doc/pandas/io/parsers.py in read(self, nrows)


1152 def read(self, nrows=None):
1153 nrows = _validate_integer("nrows", nrows)
-> 1154 ret = self._engine.read(nrows)
1155
1156 # May alter columns / col_dict

~/sandbox/pandas-doc/pandas/io/parsers.py in read(self, nrows)


2057 def read(self, nrows=None):
2058 try:
-> 2059 data = self._reader.read(nrows)
2060 except StopIteration:
2061 if self._first_chunk:

~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_
,→low_memory()

~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_
,→rows()

~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._
,→tokenize_rows()

~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_
,→error()

ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

You can elect to skip bad lines:

4.1. IO tools (text, CSV, HDF5, …) 213


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [29]: pd.read_csv(StringIO(data), error_bad_lines=False)


Skipping line 3: expected 3 fields, saw 4

Out[29]:
a b c
0 1 2 3
1 8 9 10

You can also use the usecols parameter to eliminate extraneous column data that appear in some lines but
not others:

In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])

Out[30]:
a b c
0 1 2 3
1 4 5 6
2 8 9 10

Dialect

The dialect keyword gives greater flexibility in specifying the file format. By default it uses the Excel
dialect but you can specify either the dialect name or a csv.Dialect instance.
Suppose you had data with unenclosed quotes:

In [143]: print(data)
label1,label2,label3
index1,"a,c,e
index2,b,d,f

By default, read_csv uses the Excel dialect and treats the double quote as the quote character, which causes
it to fail when it finds a newline before it finds the closing double quote.
We can get around this using dialect:

In [144]: import csv

In [145]: dia = csv.excel()

In [146]: dia.quoting = csv.QUOTE_NONE

In [147]: pd.read_csv(StringIO(data), dialect=dia)


Out[147]:
label1 label2 label3
index1 "a c e
index2 b d f

All of the dialect options can be specified separately by keyword arguments:

In [148]: data = 'a,b,c~1,2,3~4,5,6'

In [149]: pd.read_csv(StringIO(data), lineterminator='~')


(continues on next page)

214 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Out[149]:
a b c
0 1 2 3
1 4 5 6

Another common dialect option is skipinitialspace, to skip any whitespace after a delimiter:
In [150]: data = 'a, b, c\n1, 2, 3\n4, 5, 6'

In [151]: print(data)
a, b, c
1, 2, 3
4, 5, 6

In [152]: pd.read_csv(StringIO(data), skipinitialspace=True)


Out[152]:
a b c
0 1 2 3
1 4 5 6
The parsers make every attempt to “do the right thing” and not be fragile. Type inference is a pretty big
deal. If a column can be coerced to integer dtype without altering the contents, the parser will do so. Any
non-numeric columns will come through as object dtype as with the rest of pandas objects.

Quoting and Escape Characters

Quotes (and other escape characters) in embedded fields can be handled in any number of ways. One way
is to use backslashes; to properly parse this data, you should pass the escapechar option:
In [153]: data = 'a,b\n"hello, \\"Bob\\", nice to see you",5'

In [154]: print(data)
a,b
"hello, \"Bob\", nice to see you",5

In [155]: pd.read_csv(StringIO(data), escapechar='\\')


Out[155]:
a b
0 hello, "Bob", nice to see you 5

Files with fixed width columns

While read_csv() reads delimited data, the read_fwf() function works with data files that have known
and fixed column widths. The function parameters to read_fwf are largely the same as read_csv with two
extra parameters, and a different usage of the delimiter parameter:
• colspecs: A list of pairs (tuples) giving the extents of the fixed-width fields of each line as half-open
intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the
column specifications from the first 100 rows of the data. Default behavior, if not specified, is to infer.
• widths: A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.
• delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify
the filler character of the fields if it is not spaces (e.g., ‘~’).

4.1. IO tools (text, CSV, HDF5, …) 215


pandas: powerful Python data analysis toolkit, Release 0.25.2

Consider a typical fixed-width data file:

In [156]: print(open('bar.csv').read())
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3

In order to parse this file into a DataFrame, we simply need to supply the column specifications to the
read_fwf function along with the file name:

# Column specifications are a list of half-intervals


In [157]: colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]

In [158]: df = pd.read_fwf('bar.csv', colspecs=colspecs, header=None, index_col=0)

In [159]: df
Out[159]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3

Note how the parser automatically picks column names X.<column number> when header=None argument
is specified. Alternatively, you can supply just the column widths for contiguous columns:

# Widths are a list of integers


In [160]: widths = [6, 14, 13, 10]

In [161]: df = pd.read_fwf('bar.csv', widths=widths, header=None)

In [162]: df
Out[162]:
0 1 2 3
0 id8141 360.242940 149.910199 11950.7
1 id1594 444.953632 166.985655 11788.4
2 id1849 364.136849 183.628767 11806.2
3 id1230 413.836124 184.375703 11916.8
4 id1948 502.953953 173.237159 12468.3

The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between
the columns in the file.
By default, read_fwf will try to infer the file’s colspecs by using the first 100 rows of the file. It can do
it only in cases when the columns are aligned and correctly separated by the provided delimiter (default
delimiter is whitespace).

In [163]: df = pd.read_fwf('bar.csv', header=None, index_col=0)

In [164]: df
(continues on next page)

216 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Out[164]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3

New in version 0.20.0.


read_fwf supports the dtype parameter for specifying the types of parsed columns to be different from the
inferred type.
In [165]: pd.read_fwf('bar.csv', header=None, index_col=0).dtypes
Out[165]:
1 float64
2 float64
3 float64
dtype: object

In [166]: pd.read_fwf('bar.csv', header=None, dtype={2: 'object'}).dtypes


Out[166]:
0 object
1 float64
2 object
3 float64
dtype: object

Indexes

Files with an “implicit” index column

Consider a file with one less entry in the header than the number of data column:

In [167]: print(open('foo.csv').read())
A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5

In this special case, read_csv assumes that the first column is to be used as the index of the DataFrame:

In [168]: pd.read_csv('foo.csv')
Out[168]:
A B C
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5

Note that the dates weren’t automatically parsed. In that case you would need to do as before:

4.1. IO tools (text, CSV, HDF5, …) 217


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [169]: df = pd.read_csv('foo.csv', parse_dates=True)

In [170]: df.index
Out[170]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]
,→', freq=None)

Reading an index with a MultiIndex

Suppose you have data indexed by two columns:

In [171]: print(open('data/mindex_ex.csv').read())
year,indiv,zit,xit
1977,"A",1.2,.6
1977,"B",1.5,.5
1977,"C",1.7,.8
1978,"A",.2,.06
1978,"B",.7,.2
1978,"C",.8,.3
1978,"D",.9,.5
1978,"E",1.4,.9
1979,"C",.2,.15
1979,"D",.14,.05
1979,"E",.5,.15
1979,"F",1.2,.5
1979,"G",3.4,1.9
1979,"H",5.4,2.7
1979,"I",6.4,1.2

The index_col argument to read_csv can take a list of column numbers to turn multiple columns into a
MultiIndex for the index of the returned object:
In [172]: df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])

In [173]: df
Out[173]:
zit xit
year indiv
1977 A 1.20 0.60
B 1.50 0.50
C 1.70 0.80
1978 A 0.20 0.06
B 0.70 0.20
C 0.80 0.30
D 0.90 0.50
E 1.40 0.90
1979 C 0.20 0.15
D 0.14 0.05
E 0.50 0.15
F 1.20 0.50
G 3.40 1.90
H 5.40 2.70
I 6.40 1.20

218 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [174]: df.loc[1978]
Out[174]:
zit xit
indiv
A 0.2 0.06
B 0.7 0.20
C 0.8 0.30
D 0.9 0.50
E 1.4 0.90

Reading columns with a MultiIndex

By specifying list of row locations for the header argument, you can read in a MultiIndex for the columns.
Specifying non-consecutive rows will skip the intervening rows.
In [175]: from pandas.util.testing import makeCustomDataframe as mkdf

In [176]: df = mkdf(5, 3, r_idx_nlevels=2, c_idx_nlevels=4)

In [177]: df.to_csv('mi.csv')

In [178]: print(open('mi.csv').read())
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2

In [179]: pd.read_csv('mi.csv', header=[0, 1, 2, 3], index_col=[0, 1])


Out[179]:
C0 C_l0_g0 C_l0_g1 C_l0_g2
C1 C_l1_g0 C_l1_g1 C_l1_g2
C2 C_l2_g0 C_l2_g1 C_l2_g2
C3 C_l3_g0 C_l3_g1 C_l3_g2
R0 R1
R_l0_g0 R_l1_g0 R0C0 R0C1 R0C2
R_l0_g1 R_l1_g1 R1C0 R1C1 R1C2
R_l0_g2 R_l1_g2 R2C0 R2C1 R2C2
R_l0_g3 R_l1_g3 R3C0 R3C1 R3C2
R_l0_g4 R_l1_g4 R4C0 R4C1 R4C2
read_csv is also able to interpret a more common format of multi-columns indices.
In [180]: print(open('mi2.csv').read())
,a,a,a,b,c,c
,q,r,s,t,u,v
one,1,2,3,4,5,6
two,7,8,9,10,11,12

4.1. IO tools (text, CSV, HDF5, …) 219


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [181]: pd.read_csv('mi2.csv', header=[0, 1], index_col=0)


Out[181]:
a b c
q r s t u v
one 1 2 3 4 5 6
two 7 8 9 10 11 12
Note: If an index_col is not specified (e.g. you don’t have an index, or wrote it with df.to_csv(...,
index=False), then any names on the columns index will be lost.

Automatically “sniffing” the delimiter

read_csv is capable of inferring delimited (not necessarily comma-separated) files, as pandas uses the csv.
Sniffer class of the csv module. For this, you have to specify sep=None.
In [182]: print(open('tmp2.sv').read())
:0:1:2:3
0:1.1214905765122583:-1.1011663421613171:-1.2725711408453018:0.8434589457722285
1:0.8739661419816901:-1.1622548707272122:0.12618578996106738:0.5057848504967111
2:0.6695152369722812:0.4833977900441433:-0.4383565886430891:-0.13952146077085656
3:1.6678766138462109:0.906356209978661:0.8603041052486606:-0.009413710135323125
4:-0.8075485015292924:-0.7848128653629299:-1.3155155066668116:0.6875244729698119
5:-0.1572352664979729:0.30339976035788174:-0.36340691002502046:-0.5526511482544121
6:0.41442095212262187:0.17517103850750262:-0.5295157789486404:-0.06745694327155764
7:1.058814717443789:-0.11789792502832808:-1.8534207864364352:-0.7018494437516053
8:0.26239634172416604:-1.7245959745828128:0.2765803759042711:1.0730241342647273
9:0.6352851164219758:-2.1785482358583024:0.3120437647651685:1.5723784501068536

In [183]: pd.read_csv('tmp2.sv', sep=None, engine='python')


Out[183]:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
4 4 -0.807549 -0.784813 -1.315516 0.687524
5 5 -0.157235 0.303400 -0.363407 -0.552651
6 6 0.414421 0.175171 -0.529516 -0.067457
7 7 1.058815 -0.117898 -1.853421 -0.701849
8 8 0.262396 -1.724596 0.276580 1.073024
9 9 0.635285 -2.178548 0.312044 1.572378

Reading multiple files to create a single DataFrame

It’s best to use concat() to combine multiple files. See the cookbook for an example.

Iterating through files chunk by chunk

Suppose you wish to iterate through a (potentially very large) file lazily rather than reading the entire file
into memory, such as the following:

220 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [184]: print(open('tmp.sv').read())
|0|1|2|3
0|1.1214905765122583|-1.1011663421613171|-1.2725711408453018|0.8434589457722285
1|0.8739661419816901|-1.1622548707272122|0.12618578996106738|0.5057848504967111
2|0.6695152369722812|0.4833977900441433|-0.4383565886430891|-0.13952146077085656
3|1.6678766138462109|0.906356209978661|0.8603041052486606|-0.009413710135323125
4|-0.8075485015292924|-0.7848128653629299|-1.3155155066668116|0.6875244729698119
5|-0.1572352664979729|0.30339976035788174|-0.36340691002502046|-0.5526511482544121
6|0.41442095212262187|0.17517103850750262|-0.5295157789486404|-0.06745694327155764
7|1.058814717443789|-0.11789792502832808|-1.8534207864364352|-0.7018494437516053
8|0.26239634172416604|-1.7245959745828128|0.2765803759042711|1.0730241342647273
9|0.6352851164219758|-2.1785482358583024|0.3120437647651685|1.5723784501068536

In [185]: table = pd.read_csv('tmp.sv', sep='|')

In [186]: table
Out[186]:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
4 4 -0.807549 -0.784813 -1.315516 0.687524
5 5 -0.157235 0.303400 -0.363407 -0.552651
6 6 0.414421 0.175171 -0.529516 -0.067457
7 7 1.058815 -0.117898 -1.853421 -0.701849
8 8 0.262396 -1.724596 0.276580 1.073024
9 9 0.635285 -2.178548 0.312044 1.572378

By specifying a chunksize to read_csv, the return value will be an iterable object of type TextFileReader:
In [187]: reader = pd.read_csv('tmp.sv', sep='|', chunksize=4)

In [188]: reader
Out[188]: <pandas.io.parsers.TextFileReader at 0x13eb31a90>

In [189]: for chunk in reader:


.....: print(chunk)
.....:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
Unnamed: 0 0 1 2 3
4 4 -0.807549 -0.784813 -1.315516 0.687524
5 5 -0.157235 0.303400 -0.363407 -0.552651
6 6 0.414421 0.175171 -0.529516 -0.067457
7 7 1.058815 -0.117898 -1.853421 -0.701849
Unnamed: 0 0 1 2 3
8 8 0.262396 -1.724596 0.276580 1.073024
9 9 0.635285 -2.178548 0.312044 1.572378

4.1. IO tools (text, CSV, HDF5, …) 221


pandas: powerful Python data analysis toolkit, Release 0.25.2

Specifying iterator=True will also return the TextFileReader object:

In [190]: reader = pd.read_csv('tmp.sv', sep='|', iterator=True)

In [191]: reader.get_chunk(5)
Out[191]:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
4 4 -0.807549 -0.784813 -1.315516 0.687524

Specifying the parser engine

Under the hood pandas uses a fast and efficient parser implemented in C as well as a Python implementation
which is currently more feature-complete. Where possible pandas uses the C parser (specified as engine='c'),
but may fall back to Python if C-unsupported options are specified. Currently, C-unsupported options
include:
• sep other than a single character (e.g. regex separators)
• skipfooter
• sep=None with delim_whitespace=False
Specifying any of the above options will produce a ParserWarning unless the python engine is selected
explicitly using engine='python'.

Reading remote files

You can pass in a URL to a CSV file:

df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',
sep='\t')

S3 URLs are handled as well but require installing the S3Fs library:

df = pd.read_csv('s3://pandas-test/tips.csv')

If your S3 bucket requires credentials you will need to set them as environment variables or in the ~/.aws/
credentials config file, refer to the S3Fs documentation on credentials.

Writing out data

Writing to CSV format

The Series and DataFrame objects have an instance method to_csv which allows storing the contents of
the object as a comma-separated-values file. The function takes a number of arguments. Only the first is
required.
• path_or_buf: A string path to the file to write or a file object. If a file object it must be opened with
newline=’’
• sep : Field delimiter for the output file (default “,”)

222 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

• na_rep: A string representation of a missing value (default ‘’)


• float_format: Format string for floating point numbers
• columns: Columns to write (default None)
• header: Whether to write out the column names (default True)
• index: whether to write row (index) names (default True)
• index_label: Column label(s) for index column(s) if desired. If None (default), and header and
index are True, then the index names are used. (A sequence should be given if the DataFrame uses
MultiIndex).
• mode : Python write mode, default ‘w’
• encoding: a string representing the encoding to use if the contents are non-ASCII, for Python versions
prior to 3
• line_terminator: Character sequence denoting line end (default os.linesep)
• quoting: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have
set a float_format then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat
them as non-numeric
• quotechar: Character used to quote fields (default ‘”’)
• doublequote: Control quoting of quotechar in fields (default True)
• escapechar: Character used to escape sep and quotechar when appropriate (default None)
• chunksize: Number of rows to write at a time
• date_format: Format string for datetime objects

Writing a formatted string

The DataFrame object has an instance method to_string which allows control over the string representation
of the object. All arguments are optional:
• buf default None, for example a StringIO object
• columns default None, which columns to write
• col_space default None, minimum width of each column.
• na_rep default NaN, representation of NA value
• formatters default None, a dictionary (by column) of functions each of which takes a single argument
and returns a formatted string
• float_format default None, a function which takes a single (float) argument and returns a formatted
string; to be applied to floats in the DataFrame.
• sparsify default True, set to False for a DataFrame with a hierarchical index to print every MultiIndex
key at each row.
• index_names default True, will print the names of the indices
• index default True, will print the index (ie, row labels)
• header default True, will print the column labels
• justify default left, will print column headers left- or right-justified

4.1. IO tools (text, CSV, HDF5, …) 223


pandas: powerful Python data analysis toolkit, Release 0.25.2

The Series object also has a to_string method, but with only the buf, na_rep, float_format arguments.
There is also a length argument which, if set to True, will additionally output the length of the Series.

4.1.2 JSON

Read and write JSON format files and strings.

Writing JSON

A Series or DataFrame can be converted to a valid JSON string. Use to_json with optional parameters:
• path_or_buf : the pathname or buffer to write the output This can be None in which case a JSON
string is returned
• orient :
Series:
– default is index
– allowed values are {split, records, index}
DataFrame:
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string

split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, … , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array

• date_format : string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601.
• double_precision : The number of decimal places to use when encoding floating point values, default
10.
• force_ascii : force encoded string to be ASCII, default True.
• date_unit : The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’,
‘us’ or ‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’.
• default_handler : The handler to call if an object cannot otherwise be converted to a suitable format
for JSON. Takes a single argument, which is the object to convert, and returns a serializable object.
• lines : If records orient, then will write each record per line as json.
Note NaN’s, NaT’s and None will be converted to null and datetime objects will be converted based on the
date_format and date_unit parameters.

In [192]: dfj = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [193]: json = dfj.to_json()

(continues on next page)

224 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


In [194]: json
Out[194]: '{"A":{"0":2.0027932898,"1":-1.1284197337,"2":-0.2671230751,"3":0.0590811856,"4
,→":0.2126018166},"B":{"0":0.009310115,"1":-1.2591311739,"2":1.7549089729,"3":0.

,→9464922966,"4":-0.5276761509}}'

Orient options

There are a number of different options for the format of the resulting JSON file / string. Consider the
following DataFrame and Series:

In [195]: dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),


.....: columns=list('ABC'), index=list('xyz'))
.....:

In [196]: dfjo
Out[196]:
A B C
x 1 4 7
y 2 5 8
z 3 6 9

In [197]: sjo = pd.Series(dict(x=15, y=16, z=17), name='D')

In [198]: sjo
Out[198]:
x 15
y 16
z 17
Name: D, dtype: int64

Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column
labels acting as the primary index:

In [199]: dfjo.to_json(orient="columns")
Out[199]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'

# Not available for Series

Index oriented (the default for Series) similar to column oriented but the index labels are now primary:
In [200]: dfjo.to_json(orient="index")
Out[200]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'

In [201]: sjo.to_json(orient="index")
Out[201]: '{"x":15,"y":16,"z":17}'
Record oriented serializes the data to a JSON array of column -> value records, index labels are not
included. This is useful for passing DataFrame data to plotting libraries, for example the JavaScript library
d3.js:
In [202]: dfjo.to_json(orient="records")
Out[202]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'

4.1. IO tools (text, CSV, HDF5, …) 225


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [203]: sjo.to_json(orient="records")
Out[203]: '[15,16,17]'
Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and
index labels are not included:
In [204]: dfjo.to_json(orient="values")
Out[204]: '[[1,4,7],[2,5,8],[3,6,9]]'

# Not available for Series

Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name
is also included for Series:
In [205]: dfjo.to_json(orient="split")
Out[205]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,
,→9]]}'

In [206]: sjo.to_json(orient="split")
Out[206]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'
Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including
but not limited to dtypes and index names.

Note: Any orient option that encodes to a JSON object will not preserve the ordering of index and column
labels during round-trip serialization. If you wish to preserve label ordering use the split option as it uses
ordered containers.

Date handling

Writing in ISO date format:

In [207]: dfd = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [208]: dfd['date'] = pd.Timestamp('20130101')

In [209]: dfd = dfd.sort_index(1, ascending=False)

In [210]: json = dfd.to_json(date_format='iso')

In [211]: json
Out[211]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":
,→"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z

,→"},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.1144885256,"4":0.

,→5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,"3":1.5730764249,

,→"4":0.6200376615}}'

Writing in ISO date format, with microseconds:


In [212]: json = dfd.to_json(date_format='iso', date_unit='us')

In [213]: json
Out[213]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z",
,→"2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-
(continues on next page)
,→01T00:00:00.000000Z"},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.

,→1144885256,"4":0.5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,
226
,→"3":1.5730764249,"4":0.6200376615}}'
Chapter 4. User Guide
pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

Epoch timestamps, in seconds:

In [214]: json = dfd.to_json(date_format='epoch', date_unit='s')

In [215]: json
Out[215]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4
,→":1356998400},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.

,→1144885256,"4":0.5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,

,→"3":1.5730764249,"4":0.6200376615}}'

Writing to a file, with a date index and a date column:

In [216]: dfj2 = dfj.copy()

In [217]: dfj2['date'] = pd.Timestamp('20130101')

In [218]: dfj2['ints'] = list(range(5))

In [219]: dfj2['bools'] = True

In [220]: dfj2.index = pd.date_range('20130101', periods=5)

In [221]: dfj2.to_json('test.json')

In [222]: with open('test.json') as fh:


.....: print(fh.read())
.....:
{"A":{"1356998400000":2.0027932898,"1357084800000":-1.1284197337,"1357171200000":-0.
,→2671230751,"1357257600000":0.0590811856,"1357344000000":0.2126018166},"B":{

,→"1356998400000":0.009310115,"1357084800000":-1.2591311739,"1357171200000":1.7549089729,

,→"1357257600000":0.9464922966,"1357344000000":-0.5276761509},"date":{"1356998400000

,→":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,

,→"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,

,→"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{

,→"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,

,→"1357344000000":true}}

Fallback behavior

If the JSON serializer cannot handle the container contents directly it will fall back in the following manner:
• if the dtype is unsupported (e.g. np.complex) then the default_handler, if provided, will be called
for each value, otherwise an exception is raised.
• if an object is unsupported it will attempt the following:
– check if the object has defined a toDict method and call it. A toDict method should return a
dict which will then be JSON serialized.
– invoke the default_handler if one was provided.
– convert the object to a dict by traversing its contents. However this will often fail with an
OverflowError or give unexpected results.

4.1. IO tools (text, CSV, HDF5, …) 227


pandas: powerful Python data analysis toolkit, Release 0.25.2

In general the best approach for unsupported objects or dtypes is to provide a default_handler. For
example:

>>> DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json() # raises


RuntimeError: Unhandled numpy dtype 15

can be dealt with by specifying a simple default_handler:

In [223]: pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)


Out[223]: '{"0":{"0":"(1+0j)","1":"(2+0j)","2":"(1+2j)"}}'

Reading JSON

Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a
DataFrame if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series
• filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be a URL.
Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a
local file could be file ://localhost/path/to/table.json
• typ : type of object to recover (series or frame), default ‘frame’
• orient :
Series :
– default is index
– allowed values are {split, records, index}
DataFrame
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string

split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, … , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array
table adhering to the JSON Table Schema

• dtype : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer
dtypes at all, default is True, apply only to the data.
• convert_axes : boolean, try to convert the axes to the proper dtypes, default is True
• convert_dates : a list of columns to parse for dates; If True, then try to parse date-like columns,
default is True.
• keep_default_dates : boolean, default True. If parsing dates, then parse the default date-like
columns.
• numpy : direct decoding to NumPy arrays. default is False; Supports numeric data only, although
labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term if
numpy=True.

228 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

• precise_float : boolean, default False. Set to enable usage of higher precision (strtod) function when
decoding string to double values. Default (False) is to use fast but less precise builtin functionality.
• date_unit : string, the timestamp unit to detect if converting dates. Default None. By default the
timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force
timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively.
• lines : reads file as one json object per line.
• encoding : The encoding to use to decode py3 bytes.
• chunksize : when used in combination with lines=True, return a JsonReader which reads in
chunksize lines per iteration.
The parser will raise one of ValueError/TypeError/AssertionError if the JSON is not parseable.
If a non-default orient was used when encoding to JSON be sure to pass the same option here so that
decoding produces sensible results, see Orient Options for an overview.

Data conversion

The default of convert_axes=True, dtype=True, and convert_dates=True will try to parse the axes, and
all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to
dtype. convert_axes should only be set to False if you need to preserve string-like numbers (e.g. ‘1’, ‘2’)
in an axes.

Note: Large integer values may be converted to dates if convert_dates=True and the data and / or
column labels appear ‘date-like’. The exact threshold depends on the date_unit specified. ‘date-like’ means
that the column label meets one of the following criteria:
• it ends with '_at'
• it ends with '_time'
• it begins with 'timestamp'
• it is 'modified'
• it is 'date'

Warning: When reading JSON data, automatic coercing into dtypes has some quirks:
• an index can be reconstructed in a different order from serialization, that is, the returned order is
not guaranteed to be the same as before serialization
• a column that was float data will be converted to integer if it can be done safely, e.g. a column
of 1.
• bool columns will be converted to integer on reconstruction
Thus there are times where you may want to specify specific dtypes via the dtype keyword argument.

Reading from a JSON string:

In [224]: pd.read_json(json)
Out[224]:
date B A
(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 229


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


0 2013-01-01 0.390338 -1.095412
1 2013-01-01 -0.522368 -0.147142
2 2013-01-01 2.024915 0.630583
3 2013-01-01 2.114489 1.573076
4 2013-01-01 0.533759 0.620038

Reading from a file:

In [225]: pd.read_json('test.json')
Out[225]:
A B date ints bools
2013-01-01 2.002793 0.009310 2013-01-01 0 True
2013-01-02 -1.128420 -1.259131 2013-01-01 1 True
2013-01-03 -0.267123 1.754909 2013-01-01 2 True
2013-01-04 0.059081 0.946492 2013-01-01 3 True
2013-01-05 0.212602 -0.527676 2013-01-01 4 True

Don’t convert any data (but still convert axes and dates):

In [226]: pd.read_json('test.json', dtype=object).dtypes


Out[226]:
A object
B object
date object
ints object
bools object
dtype: object

Specify dtypes for conversion:

In [227]: pd.read_json('test.json', dtype={'A': 'float32', 'bools': 'int8'}).dtypes


Out[227]:
A float32
B float64
date datetime64[ns]
ints int64
bools int8
dtype: object

Preserve string indices:


In [228]: si = pd.DataFrame(np.zeros((4, 4)), columns=list(range(4)),
.....: index=[str(i) for i in range(4)])
.....:

In [229]: si
Out[229]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0

In [230]: si.index

230 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Out[230]: Index(['0', '1', '2', '3'], dtype='object')

In [231]: si.columns
Out[231]: Int64Index([0, 1, 2, 3], dtype='int64')

In [232]: json = si.to_json()

In [233]: sij = pd.read_json(json, convert_axes=False)

In [234]: sij
Out[234]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0

In [235]: sij.index
Out[235]: Index(['0', '1', '2', '3'], dtype='object')

In [236]: sij.columns
Out[236]: Index(['0', '1', '2', '3'], dtype='object')
Dates written in nanoseconds need to be read back in nanoseconds:
In [237]: json = dfj2.to_json(date_unit='ns')

# Try to parse timestamps as milliseconds -> Won't Work


In [238]: dfju = pd.read_json(json, date_unit='ms')

In [239]: dfju
Out[239]:
A B date ints bools
1356998400000000000 2.002793 0.009310 1356998400000000000 0 True
1357084800000000000 -1.128420 -1.259131 1356998400000000000 1 True
1357171200000000000 -0.267123 1.754909 1356998400000000000 2 True
1357257600000000000 0.059081 0.946492 1356998400000000000 3 True
1357344000000000000 0.212602 -0.527676 1356998400000000000 4 True

# Let pandas detect the correct precision


In [240]: dfju = pd.read_json(json)

In [241]: dfju
Out[241]:
A B date ints bools
2013-01-01 2.002793 0.009310 2013-01-01 0 True
2013-01-02 -1.128420 -1.259131 2013-01-01 1 True
2013-01-03 -0.267123 1.754909 2013-01-01 2 True
2013-01-04 0.059081 0.946492 2013-01-01 3 True
2013-01-05 0.212602 -0.527676 2013-01-01 4 True

# Or specify that all timestamps are in nanoseconds


In [242]: dfju = pd.read_json(json, date_unit='ns')

(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 231


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


In [243]: dfju
Out[243]:
A B date ints bools
2013-01-01 2.002793 0.009310 2013-01-01 0 True
2013-01-02 -1.128420 -1.259131 2013-01-01 1 True
2013-01-03 -0.267123 1.754909 2013-01-01 2 True
2013-01-04 0.059081 0.946492 2013-01-01 3 True
2013-01-05 0.212602 -0.527676 2013-01-01 4 True

The Numpy parameter

Note: This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates
etc.

If numpy=True is passed to read_json an attempt will be made to sniff an appropriate dtype during deseri-
alization and to subsequently decode directly to NumPy arrays, bypassing the need for intermediate Python
objects.
This can provide speedups if you are deserialising a large amount of numeric data:

In [244]: randfloats = np.random.uniform(-100, 1000, 10000)

In [245]: randfloats.shape = (1000, 10)

In [246]: dffloats = pd.DataFrame(randfloats, columns=list('ABCDEFGHIJ'))

In [247]: jsonfloats = dffloats.to_json()

In [248]: %timeit pd.read_json(jsonfloats)


10.7 ms +- 321 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [249]: %timeit pd.read_json(jsonfloats, numpy=True)


8.68 ms +- 352 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

The speedup is less noticeable for smaller datasets:

In [250]: jsonfloats = dffloats.head(100).to_json()

In [251]: %timeit pd.read_json(jsonfloats)


7.13 ms +- 153 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [252]: %timeit pd.read_json(jsonfloats, numpy=True)


6.36 ms +- 250 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

Warning: Direct NumPy decoding makes a number of assumptions and may fail or produce unexpected
output if these assumptions are not satisfied:
• data is numeric.

232 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

• data is uniform. The dtype is sniffed from the first value decoded. A ValueError may be raised,
or incorrect output may be produced if this condition is not satisfied.
• labels are ordered. Labels are only read from the first container, it is assumed that each subsequent
row / column has been encoded in the same order. This should be satisfied if the data was encoded
using to_json but may not be the case if the JSON is from another source.

Normalization

pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into
a flat table.

In [253]: from pandas.io.json import json_normalize

In [254]: data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},


.....: {'name': {'given': 'Mose', 'family': 'Regner'}},
.....: {'id': 2, 'name': 'Faye Raker'}]
.....:

In [255]: json_normalize(data)
Out[255]:
id name.first name.last name.given name.family name
0 1.0 Coleen Volk NaN NaN NaN
1 NaN NaN NaN Mose Regner NaN
2 2.0 NaN NaN NaN NaN Faye Raker

In [256]: data = [{'state': 'Florida',


.....: 'shortname': 'FL',
.....: 'info': {'governor': 'Rick Scott'},
.....: 'counties': [{'name': 'Dade', 'population': 12345},
.....: {'name': 'Broward', 'population': 40000},
.....: {'name': 'Palm Beach', 'population': 60000}]},
.....: {'state': 'Ohio',
.....: 'shortname': 'OH',
.....: 'info': {'governor': 'John Kasich'},
.....: 'counties': [{'name': 'Summit', 'population': 1234},
.....: {'name': 'Cuyahoga', 'population': 1337}]}]
.....:

In [257]: json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])


Out[257]:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich

The max_level parameter provides more control over which level to end normalization. With max_level=1
the following snippet normalizes until 1st nesting level of the provided dict.

4.1. IO tools (text, CSV, HDF5, …) 233


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [258]: data = [{'CreatedBy': {'Name': 'User001'},


.....: 'Lookup': {'TextField': 'Some text',
.....: 'UserField': {'Id': 'ID001',
.....: 'Name': 'Name001'}},
.....: 'Image': {'a': 'b'}
.....: }]
.....:

In [259]: json_normalize(data, max_level=1)


Out[259]:
CreatedBy.Name Lookup.TextField Lookup.UserField Image.a
0 User001 Some text {'Id': 'ID001', 'Name': 'Name001'} b

Line delimited json

New in version 0.19.0.


pandas is able to read and write line-delimited json files that are common in data processing pipelines using
Hadoop or Spark.
New in version 0.21.0.
For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time.
This can be useful for large files or to read from a stream.
In [260]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:

In [261]: df = pd.read_json(jsonl, lines=True)

In [262]: df
Out[262]:
a b
0 1 2
1 3 4

In [263]: df.to_json(orient='records', lines=True)


Out[263]: '{"a":1,"b":2}\n{"a":3,"b":4}'

# reader is an iterator that returns `chunksize` lines each iteration


In [264]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)

In [265]: reader
Out[265]: <pandas.io.json._json.JsonReader at 0x13ed13b10>

In [266]: for chunk in reader:


.....: print(chunk)
.....:
Empty DataFrame
Columns: []
Index: []

234 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

a b
0 1 2
a b
1 3 4

Table schema

New in version 0.20.0.


Table Schema is a spec for describing tabular datasets as a JSON object. The JSON includes information
on the field names, types, and other attributes. You can use the orient table to build a JSON string with
two fields, schema and data.
In [267]: df = pd.DataFrame({'A': [1, 2, 3],
.....: 'B': ['a', 'b', 'c'],
.....: 'C': pd.date_range('2016-01-01', freq='d', periods=3)},
.....: index=pd.Index(range(3), name='idx'))
.....:

In [268]: df
Out[268]:
A B C
idx
0 1 a 2016-01-01
1 2 b 2016-01-02
2 3 c 2016-01-03

In [269]: df.to_json(orient='table', date_format="iso")


Out[269]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A",
,→"type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],

,→"primaryKey":["idx"],"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a",

,→"C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},

,→{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'

The schema field contains the fields key, which itself contains a list of column name to type pairs, including
the Index or MultiIndex (see below for a list of types). The schema field also contains a primaryKey field
if the (Multi)index is unique.
The second field, data, contains the serialized data with the records orient. The index is included, and any
datetimes are ISO 8601 formatted, as required by the Table Schema spec.
The full list of types supported are described in the Table Schema spec. This table shows the mapping from
pandas types:

Pandas type Table Schema type


int64 integer
float64 number
bool boolean
datetime64[ns] datetime
timedelta64[ns] duration
categorical any
object str

A few notes on the generated table schema:

4.1. IO tools (text, CSV, HDF5, …) 235


pandas: powerful Python data analysis toolkit, Release 0.25.2

• The schema object contains a pandas_version field. This contains the version of pandas’ dialect of
the schema, and will be incremented with each revision.
• All dates are converted to UTC when serializing. Even timezone naive values, which are treated as
UTC with an offset of 0.

In [270]: from pandas.io.json import build_table_schema

In [271]: s = pd.Series(pd.date_range('2016', periods=4))

In [272]: build_table_schema(s)
Out[272]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

• datetimes with a timezone (before serializing), include an additional field tz with the time zone name
(e.g. 'US/Central').

In [273]: s_tz = pd.Series(pd.date_range('2016', periods=12,


.....: tz='US/Central'))
.....:

In [274]: build_table_schema(s_tz)
Out[274]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

• Periods are converted to timestamps before serialization, and so have the same behavior of being con-
verted to UTC. In addition, periods will contain and additional field freq with the period’s frequency,
e.g. 'A-DEC'.

In [275]: s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',


.....: periods=4))
.....:

In [276]: build_table_schema(s_per)
Out[276]:
{'fields': [{'name': 'index', 'type': 'datetime', 'freq': 'A-DEC'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

• Categoricals use the any type and an enum constraint listing the set of possible values. Additionally,
an ordered field is included:

In [277]: s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))

In [278]: build_table_schema(s_cat)
Out[278]:
{'fields': [{'name': 'index', 'type': 'integer'},
(continues on next page)

236 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


{'name': 'values',
'type': 'any',
'constraints': {'enum': ['a', 'b']},
'ordered': False}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

• A primaryKey field, containing an array of labels, is included if the index is unique:

In [279]: s_dupe = pd.Series([1, 2], index=[1, 1])

In [280]: build_table_schema(s_dupe)
Out[280]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'pandas_version': '0.20.0'}

• The primaryKey behavior is the same with MultiIndexes, but in this case the primaryKey is an array:

In [281]: s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),


.....: (0, 1)]))
.....:

In [282]: build_table_schema(s_multi)
Out[282]:
{'fields': [{'name': 'level_0', 'type': 'string'},
{'name': 'level_1', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': FrozenList(['level_0', 'level_1']),
'pandas_version': '0.20.0'}

• The default naming roughly follows these rules:


– For series, the object.name is used. If that’s none, then the name is values
– For DataFrames, the stringified version of the column name is used
– For Index (not MultiIndex), index.name is used, with a fallback to index if that is None.
– For MultiIndex, mi.names is used. If any level has no name, then level_<i> is used.
New in version 0.23.0.
read_json also accepts orient='table' as an argument. This allows for the preservation of metadata such
as dtypes and index names in a round-trippable manner.
In [283]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
.....: 'bar': ['a', 'b', 'c', 'd'],
.....: 'baz': pd.date_range('2018-01-01', freq='d', periods=4),
.....: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])
.....: }, index=pd.Index(range(4), name='idx'))
.....:

In [284]: df
Out[284]:
foo bar baz qux
idx

4.1. IO tools (text, CSV, HDF5, …) 237


pandas: powerful Python data analysis toolkit, Release 0.25.2

0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c

In [285]: df.dtypes
Out[285]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object

In [286]: df.to_json('test.json', orient='table')

In [287]: new_df = pd.read_json('test.json', orient='table')

In [288]: new_df
Out[288]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c

In [289]: new_df.dtypes
Out[289]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
Please note that the literal string ‘index’ as the name of an Index is not round-trippable, nor are any names
beginning with 'level_' within a MultiIndex. These are used by default in DataFrame.to_json() to
indicate missing values and the subsequent read cannot distinguish the intent.

In [290]: df.index.name = 'index'

In [291]: df.to_json('test.json', orient='table')

In [292]: new_df = pd.read_json('test.json', orient='table')

In [293]: print(new_df.index.name)
None

4.1.3 HTML

238 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Reading HTML content

Warning: We highly encourage you to read the HTML Table Parsing gotchas below regarding the
issues surrounding the BeautifulSoup4/html5lib/lxml parsers.

The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into
list of pandas DataFrames. Let’s look at a few examples.

Note: read_html returns a list of DataFrame objects, even if there is only a single table contained in the
HTML content.

Read a URL with no options:

In [294]: url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'

In [295]: dfs = pd.read_html(url)

In [296]: dfs
Out[296]:
[ Bank Name City ST CERT ␣
,→ Acquiring Institution Closing Date Updated Date
0 The Enloe State Bank Cooper TX 10716 ␣
,→ Legend Bank, N. A. May 31, 2019 August 22, 2019
1 Washington Federal Bank for Savings Chicago IL 30570 ␣
,→ Royal Savings Bank December 15, 2017 July 24, 2019
2 The Farmers and Merchants State Bank of Argonia Argonia KS 17719 ␣
,→ Conway Bank October 13, 2017 August 12, 2019
3 Fayette County Bank Saint Elmo IL 1802 ␣
,→ United Fidelity Bank, fsb May 26, 2017 January 29, 2019
4 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI 30003 First-
,→Citizens Bank & Trust Company May 5, 2017 March 22, 2018
.. ... ... .. ... ␣
,→ ... ... ...
551 Superior Bank, FSB Hinsdale IL 32646 ␣
,→ Superior Federal, FSB July 27, 2001 August 19, 2014
552 Malta National Bank Malta OH 6629 ␣
,→ North Valley Bank May 3, 2001 November 18, 2002
553 First Alliance Bank & Trust Co. Manchester NH 34264 Southern␣
,→New Hampshire Bank & Trust February 2, 2001 February 18, 2003
554 National State Bank of Metropolis Metropolis IL 3815 ␣
,→ Banterra Bank of Marion December 14, 2000 March 17, 2005
555 Bank of Honolulu Honolulu HI 21029 ␣
,→ Bank of the Orient October 13, 2000 March 17, 2005

[556 rows x 7 columns]]

Note: The data from the above URL changes every Monday so the resulting data above and the data
below may be slightly different.

Read in the content of the file from the above URL and pass it to read_html as a string:

4.1. IO tools (text, CSV, HDF5, …) 239


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [297]: with open(file_path, 'r') as f:


.....: dfs = pd.read_html(f.read())
.....:

In [298]: dfs
Out[298]:
[ Bank Name City ST CERT ␣
,→Acquiring Institution Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 ␣
,→North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale AZ 34527 ␣
,→ Western State Bank May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta GA 58185 ␣
,→ Synovus Bank May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville NC 58701 ␣
,→ Capital Bank, N.A. May 10, 2013 May 14, 2013
4 Douglas County Bank Douglasville GA 21649 ␣
,→Hamilton State Bank April 26, 2013 May 16, 2013
.. ... ... .. ... ␣
,→ ... ... ...
500 Superior Bank, FSB Hinsdale IL 32646 ␣
,→Superior Federal, FSB July 27, 2001 June 5, 2012
501 Malta National Bank Malta OH 6629 ␣
,→ North Valley Bank May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New␣
,→Hampshire Bank & Trust February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis IL 3815 ␣
,→Banterra Bank of Marion December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu HI 21029 ␣
,→ Bank of the Orient October 13, 2000 March 17, 2005

[505 rows x 7 columns]]

You can even pass in an instance of StringIO if you so desire:


In [299]: with open(file_path, 'r') as f:
.....: sio = StringIO(f.read())
.....:

In [300]: dfs = pd.read_html(sio)

In [301]: dfs
Out[301]:
[ Bank Name City ST CERT ␣
,→Acquiring Institution Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 ␣
,→North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale AZ 34527 ␣
,→ Western State Bank May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta GA 58185 ␣
,→ Synovus Bank May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville NC 58701 ␣
,→ Capital Bank, N.A. May 10, 2013 May 14, 2013
(continues on next page)

240 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


4 Douglas County Bank Douglasville GA 21649 ␣
,→Hamilton State Bank April 26, 2013 May 16, 2013
.. ... ... .. ... ␣
,→ ... ... ...
500 Superior Bank, FSB Hinsdale IL 32646 ␣
,→Superior Federal, FSB July 27, 2001 June 5, 2012
501 Malta National Bank Malta OH 6629 ␣
,→ North Valley Bank May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New␣
,→Hampshire Bank & Trust February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis IL 3815 ␣
,→Banterra Bank of Marion December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu HI 21029 ␣
,→ Bank of the Orient October 13, 2000 March 17, 2005

[505 rows x 7 columns]]

Note: The following examples are not run by the IPython evaluator due to the fact that having so many
network-accessing functions slows down the documentation build. If you spot an error or an example that
doesn’t run, please do not hesitate to report it over on pandas GitHub issues page.

Read a URL and match a table that contains specific text:

match = 'Metcalf Bank'


df_list = pd.read_html(url, match=match)

Specify a header row (by default <th> or <td> elements located within a <thead> are used to form the
column index, if multiple rows are contained within <thead> then a MultiIndex is created); if specified, the
header row is taken from the data minus the parsed header elements (<th> elements).

dfs = pd.read_html(url, header=0)

Specify an index column:

dfs = pd.read_html(url, index_col=0)

Specify a number of rows to skip:

dfs = pd.read_html(url, skiprows=0)

Specify a number of rows to skip using a list (xrange (Python 2 only) works as well):

dfs = pd.read_html(url, skiprows=range(2))

Specify an HTML attribute:

dfs1 = pd.read_html(url, attrs={'id': 'table'})


dfs2 = pd.read_html(url, attrs={'class': 'sortable'})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True

Specify values that should be converted to NaN:

4.1. IO tools (text, CSV, HDF5, …) 241


pandas: powerful Python data analysis toolkit, Release 0.25.2

dfs = pd.read_html(url, na_values=['No Acquirer'])

New in version 0.19.


Specify whether to keep the default set of NaN values:

dfs = pd.read_html(url, keep_default_na=False)

New in version 0.19.


Specify converters for columns. This is useful for numerical text data that has leading zeros. By default
columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can
convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0,
converters={'MNC': str})

New in version 0.19.


Use some combination of the above:

dfs = pd.read_html(url, match='Metcalf Bank', index_col=0)

Read in pandas to_html output (with some loss of floating point precision):

df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format='{0:.40g}'.format)
dfin = pd.read_html(s, index_col=0)

The lxml backend will raise an error on a failed parse if that is the only parser you provide. If you only have
a single parser you can provide just a string, but it is considered good practice to pass a list with one string
if, for example, the function expects a sequence of strings. You may use:

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml'])

Or you could pass flavor='lxml' without a list:

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor='lxml')

However, if you have bs4 and html5lib installed and pass None or ['lxml', 'bs4'] then the parse will most
likely succeed. Note that as soon as a parse succeeds, the function will return.

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml', 'bs4'])

Writing to HTML files

DataFrame objects have an instance method to_html which renders the contents of the DataFrame as an
HTML table. The function arguments are as in the method to_string described above.

Note: Not all of the possible options for DataFrame.to_html are shown here for brevity’s sake. See
to_html() for the full set of options.

242 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

In [302]: df = pd.DataFrame(np.random.randn(2, 2))

In [303]: df
Out[303]:
0 1
0 -1.050304 1.131622
1 -0.692581 -1.174172

In [304]: print(df.to_html()) # raw html


<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.050304</td>
<td>1.131622</td>
</tr>
<tr>
<th>1</th>
<td>-0.692581</td>
<td>-1.174172</td>
</tr>
</tbody>
</table>
HTML:
The columns argument will limit the columns shown:

In [305]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.050304</td>
</tr>
<tr>
<th>1</th>
<td>-0.692581</td>
</tr>
</tbody>
</table>

4.1. IO tools (text, CSV, HDF5, …) 243


pandas: powerful Python data analysis toolkit, Release 0.25.2

HTML:
float_format takes a Python callable to control the precision of floating point values:

In [306]: print(df.to_html(float_format='{0:.10f}'.format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.0503044154</td>
<td>1.1316218324</td>
</tr>
<tr>
<th>1</th>
<td>-0.6925807265</td>
<td>-1.1741715747</td>
</tr>
</tbody>
</table>

HTML:
bold_rows will make the row labels bold by default, but you can turn that off:

In [307]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-1.050304</td>
<td>1.131622</td>
</tr>
<tr>
<td>1</td>
<td>-0.692581</td>
<td>-1.174172</td>
</tr>
</tbody>
</table>

The classes argument provides the ability to give the resulting HTML table CSS classes. Note that these

244 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

classes are appended to the existing 'dataframe' class.


In [308]: print(df.to_html(classes=['awesome_table_class', 'even_more_awesome_class']))
<table border="1" class="dataframe awesome_table_class even_more_awesome_class">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.050304</td>
<td>1.131622</td>
</tr>
<tr>
<th>1</th>
<td>-0.692581</td>
<td>-1.174172</td>
</tr>
</tbody>
</table>

The render_links argument provides the ability to add hyperlinks to cells that contain URLs.
New in version 0.24.
In [309]: url_df = pd.DataFrame({
.....: 'name': ['Python', 'Pandas'],
.....: 'url': ['https://www.python.org/', 'http://pandas.pydata.org']})
.....:

In [310]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>url</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Python</td>
<td><a href="https://www.python.org/" target="_blank">https://www.python.org/</a></
,→td>

</tr>
<tr>
<th>1</th>
<td>Pandas</td>
<td><a href="http://pandas.pydata.org" target="_blank">http://pandas.pydata.org</a>
,→</td>
(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 245


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


</tr>
</tbody>
</table>

HTML:
Finally, the escape argument allows you to control whether the “<”, “>” and “&” characters escaped in the
resulting HTML (by default it is True). So to get the HTML without escaped characters pass escape=False

In [311]: df = pd.DataFrame({'a': list('&<>'), 'b': np.random.randn(3)})

Escaped:

In [312]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&amp;</td>
<td>1.254800</td>
</tr>
<tr>
<th>1</th>
<td>&lt;</td>
<td>1.131996</td>
</tr>
<tr>
<th>2</th>
<td>&gt;</td>
<td>-1.311021</td>
</tr>
</tbody>
</table>

Not escaped:

In [313]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
(continues on next page)

246 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


<tr>
<th>0</th>
<td>&</td>
<td>1.254800</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>1.131996</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-1.311021</td>
</tr>
</tbody>
</table>

Note: Some browsers may not show a difference in the rendering of the previous two HTML tables.

HTML Table Parsing Gotchas

There are some versioning issues surrounding the libraries that are used to parse HTML tables in the top-level
pandas io function read_html.
Issues with lxml
• Benefits
– lxml is very fast.
– lxml requires Cython to install correctly.
• Drawbacks
– lxml does not make any guarantees about the results of its parse unless it is given strictly valid
markup.
– In light of the above, we have chosen to allow you, the user, to use the lxml backend, but this
backend will use html5lib if lxml fails to parse
– It is therefore highly recommended that you install both BeautifulSoup4 and html5lib, so that
you will still get a valid result (provided everything else is valid) even if lxml fails.
Issues with BeautifulSoup4 using lxml as a backend
• The above issues hold here as well since BeautifulSoup4 is essentially just a wrapper around a parser
backend.
Issues with BeautifulSoup4 using html5lib as a backend
• Benefits
– html5lib is far more lenient than lxml and consequently deals with real-life markup in a much
saner way rather than just, e.g., dropping an element without notifying you.

4.1. IO tools (text, CSV, HDF5, …) 247


pandas: powerful Python data analysis toolkit, Release 0.25.2

– html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely
important for parsing HTML tables, since it guarantees a valid document. However, that does
NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.
– html5lib is pure Python and requires no additional build steps beyond its own installation.
• Drawbacks
– The biggest drawback to using html5lib is that it is slow as molasses. However consider the fact
that many tables on the web are not big enough for the parsing algorithm runtime to matter. It
is more likely that the bottleneck will be in the process of reading the raw text from the URL
over the web, i.e., IO (input-output). For very large tables, this might not be true.

4.1.4 Excel files

The read_excel() method can read Excel 2003 (.xls) files using the xlrd Python module. Excel 2007+
(.xlsx) files can be read using either xlrd or openpyxl. The to_excel() instance method is used for saving
a DataFrame to Excel. Generally the semantics are similar to working with csv data. See the cookbook for
some advanced strategies.

Reading Excel files

In the most basic use-case, read_excel takes a path to an Excel file, and the sheet_name indicating which
sheet to parse.

# Returns a DataFrame
pd.read_excel('path_to_file.xls', sheet_name='Sheet1')

ExcelFile class

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the
file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as
the file is read into memory only once.

xlsx = pd.ExcelFile('path_to_file.xls')
df = pd.read_excel(xlsx, 'Sheet1')

The ExcelFile class can also be used as a context manager.

with pd.ExcelFile('path_to_file.xls') as xls:


df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')

The sheet_names property will generate a list of the sheet names in the file.
The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:

data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)

248 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed
to read_excel with no loss in performance.

# using the ExcelFile class


data = {}
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=None,
na_values=['NA'])

# equivalent using the read_excel function


data = pd.read_excel('path_to_file.xls', ['Sheet1', 'Sheet2'],
index_col=None, na_values=['NA'])

ExcelFile can also be called with a xlrd.book.Book object as a parameter. This allows the user to control
how the excel file is read. For example, sheets can be loaded on demand by calling xlrd.open_workbook()
with on_demand=True.

import xlrd
xlrd_book = xlrd.open_workbook('path_to_file.xls', on_demand=True)
with pd.ExcelFile(xlrd_book) as xls:
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')

Specifying sheets

Note: The second argument is sheet_name, not to be confused with ExcelFile.sheet_names.

Note: An ExcelFile’s attribute sheet_names provides access to a list of sheets.

• The arguments sheet_name allows specifying the sheet or sheets to read.


• The default value for sheet_name is 0, indicating to read the first sheet
• Pass a string to refer to the name of a particular sheet in the workbook.
• Pass an integer to refer to the index of a sheet. Indices follow Python convention, beginning at 0.
• Pass a list of either strings or integers, to return a dictionary of specified sheets.
• Pass a None to return a dictionary of all available sheets.

# Returns a DataFrame
pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

Using the sheet index:

# Returns a DataFrame
pd.read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])

Using all default values:

4.1. IO tools (text, CSV, HDF5, …) 249


pandas: powerful Python data analysis toolkit, Release 0.25.2

# Returns a DataFrame
pd.read_excel('path_to_file.xls')

Using None to get all sheets:

# Returns a dictionary of DataFrames


pd.read_excel('path_to_file.xls', sheet_name=None)

Using a list to get multiple sheets:

# Returns the 1st and 4th sheet, as a dictionary of DataFrames.


pd.read_excel('path_to_file.xls', sheet_name=['Sheet1', 3])

read_excel can read more than one sheet, by setting sheet_name to either a list of sheet names, a list of
sheet positions, or None to read all sheets. Sheets can be specified by sheet index or sheet name, using an
integer or string, respectively.

Reading a MultiIndex

read_excel can read a MultiIndex index, by passing a list of columns to index_col and a MultiIndex
column by passing a list of rows to header. If either the index or columns have serialized level names those
will be read in as well by specifying the rows/columns that make up the levels.
For example, to read in a MultiIndex index without names:

In [314]: df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]},


.....: index=pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))
.....:

In [315]: df.to_excel('path_to_file.xlsx')

In [316]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])

In [317]: df
Out[317]:
a b
a c 1 5
d 2 6
b c 3 7
d 4 8

If the index has level names, they will parsed as well, using the same parameters.

In [318]: df.index = df.index.set_names(['lvl1', 'lvl2'])

In [319]: df.to_excel('path_to_file.xlsx')

In [320]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])

In [321]: df
Out[321]:
a b
lvl1 lvl2
(continues on next page)

250 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


a c 1 5
d 2 6
b c 3 7
d 4 8

If the source file has both MultiIndex index and columns, lists specifying each should be passed to index_col
and header:

In [322]: df.columns = pd.MultiIndex.from_product([['a'], ['b', 'd']],


.....: names=['c1', 'c2'])
.....:

In [323]: df.to_excel('path_to_file.xlsx')

In [324]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1], header=[0, 1])

In [325]: df
Out[325]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8

Parsing specific columns

It is often the case that users will insert columns to do temporary computations in Excel and you may not
want to read in those columns. read_excel takes a usecols keyword to allow you to specify a subset of
columns to parse.
Deprecated since version 0.24.0.
Passing in an integer for usecols has been deprecated. Please pass in a list of ints from 0 to usecols
inclusive instead.
If usecols is an integer, then it is assumed to indicate the last column to be parsed.

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=2)

You can also specify a comma-delimited set of Excel columns and ranges as a string:

pd.read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')

If usecols is a list of integers, then it is assumed to be the file column indices to be parsed.

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])

Element order is ignored, so usecols=[0, 1] is the same as [1, 0].


New in version 0.24.

4.1. IO tools (text, CSV, HDF5, …) 251


pandas: powerful Python data analysis toolkit, Release 0.25.2

If usecols is a list of strings, it is assumed that each string corresponds to a column name provided either
by the user in names or inferred from the document header row(s). Those strings define which columns will
be parsed:

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])

Element order is ignored, so usecols=['baz', 'joe'] is the same as ['joe', 'baz'].


New in version 0.24.
If usecols is callable, the callable function will be evaluated against the column names, returning names
where the callable function evaluates to True.

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())

Parsing dates

Datetime-like values are normally automatically converted to the appropriate dtype when reading the excel
file. But if you have a column of strings that look like dates (but are not actually formatted as dates in
excel), you can use the parse_dates keyword to parse those strings to datetimes:

pd.read_excel('path_to_file.xls', 'Sheet1', parse_dates=['date_strings'])

Cell converters

It is possible to transform the contents of Excel cells via the converters option. For instance, to convert a
column to boolean:

pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})

This options handles missing values and treats exceptions in the converters as missing data. Transformations
are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For
instance, a column of integers with missing values cannot be transformed to an array with integer dtype,
because NaN is strictly a float. You can manually mask missing data to recover integer dtype:

def cfun(x):
return int(x) if x else -1

pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})

Dtype specifications

New in version 0.20.


As an alternative to converters, the type for an entire column can be specified using the dtype keyword,
which takes a dictionary mapping column names to types. To interpret data with no type inference, use the
type str or object.

pd.read_excel('path_to_file.xls', dtype={'MyInts': 'int64', 'MyText': str})

252 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Writing Excel files

Writing Excel files to disk

To write a DataFrame object to a sheet of an Excel file, you can use the to_excel instance method. The
arguments are largely the same as to_csv described above, the first argument being the name of the excel
file, and the optional second argument the name of the sheet to which the DataFrame should be written. For
example:

df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')

Files with a .xls extension will be written using xlwt and those with a .xlsx extension will be written
using xlsxwriter (if available) or openpyxl.
The DataFrame will be written in a way that tries to mimic the REPL output. The index_label will be
placed in the second row instead of the first. You can place it in the first row by setting the merge_cells
option in to_excel() to False:

df.to_excel('path_to_file.xlsx', index_label='label', merge_cells=False)

In order to write separate DataFrames to separate sheets in a single Excel file, one can pass an ExcelWriter.

with pd.ExcelWriter('path_to_file.xlsx') as writer:


df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

Note: Wringing a little more performance out of read_excel Internally, Excel stores all numeric data as
floats. Because this can produce unexpected behavior when reading in data, pandas defaults to trying to
convert integers to floats if it doesn’t lose information (1.0 --> 1). You can pass convert_float=False to
disable this behavior, which may give a slight performance improvement.

Writing Excel files to memory

Pandas supports writing Excel files to buffer-like objects such as StringIO or BytesIO using ExcelWriter.

# Safe import for either Python 2.x or 3.x


try:
from io import BytesIO
except ImportError:
from cStringIO import StringIO as BytesIO

bio = BytesIO()

# By setting the 'engine' in the ExcelWriter constructor.


writer = pd.ExcelWriter(bio, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')

# Save the workbook


writer.save()

# Seek to the beginning and read to copy the workbook to a variable in memory
(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 253


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


bio.seek(0)
workbook = bio.read()

Note: engine is optional but recommended. Setting the engine determines the version of workbook pro-
duced. Setting engine='xlrd' will produce an Excel 2003-format workbook (xls). Using either 'openpyxl'
or 'xlsxwriter' will produce an Excel 2007-format workbook (xlsx). If omitted, an Excel 2007-formatted
workbook is produced.

Excel writer engines

Pandas chooses an Excel writer via two methods:


1. the engine keyword argument
2. the filename extension (via the default specified in config options)
By default, pandas uses the XlsxWriter for .xlsx, openpyxl for .xlsm, and xlwt for .xls files. If you have
multiple engines installed, you can set the default engine through setting the config options io.excel.xlsx.
writer and io.excel.xls.writer. pandas will fall back on openpyxl for .xlsx files if Xlsxwriter is not
available.
To specify which writer you want to use, you can pass an engine keyword argument to to_excel and to
ExcelWriter. The built-in engines are:
• openpyxl: version 2.4 or higher is required
• xlsxwriter
• xlwt

# By setting the 'engine' in the DataFrame 'to_excel()' methods.


df.to_excel('path_to_file.xlsx', sheet_name='Sheet1', engine='xlsxwriter')

# By setting the 'engine' in the ExcelWriter constructor.


writer = pd.ExcelWriter('path_to_file.xlsx', engine='xlsxwriter')

# Or via pandas configuration.


from pandas import options # noqa: E402
options.io.excel.xlsx.writer = 'xlsxwriter'

df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')

Style and formatting

The look and feel of Excel worksheets created from pandas can be modified using the following parameters
on the DataFrame’s to_excel method.
• float_format : Format string for floating point numbers (default None).
• freeze_panes : A tuple of two integers representing the bottommost row and rightmost column to
freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default
None).

254 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Using the Xlsxwriter engine provides many options for controlling the format of an Excel worksheet created
with the to_excel method. Excellent examples can be found in the Xlsxwriter documentation here: https:
//xlsxwriter.readthedocs.io/working_with_pandas.html

4.1.5 OpenDocument Spreadsheets

New in version 0.25.


The read_excel() method can also read OpenDocument spreadsheets using the odfpy module. The se-
mantics and features for reading OpenDocument spreadsheets match what can be done for Excel files using
engine='odf'.

# Returns a DataFrame
pd.read_excel('path_to_file.ods', engine='odf')

Note: Currently pandas only supports reading OpenDocument spreadsheets. Writing is not implemented.

4.1.6 Clipboard

A handy way to grab data is to use the read_clipboard() method, which takes the contents of the clipboard
buffer and passes them to the read_csv method. For instance, you can copy the following text to the
clipboard (CTRL-C on many operating systems):

A B C
x 1 4 p
y 2 5 q
z 3 6 r

And then import the data directly to a DataFrame by calling:

>>> clipdf = pd.read_clipboard()


>>> clipdf
A B C
x 1 4 p
y 2 5 q
z 3 6 r

The to_clipboard method can be used to write the contents of a DataFrame to the clipboard. Following
which you can paste the clipboard contents into other applications (CTRL-V on many operating systems).
Here we illustrate writing a DataFrame into clipboard and reading it back.

>>> df = pd.DataFrame({'A': [1, 2, 3],


... 'B': [4, 5, 6],
... 'C': ['p', 'q', 'r']},
... index=['x', 'y', 'z'])
>>> df
A B C
x 1 4 p
y 2 5 q
z 3 6 r
>>> df.to_clipboard()
(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 255


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


>>> pd.read_clipboard()
A B C
x 1 4 p
y 2 5 q
z 3 6 r

We can see that we got the same content back, which we had earlier written to the clipboard.

Note: You may need to install xclip or xsel (with PyQt5, PyQt4 or qtpy) on Linux to use these methods.

4.1.7 Pickling

All pandas objects are equipped with to_pickle methods which use Python’s cPickle module to save data
structures to disk using the pickle format.

In [326]: df
Out[326]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8

In [327]: df.to_pickle('foo.pkl')

The read_pickle function in the pandas namespace can be used to load any pickled pandas object (or any
other pickled object) from file:

In [328]: pd.read_pickle('foo.pkl')
Out[328]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8

Warning: Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html

Warning: read_pickle() is only guaranteed backwards compatible back to pandas version 0.20.3

256 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Compressed pickle files

New in version 0.20.0.


read_pickle(), DataFrame.to_pickle() and Series.to_pickle() can read and write compressed pickle
files. The compression types of gzip, bz2, xz are supported for reading and writing. The zip file format
only supports reading and must contain only one data file to be read.
The compression type can be an explicit parameter or be inferred from the file extension. If ‘infer’, then use
gzip, bz2, zip, or xz if filename ends in '.gz', '.bz2', '.zip', or '.xz', respectively.

In [329]: df = pd.DataFrame({
.....: 'A': np.random.randn(1000),
.....: 'B': 'foo',
.....: 'C': pd.date_range('20130101', periods=1000, freq='s')})
.....:

In [330]: df
Out[330]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39

[1000 rows x 3 columns]

Using an explicit compression type:

In [331]: df.to_pickle("data.pkl.compress", compression="gzip")

In [332]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")

In [333]: rt
Out[333]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 257


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

[1000 rows x 3 columns]

Inferring compression type from the extension:

In [334]: df.to_pickle("data.pkl.xz", compression="infer")

In [335]: rt = pd.read_pickle("data.pkl.xz", compression="infer")

In [336]: rt
Out[336]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39

[1000 rows x 3 columns]

The default is to ‘infer’:

In [337]: df.to_pickle("data.pkl.gz")

In [338]: rt = pd.read_pickle("data.pkl.gz")

In [339]: rt
Out[339]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39

[1000 rows x 3 columns]

In [340]: df["A"].to_pickle("s1.pkl.bz2")

In [341]: rt = pd.read_pickle("s1.pkl.bz2")
(continues on next page)

258 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

In [342]: rt
Out[342]:
0 -0.053113
1 0.348832
2 -0.162729
3 -1.269943
4 -0.481824
...
995 -1.001718
996 -0.471336
997 -0.071712
998 0.578273
999 0.595708
Name: A, Length: 1000, dtype: float64

4.1.8 msgpack

pandas supports the msgpack format for object serialization. This is a lightweight portable binary format,
similar to binary JSON, that is highly space efficient, and provides good performance both on the writing
(serialization), and reading (deserialization).

Warning: The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is
recommended to use pyarrow for on-the-wire transmission of pandas objects.

Warning: read_msgpack() is only guaranteed backwards compatible back to pandas version 0.20.3

In [343]: df = pd.DataFrame(np.random.rand(5, 2), columns=list('AB'))

In [344]: df.to_msgpack('foo.msg')

In [345]: pd.read_msgpack('foo.msg')
Out[345]:
A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928

In [346]: s = pd.Series(np.random.rand(5), index=pd.date_range('20130101', periods=5))

You can pass a list of objects and you will receive them back on deserialization.

In [347]: pd.to_msgpack('foo.msg', df, 'foo', np.array([1, 2, 3]), s)

In [348]: pd.read_msgpack('foo.msg')
(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 259


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


Out[348]:
[ A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928, 'foo', array([1, 2, 3]), 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64]

You can pass iterator=True to iterate over the unpacked results:

In [349]: for o in pd.read_msgpack('foo.msg', iterator=True):


.....: print(o)
.....:
A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928
foo
[1 2 3]
2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64

You can pass append=True to the writer to append to an existing pack:

In [350]: df.to_msgpack('foo.msg', append=True)

In [351]: pd.read_msgpack('foo.msg')
Out[351]:
[ A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928, 'foo', array([1, 2, 3]), 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64, A B
0 0.541029 0.554672
1 0.150831 0.503287
(continues on next page)

260 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928]

Unlike other io methods, to_msgpack is available on both a per-object basis, df.to_msgpack() and using
the top-level pd.to_msgpack(...) where you can pack arbitrary collections of Python lists, dicts, scalars,
while intermixing pandas objects.

In [352]: pd.to_msgpack('foo2.msg', {'dict': [{'df': df}, {'string': 'foo'},


.....: {'scalar': 1.}, {'s': s}]})
.....:

In [353]: pd.read_msgpack('foo2.msg')
Out[353]:
{'dict': ({'df': A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928},
{'string': 'foo'},
{'scalar': 1.0},
{'s': 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64})}

Read/write API

Msgpacks can also be read from and written to strings.

In [354]: df.to_msgpack()
Out[354]: b"\x84\xa3typ\xadblock_
,→manager\xa5klass\xa9DataFrame\xa4axes\x92\x86\xa3typ\xa5index\xa5klass\xa5Index\xa4name\xc0\xa5dtype\x

,→index\xa5klass\xaaRangeIndex\xa4name\xc0\xa5start\x00\xa4stop\x05\xa4step\x01\xa6blocks\x91\x86\xa4loc

,→`)\x0c3kN\xc3?\xac\xa1:JQ\xb2\xea?\x8c|\xa87\x17\x98\xe6?\xf3H\x83*&u\xe4?\xd4S\xff

,→{\xe0\xbf\xe1?\xd3'2\xea\xed\x1a\xe0?6\x00'gy8\xec?S\x98/\xe7\xdcB\xe7?

,→`\xdbr\xed\xbac\xb1?

,→\xa5shape\x92\x02\x05\xa5dtype\xa7float64\xa5klass\xaaFloatBlock\xa8compress\xc0"

Furthermore you can concatenate the strings to produce a list of the original objects.

In [355]: pd.read_msgpack(df.to_msgpack() + s.to_msgpack())


Out[355]:
[ A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 261


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


4 0.639300 0.067928, 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64]

4.1.9 HDF5 (PyTables)

HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using
the excellent PyTables library. See the cookbook for some advanced strategies

Warning: pandas requires PyTables >= 3.0.0. There is a indexing bug in PyTables < 3.2 which may
appear when querying stores using an index. If you see a subset of results being returned, upgrade to
PyTables >= 3.2. Stores created previously will need to be rewritten using the updated version.

In [356]: store = pd.HDFStore('store.h5')

In [357]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

Objects can be written to the file just like adding key-value pairs to a dict:

In [358]: index = pd.date_range('1/1/2000', periods=8)

In [359]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [360]: df = pd.DataFrame(np.random.randn(8, 3), index=index,


.....: columns=['A', 'B', 'C'])
.....:

# store.put('s', s) is an equivalent method


In [361]: store['s'] = s

In [362]: store['df'] = df

In [363]: store
Out[363]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

In a current or later Python session, you can retrieve stored objects:


# store.get('df') is an equivalent method
In [364]: store['df']
Out[364]:
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258

262 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

2000-01-03 0.799684 -0.733715 1.205089


2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640

# dotted (attribute) access provides get as well


In [365]: store.df
Out[365]:
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258
2000-01-03 0.799684 -0.733715 1.205089
2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640
Deletion of the object specified by the key:

# store.remove('df') is an equivalent method


In [366]: del store['df']

In [367]: store
Out[367]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

Closing a Store and using a context manager:


In [368]: store.close()

In [369]: store
Out[369]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

In [370]: store.is_open
Out[370]: False

# Working with, and automatically closing the store using a context manager
In [371]: with pd.HDFStore('store.h5') as store:
.....: store.keys()
.....:

Read/write API

HDFStore supports an top-level API using read_hdf for reading and to_hdf for writing, similar to how
read_csv and to_csv work.

In [372]: df_tl = pd.DataFrame({'A': list(range(5)), 'B': list(range(5))})

(continues on next page)

4.1. IO tools (text, CSV, HDF5, …) 263


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


In [373]: df_tl.to_hdf('store_tl.h5', 'table', append=True)

In [374]: pd.read_hdf('store_tl.h5', 'table', where=['index>2'])


Out[374]:
A B
3 3 3
4 4 4

HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting
dropna=True.
In [375]: df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2],
.....: 'col2': [1, np.nan, np.nan]})
.....:

In [376]: df_with_missing
Out[376]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN

In [377]: df_with_missing.to_hdf('file.h5', 'df_with_missing',


.....: format='table', mode='w')
.....:

In [378]: pd.read_hdf('file.h5', 'df_with_missing')


Out[378]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN

In [379]: df_with_missing.to_hdf('file.h5', 'df_with_missing',


.....: format='table', mode='w', dropna=True)
.....:

In [380]: pd.read_hdf('file.h5', 'df_with_missing')


Out[380]:
col1 col2
0 0.0 1.0
2 2.0 NaN

Fixed format

The examples above show storing using put, which write the HDF5 to PyTables in a fixed array format,
called the fixed format. These types of stores are not appendable once written (though you can simply
remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do
not support dataframes with non-unique column names. The fixed format stores offer very fast writing and
slightly faster reading than table stores. This format is specified by default when using put or to_hdf or
by format='fixed' or format='f'.

264 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

Warning: A fixed format will raise a TypeError if you try to retrieve using a where:
>>> pd.DataFrame(np.random.randn(10, 2)).to_hdf('test_fixed.h5', 'df')
>>> pd.read_hdf('test_fixed.h5', 'df', where='index>5')
TypeError: cannot pass a where specification when reading a fixed format.
this store must be selected in its entirety

Table format

HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very
much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This format is specified by format='table'
or format='t' to append or put or to_hdf.
This format can be set as an option as well pd.set_option('io.hdf.default_format','table') to enable
put/append/to_hdf to by default store in the table format.
In [381]: store = pd.HDFStore('store.h5')

In [382]: df1 = df[0:4]

In [383]: df2 = df[4:]

# append data (creates a table automatically)


In [384]: store.append('df', df1)

In [385]: store.append('df', df2)

In [386]: store
Out[386]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# select the entire object


In [387]: store.select('df')
Out[387]:
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258
2000-01-03 0.799684 -0.733715 1.205089
2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640

# the type of stored data


In [388]: store.root.df._v_attrs.pandas_type
Out[388]: 'frame_table'

Note: You can also create a table by passing format='table' or format='t' to a put operation.

4.1. IO tools (text, CSV, HDF5, …) 265


pandas: powerful Python data analysis toolkit, Release 0.25.2

Hierarchical keys

Keys to a store can be specified as a string. These can be in a hierarchical path-name like format (e.g.
foo/bar/bah), which will generate a hierarchy of sub-stores (or Groups in PyTables parlance). Keys can be
specified with out the leading ‘/’ and are always absolute (e.g. ‘foo’ refers to ‘/foo’). Removal operations
can remove everything in the sub-store and below, so be careful.
In [389]: store.put('foo/bar/bah', df)

In [390]: store.append('food/orange', df)

In [391]: store.append('food/apple', df)

In [392]: store
Out[392]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# a list of keys are returned


In [393]: store.keys()
Out[393]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah']

# remove all nodes under this level


In [394]: store.remove('food')

In [395]: store
Out[395]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
You can walk through the group hierarchy using the walk method which will yield a tuple for each group
key along with the relative keys of its contents.
New in version 0.24.0.
In [396]: for (path, subgroups, subkeys) in store.walk():
.....: for subgroup in subgroups:
.....: print('GROUP: {}/{}'.format(path, subgroup))
.....: for subkey in subkeys:
.....: key = '/'.join([path, subkey])
.....: print('KEY: {}'.format(key))
.....: print(store.get(key))
.....:
GROUP: /foo
KEY: /df
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258
2000-01-03 0.799684 -0.733715 1.205089
2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640
GROUP: /foo/bar
(continues on next page)

266 Chapter 4. User Guide


pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)


KEY: /foo/bar/bah
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258
2000-01-03 0.799684 -0.733715 1.205089
2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640

Warning: Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for
items stored under the root node.
In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'

# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array),
,→'axis1' (Array)]

Instead, use explicit string based keys:


In [397]: store['foo/bar/bah']
Out[397]:
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258
2000-01-03 0.799684 -0.733715 1.205089
2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640

Storing types

Storing mixed types in a table

Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the
appended column. Subsequent attempts at appending longer strings will raise a ValueError.
Passing min_itemsize={`values`: size} as a parameter to append will set a larger minimum for the
string columns. Storing floats, strings, ints, bools, datetime64 are currently supported. For string
columns, passing nan_rep = 'nan' to append will change the default nan representation on disk (which
converts to/from np.nan), this defaults to nan.
In [398]: df_mixed = pd.DataFrame({'A': np.random.randn(8),
.....: 'B': np.random.randn(8),

4.1. IO tools (text, CSV, HDF5, …) 267


pandas: powerful Python data analysis toolkit, Release 0.25.2

.....: 'C': np.array(np.random.randn(8), dtype='float32'),


.....: 'string': 'string',
.....: 'int': 1,
.....: 'bool': True,
.....: 'datetime64': pd.Timestamp('20010102')},
.....: index=list(range(8)))
.....:

In [399]: df_mixed.loc[df_mixed.index[3:5],
.....: ['A', 'B', 'string', 'datetime64']] = np.nan
.....:

In [400]: store.append('df_mixed', df_mixed, min_itemsize={'values': 50})

In [401]: df_mixed1 = store.select('df_mixed')

In [402]: df_mixed1
Out[402]:
A B C string int bool datetime64
0 0.894171 -1.452159 -0.105646 string 1 True 2001-01-02
1 -1.539066 1.018959 0.028593 string 1 True 2001-01-02
2 -0.114019 -0.087476 0.693070 string 1 True 2001-01-02
3 NaN NaN -0.646571 NaN 1 True NaT
4 NaN NaN -0.174558 NaN 1 True NaT
5 -2.110838 -1.234633 -1.257271 string 1 True 2001-01-02
6 -0.704558 0.463419 -0.917264 string 1 True 2001-01-02
7 -0.929182 0.841053 0.414183 string 1 True 2001-01-02

In [403]: df_mixed1.dtypes.value_counts()
Out[403]:
float64 2
datetime64[ns] 1
int64 1
object 1
bool 1
float32 1
dtype: int64

# we have provided a minimum string column size


In [404]: store.root.df_mixed.table
Out[404]:
/df_mixed/table (Table(8,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos