0% found this document useful (0 votes)

467 views2,977 pages

Pandas

Uploaded by

Kacio Witurino Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

467 views2,977 pages

Pandas

Uploaded by

Kacio Witurino Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2977

pandas: powerful Python data analysis

toolkit
Release 0.24.1

Wes McKinney& PyData Development Team

Feb 03, 2019

CONTENTS

i
ii
pandas: powerful Python data analysis toolkit, Release 0.24.1

Date: Feb 03, 2019 Version: 0.24.1

Download documentation: PDF Version | Zipped HTML
Useful links: Binary Installers | Source Repository | Issues & Ideas | Q&A Support | Mailing List
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data
analysis tools for the Python programming language.
See the Package overview for more detail about what’s in the library.

CONTENTS 1
pandas: powerful Python data analysis toolkit, Release 0.24.1

2 CONTENTS
CHAPTER

ONE

WHATS NEW IN 0.24.1 (FEBRUARY 3, 2019)

Warning: The 0.24.x series of releases will be the last to support Python 2. Future feature releases will support
Python 3 only. See Plan for dropping Python 2.7 for more.

These are the changes in pandas 0.24.1. See Release Notes for a full changelog including other versions of pandas.
See What’s New in 0.24.0 (January 25, 2019) for the 0.24.0 changelog.

1.1 API Changes

1.1.1 Changing the sort parameter for Index set operations

The default sort value for Index.union() has changed from True to None (GH24959). The default behavior,
however, remains the same: the result is sorted, unless
1. self and other are identical
2. self or other is empty
3. self or other contain values that can not be compared (a RuntimeWarning is raised).
This change will allow sort=True to mean “always sort” in a future release.
The same change applies to Index.difference() and Index.symmetric_difference(), which would
not sort the result when the values could not be compared.
The sort option for Index.intersection() has changed in three ways.
1. The default has changed from True to False, to restore the pandas 0.23.4 and earlier behavior of not sorting
by default.
2. The behavior of sort=True can now be obtained with sort=None. This will sort the result only if the values
in self and other are not identical.
3. The value sort=True is no longer allowed. A future version of pandas will properly support sort=True
meaning “always sort”.

1.2 Fixed Regressions

• Fixed regression in DataFrame.to_dict() with records orient raising an AttributeError when

the DataFrame contained more than 255 columns, or wrongly converting column names that were not valid
python identifiers (GH24939, GH24940).

3
pandas: powerful Python data analysis toolkit, Release 0.24.1

• Fixed regression in read_sql() when passing certain queries with MySQL/pymysql (GH24988).
• Fixed regression in Index.intersection incorrectly sorting the values by default (GH24959).
• Fixed regression in merge() when merging an empty DataFrame with multiple timezone-aware columns on
one of the timezone-aware columns (GH25014).
• Fixed regression in Series.rename_axis() and DataFrame.rename_axis() where passing None
failed to remove the axis name (GH25034)
• Fixed regression in to_timedelta() with box=False incorrectly returning a datetime64 object instead
of a timedelta64 object (GH24961)
• Fixed regression where custom hashable types could not be used as column keys in DataFrame.
set_index() (GH24969)

1.3 Bug Fixes

Reshaping
• Bug in DataFrame.groupby() with Grouper when there is a time change (DST) and grouping frequency
is '1d' (GH24972)
Visualization
• Fixed the warning for implicitly registered matplotlib converters not showing. See Restore Matplotlib datetime
Converter Registration for more (GH24963).
Other
• Fixed AttributeError when printing a DataFrame’s HTML repr after accessing the IPython config object
(GH25036)

1.4 Contributors

A total of 4 people contributed patches to this release. People with a “+” by their names contributed a patch for the
first time.
• Joris Van den Bossche
• MeeseeksMachine +
• Roman Yurchak
• Tom Augspurger

4 Chapter 1. Whats New in 0.24.1 (February 3, 2019)

CHAPTER

TWO

INSTALLATION

The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for
data analysis and scientific computing. This is the recommended installation method for most users.
Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development version are
also provided.

2.1 Plan for dropping Python 2.7

The Python core team plans to stop supporting Python 2.7 on January 1st, 2020. In line with NumPy’s plans, all
pandas releases through December 31, 2018 will support Python 2.
The 0.24.x feature release will be the last release to support Python 2. The released package will continue to be
available on PyPI and through conda.
Starting January 1, 2019, all new feature releases (> 0.24) will be Python 3 only.
If there are people interested in continued support for Python 2.7 past December 31, 2018 (either backporting bug
fixes or funding) please reach out to the maintainers on the issue tracker.
For more information, see the Python 3 statement and the Porting to Python 3 guide.

2.2 Python version support

Officially Python 2.7, 3.5, 3.6, and 3.7.

2.3 Installing pandas

2.3.1 Installing with Anaconda

Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make up the SciPy
stack (IPython, NumPy, Matplotlib, . . . ) is with Anaconda, a cross-platform (Linux, Mac OS X, Windows) Python
distribution for data analytics and scientific computing.
After running the installer, the user will have access to pandas and the rest of the SciPy stack without needing to install
anything else, and without needing to wait for any software to be compiled.
Installation instructions for Anaconda can be found here.

5
pandas: powerful Python data analysis toolkit, Release 0.24.1

A full list of the packages available as part of the Anaconda distribution can be found here.
Another advantage to installing Anaconda is that you don’t need admin rights to install it. Anaconda can install in the
user’s home directory, which makes it trivial to delete Anaconda if you decide (just delete that folder).

2.3.2 Installing with Miniconda

The previous section outlined how to get pandas installed as part of the Anaconda distribution. However this approach
means you will install well over one hundred packages and involves downloading the installer which is a few hundred
megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then installing pandas with
Miniconda may be a better solution.
Conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is both
cross-platform and language agnostic (it can play a similar role to a pip and virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the Conda command to
install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this for you. The
installer can be found here
The next step is to create a new conda environment. A conda environment is like a virtualenv that allows you to specify
a specific version of Python and set of libraries. Run the following commands from a terminal window:

conda create -n name_of_my_env python

This will create a minimal environment with only Python installed in it. To put your self inside this environment run:

source activate name_of_my_env

On Windows the command is:

activate name_of_my_env

The final step required is to install pandas. This can be done with the following command:

conda install pandas

To install a specific pandas version:

conda install pandas=0.20.3

To install other packages, IPython for example:

conda install ipython

To install the full Anaconda distribution:

conda install anaconda

If you need packages that are available to pip but not conda, then install pip, and then use pip to install those packages:

conda install pip

pip install django

6 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.24.1

2.3.3 Installing from PyPI

pandas can be installed via pip from PyPI.

pip install pandas

2.3.4 Installing with ActivePython

Installation instructions for ActivePython can be found here. Versions 2.7 and 3.5 include pandas.

2.3.5 Installing using your Linux distribution’s package manager.

The commands in this table will install pandas for Python 3 from your distribution. To install pandas for Python 2,
you may need to use the python-pandas package.

Distribution Status Download / Reposi- Install method

tory Link
Debian stable official Debian reposi- sudo apt-get install python3-pandas
tory
Debian & unstable NeuroDebian sudo apt-get install python3-pandas
Ubuntu (latest
pack-
ages)
Ubuntu stable official Ubuntu reposi- sudo apt-get install python3-pandas
tory
OpenSuse stable OpenSuse Repository zypper in python3-pandas
Fedora stable official Fedora reposi- dnf install python3-pandas
tory
Centos/RHELstable EPEL repository yum install python3-pandas

However, the packages in the linux package managers are often a few versions behind, so to get the newest version of
pandas, it’s recommended to install using the pip or conda methods described above.

2.3.6 Installing from source

See the contributing guide for complete instructions on building from the git source tree. Further, see creating a
development environment if you wish to create a pandas development environment.

2.4 Running the test suite

pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing. To
run it on your machine to verify that everything is working (and that you have all of the dependencies, soft and hard,
installed), make sure you have pytest >= 3.6 and Hypothesis >= 3.58, then run:

>>> pd.test()
running: pytest --skip-slow --skip-network C:\Users\TP\Anaconda3\envs\py36\lib\site-
˓→packages\pandas

============================= test session starts =============================

(continues on next page)

2.4. Running the test suite 7

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

platform win32 -- Python 3.6.2, pytest-3.6.0, py-1.4.34, pluggy-0.4.0
rootdir: C:\Users\TP\Documents\Python\pandasdev\pandas, inifile: setup.cfg
collected 12145 items / 3 skipped

..................................................................S......
........S................................................................
.........................................................................

==================== 12130 passed, 12 skipped in 368.339 seconds =====================

2.5 Dependencies

• setuptools: 24.2.0 or higher

• NumPy: 1.12.0 or higher
• python-dateutil: 2.5.0 or higher
• pytz

2.5.1 Recommended Dependencies

• numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart chunk-
ing and caching to achieve large speedups. If installed, must be Version 2.6.1 or higher.
• bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython routines
to achieve large speedups. If installed, must be Version 1.2.0 or higher.

Note: You are highly encouraged to install these libraries, as they provide speed improvements, especially when
working with large data sets.

2.5.2 Optional Dependencies

• Cython: Only necessary to build development version. Version 0.28.2 or higher.

• SciPy: miscellaneous statistical functions, Version 0.18.1 or higher
• xarray: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or
higher is recommended.
• PyTables: necessary for HDF5-based storage, Version 3.4.2 or higher
• pyarrow (>= 0.9.0): necessary for feather-based storage.
• Apache Parquet, either pyarrow (>= 0.7.0) or fastparquet (>= 0.2.1) for parquet-based storage. The snappy and
brotli are available for compression support.
• SQLAlchemy: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you
also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the
SQLAlchemy docs. Some common drivers are:
– psycopg2: for PostgreSQL
– pymysql: for MySQL.

8 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.24.1

– SQLite: for SQLite, this is included in Python’s standard library by default.

• matplotlib: for plotting, Version 2.0.0 or higher.
• For Excel I/O:
– xlrd/xlwt: Excel reading (xlrd), version 1.0.0 or higher required, and writing (xlwt)
– openpyxl: openpyxl version 2.4.0 for writing .xlsx files (xlrd >= 0.9.0)
– XlsxWriter: Alternative Excel writer
• Jinja2: Template engine for conditional HTML formatting.
• s3fs: necessary for Amazon S3 access (s3fs >= 0.0.7).
• blosc: for msgpack compression using blosc
• gcsfs: necessary for Google Cloud Storage access (gcsfs >= 0.1.0).
• One of qtpy (requires PyQt or PySide), PyQt5, PyQt4, pygtk, xsel, or xclip: necessary to use
read_clipboard(). Most package managers on Linux distributions will have xclip and/or xsel im-
mediately available for installation.
• pandas-gbq: for Google BigQuery I/O. (pandas-gbq >= 0.8.0)
• Backports.lzma: Only for Python 2, for writing to and/or reading from an xz compressed DataFrame in CSV;
Python 3 support is built into the standard library.
• One of the following combinations of libraries is needed to use the top-level read_html() function:
Changed in version 0.23.0.

Note: If using BeautifulSoup4 a minimum version of 4.2.1 is required

– BeautifulSoup4 and html5lib (Any recent version of html5lib is okay.)

– BeautifulSoup4 and lxml
– BeautifulSoup4 and html5lib and lxml
– Only lxml, although see HTML Table Parsing for reasons as to why you should probably not take this
approach.

Warning:
– if you install BeautifulSoup4 you must install either lxml or html5lib or both. read_html() will
not work with only BeautifulSoup4 installed.
– You are highly encouraged to read HTML Table Parsing gotchas. It explains issues surrounding the
installation and usage of the above three libraries.

Note:
– if you’re on a system with apt-get you can do

sudo apt-get build-dep python-lxml

to get the necessary dependencies for installation of lxml. This will prevent further headaches down the
line.

2.5. Dependencies 9
pandas: powerful Python data analysis toolkit, Release 0.24.1

Note: Without the optional dependencies, many useful features will not work. Hence, it is highly recommended that
you install these. A packaged distribution like Anaconda, ActivePython (version 2.7 or 3.5), or Enthought Canopy
may be worth considering.

10 Chapter 2. Installation
CHAPTER

THREE

GETTING STARTED

3.1 Package overview

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing
practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful
and flexible open source data analysis / manipulation tool available in any language. It is already well on its way
toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed
into a pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the
vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users,
DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy
and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can
simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both ag-
gregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into
DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
• Intuitive merging and joining data sets
• Flexible reshaping and pivoting of data sets
• Hierarchical labeling of axes (possible to have multiple labels per tick)
• Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading
data from the ultrafast HDF5 format

11
pandas: powerful Python data analysis toolkit, Release 0.24.1

• Time series-specific functionality: date range generation and frequency conversion, moving window statistics,
moving window linear regressions, date shifting and lagging, etc.
Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific
research environments. For data scientists, working with data is typically divided into multiple stages: munging and
cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or
tabular display. pandas is the ideal tool for all of these tasks.
Some other notes
• pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However,
as with anything else generalization usually sacrifices performance. So if you focus on one feature for your
application you may be able to create a faster specialized tool.
• pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in
Python.
• pandas has been used extensively in production in financial applications.

3.1.1 Data Structures

Dimensions Name Description

1 Series 1D labeled homogeneously-typed array
2 DataFrame General 2D labeled, size-mutable tabular structure with potentially
heterogeneously-typed column

Why more than one data structure?

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For
example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert
and remove objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the typical
orientation of time series and cross-sectional data sets. When using ndarrays to store 2- and 3-dimensional data, a
burden is placed on the user to consider the orientation of the data set when writing functions; axes are considered
more or less equivalent (except when C- or Fortran-contiguousness matters for performance). In pandas, the axes are
intended to lend more semantic meaning to the data; i.e., for a particular data set there is likely to be a “right” way to
orient the data. The goal, then, is to reduce the amount of mental effort required to code up data transformations in
downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the
columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable
code:

for col in df.columns:

series = df[col]
# do something with series

3.1.2 Mutability and copying of data

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The
length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast
majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability
where sensible.

12 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.1.3 Getting Support

The first stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas community
experts can answer through Stack Overflow.

3.1.4 Community

pandas is actively supported today by a community of like-minded individuals around the world who contribute their
valuable time and energy to help make open source pandas possible. Thanks to all of our contributors.
If you’re interested in contributing, please visit the contributing guide.
pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as a world-
class open-source project, and makes it possible to donate to the project.

3.1.5 Project Governance

The governance process that pandas project has used informally since its inception in 2008 is formalized in Project
Governance documents. The documents clarify how decisions are made and how the various elements of our commu-
nity interact, including the relationship between open source collaborative development and work that may be funded
by for-profit or non-profit entities.
Wes McKinney is the Benevolent Dictator for Life (BDFL).

3.1.6 Development Team

The list of the Core Team members and more detailed information can be found on the people’s page of the governance
repo.

3.1.7 Institutional Partners

The information about current institutional partners can be found on pandas website page.

3.1.8 License

BSD 3-Clause License

All rights reserved.

Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,

this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

(continues on next page)

3.1. Package overview 13

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

3.2 10 Minutes to pandas

This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.
Customarily, we import as follows:

In [1]: import numpy as np

In [2]: import pandas as pd

3.2.1 Object Creation

See the Data Structure Intro section.

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [5]: dates = pd.date_range('20130101', periods=6)

In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
(continues on next page)

14 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

dtype='datetime64[ns]', freq='D')

In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [8]: df
Out[8]:
A B C D
2013-01-01 -1.357418 -0.142843 -0.744858 1.962986
2013-01-02 -1.349466 0.358791 -0.668969 -0.586338
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724
2013-01-04 1.076448 0.275756 0.178572 -0.669920
2013-01-05 -0.876710 -0.830077 -1.053295 -0.977186
2013-01-06 0.940624 -0.657478 -0.658436 0.077478

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [9]: df2 = pd.DataFrame({'A': 1.,

...: 'B': pd.Timestamp('20130102'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3] * 4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
...:

In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

The columns of the resulting DataFrame have different dtypes.

In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled.
Here’s a subset of the attributes that will be completed:

In [12]: df2.<TAB> # noqa: E225, E999

df2.A df2.bool
df2.abs df2.boxplot
df2.add df2.C
df2.add_prefix df2.clip
df2.add_suffix df2.clip_lower
df2.align df2.clip_upper
df2.all df2.columns
df2.any df2.combine
(continues on next page)

3.2. 10 Minutes to pandas 15

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

df2.append df2.combine_first
df2.apply df2.compound
df2.applymap df2.consolidate
df2.D

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes
have been truncated for brevity.

3.2.2 Viewing Data

See the Basics section.

Here is how to view the top and bottom rows of the frame:

In [13]: df.head()
Out[13]:
A B C D
2013-01-01 -1.357418 -0.142843 -0.744858 1.962986
2013-01-02 -1.349466 0.358791 -0.668969 -0.586338
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724
2013-01-04 1.076448 0.275756 0.178572 -0.669920
2013-01-05 -0.876710 -0.830077 -1.053295 -0.977186

In [14]: df.tail(3)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C D
2013-01-04 1.076448 0.275756 0.178572 -0.669920
2013-01-05 -0.876710 -0.830077 -1.053295 -0.977186
2013-01-06 0.940624 -0.657478 -0.658436 0.077478

Display the index, columns:

In [15]: df.index
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')

In [16]: df.columns
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that his can be an expensive
operation when your DataFrame has columns with different data types, which comes down to a fundamental differ-
ence between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames
have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that
can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a
Python object.
For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying
data.

In [17]: df.to_numpy()
Out[17]:
(continues on next page)

16 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

array([[-1.35741793, -0.14284332, -0.74485798, 1.96298552],
[-1.3494656 , 0.35879089, -0.66896897, -0.58633758],
[-0.58865118, -0.83778629, -1.24457304, 0.04272407],
[ 1.07644808, 0.2757562 , 0.17857179, -0.66992021],
[-0.8767104 , -0.83007713, -1.05329532, -0.97718584],
[ 0.94062388, -0.65747798, -0.65843634, 0.07747846]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)

Note: DataFrame.to_numpy() does not include the index or column labels in the output.

describe() shows a quick statistic summary of your data:

In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.359196 -0.305606 -0.698593 -0.025043
std 1.099832 0.545527 0.489481 1.058759
min -1.357418 -0.837786 -1.244573 -0.977186
25% -1.231277 -0.786927 -0.976186 -0.649025
50% -0.732681 -0.400161 -0.706913 -0.271807
75% 0.558305 0.171106 -0.661069 0.068790
max 1.076448 0.358791 0.178572 1.962986

Transposing your data:

In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A -1.357418 -1.349466 -0.588651 1.076448 -0.876710 0.940624
B -0.142843 0.358791 -0.837786 0.275756 -0.830077 -0.657478
C -0.744858 -0.668969 -1.244573 0.178572 -1.053295 -0.658436
D 1.962986 -0.586338 0.042724 -0.669920 -0.977186 0.077478

Sorting by an axis:

In [21]: df.sort_index(axis=1, ascending=False)

Out[21]:
D C B A
2013-01-01 1.962986 -0.744858 -0.142843 -1.357418
2013-01-02 -0.586338 -0.668969 0.358791 -1.349466
2013-01-03 0.042724 -1.244573 -0.837786 -0.588651
2013-01-04 -0.669920 0.178572 0.275756 1.076448
2013-01-05 -0.977186 -1.053295 -0.830077 -0.876710
2013-01-06 0.077478 -0.658436 -0.657478 0.940624

Sorting by values:

3.2. 10 Minutes to pandas 17

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724
2013-01-05 -0.876710 -0.830077 -1.053295 -0.977186
2013-01-06 0.940624 -0.657478 -0.658436 0.077478
2013-01-01 -1.357418 -0.142843 -0.744858 1.962986
2013-01-04 1.076448 0.275756 0.178572 -0.669920
2013-01-02 -1.349466 0.358791 -0.668969 -0.586338

3.2.3 Selection

Note: While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for
interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc
and .iloc.

See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

Getting

Selecting a single column, which yields a Series, equivalent to df.A:

In [23]: df['A']
Out[23]:
2013-01-01 -1.357418
2013-01-02 -1.349466
2013-01-03 -0.588651
2013-01-04 1.076448
2013-01-05 -0.876710
2013-01-06 0.940624
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 -1.357418 -0.142843 -0.744858 1.962986
2013-01-02 -1.349466 0.358791 -0.668969 -0.586338
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724

In [25]: df['20130102':'20130104']
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C D
2013-01-02 -1.349466 0.358791 -0.668969 -0.586338
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724
2013-01-04 1.076448 0.275756 0.178572 -0.669920

Selection by Label

See more in Selection by Label.

18 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

For getting a cross section using a label:

In [26]: df.loc[dates[0]]
Out[26]:
A -1.357418
B -0.142843
C -0.744858
D 1.962986
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [27]: df.loc[:, ['A', 'B']]

Out[27]:
A B
2013-01-01 -1.357418 -0.142843
2013-01-02 -1.349466 0.358791
2013-01-03 -0.588651 -0.837786
2013-01-04 1.076448 0.275756
2013-01-05 -0.876710 -0.830077
2013-01-06 0.940624 -0.657478

Showing label slicing, both endpoints are included:

In [28]: df.loc['20130102':'20130104', ['A', 'B']]

Out[28]:
A B
2013-01-02 -1.349466 0.358791
2013-01-03 -0.588651 -0.837786
2013-01-04 1.076448 0.275756

Reduction in the dimensions of the returned object:

In [29]: df.loc['20130102', ['A', 'B']]

Out[29]:
A -1.349466
B 0.358791
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [30]: df.loc[dates[0], 'A']

Out[30]: -1.3574179252656504

For getting fast access to a scalar (equivalent to the prior method):

In [31]: df.at[dates[0], 'A']

Out[31]: -1.3574179252656504

Selection by Position

See more in Selection by Position.

Select via the position of the passed integers:

3.2. 10 Minutes to pandas 19

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [32]: df.iloc[3]
Out[32]:
A 1.076448
B 0.275756
C 0.178572
D -0.669920
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python:

In [33]: df.iloc[3:5, 0:2]

Out[33]:
A B
2013-01-04 1.076448 0.275756
2013-01-05 -0.876710 -0.830077

By lists of integer position locations, similar to the numpy/python style:

In [34]: df.iloc[[1, 2, 4], [0, 2]]

Out[34]:
A C
2013-01-02 -1.349466 -0.668969
2013-01-03 -0.588651 -1.244573
2013-01-05 -0.876710 -1.053295

For slicing rows explicitly:

In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 -1.349466 0.358791 -0.668969 -0.586338
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724

For slicing columns explicitly:

In [36]: df.iloc[:, 1:3]

Out[36]:
B C
2013-01-01 -0.142843 -0.744858
2013-01-02 0.358791 -0.668969
2013-01-03 -0.837786 -1.244573
2013-01-04 0.275756 0.178572
2013-01-05 -0.830077 -1.053295
2013-01-06 -0.657478 -0.658436

For getting a value explicitly:

In [37]: df.iloc[1, 1]
Out[37]: 0.35879089022289157

For getting fast access to a scalar (equivalent to the prior method):

In [38]: df.iat[1, 1]
Out[38]: 0.35879089022289157

20 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Boolean Indexing

Using a single column’s values to select data.

In [39]: df[df.A > 0]

Out[39]:
A B C D
2013-01-04 1.076448 0.275756 0.178572 -0.669920
2013-01-06 0.940624 -0.657478 -0.658436 0.077478

Selecting values from a DataFrame where a boolean condition is met.

In [40]: df[df > 0]

Out[40]:
A B C D
2013-01-01 NaN NaN NaN 1.962986
2013-01-02 NaN 0.358791 NaN NaN
2013-01-03 NaN NaN NaN 0.042724
2013-01-04 1.076448 0.275756 0.178572 NaN
2013-01-05 NaN NaN NaN NaN
2013-01-06 0.940624 NaN NaN 0.077478

Using the isin() method for filtering:

In [41]: df2 = df.copy()

In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

In [43]: df2
Out[43]:
A B C D E
2013-01-01 -1.357418 -0.142843 -0.744858 1.962986 one
2013-01-02 -1.349466 0.358791 -0.668969 -0.586338 one
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724 two
2013-01-04 1.076448 0.275756 0.178572 -0.669920 three
2013-01-05 -0.876710 -0.830077 -1.053295 -0.977186 four
2013-01-06 0.940624 -0.657478 -0.658436 0.077478 three

In [44]: df2[df2['E'].isin(['two', 'four'])]

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C D E
2013-01-03 -0.588651 -0.837786 -1.244573 0.042724 two
2013-01-05 -0.876710 -0.830077 -1.053295 -0.977186 four

Setting

Setting a new column automatically aligns the data by the indexes.

In [45]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102',

˓→periods=6))

In [46]: s1
Out[46]:
2013-01-02 1
2013-01-03 2
(continues on next page)

3.2. 10 Minutes to pandas 21

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64

In [47]: df['F'] = s1

Setting values by label:

In [48]: df.at[dates[0], 'A'] = 0

Setting values by position:

In [49]: df.iat[0, 1] = 0

Setting by assigning with a NumPy array:

In [50]: df.loc[:, 'D'] = np.array([5] * len(df))

The result of the prior setting operations.

In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -0.744858 5 NaN
2013-01-02 -1.349466 0.358791 -0.668969 5 1.0
2013-01-03 -0.588651 -0.837786 -1.244573 5 2.0
2013-01-04 1.076448 0.275756 0.178572 5 3.0
2013-01-05 -0.876710 -0.830077 -1.053295 5 4.0
2013-01-06 0.940624 -0.657478 -0.658436 5 5.0

A where operation with setting.

In [52]: df2 = df.copy()

In [53]: df2[df2 > 0] = -df2

In [54]: df2
Out[54]:
A B C D F
2013-01-01 0.000000 0.000000 -0.744858 -5 NaN
2013-01-02 -1.349466 -0.358791 -0.668969 -5 -1.0
2013-01-03 -0.588651 -0.837786 -1.244573 -5 -2.0
2013-01-04 -1.076448 -0.275756 -0.178572 -5 -3.0
2013-01-05 -0.876710 -0.830077 -1.053295 -5 -4.0
2013-01-06 -0.940624 -0.657478 -0.658436 -5 -5.0

3.2.4 Missing Data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See
the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

22 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1

In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.744858 5 NaN 1.0
2013-01-02 -1.349466 0.358791 -0.668969 5 1.0 1.0
2013-01-03 -0.588651 -0.837786 -1.244573 5 2.0 NaN
2013-01-04 1.076448 0.275756 0.178572 5 3.0 NaN

To drop any rows that have missing data.

In [58]: df1.dropna(how='any')
Out[58]:
A B C D F E
2013-01-02 -1.349466 0.358791 -0.668969 5 1.0 1.0

Filling missing data.

In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.744858 5 5.0 1.0
2013-01-02 -1.349466 0.358791 -0.668969 5 1.0 1.0
2013-01-03 -0.588651 -0.837786 -1.244573 5 2.0 5.0
2013-01-04 1.076448 0.275756 0.178572 5 3.0 5.0

To get the boolean mask where values are nan.

In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

3.2.5 Operations

See the Basic section on Binary Ops.

Stats

Operations in general exclude missing data.

Performing a descriptive statistic:
In [61]: df.mean()
Out[61]:
A -0.132959
B -0.281799
C -0.698593
D 5.000000
(continues on next page)

3.2. 10 Minutes to pandas 23

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

F 3.000000
dtype: float64

Same operation on the other axis:

In [62]: df.mean(1)
Out[62]:
2013-01-01 1.063786
2013-01-02 0.868071
2013-01-03 0.865798
2013-01-04 1.906155
2013-01-05 1.247983
2013-01-06 1.924942
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically
broadcasts along the specified dimension.
In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64

In [65]: df.sub(s, axis='index')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.588651 -1.837786 -2.244573 4.0 1.0
2013-01-04 -1.923552 -2.724244 -2.821428 2.0 0.0
2013-01-05 -5.876710 -5.830077 -6.053295 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN

Apply

Applying functions to the data:

In [66]: df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -0.744858 5 NaN
2013-01-02 -1.349466 0.358791 -1.413827 10 1.0
2013-01-03 -1.938117 -0.478995 -2.658400 15 3.0
2013-01-04 -0.861669 -0.203239 -2.479828 20 6.0
2013-01-05 -1.738379 -1.033316 -3.533124 25 10.0
2013-01-06 -0.797755 -1.690794 -4.191560 30 15.0

(continues on next page)

24 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [67]: df.apply(lambda x: x.max() - x.min())
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A 2.425914
B 1.196577
C 1.423145
D 0.000000
F 4.000000
dtype: float64

Histogramming

See more at Histogramming and Discretization.

In [68]: s = pd.Series(np.random.randint(0, 7, size=10))

In [69]: s
Out[69]:
0 0
1 4
2 6
3 5
4 5
5 2
6 2
7 0
8 4
9 2
dtype: int64

In [70]: s.value_counts()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[70]:
˓→

2 3
5 2
4 2
0 2
6 1
dtype: int64

String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each
element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions
by default (and in some cases always uses them). See more at Vectorized String Methods.

In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
(continues on next page)

3.2. 10 Minutes to pandas 25

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object

3.2.6 Merge

Concat

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various
kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together with concat():
In [73]: df = pd.DataFrame(np.random.randn(10, 4))

In [74]: df
Out[74]:
0 1 2 3
0 1.231856 0.283952 0.118992 -0.724206
1 -1.653430 0.118499 1.300022 -1.526964
2 0.081166 0.233922 1.264717 0.935723
3 -0.689927 -0.517942 -0.277571 -1.175085
4 1.795727 -1.335807 -0.846357 0.062066
5 -0.408033 0.071026 0.528408 -0.646874
6 -1.151626 -0.323356 1.020479 -0.308118
7 -0.303478 -0.233435 2.130528 0.479592
8 0.061129 0.481290 0.259358 -0.979837
9 0.714785 0.176319 0.647529 1.651090

# break it into pieces

In [75]: pieces = [df[:3], df[3:7], df[7:]]

In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 1.231856 0.283952 0.118992 -0.724206
1 -1.653430 0.118499 1.300022 -1.526964
2 0.081166 0.233922 1.264717 0.935723
3 -0.689927 -0.517942 -0.277571 -1.175085
4 1.795727 -1.335807 -0.846357 0.062066
5 -0.408033 0.071026 0.528408 -0.646874
6 -1.151626 -0.323356 1.020479 -0.308118
7 -0.303478 -0.233435 2.130528 0.479592
8 0.061129 0.481290 0.259358 -0.979837
9 0.714785 0.176319 0.647529 1.651090

Join

SQL style merges. See the Database style joining section.

26 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2

In [80]: right
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[80]:
key rval
0 foo 4
1 foo 5

In [81]: pd.merge(left, right, on='key')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out
˓→

key lval rval

0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

Another example that can be given is:

In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2

In [85]: right
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[85]:
key rval
0 foo 4
1 bar 5

In [86]: pd.merge(left, right, on='key')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out
˓→

key lval rval

0 foo 1 4
1 bar 2 5

Append

Append rows to a dataframe. See the Appending section.

In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])

(continues on next page)

3.2. 10 Minutes to pandas 27

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [88]: df
Out[88]:
A B C D
0 1.392422 0.748325 -0.436142 -1.278450
1 0.043423 -0.533787 -0.351340 0.716190
2 0.969599 -1.009690 -0.706241 1.735671
3 -0.602929 0.052976 0.317099 0.309610
4 -1.071123 -2.226177 -0.279247 -0.216166
5 0.386439 -0.072899 -0.985965 -0.626187
6 -0.008656 -0.771064 -0.377700 1.899199
7 -0.286120 -0.281921 -0.367725 -0.919801

In [89]: s = df.iloc[3]

In [90]: df.append(s, ignore_index=True)

Out[90]:
A B C D
0 1.392422 0.748325 -0.436142 -1.278450
1 0.043423 -0.533787 -0.351340 0.716190
2 0.969599 -1.009690 -0.706241 1.735671
3 -0.602929 0.052976 0.317099 0.309610
4 -1.071123 -2.226177 -0.279247 -0.216166
5 0.386439 -0.072899 -0.985965 -0.626187
6 -0.008656 -0.771064 -0.377700 1.899199
7 -0.286120 -0.281921 -0.367725 -0.919801
8 -0.602929 0.052976 0.317099 0.309610

3.2.7 Grouping

By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
See the Grouping section.

In [91]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',

....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
....:

In [92]: df
Out[92]:
A B C D
0 foo one -1.304704 -0.260415
1 bar one -0.218177 1.004457
2 foo two -0.598015 -1.822984
3 bar three -0.342118 0.873662
4 foo two 0.034035 1.571507
5 bar two -0.033906 -0.785660
(continues on next page)

28 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

6 foo one -0.637461 0.191702
7 foo three 0.622365 0.771074

Grouping and then applying the sum() function to the resulting groups.

In [93]: df.groupby('A').sum()
Out[93]:
C D
A
bar -0.594201 1.092460
foo -1.883780 0.450884

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [94]: df.groupby(['A', 'B']).sum()

Out[94]:
C D
A B
bar one -0.218177 1.004457
three -0.342118 0.873662
two -0.033906 -0.785660
foo one -1.942164 -0.068713
three 0.622365 0.771074
two -0.563980 -0.251477

3.2.8 Reshaping

See the sections on Hierarchical Indexing and Reshaping.

Stack

In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',

....: 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two',
....: 'one', 'two', 'one', 'two']]))
....:

In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [98]: df2 = df[:4]

In [99]: df2
Out[99]:
A B
first second
bar one 1.318905 0.645529
two -1.061967 -1.019842
baz one -0.428076 0.273714
two 0.670740 0.674048

The stack() method “compresses” a level in the DataFrame’s columns.

3.2. 10 Minutes to pandas 29

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [100]: stacked = df2.stack()

In [101]: stacked
Out[101]:
first second
bar one A 1.318905
B 0.645529
two A -1.061967
B -1.019842
baz one A -0.428076
B 0.273714
two A 0.670740
B 0.674048
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is
unstack(), which by default unstacks the last level:
In [102]: stacked.unstack()
Out[102]:
A B
first second
bar one 1.318905 0.645529
two -1.061967 -1.019842
baz one -0.428076 0.273714
two 0.670740 0.674048

In [103]: stacked.unstack(1)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

second one two

first
bar A 1.318905 -1.061967
B 0.645529 -1.019842
baz A -0.428076 0.670740
B 0.273714 0.674048

In [104]: stacked.unstack(0)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

first bar baz

second
one A 1.318905 -0.428076
B 0.645529 0.273714
two A -1.061967 0.670740
B -1.019842 0.674048

Pivot Tables

See the section on Pivot Tables.

In [105]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
.....: 'B': ['A', 'B', 'C'] * 4,
.....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
.....: 'D': np.random.randn(12),
.....: 'E': np.random.randn(12)})
(continues on next page)

30 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

.....:

In [106]: df
Out[106]:
A B C D E
0 one A foo 1.286758 -0.434526
1 one B foo 1.050525 0.587465
2 two C foo -0.743868 2.197816
3 three A bar 0.347013 0.353235
4 one B bar 0.308812 -1.030456
5 one C bar -0.345680 -0.319095
6 two A foo -0.238305 0.718997
7 three B foo -1.962755 -0.136761
8 one C foo 1.446097 1.056697
9 one A bar -0.293508 -0.566902
10 two B bar 1.450870 0.272084
11 three C bar 0.017218 -0.275535

We can produce pivot tables from this data very easily:

In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[107]:
C bar foo
A B
one A -0.293508 1.286758
B 0.308812 1.050525
C -0.345680 1.446097
three A 0.347013 NaN
B NaN -1.962755
C 0.017218 NaN
two A NaN -0.238305
B 1.450870 NaN
C NaN -0.743868

3.2.9 Time Series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency con-
version (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial
applications. See the Time Series section.
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')

In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 24484
Freq: 5T, dtype: int64

Time zone representation:

In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)

(continues on next page)

3.2. 10 Minutes to pandas 31

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [113]: ts
Out[113]:
2012-03-06 -1.455120
2012-03-07 -0.600515
2012-03-08 0.931334
2012-03-09 -0.824164
2012-03-10 -0.778422
Freq: D, dtype: float64

In [114]: ts_utc = ts.tz_localize('UTC')

In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 -1.455120
2012-03-07 00:00:00+00:00 -0.600515
2012-03-08 00:00:00+00:00 0.931334
2012-03-09 00:00:00+00:00 -0.824164
2012-03-10 00:00:00+00:00 -0.778422
Freq: D, dtype: float64

Converting to another time zone:

In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 -1.455120
2012-03-06 19:00:00-05:00 -0.600515
2012-03-07 19:00:00-05:00 0.931334
2012-03-08 19:00:00-05:00 -0.824164
2012-03-09 19:00:00-05:00 -0.778422
Freq: D, dtype: float64

Converting between time span representations:

In [117]: rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [118]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [119]: ts
Out[119]:
2012-01-31 0.340828
2012-02-29 -2.392782
2012-03-31 -1.354964
2012-04-30 0.602123
2012-05-31 1.194818
Freq: M, dtype: float64

In [120]: ps = ts.to_period()

In [121]: ps
Out[121]:
2012-01 0.340828
2012-02 -2.392782
2012-03 -1.354964
2012-04 0.602123
2012-05 1.194818
Freq: M, dtype: float64

(continues on next page)

32 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [122]: ps.to_timestamp()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2012-01-01 0.340828
2012-02-01 -2.392782
2012-03-01 -1.354964
2012-04-01 0.602123
2012-05-01 1.194818
Freq: MS, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following
example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following
the quarter end:

In [123]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In [124]: ts = pd.Series(np.random.randn(len(prng)), prng)

In [125]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [126]: ts.head()
Out[126]:
1990-03-01 09:00 -0.791097
1990-06-01 09:00 -0.435129
1990-09-01 09:00 0.159314
1990-12-01 09:00 1.550794
1991-03-01 09:00 -1.841614
Freq: H, dtype: float64

3.2.10 Categoricals

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API
documentation.

In [127]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],

.....: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
.....:

Convert the raw grades to a categorical data type.

Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!).

3.2. 10 Minutes to pandas 33

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new
Series by default).

In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium",

.....: "good", "very good"])
.....:

In [132]: df["grade"]
Out[132]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

Grouping by a categorical column also shows empty categories.

In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64

3.2.11 Plotting

See the Plotting docs.

In [135]: ts = pd.Series(np.random.randn(1000),
.....: index=pd.date_range('1/1/2000', periods=1000))
.....:

In [136]: ts = ts.cumsum()

In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7f9bd00be518>

34 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:

In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,

.....: columns=['A', 'B', 'C', 'D'])
.....:

In [139]: df = df.cumsum()

In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>

In [141]: df.plot()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[141]: <matplotlib.axes._subplots.
˓→AxesSubplot at 0x7f9bd00be358>

In [142]: plt.legend(loc='best')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→<matplotlib.legend.Legend at 0x7f9bcf984198>

3.2. 10 Minutes to pandas 35

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.2.12 Getting Data In/Out

CSV

Writing to a csv file.

In [143]: df.to_csv('foo.csv')

Reading from a csv file.

In [144]: pd.read_csv('foo.csv')
Out[144]:
Unnamed: 0 A B C D
0 2000-01-01 -1.106212 0.052439 0.258578 0.167704
1 2000-01-02 0.096291 -0.168443 0.919563 3.146129
2 2000-01-03 0.751733 -1.444076 0.151257 1.914114
3 2000-01-04 0.320187 -2.203755 0.083312 2.386900
4 2000-01-05 -0.771781 -2.477903 -0.590597 0.984160
5 2000-01-06 -3.661357 -1.987684 -2.283570 -0.069557
6 2000-01-07 -6.574314 -2.202263 -4.097481 1.613022
7 2000-01-08 -7.034926 -3.331085 -2.322236 1.296695
8 2000-01-09 -7.054467 -2.845888 -3.695062 1.710311
(continues on next page)

36 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

9 2000-01-10 -7.264232 -2.808375 -2.293146 1.705831
10 2000-01-11 -8.255336 -3.507692 -1.732582 1.786498
11 2000-01-12 -9.013493 -1.793710 -2.723867 0.926982
12 2000-01-13 -8.013036 -2.114864 -2.837715 1.582205
13 2000-01-14 -7.536122 0.062405 -4.281060 1.689125
14 2000-01-15 -7.730554 -0.829700 -4.331371 0.531031
15 2000-01-16 -8.498195 -1.308348 -2.787777 -0.259439
16 2000-01-17 -7.424926 -3.153015 -1.993948 -0.253209
17 2000-01-18 -7.218294 -2.179350 -1.021484 -2.308590
18 2000-01-19 -7.090618 -1.482352 -0.681502 -1.576042
19 2000-01-20 -8.145643 -0.722648 -0.165623 -1.533321
20 2000-01-21 -7.780785 -1.858508 1.638698 -2.510842
21 2000-01-22 -6.967511 -1.914543 0.949562 -2.571928
22 2000-01-23 -6.531915 -2.273032 2.319920 -2.849229
23 2000-01-24 -6.377541 -2.330573 2.434004 -2.750015
24 2000-01-25 -5.808551 -1.426266 2.786692 -2.278309
25 2000-01-26 -9.053528 -1.579310 3.541408 -2.287926
26 2000-01-27 -10.288457 -3.429093 2.354397 -0.456415
27 2000-01-28 -10.707880 -3.436909 3.105953 1.112988
28 2000-01-29 -12.341329 -4.646381 0.908548 0.357143
29 2000-01-30 -12.749743 -4.185839 -0.464568 -0.365092
.. ... ... ... ... ...
970 2002-08-28 -49.300819 -10.625787 -19.795064 39.837107
971 2002-08-29 -50.392348 -10.418061 -19.378916 39.793901
972 2002-08-30 -50.732231 -10.648245 -19.245351 38.327009
973 2002-08-31 -51.497373 -11.348523 -18.228027 38.063597
974 2002-09-01 -50.660243 -11.462782 -18.183167 38.562628
975 2002-09-02 -51.762582 -12.035179 -19.585008 38.995538
976 2002-09-03 -51.484775 -13.387867 -20.044674 37.573936
977 2002-09-04 -49.577076 -14.414938 -18.518241 36.573115
978 2002-09-05 -49.128468 -14.643546 -18.708406 36.314032
979 2002-09-06 -48.015028 -15.202744 -19.568041 36.091346
980 2002-09-07 -47.940819 -14.991537 -17.992336 35.628932
981 2002-09-08 -46.866179 -15.406872 -19.890962 36.394331
982 2002-09-09 -46.373293 -16.111291 -19.094009 33.575639
983 2002-09-10 -43.817322 -15.010925 -19.156849 34.569563
984 2002-09-11 -42.428442 -14.225363 -16.896909 34.043532
985 2002-09-12 -42.866898 -14.312137 -18.545225 33.674460
986 2002-09-13 -43.778587 -14.385213 -17.998220 34.099622
987 2002-09-14 -42.350177 -15.635587 -18.182924 34.375302
988 2002-09-15 -44.207904 -12.995189 -16.650983 37.059518
989 2002-09-16 -44.730886 -14.522054 -16.289390 37.288436
990 2002-09-17 -44.281695 -14.120764 -18.318399 38.276995
991 2002-09-18 -43.424951 -13.241882 -18.959816 38.934585
992 2002-09-19 -42.235279 -14.842305 -18.216721 40.166678
993 2002-09-20 -43.290689 -14.197939 -16.755549 41.301336
994 2002-09-21 -42.812843 -15.131699 -19.759868 40.693523
995 2002-09-22 -42.400940 -15.513853 -19.523622 40.976186
996 2002-09-23 -42.913685 -16.636313 -20.827121 43.044706
997 2002-09-24 -44.918366 -13.358913 -20.794718 42.358436
998 2002-09-25 -43.780017 -13.686018 -21.022278 42.238321
999 2002-09-26 -43.725842 -13.353634 -21.775078 42.442283

[1000 rows x 5 columns]

3.2. 10 Minutes to pandas 37

pandas: powerful Python data analysis toolkit, Release 0.24.1

HDF5

Reading and writing to HDFStores.

Writing to a HDF5 Store.

In [145]: df.to_hdf('foo.h5', 'df')

Reading from a HDF5 Store.

In [146]: pd.read_hdf('foo.h5', 'df')

Out[146]:
A B C D
2000-01-01 -1.106212 0.052439 0.258578 0.167704
2000-01-02 0.096291 -0.168443 0.919563 3.146129
2000-01-03 0.751733 -1.444076 0.151257 1.914114
2000-01-04 0.320187 -2.203755 0.083312 2.386900
2000-01-05 -0.771781 -2.477903 -0.590597 0.984160
2000-01-06 -3.661357 -1.987684 -2.283570 -0.069557
2000-01-07 -6.574314 -2.202263 -4.097481 1.613022
2000-01-08 -7.034926 -3.331085 -2.322236 1.296695
2000-01-09 -7.054467 -2.845888 -3.695062 1.710311
2000-01-10 -7.264232 -2.808375 -2.293146 1.705831
2000-01-11 -8.255336 -3.507692 -1.732582 1.786498
2000-01-12 -9.013493 -1.793710 -2.723867 0.926982
2000-01-13 -8.013036 -2.114864 -2.837715 1.582205
2000-01-14 -7.536122 0.062405 -4.281060 1.689125
2000-01-15 -7.730554 -0.829700 -4.331371 0.531031
2000-01-16 -8.498195 -1.308348 -2.787777 -0.259439
2000-01-17 -7.424926 -3.153015 -1.993948 -0.253209
2000-01-18 -7.218294 -2.179350 -1.021484 -2.308590
2000-01-19 -7.090618 -1.482352 -0.681502 -1.576042
2000-01-20 -8.145643 -0.722648 -0.165623 -1.533321
2000-01-21 -7.780785 -1.858508 1.638698 -2.510842
2000-01-22 -6.967511 -1.914543 0.949562 -2.571928
2000-01-23 -6.531915 -2.273032 2.319920 -2.849229
2000-01-24 -6.377541 -2.330573 2.434004 -2.750015
2000-01-25 -5.808551 -1.426266 2.786692 -2.278309
2000-01-26 -9.053528 -1.579310 3.541408 -2.287926
2000-01-27 -10.288457 -3.429093 2.354397 -0.456415
2000-01-28 -10.707880 -3.436909 3.105953 1.112988
2000-01-29 -12.341329 -4.646381 0.908548 0.357143
2000-01-30 -12.749743 -4.185839 -0.464568 -0.365092
... ... ... ... ...
2002-08-28 -49.300819 -10.625787 -19.795064 39.837107
2002-08-29 -50.392348 -10.418061 -19.378916 39.793901
2002-08-30 -50.732231 -10.648245 -19.245351 38.327009
2002-08-31 -51.497373 -11.348523 -18.228027 38.063597
2002-09-01 -50.660243 -11.462782 -18.183167 38.562628
2002-09-02 -51.762582 -12.035179 -19.585008 38.995538
2002-09-03 -51.484775 -13.387867 -20.044674 37.573936
2002-09-04 -49.577076 -14.414938 -18.518241 36.573115
2002-09-05 -49.128468 -14.643546 -18.708406 36.314032
2002-09-06 -48.015028 -15.202744 -19.568041 36.091346
2002-09-07 -47.940819 -14.991537 -17.992336 35.628932
2002-09-08 -46.866179 -15.406872 -19.890962 36.394331
2002-09-09 -46.373293 -16.111291 -19.094009 33.575639
2002-09-10 -43.817322 -15.010925 -19.156849 34.569563
(continues on next page)

38 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2002-09-11 -42.428442 -14.225363 -16.896909 34.043532
2002-09-12 -42.866898 -14.312137 -18.545225 33.674460
2002-09-13 -43.778587 -14.385213 -17.998220 34.099622
2002-09-14 -42.350177 -15.635587 -18.182924 34.375302
2002-09-15 -44.207904 -12.995189 -16.650983 37.059518
2002-09-16 -44.730886 -14.522054 -16.289390 37.288436
2002-09-17 -44.281695 -14.120764 -18.318399 38.276995
2002-09-18 -43.424951 -13.241882 -18.959816 38.934585
2002-09-19 -42.235279 -14.842305 -18.216721 40.166678
2002-09-20 -43.290689 -14.197939 -16.755549 41.301336
2002-09-21 -42.812843 -15.131699 -19.759868 40.693523
2002-09-22 -42.400940 -15.513853 -19.523622 40.976186
2002-09-23 -42.913685 -16.636313 -20.827121 43.044706
2002-09-24 -44.918366 -13.358913 -20.794718 42.358436
2002-09-25 -43.780017 -13.686018 -21.022278 42.238321
2002-09-26 -43.725842 -13.353634 -21.775078 42.442283

[1000 rows x 4 columns]

Excel

Reading and writing to MS Excel.

Writing to an excel file.

In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

Reading from an excel file.

In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Out[148]:
Unnamed: 0 A B C D
0 2000-01-01 -1.106212 0.052439 0.258578 0.167704
1 2000-01-02 0.096291 -0.168443 0.919563 3.146129
2 2000-01-03 0.751733 -1.444076 0.151257 1.914114
3 2000-01-04 0.320187 -2.203755 0.083312 2.386900
4 2000-01-05 -0.771781 -2.477903 -0.590597 0.984160
5 2000-01-06 -3.661357 -1.987684 -2.283570 -0.069557
6 2000-01-07 -6.574314 -2.202263 -4.097481 1.613022
7 2000-01-08 -7.034926 -3.331085 -2.322236 1.296695
8 2000-01-09 -7.054467 -2.845888 -3.695062 1.710311
9 2000-01-10 -7.264232 -2.808375 -2.293146 1.705831
10 2000-01-11 -8.255336 -3.507692 -1.732582 1.786498
11 2000-01-12 -9.013493 -1.793710 -2.723867 0.926982
12 2000-01-13 -8.013036 -2.114864 -2.837715 1.582205
13 2000-01-14 -7.536122 0.062405 -4.281060 1.689125
14 2000-01-15 -7.730554 -0.829700 -4.331371 0.531031
15 2000-01-16 -8.498195 -1.308348 -2.787777 -0.259439
16 2000-01-17 -7.424926 -3.153015 -1.993948 -0.253209
17 2000-01-18 -7.218294 -2.179350 -1.021484 -2.308590
18 2000-01-19 -7.090618 -1.482352 -0.681502 -1.576042
19 2000-01-20 -8.145643 -0.722648 -0.165623 -1.533321
20 2000-01-21 -7.780785 -1.858508 1.638698 -2.510842
21 2000-01-22 -6.967511 -1.914543 0.949562 -2.571928
22 2000-01-23 -6.531915 -2.273032 2.319920 -2.849229
(continues on next page)

3.2. 10 Minutes to pandas 39

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

23 2000-01-24 -6.377541 -2.330573 2.434004 -2.750015
24 2000-01-25 -5.808551 -1.426266 2.786692 -2.278309
25 2000-01-26 -9.053528 -1.579310 3.541408 -2.287926
26 2000-01-27 -10.288457 -3.429093 2.354397 -0.456415
27 2000-01-28 -10.707880 -3.436909 3.105953 1.112988
28 2000-01-29 -12.341329 -4.646381 0.908548 0.357143
29 2000-01-30 -12.749743 -4.185839 -0.464568 -0.365092
.. ... ... ... ... ...
970 2002-08-28 -49.300819 -10.625787 -19.795064 39.837107
971 2002-08-29 -50.392348 -10.418061 -19.378916 39.793901
972 2002-08-30 -50.732231 -10.648245 -19.245351 38.327009
973 2002-08-31 -51.497373 -11.348523 -18.228027 38.063597
974 2002-09-01 -50.660243 -11.462782 -18.183167 38.562628
975 2002-09-02 -51.762582 -12.035179 -19.585008 38.995538
976 2002-09-03 -51.484775 -13.387867 -20.044674 37.573936
977 2002-09-04 -49.577076 -14.414938 -18.518241 36.573115
978 2002-09-05 -49.128468 -14.643546 -18.708406 36.314032
979 2002-09-06 -48.015028 -15.202744 -19.568041 36.091346
980 2002-09-07 -47.940819 -14.991537 -17.992336 35.628932
981 2002-09-08 -46.866179 -15.406872 -19.890962 36.394331
982 2002-09-09 -46.373293 -16.111291 -19.094009 33.575639
983 2002-09-10 -43.817322 -15.010925 -19.156849 34.569563
984 2002-09-11 -42.428442 -14.225363 -16.896909 34.043532
985 2002-09-12 -42.866898 -14.312137 -18.545225 33.674460
986 2002-09-13 -43.778587 -14.385213 -17.998220 34.099622
987 2002-09-14 -42.350177 -15.635587 -18.182924 34.375302
988 2002-09-15 -44.207904 -12.995189 -16.650983 37.059518
989 2002-09-16 -44.730886 -14.522054 -16.289390 37.288436
990 2002-09-17 -44.281695 -14.120764 -18.318399 38.276995
991 2002-09-18 -43.424951 -13.241882 -18.959816 38.934585
992 2002-09-19 -42.235279 -14.842305 -18.216721 40.166678
993 2002-09-20 -43.290689 -14.197939 -16.755549 41.301336
994 2002-09-21 -42.812843 -15.131699 -19.759868 40.693523
995 2002-09-22 -42.400940 -15.513853 -19.523622 40.976186
996 2002-09-23 -42.913685 -16.636313 -20.827121 43.044706
997 2002-09-24 -44.918366 -13.358913 -20.794718 42.358436
998 2002-09-25 -43.780017 -13.686018 -21.022278 42.238321
999 2002-09-26 -43.725842 -13.353634 -21.775078 42.442283

[1000 rows x 5 columns]

3.2.13 Gotchas

If you are attempting to perform an operation you might see an exception like:

>>> if pd.Series([False, True, False]):

... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

See Comparisons for an explanation and what to do.

See Gotchas as well.

40 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.3 Essential Basic Functionality

Here we discuss a lot of the essential functionality common to the pandas data structures. Here’s how to create some
of the objects used in the examples from the previous section:

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index,

...: columns=['A', 'B', 'C'])
...:

In [4]: wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],

...: major_axis=pd.date_range('1/1/2000', periods=5),
...: minor_axis=['A', 'B', 'C', 'D'])
...:

3.3.1 Head and Tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number
of elements to display is five, but you may pass a custom number.

In [5]: long_series = pd.Series(np.random.randn(1000))

In [6]: long_series.head()
Out[6]:
0 -2.211372
1 0.974466
2 -2.006747
3 -0.410001
4 -0.078638
dtype: float64

In [7]: long_series.tail(3)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[7]:
˓→

997 -0.196166
998 0.380733
999 -0.275874
dtype: float64

3.3.2 Attributes and Underlying Data

pandas objects have a number of attributes enabling you to access the metadata
• shape: gives the axis dimensions of the object, consistent with ndarray
• Axis labels
– Series: index (only axis)
– DataFrame: index (rows) and columns
– Panel: items, major_axis, and minor_axis

3.3. Essential Basic Functionality 41

pandas: powerful Python data analysis toolkit, Release 0.24.1

Note, these attributes can be safely assigned to!

In [8]: df[:2]
Out[8]:
A B C
2000-01-01 -0.173215 0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929

In [9]: df.columns = [x.lower() for x in df.columns]

In [10]: df
Out[10]:
a b c
2000-01-01 -0.173215 0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
2000-01-03 1.071804 0.721555 -0.706771
2000-01-04 -1.039575 0.271860 -0.424972
2000-01-05 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427
2000-01-07 0.524988 0.404705 0.577046
2000-01-08 -1.715002 -1.039268 -0.370647

Pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual
data and do the actual computation. For many types, the underlying array is a numpy.ndarray. However, pandas
and 3rd party libraries may extend NumPy’s type system to add support for custom arrays (see dtypes).
To get the actual data inside a Index or Series, use the .array property

In [11]: s.array
Out[11]:
<PandasArray>
[ 0.46911229990718628, -0.28286334432866328, -1.5090585031735124,
-1.1356323710171934, 1.2121120250208506]
Length: 5, dtype: float64

In [12]: s.index.array
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas
uses them is a bit beyond the scope of this introduction. See dtypes for more.
If you know you need a NumPy array, use to_numpy() or numpy.asarray().

In [13]: s.to_numpy()
Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

In [14]: np.asarray(s)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[14]: array([ 0.4691,
˓→ -0.2829, -1.5091, -1.1356, 1.2121])

When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing
values. See dtypes for more.
to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider date-
times with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly

42 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

useful representations:
1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the time-
zone discarded
Timezones may be preserved with dtype=object

In [15]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [16]: ser.to_numpy(dtype=object)
Out[16]:
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')], dtype=object)

Or thrown away with dtype='datetime64[ns]'

In [17]: ser.to_numpy(dtype="datetime64[ns]")
Out[17]: array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
˓→dtype='datetime64[ns]')

Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has a
single data type for all the columns, DataFrame.to_numpy() will return the underlying data:

In [18]: df.to_numpy()
Out[18]:
array([[-0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949],
[ 1.0718, 0.7216, -0.7068],
[-1.0396, 0.2719, -0.425 ],
[ 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784],
[ 0.525 , 0.4047, 0.577 ],
[-1.715 , -1.0393, -0.3706]])

If a DataFrame or Panel contains homogeneously-typed data, the ndarray can actually be modified in-place, and the
changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not
all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all
of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and
integers, the resulting array will be of float dtype.

In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series
or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend
avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:
1. When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array
or the extension array. Series.array will always return an ExtensionArray, and will never copy data.
Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.
2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and
coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a
method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.

3.3. Essential Basic Functionality 43

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.3.3 Accelerated operations

pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr
library and the bottleneck libraries.
These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr uses
smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that are especially
fast when dealing with arrays that have nans.
Here is a sample (using 100 column x 100,000 row DataFrames):

Operation 0.11.0 (ms) Prior Version (ms) Ratio to Prior

df1 > df2 13.32 125.35 0.1063
df1 * df2 21.71 36.63 0.5928
df1 + df2 22.04 36.50 0.6039

You are highly encouraged to install both libraries. See the section Recommended Dependencies for more installation
info.
These are both enabled to be used by default, you can control this by setting the options:
New in version 0.20.0.
pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

3.3.4 Flexible binary operations

With binary operations between pandas data structures, there are two key points of interest:
• Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
• Missing data in computations.
We will demonstrate how to manage these issues independently, though they can be handled simultaneously.

Matching / broadcasting behavior

DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), . . . for
carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions,
you can use to either match on the index or columns via the axis keyword:
In [19]: df = pd.DataFrame({
....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
....:

In [20]: df
Out[20]:
one two three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830

(continues on next page)

44 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [21]: row = df.iloc[1]

In [22]: column = df['two']

In [23]: df.sub(row, axis='columns')

Out[23]:
one two three
a 1.757280 -2.688953 NaN
b 0.000000 0.000000 0.000000
c 1.153738 -0.121396 -0.402113
d NaN 0.507782 -2.065853

In [24]: df.sub(row, axis=1)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 1.757280 -2.688953 NaN
b 0.000000 0.000000 0.000000
c 1.153738 -0.121396 -0.402113
d NaN 0.507782 -2.065853

In [25]: df.sub(column, axis='index')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 3.043851 0.0 NaN
b -1.402381 0.0 -0.650888
c -0.127247 0.0 -0.931605
d NaN 0.0 -3.224524

In [26]: df.sub(column, axis=0)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 3.043851 0.0 NaN
b -1.402381 0.0 -0.650888
c -0.127247 0.0 -0.931605
d NaN 0.0 -3.224524

Furthermore you can align a level of a MultiIndexed DataFrame with a Series.

In [27]: dfmi = df.copy()

In [28]: dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),

....: (1, 'c'), (2, 'a')],
....: names=['first', 'second'])
....:

In [29]: dfmi.sub(column, axis=0, level='second')

Out[29]:
one two three
first second
1 a 3.043851 0.000000 NaN
b -1.402381 0.000000 -0.650888
c -0.127247 0.000000 -0.931605
2 a NaN 3.196734 -0.027789

3.3. Essential Basic Functionality 45

pandas: powerful Python data analysis toolkit, Release 0.24.1

With Panel, describing the matching behavior is a bit more difficult, so the arithmetic methods instead (and perhaps
confusingly?) give you the option to specify the broadcast axis. For example, suppose we wished to demean the data
over a particular axis. This can be accomplished by taking the mean over an axis and broadcasting over the same axis:

In [30]: major_mean = wp.mean(axis='major')

In [31]: major_mean
Out[31]:
Item1 Item2
A -0.378069 0.675929
B -0.241429 -0.018080
C -0.597702 0.129006
D 0.204005 0.245570

In [32]: wp.sub(major_mean, axis='major')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

And similarly for axis="items" and axis="minor".

Note: I could be convinced to make the axis argument in the DataFrame methods match the broadcasting behavior of
Panel. Though it would require a transition period so users can change their code. . .

Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at
the same time returning a two-tuple of the same type as the left hand side. For example:

In [33]: s = pd.Series(np.arange(10))

In [34]: s
Out[34]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64

In [35]: div, rem = divmod(s, 3)

In [36]: div
Out[36]:
0 0
1 0
2 0
3 1
4 1
(continues on next page)

46 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

5 1
6 2
7 2
8 2
9 3
dtype: int64

In [37]: rem
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[37]:
˓→

0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
9 0
dtype: int64

In [38]: idx = pd.Index(np.arange(10))

In [39]: idx
Out[39]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [40]: div, rem = divmod(idx, 3)

In [41]: div
Out[41]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [42]: rem
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[42]:
˓→Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

We can also do elementwise divmod():

In [43]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [44]: div
Out[44]:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64

In [45]: rem
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[45]:
˓→
(continues on next page)

3.3. Essential Basic Functionality 47

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

0 0
1 1
2 2
3 0
4 0
5 1
6 1
7 2
8 2
9 3
dtype: int64

Missing data / operations with fill values

In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute
when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may
wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can
later replace NaN with some other value using fillna if you wish).

In [46]: df
Out[46]:
one two three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830

In [47]: df2
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 1.400810 -1.643041 1.000000
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830

In [48]: df + df2
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 2.801620 -3.286083 NaN
b -0.712940 2.091822 0.790046
c 1.594536 1.849030 -0.014180
d NaN 3.107386 -3.341661

In [49]: df.add(df2, fill_value=0)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 2.801620 -3.286083 1.000000
b -0.712940 2.091822 0.790046
c 1.594536 1.849030 -0.014180
d NaN 3.107386 -3.341661

48 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Flexible Comparisons

Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogous
to the binary arithmetic operations described above:

In [50]: df.gt(df2)
Out[50]:
one two three
a False False False
b False False False
c False False False
d False False False

In [51]: df2.ne(df)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a False False True
b False False False
c False False False
d True False False

These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool. These
boolean objects can be used in indexing operations, see the section on Boolean indexing.

Boolean Reductions

You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [52]: (df > 0).all()

Out[52]:
one False
two False
three False
dtype: bool

In [53]: (df > 0).any()

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[53]:
one True
two True
three True
dtype: bool

You can reduce to a final boolean value.

In [54]: (df > 0).any().any()

Out[54]: True

You can test if a pandas object is empty, via the empty property.

In [55]: df.empty
Out[55]: False

In [56]: pd.DataFrame(columns=list('ABC')).empty
\\\\\\\\\\\\\\\Out[56]: True

To evaluate single-element pandas objects in a boolean context, use the method bool():

3.3. Essential Basic Functionality 49

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [57]: pd.Series([True]).bool()
Out[57]: True

In [58]: pd.Series([False]).bool()
\\\\\\\\\\\\\\Out[58]: False

In [59]: pd.DataFrame([[True]]).bool()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[59]: True

In [60]: pd.DataFrame([[False]]).bool()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[60]: False

Warning: You might be tempted to do the following:

>>> if df:
... pass

Or
>>> df and df2

These will both raise errors, as you are trying to compare multiple values.:
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.
˓→all().

See gotchas for a more detailed discussion.

Comparing if objects are equivalent

Often you may find that there is more than one way to compute the same result. As a simple example, consider df
+ df and df * 2. To test that these two computations produce the same result, given the tools shown above, you
might imagine using (df + df == df * 2).all(). But in fact, this expression is False:

In [61]: df + df == df * 2
Out[61]:
one two three
a True True False
b True True True
c True True True
d False True True

In [62]: (df + df == df * 2).all()

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one False
two True
three False
dtype: bool

Notice that the boolean DataFrame df + df == df * 2 contains some False values! This is because NaNs do
not compare as equals:

In [63]: np.nan == np.nan

Out[63]: False

50 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

So, NDFrames (such as Series, DataFrames, and Panels) have an equals() method for testing equality, with NaNs
in corresponding locations treated as equal.

In [64]: (df + df).equals(df * 2)

Out[64]: True

Note that the Series or DataFrame index needs to be in the same order for equality to be True:

In [65]: df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})

In [66]: df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])

In [67]: df1.equals(df2)
Out[67]: False

In [68]: df1.equals(df2.sort_index())
\\\\\\\\\\\\\\\Out[68]: True

Comparing array-like objects

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [69]: pd.Series(['foo', 'bar', 'baz']) == 'foo'

Out[69]:
0 True
1 False
2 False
dtype: bool

In [70]: pd.Index(['foo', 'bar', 'baz']) == 'foo'

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[70]: array([ True, False,
˓→False], dtype=bool)

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [71]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

Out[71]:
0 True
1 True
2 False
dtype: bool

In [72]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[72]:
0 True
1 True
2 False
dtype: bool

Trying to compare Index or Series objects of different lengths will raise a ValueError:

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])

ValueError: Series lengths must match to compare

3.3. Essential Basic Functionality 51

pandas: powerful Python data analysis toolkit, Release 0.24.1

Note that this is different from the NumPy behavior where a comparison can be broadcast:

In [73]: np.array([1, 2, 3]) == np.array([2])

Out[73]: array([False, True, False], dtype=bool)

or it can return False if broadcasting can not be done:

In [74]: np.array([1, 2, 3]) == np.array([1, 2])

Out[74]: False

Combining overlapping data sets

A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the
other. An example would be two data series representing a particular economic indicator where one is considered to
be of “higher quality”. However, the lower quality series might extend further back in history or have more complete
data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame
are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation
is combine_first(), which we illustrate:

In [75]: df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],

....: 'B': [np.nan, 2., 3., np.nan, 6.]})
....:

In [76]: df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],

....: 'B': [np.nan, np.nan, 3., 4., 6., 8.]})
....:

In [77]: df1
Out[77]:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0

In [78]: df2
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[78]:
˓→

A B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0

In [79]: df1.combine_first(df2)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
(continues on next page)

52 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

4 3.0 6.0
5 7.0 8.0

General DataFrame Combine

The combine_first() method above calls the more general DataFrame.combine(). This method takes
another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs
of Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first() as above:
In [80]: def combiner(x, y):
....: np.where(pd.isna(x), y, x)
....: df1.combine(df2, combiner)
....:
Out[80]:
A B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN

3.3.5 Descriptive statistics

There exists a large number of methods for computing descriptive statistics and other related operations on Series,
DataFrame, and Panel. Most of these are aggregations (hence producing a lower-dimensional result) like sum(),
mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same
size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, . . . }, but the axis can be
specified by name or integer:
• Series: no axis argument needed
• DataFrame: “index” (axis=0, default), “columns” (axis=1)
• Panel: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)
For example:
In [81]: df
Out[81]:
one two three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830

In [82]: df.mean(0)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one 0.613869
two 0.470270
three -0.427633
dtype: float64
(continues on next page)

3.3. Essential Basic Functionality 53

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [83]: df.mean(1)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a -0.121116
b 0.361488
c 0.571564
d -0.058569
dtype: float64

All such methods have a skipna option signaling whether to exclude missing data (True by default):

In [84]: df.sum(0, skipna=False)

Out[84]:
one NaN
two 1.881078
three NaN
dtype: float64

In [85]: df.sum(axis=1, skipna=True)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[85]:
˓→

a -0.242232
b 1.084464
c 1.714693
d -0.117137
dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standard-
ization (rendering data zero mean and standard deviation 1), very concisely:

In [86]: ts_stand = (df - df.mean()) / df.std()

In [87]: ts_stand.std()
Out[87]:
one 1.0
two 1.0
three 1.0
dtype: float64

In [88]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [89]: xs_stand.std(1)
Out[89]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat different
from expanding() and rolling(). For more details please see this note.

In [90]: df.cumsum()
Out[90]:
one two three
(continues on next page)

54 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

a 1.400810 -1.643041 NaN
b 1.044340 -0.597130 0.395023
c 1.841608 0.327385 0.387933
d NaN 1.881078 -1.282898

Here is a quick reference summary table of common functions. Each also takes an optional level parameter which
applies only if the object has a hierarchical index.

Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

In [91]: np.mean(df['one'])
Out[91]: 0.6138692844180106

In [92]: np.mean(df['one'].to_numpy())
\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[92]: nan

Series.nunique() will return the number of unique non-NA values in a Series:

In [93]: series = pd.Series(np.random.randn(500))

In [94]: series[20:500] = np.nan

In [95]: series[10:20] = 5

In [96]: series.nunique()
Out[96]: 11

3.3. Essential Basic Functionality 55

pandas: powerful Python data analysis toolkit, Release 0.24.1

Summarizing data: describe

There is a convenient describe() function which computes a variety of summary statistics about a Series or the
columns of a DataFrame (excluding NAs of course):

In [97]: series = pd.Series(np.random.randn(1000))

In [98]: series[::2] = np.nan

In [99]: series.describe()
Out[99]:
count 500.000000
mean -0.020695
std 1.011840
min -2.683763
25% -0.709297
50% -0.070211
75% 0.712856
max 3.160915
dtype: float64

In [100]: frame = pd.DataFrame(np.random.randn(1000, 5),

.....: columns=['a', 'b', 'c', 'd', 'e'])
.....:

In [101]: frame.iloc[::2] = np.nan

In [102]: frame.describe()
Out[102]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 0.026515 0.022952 -0.047307 -0.052551 0.011210
std 1.016752 0.980046 1.020837 1.008271 1.006726
min -3.000951 -2.637901 -3.303099 -3.159200 -3.188821
25% -0.647623 -0.593587 -0.709906 -0.691338 -0.689176
50% 0.047578 -0.026675 -0.029655 -0.032769 -0.015775
75% 0.723946 0.771931 0.603753 0.667044 0.652221
max 2.740139 2.752332 3.004229 2.728702 3.240991

You can select specific percentiles to include in the output:

In [103]: series.describe(percentiles=[.05, .25, .75, .95])

Out[103]:
count 500.000000
mean -0.020695
std 1.011840
min -2.683763
5% -1.641337
25% -0.709297
50% -0.070211
75% 0.712856
95% 1.699176
max 3.160915
dtype: float64

By default, the median is always included.

For a non-numerical Series object, describe() will give a simple summary of the number of unique values and
most frequently occurring values:

56 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [104]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [105]: s.describe()
Out[105]:
count 9
unique 4
top a
freq 5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical
columns or, if none are, only categorical columns:

In [106]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})

In [107]: frame.describe()
Out[107]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000

This behavior can be controlled by providing a list of types as include/exclude arguments. The special value
all can also be used:

In [108]: frame.describe(include=['object'])
Out[108]:
a
count 4
unique 2
top Yes
freq 2

In [109]: frame.describe(include=['number'])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[109]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000

In [110]: frame.describe(include='all')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b
count 4 4.000000
unique 2 NaN
top Yes NaN
(continues on next page)

3.3. Essential Basic Functionality 57

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

freq 2 NaN
mean NaN 1.500000
std NaN 1.290994
min NaN 0.000000
25% NaN 0.750000
50% NaN 1.500000
75% NaN 2.250000
max NaN 3.000000

That feature relies on select_dtypes. Refer to there for details about accepted inputs.

Index of Min/Max Values

The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and
maximum corresponding values:

In [111]: s1 = pd.Series(np.random.randn(5))

In [112]: s1
Out[112]:
0 -0.068822
1 -1.129788
2 -0.269798
3 -0.375580
4 0.513381
dtype: float64

In [113]: s1.idxmin(), s1.idxmax()

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[1
˓→(1, 4)

In [114]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])

In [115]: df1
Out[115]:
A B C
0 0.333329 -0.910090 -1.321220
1 2.111424 1.701169 0.858336
2 -0.608055 -2.082155 -0.069618
3 1.412817 -0.562658 0.770042
4 0.373294 -0.965381 -1.607840

In [116]: df1.idxmin(axis=0)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A 2
B 2
C 4
dtype: int64

In [117]: df1.idxmax(axis=1)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 A
1 A
(continues on next page)

58 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2 C
3 A
4 A
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax()
return the first matching index:

In [118]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [119]: df3
Out[119]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN

In [120]: df3['A'].idxmin()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[120]: 'd'

Note: idxmin and idxmax are called argmin and argmax in NumPy.

Value counts (histogramming) / Mode

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can
also be used as a function on regular arrays:

In [121]: data = np.random.randint(0, 7, size=50)

In [122]: data
Out[122]:
array([6, 4, 1, 3, 4, 4, 4, 6, 5, 2, 6, 1, 0, 4, 3, 2, 5, 3, 4, 0, 5, 3, 0,
1, 5, 0, 1, 5, 3, 4, 1, 2, 3, 2, 4, 6, 1, 4, 3, 5, 2, 1, 2, 4, 1, 6,
3, 6, 3, 3])

In [123]: s = pd.Series(data)

In [124]: s.value_counts()
Out[124]:
4 10
3 10
1 8
6 6
5 6
2 6
0 4
dtype: int64

In [125]: pd.value_counts(data)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[125]:
˓→

(continues on next page)

3.3. Essential Basic Functionality 59

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

4 10
3 10
1 8
6 6
5 6
2 6
0 4
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:
In [126]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [127]: s5.mode()
Out[127]:
0 3
1 7
dtype: int64

In [128]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),

.....: "B": np.random.randint(-10, 15, size=50)})
.....:

In [129]: df5.mode()
Out[129]:
A B
0 0 -9

Discretization and quantiling

Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample
quantiles) functions:
In [130]: arr = np.random.randn(20)

In [131]: factor = pd.cut(arr, 4)

In [132]: factor
Out[132]:
[(1.27, 2.31], (0.231, 1.27], (-0.809, 0.231], (-1.853, -0.809], (1.27, 2.31], ...,
˓→(0.231, 1.27], (-0.809, 0.231], (-1.853, -0.809], (1.27, 2.31], (0.231, 1.27]]

Length: 20
Categories (4, interval[float64]): [(-1.853, -0.809] < (-0.809, 0.231] < (0.231, 1.
˓→27] < (1.27, 2.31]]

In [133]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [134]: factor
Out[134]:
[(1, 5], (0, 1], (-1, 0], (-5, -1], (1, 5], ..., (1, 5], (-1, 0], (-5, -1], (1, 5],
˓→(0, 1]]

Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size
quartiles like so:

60 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [135]: arr = np.random.randn(30)

In [136]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])

In [137]: factor
Out[137]:
[(-2.219, -0.669], (-0.669, 0.00453], (0.367, 2.369], (0.00453, 0.367], (0.367, 2.
˓→369], ..., (0.00453, 0.367], (0.367, 2.369], (0.00453, 0.367], (-0.669, 0.00453],

˓→(0.367, 2.369]]

Length: 30
Categories (4, interval[float64]): [(-2.219, -0.669] < (-0.669, 0.00453] < (0.00453,
˓→0.367] <

(0.367, 2.369]]

In [138]: pd.value_counts(factor)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

(0.367, 2.369] 8
(-2.219, -0.669] 8
(0.00453, 0.367] 7
(-0.669, 0.00453] 7
dtype: int64

We can also pass infinite values to define the bins:

In [139]: arr = np.random.randn(20)

In [140]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [141]: factor
Out[141]:
[(0.0, inf], (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-
˓→inf, 0.0], (0.0, inf], (-inf, 0.0], (-inf, 0.0]]

Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

3.3.6 Function application

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below.
The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or
Series, row- or column-wise, or elementwise.
1. Tablewise Function Application: pipe()
2. Row or Column-wise Function Application: apply()
3. Aggregation API: agg() and transform()
4. Applying Elementwise Functions: applymap()

Tablewise Function Application

DataFrames and Series can of course just be passed into functions. However, if the function needs to be called
in a chain, consider using the pipe() method. Compare the following

3.3. Essential Basic Functionality 61

pandas: powerful Python data analysis toolkit, Release 0.24.1

# f, g, and h are functions taking and returning ``DataFrames``

>>> f(g(h(df), arg1=1), arg2=2, arg3=3)

with the equivalent

>>> (df.pipe(h)
... .pipe(g, arg1=1)
... .pipe(f, arg2=2, arg3=3))

Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your own or
another library’s functions in method chains, alongside pandas’ methods.
In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument. What
if the function you wish to apply takes its data as, say, the second argument? In this case, provide pipe with a tuple
of (callable, data_keyword). .pipe will route the DataFrame to the argument specified in the tuple.
For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame as the
second argument, data. We pass in the function, keyword pair (sm.ols, 'data') to pipe:
In [142]: import statsmodels.formula.api as sm

In [143]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [144]: (bb.query('h > 0')

.....: .assign(ln_h=lambda df: np.log(df.h))
.....: .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
.....: .fit()
.....: .summary()
.....: )
.....:
Out[144]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: hr R-squared: 0.685
Model: OLS Adj. R-squared: 0.665
Method: Least Squares F-statistic: 34.28
Date: Sun, 03 Feb 2019 Prob (F-statistic): 3.48e-15
Time: 21:34:07 Log-Likelihood: -205.92
No. Observations: 68 AIC: 421.8
Df Residuals: 63 BIC: 432.9
Df Model: 4
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -8484.7720 4664.146 -1.819 0.074 -1.78e+04 835.780
C(lg)[T.NL] -2.2736 1.325 -1.716 0.091 -4.922 0.375
ln_h -1.3542 0.875 -1.547 0.127 -3.103 0.395
year 4.2277 2.324 1.819 0.074 -0.417 8.872
g 0.1841 0.029 6.258 0.000 0.125 0.243
==============================================================================
Omnibus: 10.875 Durbin-Watson: 1.999
Prob(Omnibus): 0.004 Jarque-Bera (JB): 17.298
Skew: 0.537 Prob(JB): 0.000175
Kurtosis: 5.225 Cond. No. 1.49e+07
==============================================================================
(continues on next page)

62 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
˓→specified.

[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the popular
(%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at home in python.
We encourage you to view the source code of pipe().

Row or Column-wise Function Application

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the de-
scriptive statistics methods, takes an optional axis argument:

In [145]: df.apply(np.mean)
Out[145]:
one 0.613869
two 0.470270
three -0.427633
dtype: float64

In [146]: df.apply(np.mean, axis=1)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[146]:
˓→

a -0.121116
b 0.361488
c 0.571564
d -0.058569
dtype: float64

In [147]: df.apply(lambda x: x.max() - x.min())

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one 1.757280
two 3.196734
three 2.065853
dtype: float64

In [148]: df.apply(np.cumsum)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 1.400810 -1.643041 NaN
b 1.044340 -0.597130 0.395023
c 1.841608 0.327385 0.387933
d NaN 1.881078 -1.282898

In [149]: df.apply(np.exp)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

a 4.058485 0.193391 NaN
(continues on next page)

3.3. Essential Basic Functionality 63

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

b 0.700143 2.845991 1.484418
c 2.219469 2.520646 0.992935
d NaN 4.728902 0.188091

The apply() method will also dispatch on a string method name.

In [150]: df.apply('mean')
Out[150]:
one 0.613869
two 0.470270
three -0.427633
dtype: float64

In [151]: df.apply('mean', axis=1)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[151]:
˓→

a -0.121116
b 0.361488
c 0.571564
d -0.058569
dtype: float64

The return type of the function passed to apply() affects the type of the final output from DataFrame.apply for
the default behaviour:
• If the applied function returns a Series, the final output is a DataFrame. The columns match the index of
the Series returned by the applied function.
• If the applied function returns any other type, the final output is a Series.
This default behaviour can be overridden using the result_type, which accepts three options: reduce,
broadcast, and expand. These will determine how list-likes return values expand (or not) to a DataFrame.
apply() combined with some cleverness can be used to answer many questions about a data set. For example,
suppose we wanted to extract the date where the maximum value for each column occurred:

In [152]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],

.....: index=pd.date_range('1/1/2000', periods=1000))
.....:

In [153]: tsdf.apply(lambda x: x.idxmax())

Out[153]:
A 2000-06-10
B 2001-07-04
C 2002-08-09
dtype: datetime64[ns]

You may also pass additional arguments and keyword arguments to the apply() method. For instance, consider the
following function you would like to apply:

def subtract_and_divide(x, sub, divide=1):

return (x - sub) / divide

You may then apply this function as follows:

df.apply(subtract_and_divide, args=(5,), divide=3)

Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:

64 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [154]: tsdf
Out[154]:
A B C
2000-01-01 -0.652077 -0.239118 0.841272
2000-01-02 0.130224 0.347505 -0.385666
2000-01-03 -1.700237 -0.925899 0.199564
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.339319 -0.978307 0.689492
2000-01-09 0.601495 -0.630417 -1.040079
2000-01-10 1.511723 -0.427952 -0.400154

In [155]: tsdf.apply(pd.Series.interpolate)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C
2000-01-01 -0.652077 -0.239118 0.841272
2000-01-02 0.130224 0.347505 -0.385666
2000-01-03 -1.700237 -0.925899 0.199564
2000-01-04 -1.292326 -0.936380 0.297550
2000-01-05 -0.884415 -0.946862 0.395535
2000-01-06 -0.476503 -0.957344 0.493521
2000-01-07 -0.068592 -0.967825 0.591507
2000-01-08 0.339319 -0.978307 0.689492
2000-01-09 0.601495 -0.630417 -1.040079
2000-01-10 1.511723 -0.427952 -0.400154

Finally, apply() takes an argument raw which is False by default, which converts each row or column into a Series
before applying the function. When set to True, the passed function will instead receive an ndarray object, which has
positive performance implications if you do not need the indexing functionality.

Aggregation API

New in version 0.20.0.

The aggregation API allows one to express possibly multiple aggregation operations in a single concise way. This API
is similar across pandas objects, see groupby API, the window functions API, and the resample API. The entry point
for aggregation is DataFrame.aggregate(), or the alias DataFrame.agg().
We will use a similar starting frame from above:
In [156]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
.....: index=pd.date_range('1/1/2000', periods=10))
.....:

In [157]: tsdf.iloc[3:7] = np.nan

In [158]: tsdf
Out[158]:
A B C
2000-01-01 0.396575 -0.364907 0.051290
2000-01-02 -0.310517 -0.369093 -0.353151
2000-01-03 -0.522441 1.659115 -0.272364
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
(continues on next page)

3.3. Essential Basic Functionality 65

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -0.057890 1.148901 0.011528
2000-01-09 -0.155578 0.742150 0.107324
2000-01-10 0.531797 0.080254 0.833297

Using a single function is equivalent to apply(). You can also pass named methods as strings. These will return a
Series of the aggregated output:

In [159]: tsdf.agg(np.sum)
Out[159]:
A -0.118055
B 2.896420
C 0.377923
dtype: float64

In [160]: tsdf.agg('sum')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[160]:
A -0.118055
B 2.896420
C 0.377923
dtype: float64

# these are equivalent to a ``.sum()`` because we are aggregating

# on a single function
In [161]: tsdf.sum()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A -0.118055
B 2.896420
C 0.377923
dtype: float64

Single aggregations on a Series this will return a scalar value:

In [162]: tsdf.A.agg('sum')
Out[162]: -0.11805495013260869

Aggregating with multiple functions

You can pass multiple aggregation arguments as a list. The results of each of the passed functions will be a row in the
resulting DataFrame. These are naturally named from the aggregation function.

In [163]: tsdf.agg(['sum'])
Out[163]:
A B C
sum -0.118055 2.89642 0.377923

Multiple functions yield multiple rows:

In [164]: tsdf.agg(['sum', 'mean'])

Out[164]:
A B C
sum -0.118055 2.896420 0.377923
mean -0.019676 0.482737 0.062987

66 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

On a Series, multiple functions return a Series, indexed by the function names:

In [165]: tsdf.A.agg(['sum', 'mean'])

Out[165]:
sum -0.118055
mean -0.019676
Name: A, dtype: float64

Passing a lambda function will yield a <lambda> named row:

In [166]: tsdf.A.agg(['sum', lambda x: x.mean()])

Out[166]:
sum -0.118055
<lambda> -0.019676
Name: A, dtype: float64

Passing a named function will yield that name for the row:

In [167]: def mymean(x):

.....: return x.mean()
.....:

In [168]: tsdf.A.agg(['sum', mymean])

Out[168]:
sum -0.118055
mymean -0.019676
Name: A, dtype: float64

Aggregating with a dict

Passing a dictionary of column names to a scalar or a list of scalars, to DataFrame.agg allows you to customize
which functions are applied to which columns. Note that the results are not in any particular order, you can use an
OrderedDict instead to guarantee ordering.

In [169]: tsdf.agg({'A': 'mean', 'B': 'sum'})

Out[169]:
A -0.019676
B 2.896420
dtype: float64

Passing a list-like will generate a DataFrame output. You will get a matrix-like output of all of the aggregators. The
output will consist of all unique functions. Those that are not noted for a particular column will be NaN:

In [170]: tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})

Out[170]:
A B
mean -0.019676 NaN
min -0.522441 NaN
sum NaN 2.89642

Mixed Dtypes

When presented with mixed dtypes that cannot aggregate, .agg will only take the valid aggregations. This is similar
to how groupby .agg works.

3.3. Essential Basic Functionality 67

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [171]: mdf = pd.DataFrame({'A': [1, 2, 3],

.....: 'B': [1., 2., 3.],
.....: 'C': ['foo', 'bar', 'baz'],
.....: 'D': pd.date_range('20130101', periods=3)})
.....:

In [172]: mdf.dtypes
Out[172]:
A int64
B float64
C object
D datetime64[ns]
dtype: object

In [173]: mdf.agg(['min', 'sum'])

Out[173]:
A B C D
min 1 1.0 bar 2013-01-01
sum 6 6.0 foobarbaz NaT

Custom describe

With .agg() is it possible to easily create a custom describe function, similar to the built in describe function.

In [174]: from functools import partial

In [175]: q_25 = partial(pd.Series.quantile, q=0.25)

In [176]: q_25.name = '25%'

In [177]: q_75 = partial(pd.Series.quantile, q=0.75)

In [178]: q_75.name = '75%'

In [179]: tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])

Out[179]:
A B C
count 6.000000 6.000000 6.000000
mean -0.019676 0.482737 0.062987
std 0.408577 0.836785 0.420419
min -0.522441 -0.369093 -0.353151
25% -0.271782 -0.253617 -0.201391
median -0.106734 0.411202 0.031409
75% 0.282958 1.047213 0.093315
max 0.531797 1.659115 0.833297

Transform API

New in version 0.20.0.

The transform() method returns an object that is indexed the same (same size) as the original. This API allows
you to provide multiple operations at the same time rather than one-by-one. Its API is quite similar to the .agg API.
We create a frame similar to the one used in the above sections.

68 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [180]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],

.....: index=pd.date_range('1/1/2000', periods=10))
.....:

In [181]: tsdf.iloc[3:7] = np.nan

In [182]: tsdf
Out[182]:
A B C
2000-01-01 -1.219234 -1.652700 -0.698277
2000-01-02 1.858653 -0.738520 0.630364
2000-01-03 -0.112596 1.525897 1.364225
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -0.527790 -1.715506 0.387274
2000-01-09 -0.569341 0.569386 0.134136
2000-01-10 -0.413993 -0.862280 0.662690

Transform the entire frame. .transform() allows input functions as: a NumPy function, a string function name or
a user defined function.

In [183]: tsdf.transform(np.abs)
Out[183]:
A B C
2000-01-01 1.219234 1.652700 0.698277
2000-01-02 1.858653 0.738520 0.630364
2000-01-03 0.112596 1.525897 1.364225
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.527790 1.715506 0.387274
2000-01-09 0.569341 0.569386 0.134136
2000-01-10 0.413993 0.862280 0.662690

In [184]: tsdf.transform('abs')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C
2000-01-01 1.219234 1.652700 0.698277
2000-01-02 1.858653 0.738520 0.630364
2000-01-03 0.112596 1.525897 1.364225
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.527790 1.715506 0.387274
2000-01-09 0.569341 0.569386 0.134136
2000-01-10 0.413993 0.862280 0.662690

In [185]: tsdf.transform(lambda x: x.abs())

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C
2000-01-01 1.219234 1.652700 0.698277
(continues on next page)

3.3. Essential Basic Functionality 69

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2000-01-02 1.858653 0.738520 0.630364
2000-01-03 0.112596 1.525897 1.364225
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.527790 1.715506 0.387274
2000-01-09 0.569341 0.569386 0.134136
2000-01-10 0.413993 0.862280 0.662690

Here transform() received a single function; this is equivalent to a ufunc application.

In [186]: np.abs(tsdf)
Out[186]:
A B C
2000-01-01 1.219234 1.652700 0.698277
2000-01-02 1.858653 0.738520 0.630364
2000-01-03 0.112596 1.525897 1.364225
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.527790 1.715506 0.387274
2000-01-09 0.569341 0.569386 0.134136
2000-01-10 0.413993 0.862280 0.662690

Passing a single function to .transform() with a Series will yield a single Series in return.
In [187]: tsdf.A.transform(np.abs)
Out[187]:
2000-01-01 1.219234
2000-01-02 1.858653
2000-01-03 0.112596
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 0.527790
2000-01-09 0.569341
2000-01-10 0.413993
Freq: D, Name: A, dtype: float64

Transform with multiple functions

Passing multiple functions will yield a column MultiIndexed DataFrame. The first level will be the original frame
column names; the second level will be the names of the transforming functions.
In [188]: tsdf.transform([np.abs, lambda x: x + 1])
Out[188]:
A B C
absolute <lambda> absolute <lambda> absolute <lambda>
2000-01-01 1.219234 -0.219234 1.652700 -0.652700 0.698277 0.301723
2000-01-02 1.858653 2.858653 0.738520 0.261480 0.630364 1.630364
2000-01-03 0.112596 0.887404 1.525897 2.525897 1.364225 2.364225
2000-01-04 NaN NaN NaN NaN NaN NaN
(continues on next page)

70 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2000-01-05 NaN NaN NaN NaN NaN NaN
2000-01-06 NaN NaN NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN NaN NaN
2000-01-08 0.527790 0.472210 1.715506 -0.715506 0.387274 1.387274
2000-01-09 0.569341 0.430659 0.569386 1.569386 0.134136 1.134136
2000-01-10 0.413993 0.586007 0.862280 0.137720 0.662690 1.662690

Passing multiple functions to a Series will yield a DataFrame. The resulting column names will be the transforming
functions.
In [189]: tsdf.A.transform([np.abs, lambda x: x + 1])
Out[189]:
absolute <lambda>
2000-01-01 1.219234 -0.219234
2000-01-02 1.858653 2.858653
2000-01-03 0.112596 0.887404
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 NaN NaN
2000-01-07 NaN NaN
2000-01-08 0.527790 0.472210
2000-01-09 0.569341 0.430659
2000-01-10 0.413993 0.586007

Transforming with a dict

Passing a dict of functions will allow selective transforming per column.

In [190]: tsdf.transform({'A': np.abs, 'B': lambda x: x + 1})
Out[190]:
A B
2000-01-01 1.219234 -0.652700
2000-01-02 1.858653 0.261480
2000-01-03 0.112596 2.525897
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 NaN NaN
2000-01-07 NaN NaN
2000-01-08 0.527790 -0.715506
2000-01-09 0.569341 1.569386
2000-01-10 0.413993 0.137720

Passing a dict of lists will generate a MultiIndexed DataFrame with these selective transforms.
In [191]: tsdf.transform({'A': np.abs, 'B': [lambda x: x + 1, 'sqrt']})
Out[191]:
A B
absolute <lambda> sqrt
2000-01-01 1.219234 -0.652700 NaN
2000-01-02 1.858653 0.261480 NaN
2000-01-03 0.112596 2.525897 1.235272
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
(continues on next page)

3.3. Essential Basic Functionality 71

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2000-01-08 0.527790 -0.715506 NaN
2000-01-09 0.569341 1.569386 0.754577
2000-01-10 0.413993 0.137720 NaN

Applying Elementwise Functions

Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods
applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and
returning a single value. For example:
In [192]: df4
Out[192]:
one two three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830

In [193]: def f(x):

.....: return len(str(x))
.....:

In [194]: df4['one'].map(f)
Out[194]:
a 18
b 19
c 18
d 3
Name: one, dtype: int64

In [195]: df4.applymap(f)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[195]:
one two three
a 18 19 3
b 19 18 19
c 18 18 21
d 3 18 19

Series.map() has an additional feature; it can be used to easily “link” or “map” values defined by a secondary
series. This is closely related to merging/joining functionality:
In [196]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
.....: index=['a', 'b', 'c', 'd', 'e'])
.....:

In [197]: t = pd.Series({'six': 6., 'seven': 7.})

In [198]: s
Out[198]:
a six
b seven
c six
d seven
e six
dtype: object
(continues on next page)

72 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [199]: s.map(t)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[199]:
˓→

a 6.0
b 7.0
c 6.0
d 7.0
e 6.0
dtype: float64

3.3.7 Reindexing and altering labels

reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features
relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a
particular axis. This accomplishes several things:
• Reorders the existing data to match a new set of labels
• Inserts missing value (NA) markers in label locations where no data for that label existed
• If specified, fill data for missing labels using logic (highly relevant to working with time series data)
Here is a simple example:

In [200]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [201]: s
Out[201]:
a -0.368437
b -0.036473
c 0.774830
d -0.310545
e 0.709717
dtype: float64

In [202]: s.reindex(['e', 'b', 'f', 'd'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[2
˓→

e 0.709717
b -0.036473
f NaN
d -0.310545
dtype: float64

Here, the f label was not contained in the Series and hence appears as NaN in the result.
With a DataFrame, you can simultaneously reindex the index and columns:

In [203]: df
Out[203]:
one two three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830
(continues on next page)

3.3. Essential Basic Functionality 73

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [204]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

three two one

c -0.007090 0.924515 0.797268
f NaN NaN NaN
b 0.395023 1.045911 -0.356470

You may also use reindex with an axis keyword:

In [205]: df.reindex(['c', 'f', 'b'], axis='index')

Out[205]:
one two three
c 0.797268 0.924515 -0.007090
f NaN NaN NaN
b -0.356470 1.045911 0.395023

Note that the Index objects containing the actual axis labels can be shared between objects. So if we have a Series
and a DataFrame, the following can be done:

In [206]: rs = s.reindex(df.index)

In [207]: rs
Out[207]:
a -0.368437
b -0.036473
c 0.774830
d -0.310545
dtype: float64

In [208]: rs.index is df.index

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[208]:
˓→True

This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.
New in version 0.21.0.
DataFrame.reindex() also supports an “axis-style” calling convention, where you specify a single labels
argument and the axis it applies to.

In [209]: df.reindex(['c', 'f', 'b'], axis='index')

Out[209]:
one two three
c 0.797268 0.924515 -0.007090
f NaN NaN NaN
b -0.356470 1.045911 0.395023

In [210]: df.reindex(['three', 'two', 'one'], axis='columns')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

three two one

a NaN -1.643041 1.400810
b 0.395023 1.045911 -0.356470
c -0.007090 0.924515 0.797268
d -1.670830 1.553693 NaN

74 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

See also:
MultiIndex / Advanced Indexing is an even more concise way of doing reindexing.

Note: When writing performance-sensitive code, there is a good reason to spend some time becoming a reindexing
ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames internally triggers a
reindexing step. For exploratory analysis you will hardly notice the difference (because reindex has been heavily
optimized), but when CPU cycles matter sprinkling a few explicit reindex calls here and there can have an impact.

Reindexing to align with another object

You may wish to take an object and reindex its axes to be labeled the same as another object. While the syntax for this
is straightforward albeit verbose, it is a common enough operation that the reindex_like() method is available
to make this simpler:

In [211]: df2
Out[211]:
one two
a 1.400810 -1.643041
b -0.356470 1.045911
c 0.797268 0.924515

In [212]: df3
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Ou
˓→

one two
a 0.786941 -1.752170
b -0.970339 0.936783
c 0.183399 0.815387

In [213]: df.reindex_like(df2)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two
a 1.400810 -1.643041
b -0.356470 1.045911
c 0.797268 0.924515

Aligning objects with each other with align

The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to
joining and merging):
• join='outer': take the union of the indexes (default)
• join='left': use the calling object’s index
• join='right': use the passed object’s index
• join='inner': intersect the indexes
It returns a tuple with both of the reindexed Series:

In [214]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

(continues on next page)

3.3. Essential Basic Functionality 75

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [215]: s1 = s[:4]

In [216]: s2 = s[1:]

In [217]: s1.align(s2)
Out[217]:
(a -0.610263
b -0.170883
c 0.367255
d 0.273860
e NaN
dtype: float64, a NaN
b -0.170883
c 0.367255
d 0.273860
e 0.314782
dtype: float64)

In [218]: s1.align(s2, join='inner')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

(b -0.170883
c 0.367255
d 0.273860
dtype: float64, b -0.170883
c 0.367255
d 0.273860
dtype: float64)

In [219]: s1.align(s2, join='left')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

(a -0.610263
b -0.170883
c 0.367255
d 0.273860
dtype: float64, a NaN
b -0.170883
c 0.367255
d 0.273860
dtype: float64)

For DataFrames, the join method will be applied to both the index and the columns by default:

In [220]: df.align(df2, join='inner')

Out[220]:
( one two
a 1.400810 -1.643041
b -0.356470 1.045911
c 0.797268 0.924515, one two
a 1.400810 -1.643041
b -0.356470 1.045911
c 0.797268 0.924515)

You can also pass an axis option to only align on the specified axis:

76 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [221]: df.align(df2, join='inner', axis=0)

Out[221]:
( one two three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090, one two
a 1.400810 -1.643041
b -0.356470 1.045911
c 0.797268 0.924515)

If you pass a Series to DataFrame.align(), you can choose to align both objects either on the DataFrame’s index
or columns using the axis argument:

In [222]: df.align(df2.iloc[0], axis=1)

Out[222]:
( one three two
a 1.400810 NaN -1.643041
b -0.356470 0.395023 1.045911
c 0.797268 -0.007090 0.924515
d NaN -1.670830 1.553693, one 1.400810
three NaN
two -1.643041
Name: a, dtype: float64)

Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following table:

Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value

We illustrate these fill methods on a simple Series:

In [223]: rng = pd.date_range('1/3/2000', periods=8)

In [224]: ts = pd.Series(np.random.randn(8), index=rng)

In [225]: ts2 = ts[[0, 3, 6]]

In [226]: ts
Out[226]:
2000-01-03 -0.082578
2000-01-04 0.768554
2000-01-05 0.398842
2000-01-06 -0.357956
2000-01-07 0.156403
2000-01-08 -1.347564
2000-01-09 0.253506
2000-01-10 1.228964
Freq: D, dtype: float64

In [227]: ts2
(continues on next page)

3.3. Essential Basic Functionality 77

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2000-01-03 -0.082578
2000-01-06 -0.357956
2000-01-09 0.253506
dtype: float64

In [228]: ts2.reindex(ts.index)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2000-01-03 -0.082578
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 -0.357956
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 0.253506
2000-01-10 NaN
Freq: D, dtype: float64

In [229]: ts2.reindex(ts.index, method='ffill')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2000-01-03 -0.082578
2000-01-04 -0.082578
2000-01-05 -0.082578
2000-01-06 -0.357956
2000-01-07 -0.357956
2000-01-08 -0.357956
2000-01-09 0.253506
2000-01-10 0.253506
Freq: D, dtype: float64

In [230]: ts2.reindex(ts.index, method='bfill')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2000-01-03 -0.082578
2000-01-04 -0.357956
2000-01-05 -0.357956
2000-01-06 -0.357956
2000-01-07 0.253506
2000-01-08 0.253506
2000-01-09 0.253506
2000-01-10 NaN
Freq: D, dtype: float64

In [231]: ts2.reindex(ts.index, method='nearest')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2000-01-03 -0.082578
2000-01-04 -0.082578
2000-01-05 -0.357956
2000-01-06 -0.357956
2000-01-07 -0.357956
2000-01-08 0.253506
2000-01-09 0.253506
2000-01-10 0.253506
(continues on next page)

78 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Freq: D, dtype: float64

These methods require that the indexes are ordered increasing or decreasing.
Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

In [232]: ts2.reindex(ts.index).fillna(method='ffill')
Out[232]:
2000-01-03 -0.082578
2000-01-04 -0.082578
2000-01-05 -0.082578
2000-01-06 -0.357956
2000-01-07 -0.357956
2000-01-08 -0.357956
2000-01-09 0.253506
2000-01-10 0.253506
Freq: D, dtype: float64

reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and
interpolate() will not perform any checks on the order of the index.

Limits on filling while reindexing

The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies the
maximum count of consecutive matches:

In [233]: ts2.reindex(ts.index, method='ffill', limit=1)

Out[233]:
2000-01-03 -0.082578
2000-01-04 -0.082578
2000-01-05 NaN
2000-01-06 -0.357956
2000-01-07 -0.357956
2000-01-08 NaN
2000-01-09 0.253506
2000-01-10 0.253506
Freq: D, dtype: float64

In contrast, tolerance specifies the maximum distance between the index and indexer values:

In [234]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')

Out[234]:
2000-01-03 -0.082578
2000-01-04 -0.082578
2000-01-05 NaN
2000-01-06 -0.357956
2000-01-07 -0.357956
2000-01-08 NaN
2000-01-09 0.253506
2000-01-10 0.253506
Freq: D, dtype: float64

Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced
into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

3.3. Essential Basic Functionality 79

pandas: powerful Python data analysis toolkit, Release 0.24.1

Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

In [235]: df
Out[235]:
one two three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830

In [236]: df.drop(['a', 'd'], axis=0)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090

In [237]: df.drop(['one'], axis=1)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

two three
a -1.643041 NaN
b 1.045911 0.395023
c 0.924515 -0.007090
d 1.553693 -1.670830

Note that the following also works, but is a bit less obvious / clean:

In [238]: df.reindex(df.index.difference(['a', 'd']))

Out[238]:
one two three
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090

Renaming / mapping labels

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [239]: s
Out[239]:
a -0.610263
b -0.170883
c 0.367255
d 0.273860
e 0.314782
dtype: float64

In [240]: s.rename(str.upper)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[2
˓→

A -0.610263
B -0.170883
C 0.367255
D 0.273860
(continues on next page)

80 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

E 0.314782
dtype: float64

If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique
values). A dict or Series can also be used:

In [241]: df.rename(columns={'one': 'foo', 'two': 'bar'},

.....: index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
.....:
Out[241]:
foo bar three
apple 1.400810 -1.643041 NaN
banana -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
durian NaN 1.553693 -1.670830

If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping don’t
throw an error.
New in version 0.21.0.
DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper and
the axis to apply that mapping to.

In [242]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')

Out[242]:
foo bar three
a 1.400810 -1.643041 NaN
b -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
d NaN 1.553693 -1.670830

In [243]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one two three

apple 1.400810 -1.643041 NaN
banana -0.356470 1.045911 0.395023
c 0.797268 0.924515 -0.007090
durian NaN 1.553693 -1.670830

The rename() method also provides an inplace named parameter that is by default False and copies the under-
lying data. Pass inplace=True to rename the data in place.
New in version 0.18.0.
Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

In [244]: s.rename("scalar-name")
Out[244]:
a -0.610263
b -0.170883
c 0.367255
d 0.273860
e 0.314782
Name: scalar-name, dtype: float64

New in version 0.24.0.

3.3. Essential Basic Functionality 81

pandas: powerful Python data analysis toolkit, Release 0.24.1

The methods rename_axis() and rename_axis() allow specific names of a MultiIndex to be changed (as
opposed to the labels).

In [245]: df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6],

.....: 'y': [10, 20, 30, 40, 50, 60]},
.....: index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1,
˓→2]],

.....: names=['let', 'num']))

.....:

In [246]: df
Out[246]:
x y
let num
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60

In [247]: df.rename_axis(index={'let': 'abc'})

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

x y
abc num
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60

In [248]: df.rename_axis(index=str.upper)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

x y
LET NUM
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60

3.3.8 Iteration

The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is regarded
as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the
dict-like convention of iterating over the “keys” of the objects.
In short, basic iteration (for i in object) produces:
• Series: values
• DataFrame: column labels
• Panel: item labels

82 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Thus, for example, iterating over a DataFrame gives you the column names:

In [249]: df = pd.DataFrame({'col1': np.random.randn(3),

.....: 'col2': np.random.randn(3)}, index=['a', 'b', 'c'])
.....:

In [250]: for col in df:

.....: print(col)
.....:
col1
col2

Pandas objects also have the dict-like iteritems() method to iterate over the (key, value) pairs.
To iterate over the rows of a DataFrame, you can use the following methods:
• iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to Series
objects, which can change the dtypes and has some performance implications.
• itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster than
iterrows(), and is in most cases preferable to use to iterate over the values of a DataFrame.

Warning: Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is
not needed and can be avoided with one of the following approaches:
• Look for a vectorized solution: many operations can be performed using built-in methods or NumPy func-
tions, (boolean) indexing, . . .
• When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply()
instead of iterating over the values. See the docs on function application.
• If you need to do iterative manipulations on the values but performance is important, consider writing the in-
ner loop with cython or numba. See the enhancing performance section for some examples of this approach.

Warning: You should never modify something you are iterating over. This is not guaranteed to work in all cases.
Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!
For example, in the following case setting the value has no effect:
In [251]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [252]: for index, row in df.iterrows():

.....: row['a'] = 10
.....:

In [253]: df
Out[253]:
a b
0 1 a
1 2 b
2 3 c

iteritems

Consistent with the dict-like interface, iteritems() iterates through key-value pairs:

3.3. Essential Basic Functionality 83

pandas: powerful Python data analysis toolkit, Release 0.24.1

• Series: (index, scalar value) pairs

• DataFrame: (column, Series) pairs
• Panel: (item, DataFrame) pairs
For example:

In [254]: for item, frame in wp.iteritems():

.....: print(item)
.....: print(frame)
.....:
Item1
A B C D
2000-01-01 -1.157892 -1.344312 0.844885 1.075770
2000-01-02 -0.109050 1.643563 -1.469388 0.357021
2000-01-03 -0.674600 -1.776904 -0.968914 -1.294524
2000-01-04 0.413738 0.276662 -0.472035 -0.013960
2000-01-05 -0.362543 -0.006154 -0.923061 0.895717
Item2
A B C D
2000-01-01 0.805244 -1.206412 2.565646 1.431256
2000-01-02 1.340309 -1.170299 -0.226169 0.410835
2000-01-03 0.813850 0.132003 -0.827317 -0.076467
2000-01-04 -1.187678 1.130127 -1.436737 -1.413681
2000-01-05 1.607920 1.024180 0.569605 0.875906

iterrows

iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding
each index value along with a Series containing the data in each row:

In [255]: for row_index, row in df.iterrows():

.....: print(row_index, row, sep='\n')
.....:
0
a 1
b a
Name: 0, dtype: object
1
a 2
b b
Name: 1, dtype: object
2
a 3
b c
Name: 2, dtype: object

Note: Because iterrows() returns a Series for each row, it does not preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames). For example,

In [256]: df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

In [257]: df_orig.dtypes
Out[257]:
int int64
(continues on next page)

84 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

float float64
dtype: object

In [258]: row = next(df_orig.iterrows())[1]

In [259]: row
Out[259]:
int 1.0
float 1.5
Name: 0, dtype: float64

All values in row, returned as a Series, are now upcasted to floats, also the original integer value in column x:

In [260]: row['int'].dtype
Out[260]: dtype('float64')

In [261]: df_orig['int'].dtype
\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[261]: dtype('int64')

To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the
values and which is generally much faster than iterrows().

For instance, a contrived way to transpose the DataFrame would be:

In [262]: df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

In [263]: print(df2)
x y
0 1 4
1 2 5
2 3 6

In [264]: print(df2.T)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ 0 1 2
x 1 2 3
y 4 5 6

In [265]: df2_t = pd.DataFrame({idx: values for idx, values in df2.iterrows()})

In [266]: print(df2_t)
0 1 2
x 1 2 3
y 4 5 6

itertuples

The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first
element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.
For instance:

In [267]: for row in df.itertuples():

.....: print(row)
.....:
(continues on next page)

3.3. Essential Basic Functionality 85

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')

This method does not convert the row to a Series object; it merely returns the values inside a namedtuple. Therefore,
itertuples() preserves the data type of the values and is generally faster as iterrows().

Note: The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start
with an underscore. With a large number of columns (>255), regular tuples are returned.

3.3.9 .dt accessor

Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a date-
time/period like Series. This will return a Series, indexed like the existing Series.

# datetime
In [268]: s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))

In [269]: s
Out[269]:
0 2013-01-01 09:10:12
1 2013-01-02 09:10:12
2 2013-01-03 09:10:12
3 2013-01-04 09:10:12
dtype: datetime64[ns]

In [270]: s.dt.hour
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 9
1 9
2 9
3 9
dtype: int64

In [271]: s.dt.second
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 12
1 12
2 12
3 12
dtype: int64

In [272]: s.dt.day
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 1
1 2
2 3
3 4
dtype: int64

This enables nice expressions like this:

86 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [273]: s[s.dt.day == 2]
Out[273]:
1 2013-01-02 09:10:12
dtype: datetime64[ns]

You can easily produces tz aware transformations:

In [274]: stz = s.dt.tz_localize('US/Eastern')

In [275]: stz
Out[275]:
0 2013-01-01 09:10:12-05:00
1 2013-01-02 09:10:12-05:00
2 2013-01-03 09:10:12-05:00
3 2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [276]: stz.dt.tz
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→<DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

You can also chain these types of operations:

In [277]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[277]:
0 2013-01-01 04:10:12-05:00
1 2013-01-02 04:10:12-05:00
2 2013-01-03 04:10:12-05:00
3 2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

You can also format datetime values as strings with Series.dt.strftime() which supports the same format as
the standard strftime().
# DatetimeIndex
In [278]: s = pd.Series(pd.date_range('20130101', periods=4))

In [279]: s
Out[279]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: datetime64[ns]

In [280]: s.dt.strftime('%Y/%m/%d')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[280]
˓→

0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object

# PeriodIndex
In [281]: s = pd.Series(pd.period_range('20130101', periods=4))

(continues on next page)

3.3. Essential Basic Functionality 87

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [282]: s
Out[282]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]

In [283]: s.dt.strftime('%Y/%m/%d')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[283]:
˓→

0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object

The .dt accessor works for period and timedelta dtypes.

# period
In [284]: s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))

In [285]: s
Out[285]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]

In [286]: s.dt.year
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[286]:
˓→

0 2013
1 2013
2 2013
3 2013
dtype: int64

In [287]: s.dt.day
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 1
1 2
2 3
3 4
dtype: int64

# timedelta
In [288]: s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))

In [289]: s
Out[289]:
0 1 days 00:00:05
1 1 days 00:00:06
2 1 days 00:00:07
(continues on next page)

88 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

3 1 days 00:00:08
dtype: timedelta64[ns]

In [290]: s.dt.days
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 1
1 1
2 1
3 1
dtype: int64

In [291]: s.dt.seconds
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 5
1 6
2 7
3 8
dtype: int64

In [292]: s.dt.components
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

days hours minutes seconds milliseconds microseconds nanoseconds

0 1 0 0 5 0 0 0
1 1 0 0 6 0 0 0
2 1 0 0 7 0 0 0
3 1 0 0 8 0 0 0

Note: Series.dt will raise a TypeError if you access with a non-datetime-like values.

3.3.10 Vectorized string methods

Series is equipped with a set of string processing methods that make it easy to operate on each element of the array.
Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s
str attribute and generally have names matching the equivalent (scalar) built-in string methods. For example:

In [293]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog

˓→', 'cat'])

In [294]: s.str.lower()
Out[294]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object

3.3. Essential Basic Functionality 89

pandas: powerful Python data analysis toolkit, Release 0.24.1

Powerful pattern-matching methods are provided as well, but note that pattern-matching generally uses regular expres-
sions by default (and in some cases always uses them).
Please see Vectorized String Methods for a complete description.

3.3.11 Sorting

Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination
of both.

By Index

The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its
index levels.
In [295]: df = pd.DataFrame({
.....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
.....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
.....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
.....:

In [296]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],

.....: columns=['three', 'two', 'one'])
.....:

In [297]: unsorted_df
Out[297]:
three two one
a NaN -0.867293 0.050162
d 1.215473 -0.051744 NaN
c -0.421091 -0.712097 0.953102
b 1.205223 0.632624 -1.534113

# DataFrame
In [298]: unsorted_df.sort_index()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

three two one

a NaN -0.867293 0.050162
b 1.205223 0.632624 -1.534113
c -0.421091 -0.712097 0.953102
d 1.215473 -0.051744 NaN

In [299]: unsorted_df.sort_index(ascending=False)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

three two one

d 1.215473 -0.051744 NaN
c -0.421091 -0.712097 0.953102
b 1.205223 0.632624 -1.534113
a NaN -0.867293 0.050162

In [300]: unsorted_df.sort_index(axis=1)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one three two

(continues on next page)

90 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

a 0.050162 NaN -0.867293
d NaN 1.215473 -0.051744
c 0.953102 -0.421091 -0.712097
b -1.534113 1.205223 0.632624

# Series
In [301]: unsorted_df['three'].sort_index()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a NaN
b 1.205223
c -0.421091
d 1.215473
Name: three, dtype: float64

By Values

The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values()
method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.
sort_values() may used to specify one or more columns to use to determine the sorted order.
In [302]: df1 = pd.DataFrame({'one': [2, 1, 1, 1],
.....: 'two': [1, 3, 2, 4],
.....: 'three': [5, 4, 3, 2]})
.....:

In [303]: df1.sort_values(by='two')
Out[303]:
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2

The by parameter can take a list of column names, e.g.:

In [304]: df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])
Out[304]:
one two three
2 1 2 3
1 1 3 4
3 1 4 2
0 2 1 5

These methods have special treatment of NA values via the na_position argument:
In [305]: s[2] = np.nan

In [306]: s.sort_values()
Out[306]:
0 A
3 Aaba
1 B
4 Baca
6 CABA
(continues on next page)

3.3. Essential Basic Functionality 91

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

8 cat
7 dog
2 NaN
5 NaN
dtype: object

In [307]: s.sort_values(na_position='first')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2 NaN
5 NaN
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
dtype: object

By Indexes and Values

New in version 0.23.0.

Strings passed as the by parameter to DataFrame.sort_values() may refer to either columns or index level
names.

# Build MultiIndex
In [308]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
.....: ('b', 2), ('b', 1), ('b', 1)])
.....:

In [309]: idx.names = ['first', 'second']

# Build DataFrame
In [310]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
.....: index=idx)
.....:

In [311]: df_multi
Out[311]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1

Sort by ‘second’ (index) and ‘A’ (column)

In [312]: df_multi.sort_values(by=['second', 'A'])

Out[312]:
A
(continues on next page)

92 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

first second
b 1 1
1 2
a 1 6
b 2 3
a 2 4
2 5

Note: If a string matches both a column name and an index level name then a warning is issued and the column takes
precedence. This will result in an ambiguity error in a future version.

searchsorted

Series has the searchsorted() method, which works similarly to numpy.ndarray.searchsorted().

In [313]: ser = pd.Series([1, 2, 3])

In [314]: ser.searchsorted([0, 3])

Out[314]: array([0, 2])

In [315]: ser.searchsorted([0, 4])

\\\\\\\\\\\\\\\\\\\\\\\\Out[315]: array([0, 3])

In [316]: ser.searchsorted([1, 3], side='right')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[316]: array([1, 3])

In [317]: ser.searchsorted([1, 3], side='left')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[317]:
˓→array([0, 2])

In [318]: ser = pd.Series([3, 1, 2])

In [319]: ser.searchsorted([0, 3], sorter=np.argsort(ser))

Out[319]: array([0, 2])

smallest / largest values

Series has the nsmallest() and nlargest() methods which return the smallest or largest 𝑛 values. For a
large Series this can be much faster than sorting the entire Series and calling head(n) on the result.
In [320]: s = pd.Series(np.random.permutation(10))

In [321]: s
Out[321]:
0 5
1 3
2 2
3 0
4 7
5 6
6 9
7 1
(continues on next page)

3.3. Essential Basic Functionality 93

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

8 4
9 8
dtype: int64

In [322]: s.sort_values()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[322
˓→

3 0
7 1
2 2
1 3
8 4
0 5
5 6
4 7
9 8
6 9
dtype: int64

In [323]: s.nsmallest(3)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

3 0
7 1
2 2
dtype: int64

In [324]: s.nlargest(3)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

6 9
9 8
4 7
dtype: int64

DataFrame also has the nlargest and nsmallest methods.

In [325]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
.....: 'b': list('abdceff'),
.....: 'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
.....:

In [326]: df.nlargest(3, 'a')

Out[326]:
a b c
5 11 f 3.0
3 10 c 3.2
4 8 e NaN

In [327]: df.nlargest(5, ['a', 'c'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[327]:
a b c
5 11 f 3.0
3 10 c 3.2
4 8 e NaN
2 1 d 4.0
6 -1 f 4.0
(continues on next page)

94 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [328]: df.nsmallest(3, 'a')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b c
0 -2 a 1.0
1 -1 b 2.0
6 -1 f 4.0

In [329]: df.nsmallest(5, ['a', 'c'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b c
0 -2 a 1.0
1 -1 b 2.0
6 -1 f 4.0
2 1 d 4.0
4 8 e NaN

Sorting by a MultiIndex column

You must be explicit about sorting when the column is a MultiIndex, and fully specify all levels to by.

In [330]: df1.columns = pd.MultiIndex.from_tuples([('a', 'one'),

.....: ('a', 'two'),
.....: ('b', 'three')])
.....:

In [331]: df1.sort_values(by=('a', 'two'))

Out[331]:
a b
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2

3.3.12 Copying

The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they are im-
mutable) and returns a new object. Note that it is seldom necessary to copy objects. For example, there are only a
handful of ways to alter a DataFrame in-place:
• Inserting, deleting, or modifying a column.
• Assigning to the index or columns attributes.
• For homogeneous data, directly modifying the values via the values attribute or advanced indexing.
To be clear, no pandas method has the side effect of modifying your data; almost every method returns a new object,
leaving the original object untouched. If the data is modified, it is because you did so explicitly.

3.3. Essential Basic Functionality 95

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.3.13 dtypes

For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy
provides support for float, int, bool, timedelta64[ns] and datetime64[ns] (note that NumPy does not
support timezone-aware datetimes).
Pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the extensions
pandas has made internally. See Extension Types for how to write your own extension that works with pandas. See
Extension Data Types for a list of third-party libraries that have implemented an extension.
The following table lists all of pandas extension types. See the respective documentation sections for more on each
type.

Kind of Data Data Type Scalar Array Documentation

tz-aware date- DatetimeTZDtype Timestamp arrays. Time Zone Handling
time DatetimeArray
Categorical CategoricalDtype (none) Categorical Categorical Data
period (time PeriodDtype Period arrays. Time Span Representa-
spans) PeriodArray tion
sparse SparseDtype (none) arrays. Sparse data structures
SparseArray
intervals IntervalDtype Interval arrays. IntervalIndex
IntervalArray
nullable integer Int64Dtype, . . . (none) arrays. Nullable Integer Data
IntegerArray Type

Pandas uses the object dtype for storing strings.

Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible (for
performance and interoperability with other libraries and methods. See object conversion).
A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.
In [332]: dft = pd.DataFrame({'A': np.random.rand(3),
.....: 'B': 1,
.....: 'C': 'foo',
.....: 'D': pd.Timestamp('20010102'),
.....: 'E': pd.Series([1.0] * 3).astype('float32'),
.....: 'F': False,
.....: 'G': pd.Series([1] * 3, dtype='int8')})
.....:

In [333]: dft
Out[333]:
A B C D E F G
0 0.278831 1 foo 2001-01-02 1.0 False 1
1 0.242124 1 foo 2001-01-02 1.0 False 1
2 0.078031 1 foo 2001-01-02 1.0 False 1

In [334]: dft.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float64
B int64
C object
D datetime64[ns]
E float32
(continues on next page)

96 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

F bool
G int8
dtype: object

On a Series object, use the dtype attribute.

In [335]: dft['A'].dtype
Out[335]: dtype('float64')

If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be chosen to
accommodate all of the data types (object is the most general).

# these ints are coerced to floats

In [336]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[336]:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
dtype: float64

# string data forces an ``object`` dtype

In [337]: pd.Series([1, 2, 3, 6., 'foo'])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[337]:
˓→

0 1
1 2
2 3
3 6
4 foo
dtype: object

The number of columns of each type in a DataFrame can be found by calling get_dtype_counts().

In [338]: dft.get_dtype_counts()
Out[338]:
float64 1
float32 1
int64 1
int8 1
datetime64[ns] 1
bool 1
object 1
dtype: int64

Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the dtype
keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore,
different numeric dtypes will NOT be combined. The following example will give you a taste.

In [339]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [340]: df1
Out[340]:
A
(continues on next page)

3.3. Essential Basic Functionality 97

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

0 -1.641339
1 -0.314062
2 -0.679206
3 1.178243
4 0.181790
5 -2.044248
6 1.151282
7 -1.641398

In [341]: df1.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float32
dtype: object

In [342]: df2 = pd.DataFrame({'A': pd.Series(np.random.randn(8), dtype='float16'),

.....: 'B': pd.Series(np.random.randn(8)),
.....: 'C': pd.Series(np.array(np.random.randn(8),
.....: dtype='uint8'))})
.....:

In [343]: df2
Out[343]:
A B C
0 0.130737 -1.143729 1
1 0.289551 2.787500 0
2 0.590820 -0.708143 254
3 -0.020142 -1.512388 0
4 -1.048828 -0.243145 1
5 -0.808105 -0.650992 0
6 1.373047 2.090108 0
7 -0.254395 0.433098 0

In [344]: df2.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float16
B float64
C uint8
dtype: object

defaults

By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit). The
following will all result in int64 dtypes.

In [345]: pd.DataFrame([1, 2], columns=['a']).dtypes

Out[345]:
a int64
dtype: object

In [346]: pd.DataFrame({'a': [1, 2]}).dtypes

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[346]:
a int64
dtype: object
(continues on next page)

98 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [347]: pd.DataFrame({'a': 1}, index=list(range(2))).dtypes

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[347]:
a int64
dtype: object

Note that Numpy will choose platform-dependent types when creating arrays. The following WILL result in int32
on 32-bit platform.

In [348]: frame = pd.DataFrame(np.array([1, 2]))

upcasting

Types can potentially be upcasted when combined with other types, meaning they are promoted from the current type
(e.g. int to float).

In [349]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [350]: df3
Out[350]:
A B C
0 -1.510602 -1.143729 1.0
1 -0.024511 2.787500 0.0
2 -0.088385 -0.708143 254.0
3 1.158101 -1.512388 0.0
4 -0.867039 -0.243145 1.0
5 -2.852354 -0.650992 0.0
6 2.524329 2.090108 0.0
7 -1.895793 0.433098 0.0

In [351]: df3.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float32
B float64
C float64
dtype: object

DataFrame.to_numpy() will return the lower-common-denominator of the dtypes, meaning the dtype that can
accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force some upcasting.

In [352]: df3.to_numpy().dtype
Out[352]: dtype('float64')

astype

You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a
copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition, they will raise an
exception if the astype operation is invalid.
Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then the more
general one will be used as the result of the operation.

3.3. Essential Basic Functionality 99

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [353]: df3
Out[353]:
A B C
0 -1.510602 -1.143729 1.0
1 -0.024511 2.787500 0.0
2 -0.088385 -0.708143 254.0
3 1.158101 -1.512388 0.0
4 -0.867039 -0.243145 1.0
5 -2.852354 -0.650992 0.0
6 2.524329 2.090108 0.0
7 -1.895793 0.433098 0.0

In [354]: df3.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float32
B float64
C float64
dtype: object

# conversion of dtypes
In [355]: df3.astype('float32').dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float32
B float32
C float32
dtype: object

Convert a subset of columns to a specified type using astype().

In [356]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [357]: dft[['a', 'b']] = dft[['a', 'b']].astype(np.uint8)

In [358]: dft
Out[358]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9

In [359]: dft.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[359]:
a uint8
b uint8
c int64
dtype: object

New in version 0.19.0.

Convert certain columns to a specific dtype by passing a dict to astype().
In [360]: dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [361]: dft1 = dft1.astype({'a': np.bool, 'c': np.float64})

In [362]: dft1
(continues on next page)

100 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Out[362]:
a b c
0 True 4 7.0
1 False 5 8.0
2 True 6 9.0

In [363]: dft1.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[363]:
˓→

a bool
b int64
c float64
dtype: object

Note: When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.
loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from
the right hand side. Therefore the following piece of code produces the unintended result.

In [364]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [365]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes

Out[365]:
a uint8
b uint8
dtype: object

In [366]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [367]: dft.dtypes
Out[367]:
a int64
b int64
c int64
dtype: object

object conversion

pandas offers various functions to try to force conversion of types from the object dtype to other types. In cases
where the data is already of the correct type, but stored in an object array, the DataFrame.infer_objects()
and Series.infer_objects() methods can be used to soft convert to the correct type.

In [368]: import datetime

In [369]: df = pd.DataFrame([[1, 2],

.....: ['a', 'b'],
.....: [datetime.datetime(2016, 3, 2),
.....: datetime.datetime(2016, 3, 2)]])
.....:

In [370]: df = df.T

In [371]: df
(continues on next page)

3.3. Essential Basic Functionality 101

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Out[371]:
0 1 2
0 1 a 2016-03-02 00:00:00
1 2 b 2016-03-02 00:00:00

In [372]: df.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 object
1 object
2 object
dtype: object

Because the data was transposed the original inference stored all columns as object, which infer_objects will
correct.

In [373]: df.infer_objects().dtypes
Out[373]:
0 int64
1 object
2 datetime64[ns]
dtype: object

The following functions are available for one dimensional object arrays or scalars to perform hard conversion of objects
to a specified type:
• to_numeric() (conversion to numeric dtypes)

In [374]: m = ['1.1', 2, 3]

In [375]: pd.to_numeric(m)
Out[375]: array([ 1.1, 2. , 3. ])

• to_datetime() (conversion to datetime objects)

In [376]: import datetime

In [377]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]

In [378]: pd.to_datetime(m)
Out[378]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]',
˓→freq=None)

• to_timedelta() (conversion to timedelta objects)

In [379]: m = ['5us', pd.Timedelta('1day')]

In [380]: pd.to_timedelta(m)
Out[380]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype=
˓→'timedelta64[ns]', freq=None)

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements
that cannot be converted to desired dtype or object. By default, errors='raise', meaning that any errors encoun-
tered will be raised during the conversion process. However, if errors='coerce', these errors will be ignored
and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric).
This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but
occasionally has non-conforming elements intermixed that you want to represent as missing:

102 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [381]: import datetime

In [382]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [383]: pd.to_datetime(m, errors='coerce')

Out[383]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [384]: m = ['apple', 2, 3]

In [385]: pd.to_numeric(m, errors='coerce')

Out[385]: array([ nan, 2., 3.])

In [386]: m = ['apple', pd.Timedelta('1day')]

In [387]: pd.to_timedelta(m, errors='coerce')

Out[387]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The errors parameter has a third option of errors='ignore', which will simply return the passed in data if it
encounters any errors with the conversion to a desired data type:

In [388]: import datetime

In [389]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [390]: pd.to_datetime(m, errors='ignore')

Out[390]: Index(['apple', 2016-03-02 00:00:00], dtype='object')

In [391]: m = ['apple', 2, 3]

In [392]: pd.to_numeric(m, errors='ignore')

Out[392]: array(['apple', 2, 3], dtype=object)

In [393]: m = ['apple', pd.Timedelta('1day')]

In [394]: pd.to_timedelta(m, errors='ignore')

Out[394]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, to_numeric() provides another argument downcast, which gives the option of
downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:

In [395]: m = ['1', 2, 3]

In [396]: pd.to_numeric(m, downcast='integer') # smallest signed int dtype

Out[396]: array([1, 2, 3], dtype=int8)

In [397]: pd.to_numeric(m, downcast='signed') # same as 'integer'

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[397]: array([1, 2, 3], dtype=int8)

In [398]: pd.to_numeric(m, downcast='unsigned') # smallest unsigned int dtype

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[398]:
˓→array([1, 2, 3], dtype=uint8)

In [399]: pd.to_numeric(m, downcast='float') # smallest float dtype

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→array([ 1., 2., 3.], dtype=float32)

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-
dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column

3.3. Essential Basic Functionality 103

pandas: powerful Python data analysis toolkit, Release 0.24.1

efficiently:

In [400]: import datetime

In [401]: df = pd.DataFrame([
.....: ['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
.....:

In [402]: df
Out[402]:
0 1
0 2016-07-09 2016-03-02 00:00:00
1 2016-07-09 2016-03-02 00:00:00

In [403]: df.apply(pd.to_datetime)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02

In [404]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')

In [405]: df
Out[405]:
0 1 2
0 1.1 2 3
1 1.1 2 3

In [406]: df.apply(pd.to_numeric)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[406]:
0 1 2
0 1.1 2 3
1 1.1 2 3

In [407]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [408]: df
Out[408]:
0 1
0 5us 1 days 00:00:00
1 5us 1 days 00:00:00

In [409]: df.apply(pd.to_timedelta)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[409]:
˓→

0 1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days

gotchas

Performing selection operations on integer type data can easily upcast the data to floating. The dtype of the
input data will be preserved in cases where nans are not introduced. See also Support for integer NA.

104 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [410]: dfi = df3.astype('int32')

In [411]: dfi['E'] = 1

In [412]: dfi
Out[412]:
A B C E
0 -1 -1 1 1
1 0 2 0 1
2 0 0 254 1
3 1 -1 0 1
4 0 0 1 1
5 -2 0 0 1
6 2 2 0 1
7 -1 0 0 1

In [413]: dfi.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A int32
B int32
C int32
E int64
dtype: object

In [414]: casted = dfi[dfi > 0]

In [415]: casted
Out[415]:
A B C E
0 NaN NaN 1.0 1
1 NaN 2.0 NaN 1
2 NaN NaN 254.0 1
3 1.0 NaN NaN 1
4 NaN NaN 1.0 1
5 NaN NaN NaN 1
6 2.0 2.0 NaN 1
7 NaN NaN NaN 1

In [416]: casted.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float64
B float64
C float64
E int64
dtype: object

While float dtypes are unchanged.

In [417]: dfa = df3.copy()

In [418]: dfa['A'] = dfa['A'].astype('float32')

In [419]: dfa.dtypes
Out[419]:
A float32
(continues on next page)

3.3. Essential Basic Functionality 105

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

B float64
C float64
dtype: object

In [420]: casted = dfa[df2 > 0]

In [421]: casted
Out[421]:
A B C
0 -1.510602 NaN 1.0
1 -0.024511 2.787500 NaN
2 -0.088385 NaN 254.0
3 NaN NaN NaN
4 NaN NaN 1.0
5 NaN NaN NaN
6 2.524329 2.090108 NaN
7 NaN 0.433098 NaN

In [422]: casted.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A float32
B float64
C float64
dtype: object

3.3.14 Selecting columns based on dtype

The select_dtypes() method implements subsetting of columns based on their dtype.

First, let’s create a DataFrame with a slew of different dtypes:
In [423]: df = pd.DataFrame({'string': list('abc'),
.....: 'int64': list(range(1, 4)),
.....: 'uint8': np.arange(3, 6).astype('u1'),
.....: 'float64': np.arange(4.0, 7.0),
.....: 'bool1': [True, False, True],
.....: 'bool2': [False, True, False],
.....: 'dates': pd.date_range('now', periods=3),
.....: 'category': pd.Series(list("ABC")).astype('category')})
.....:

In [424]: df['tdeltas'] = df.dates.diff()

In [425]: df['uint64'] = np.arange(3, 6).astype('u8')

In [426]: df['other_dates'] = pd.date_range('20130101', periods=3)

In [427]: df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [428]: df
Out[428]:
string int64 uint8 float64 bool1 ... category tdeltas uint64 other_dates
˓→ tz_aware_dates
0 a 1 3 4.0 True ... A NaT 3 2013-01-01 2013-
˓→01-01 00:00:00-05:00
(continues on next page)

106 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

1 b 2 4 5.0 False ... B 1 days 4 2013-01-02 2013-
˓→01-02 00:00:00-05:00

2 c 3 5 6.0 True ... C 1 days 5 2013-01-03 2013-

˓→01-03 00:00:00-05:00

[3 rows x 12 columns]

And the dtypes:

In [429]: df.dtypes
Out[429]:
string object
int64 int64
uint8 uint8
float64 float64
bool1 bool
bool2 bool
dates datetime64[ns]
category category
tdeltas timedelta64[ns]
uint64 uint64
other_dates datetime64[ns]
tz_aware_dates datetime64[ns, US/Eastern]
dtype: object

select_dtypes() has two parameters include and exclude that allow you to say “give me the columns with
these dtypes” (include) and/or “give the columns without these dtypes” (exclude).
For example, to select bool columns:

In [430]: df.select_dtypes(include=[bool])
Out[430]:
bool1 bool2
0 True False
1 False True
2 True False

You can also pass the name of a dtype in the NumPy dtype hierarchy:

In [431]: df.select_dtypes(include=['bool'])
Out[431]:
bool1 bool2
0 True False
1 False True
2 True False

select_dtypes() also works with generic dtypes as well.

For example, to select all numeric and boolean columns while excluding unsigned integers:

In [432]: df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])

Out[432]:
int64 float64 bool1 bool2 tdeltas
0 1 4.0 True False NaT
1 2 5.0 False True 1 days
2 3 6.0 True False 1 days

To select string columns you must use the object dtype:

3.3. Essential Basic Functionality 107

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [433]: df.select_dtypes(include=['object'])
Out[433]:
string
0 a
1 b
2 c

To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a tree of
child dtypes:

In [434]: def subdtypes(dtype):

.....: subs = dtype.__subclasses__()
.....: if not subs:
.....: return dtype
.....: return [dtype, [subdtypes(dt) for dt in subs]]
.....:

All NumPy dtypes are subclasses of numpy.generic:

In [435]: subdtypes(np.generic)
Out[435]:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]

Note: Pandas also defines the types category, and datetime64[ns, tz], which are not integrated into the
normal NumPy hierarchy and won’t show up with the above function.

108 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.4 Intro to Data Structures

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started.
The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To
get started, import NumPy and load pandas into your namespace:
In [1]: import numpy as np

In [2]: import pandas as pd

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken
unless done so explicitly by you.
We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in
separate sections.

3.4.1 Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers,
Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is
to call:
>>> s = pd.Series(data, index=index)

Here, data can be many different things:

• a Python dict
• an ndarray
• a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:
From ndarray
If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values
[0, ..., len(data) - 1].
In [3]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [4]: s
Out[4]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 1.212112
dtype: float64

In [5]: s.index
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[5]:
˓→Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [6]: pd.Series(np.random.randn(5))
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 -0.173215
(continues on next page)

3.4. Intro to Data Structures 109

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

1 0.119209
2 -1.044236
3 -0.861849
4 -2.104569
dtype: float64

Note: pandas supports non-unique index values. If an operation that does not support duplicate index values is
attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there
are many instances in computations, like parts of GroupBy, where the index is not used).

From dict
Series can be instantiated from dicts:

In [7]: d = {'b': 1, 'a': 0, 'c': 2}

In [8]: pd.Series(d)
Out[8]:
b 1
a 0
c 2
dtype: int64

Note: When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion
order, if you’re using Python version >= 3.6 and Pandas version >= 0.23.
If you’re using Python < 3.6 or Pandas < 0.23, and an index is not passed, the Series index will be the lexically
ordered list of dict keys.

In the example above, if you were on a Python version lower than 3.6 or a Pandas version lower than 0.23, the Series
would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b', 'a', 'c']).
If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [9]: d = {'a': 0., 'b': 1., 'c': 2.}

In [10]: pd.Series(d)
Out[10]:
a 0.0
b 1.0
c 2.0
dtype: float64

In [11]: pd.Series(d, index=['b', 'c', 'd', 'a'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[11]:
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

Note: NaN (not a number) is the standard missing data marker used in pandas.

110 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

From scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [12]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

Out[12]:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64

Series is ndarray-like

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations
such as slicing will also slice the index.

In [13]: s[0]
Out[13]: 0.46911229990718628

In [14]: s[:3]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[14]:
a 0.469112
b -0.282863
c -1.509059
dtype: float64

In [15]: s[s > s.median()]

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[1
˓→

a 0.469112
e 1.212112
dtype: float64

In [16]: s[[4, 3, 1]]

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

e 1.212112
d -1.135632
b -0.282863
dtype: float64

In [17]: np.exp(s)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a 1.598575
b 0.753623
c 0.221118
d 0.321219
e 3.360575
dtype: float64

Note: We will address array-based indexing like s[[4, 3, 1]] in section.

Like a NumPy array, a pandas Series has a dtype.

3.4. Intro to Data Structures 111

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [18]: s.dtype
Out[18]: dtype('float64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places,
in which case the dtype would be a ExtensionDtype. Some examples within pandas are Categorical Data and
Nullable Integer Data Type. See dtypes for more.
If you need the actual array backing a Series, use Series.array.

In [19]: s.array
Out[19]:
<PandasArray>
[ 0.46911229990718628, -0.28286334432866328, -1.5090585031735124,
-1.1356323710171934, 1.2121120250208506]
Length: 5, dtype: float64

Accessing the array can be useful when you need to do some operation without the index (to disable automatic
alignment, for example).
Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one
or more concrete arrays like a numpy.ndarray. Pandas knows how to take an ExtensionArray and store it in
a Series or a column of a DataFrame. See dtypes for more.
While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

In [20]: s.to_numpy()
Out[20]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.

Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:

In [21]: s['a']
Out[21]: 0.46911229990718628

In [22]: s['e'] = 12.

In [23]: s
Out[23]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 12.000000
dtype: float64

In [24]: 'e' in s
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\O
˓→True

In [25]: 'f' in s
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→False

If a label is not contained, an exception is raised:

112 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

>>> s['f']
KeyError: 'f'

Using the get method, a missing label will return None or specified default:
In [26]: s.get('f')

In [27]: s.get('f', np.nan)

Out[27]: nan

Vectorized operations and label alignment with Series

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true
when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.
In [28]: s + s
Out[28]:
a 0.938225
b -0.565727
c -3.018117
d -2.271265
e 24.000000
dtype: float64

In [29]: s * 2
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\O
˓→

a 0.938225
b -0.565727
c -3.018117
d -2.271265
e 24.000000
dtype: float64

In [30]: np.exp(s)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a 1.598575
b 0.753623
c 0.221118
d 0.321219
e 162754.791419
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align the data based on
label. Thus, you can write computations without giving consideration to whether the Series involved have the same
labels.
In [31]: s[1:] + s[:-1]
Out[31]:
a NaN
b -0.565727
c -3.018117
d -2.271265
(continues on next page)

3.4. Intro to Data Structures 113

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

e NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found
in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit
data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data
alignment features of the pandas data structures set pandas apart from the majority of related tools for working with
labeled data.

Note: In general, we chose to make the default result of operations between differently indexed objects yield the
union of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is
typically important information as part of a computation. You of course have the option of dropping labels with
missing data via the dropna function.

Name attribute

Series can also have a name attribute:

In [32]: s = pd.Series(np.random.randn(5), name='something')

In [33]: s
Out[33]:
0 -0.494929
1 1.071804
2 0.721555
3 -0.706771
4 -1.039575
Name: something, dtype: float64

In [34]: s.name
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→'something'

The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as
you will see below.
New in version 0.18.0.
You can rename a Series with the pandas.Series.rename() method.

In [35]: s2 = s.rename("different")

In [36]: s2.name
Out[36]: 'different'

Note that s and s2 refer to different objects.

3.4.2 DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it
like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
Like Series, DataFrame accepts many different kinds of input:

114 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

• Dict of 1D ndarrays, lists, dicts, or Series

• 2-D numpy.ndarray
• Structured or record ndarray
• A Series
• Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass
an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict
of Series plus a specific index will discard all data not matching up to the passed index.
If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Note: When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by the dict’s
insertion order, if you are using Python version >= 3.6 and Pandas >= 0.23.
If you are using Python < 3.6 or Pandas < 0.23, and columns is not specified, the DataFrame columns will be the
lexically ordered list of dict keys.

From dict of Series or dicts

The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first
be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.

In [37]: d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),

....: 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:

In [38]: df = pd.DataFrame(d)

In [39]: df
Out[39]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0

In [40]: pd.DataFrame(d, index=['d', 'b', 'a'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[40]:
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0

In [41]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN

The row and column labels can be accessed respectively by accessing the index and columns attributes:

3.4. Intro to Data Structures 115

pandas: powerful Python data analysis toolkit, Release 0.24.1

Note: When a particular set of columns is passed along with a dict of data, the passed columns override the keys in
the dict.

In [42]: df.index
Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')

In [43]: df.columns
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[43]: Index(['one', 'two'],
˓→dtype='object')

From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays.
If no index is passed, the result will be range(n), where n is the array length.
In [44]: d = {'one': [1., 2., 3., 4.],
....: 'two': [4., 3., 2., 1.]}
....:

In [45]: pd.DataFrame(d)
Out[45]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

In [46]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[46]:
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0

From structured or record array

This case is handled identically to a dict of arrays.

In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]

In [49]: pd.DataFrame(data)
Out[49]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'

In [50]: pd.DataFrame(data, index=['first', 'second'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[50]:
A B C
first 1 2.0 b'Hello'
(continues on next page)

116 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

second 2 3.0 b'World'

In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

C A B
0 b'Hello' 1 2.0
1 b'World' 2 3.0

Note: DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

From a list of dicts

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [53]: pd.DataFrame(data2)
Out[53]:
a b c
0 1 2 NaN
1 5 10 20.0

In [54]: pd.DataFrame(data2, index=['first', 'second'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[54]:
a b c
first 1 2 NaN
second 5 10 20.0

In [55]: pd.DataFrame(data2, columns=['a', 'b'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b
0 1 2
1 5 10

From a dict of tuples

You can automatically create a MultiIndexed frame by passing a tuples dictionary.

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},

....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
....:
Out[56]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0

3.4. Intro to Data Structures 117

pandas: powerful Python data analysis toolkit, Release 0.24.1

From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the
original name of the Series (only if no other column name provided).
Missing Data
Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing data, we use
np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the data argument to
the DataFrame constructor, and its masked entries will be considered missing.

Alternate Constructors

DataFrame.from_dict
DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates
like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can
be set to 'index' in order to use the dict keys as row labels.

In [57]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))

Out[57]:
A B
0 1 4
1 2 5
2 3 6

If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired column
names:

In [58]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),

....: orient='index', columns=['one', 'two', 'three'])
....:
Out[58]:
one two three
A 1 2 3
B 4 5 6

DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously to the
normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of the structured
dtype. For example:

In [59]: data
Out[59]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [60]: pd.DataFrame.from_records(data, index='C')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B
C
b'Hello' 1 2.0
b'World' 2 3.0

118 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting
columns works with the same syntax as the analogous dict operations:
In [61]: df['one']
Out[61]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64

In [62]: df['three'] = df['one'] * df['two']

In [63]: df['flag'] = df['one'] > 2

In [64]: df
Out[64]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False

Columns can be deleted or popped like with a dict:

In [65]: del df['two']

In [66]: three = df.pop('three')

In [67]: df
Out[67]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False

When inserting a scalar value, it will naturally be propagated to fill the column:
In [68]: df['foo'] = 'bar'

In [69]: df
Out[69]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s
index:
In [70]: df['one_trunc'] = df['one'][:2]

In [71]: df
Out[71]:
(continues on next page)

3.4. Intro to Data Structures 119

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular location in the
columns:

In [72]: df.insert(1, 'bar', df['one'])

In [73]: df
Out[73]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN

Assigning New Columns in Method Chains

Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new columns
that are potentially derived from existing columns.

In [74]: iris = pd.read_csv('data/iris.data')

In [75]: iris.head()
Out[75]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [76]: (iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength'])

....: .head())
....:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio

0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000

In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be evaluated
on the DataFrame being assigned to.

In [77]: iris.assign(sepal_ratio=lambda x: (x['SepalWidth'] / x['SepalLength'])).

˓→head()

Out[77]:
(continues on next page)

120 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000

assign always returns a copy of the data, leaving the original DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to the
DataFrame at hand. This is common when using assign in a chain of operations. For example, we can limit the
DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:

In [78]: (iris.query('SepalLength > 5')

....: .assign(SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
....: PetalRatio=lambda x: x.PetalWidth / x.PetalLength)
....: .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
....:
Out[78]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7a2be9c128>

Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly, this is the
DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering happens first, and then the
ratio calculations. This is an example where we didn’t have a reference to the filtered DataFrame available.

3.4. Intro to Data Structures 121

pandas: powerful Python data analysis toolkit, Release 0.24.1

The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the
values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be
called on the DataFrame. A copy of the original DataFrame is returned, with the new values inserted.
Changed in version 0.23.0.
Starting with Python 3.6 the order of **kwargs is preserved. This allows for dependent assignment, where an
expression later in **kwargs can refer to a column created earlier in the same assign().

In [79]: dfa = pd.DataFrame({"A": [1, 2, 3],

....: "B": [4, 5, 6]})
....:

In [80]: dfa.assign(C=lambda x: x['A'] + x['B'],

....: D=lambda x: x['A'] + x['C'])
....:
Out[80]:
A B C D
0 1 4 5 6
1 2 5 7 9
2 3 6 9 12

In the second expression, x['C'] will refer to the newly created column, that’s equal to dfa['A'] + dfa['B'].
To write code compatible with all versions of Python, split the assignment in two.

In [81]: dependent = pd.DataFrame({"A": [1, 1, 1]})

In [82]: (dependent.assign(A=lambda x: x['A'] + 1)

....: .assign(B=lambda x: x['A'] + 2))
....:
Out[82]:
A B
0 2 4
1 2 4
2 2 4

Warning: Dependent assignment maybe subtly change the behavior of your code between Python 3.6 and older
versions of Python.
If you wish write code that supports versions of python before and after 3.6, you’ll need to take care when passing
assign expressions that
• Updating an existing column
• Referring to the newly updated column in the same assign
For example, we’ll update column “A” and then refer to it when creating “B”.
>>> dependent = pd.DataFrame({"A": [1, 1, 1]})
>>> dependent.assign(A=lambda x: x["A"] + 1, B=lambda x: x["A"] + 2)

For Python 3.5 and earlier the expression creating B refers to the “old” value of A, [1, 1, 1]. The output is
then
A B
0 2 3
1 2 3
2 2 3

122 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

For Python 3.6 and later, the expression creating A refers to the “new” value of A, [2, 2, 2], which results in
A B
0 2 4
1 2 4
2 2 4

Indexing / Selection

The basics of indexing are as follows:

Operation Syntax Result

Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

Row selection, for example, returns a Series whose index is the columns of the DataFrame:

In [83]: df.loc['b']
Out[83]:
one 2
bar 2
flag False
foo bar
one_trunc 2
Name: b, dtype: object

In [84]: df.iloc[2]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

one 3
bar 3
flag True
foo bar
one_trunc NaN
Name: c, dtype: object

For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section on indexing. We
will address the fundamentals of reindexing / conforming to new sets of labels in the section on reindexing.

Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels).
Again, the resulting object will have the union of the column and row labels.

In [85]: df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

In [86]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

In [87]: df + df2
(continues on next page)

3.4. Intro to Data Structures 123

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Out[87]:
A B C D
0 0.045691 -0.014138 1.380871 NaN
1 -0.955398 -1.501007 0.037181 NaN
2 -0.662690 1.534833 -0.859691 NaN
3 -2.452949 1.237274 -0.133712 NaN
4 1.414490 1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN

When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the
DataFrame columns, thus broadcasting row-wise. For example:

In [88]: df - df.iloc[0]
Out[88]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 -1.359261 -0.248717 -0.453372 -1.754659
2 0.253128 0.829678 0.010026 -1.991234
3 -1.311128 0.054325 -1.724913 -1.620544
4 0.573025 1.500742 -0.676070 1.367331
5 -1.741248 0.781993 -1.241620 -2.053136
6 -1.240774 -0.869551 -0.153282 0.000430
7 -0.743894 0.411013 -0.929563 -0.282386
8 -1.194921 1.320690 0.238224 -1.482644
9 2.293786 1.856228 0.773289 -1.446531

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting
will be column-wise:

In [89]: index = pd.date_range('1/1/2000', periods=8)

In [90]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))

In [91]: df
Out[91]:
A B C
2000-01-01 -1.226825 0.769804 -1.281247
2000-01-02 -0.727707 -0.121306 -0.097883
2000-01-03 0.695775 0.341734 0.959726
2000-01-04 -1.110336 -0.619976 0.149748
2000-01-05 -0.732339 0.687738 0.176444
2000-01-06 0.403310 -0.154951 0.301624
2000-01-07 -2.179861 -1.369849 -0.954208
2000-01-08 1.462696 -1.743161 -0.826591

In [92]: type(df['A'])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→pandas.core.series.Series

In [93]: df - df['A']
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

(continues on next page)

124 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2000-01-01 00:00:00 2000-01-02 00:00:00 2000-01-03 00:00:00 ... A
˓→ B C
2000-01-01 NaN NaN NaN ... NaN
˓→NaN NaN

2000-01-02 NaN NaN NaN ... NaN

˓→NaN NaN

2000-01-03 NaN NaN NaN ... NaN

˓→NaN NaN

2000-01-04 NaN NaN NaN ... NaN

˓→NaN NaN

2000-01-05 NaN NaN NaN ... NaN

˓→NaN NaN

2000-01-06 NaN NaN NaN ... NaN

˓→NaN NaN

2000-01-07 NaN NaN NaN ... NaN

˓→NaN NaN

2000-01-08 NaN NaN NaN ... NaN

˓→NaN NaN

[8 rows x 11 columns]

Warning:
df - df['A']

is now deprecated and will be removed in a future release. The preferred way to replicate this behavior is
df.sub(df['A'], axis=0)

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.
Operations with scalars are just as you would expect:

In [94]: df * 5 + 2
Out[94]:
A B C
2000-01-01 -4.134126 5.849018 -4.406237
2000-01-02 -1.638535 1.393469 1.510587
2000-01-03 5.478873 3.708672 6.798628
2000-01-04 -3.551681 -1.099880 2.748742
2000-01-05 -1.661697 5.438692 2.882222
2000-01-06 4.016548 1.225246 3.508122
2000-01-07 -8.899303 -4.849247 -2.771039
2000-01-08 9.313480 -6.715805 -2.132955

In [95]: 1 / df
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C
2000-01-01 -0.815112 1.299033 -0.780489
2000-01-02 -1.374179 -8.243600 -10.216313
2000-01-03 1.437247 2.926250 1.041965
2000-01-04 -0.900628 -1.612966 6.677871
2000-01-05 -1.365487 1.454041 5.667510
2000-01-06 2.479485 -6.453662 3.315381
(continues on next page)

3.4. Intro to Data Structures 125

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2000-01-07 -0.458745 -0.730007 -1.047990
2000-01-08 0.683669 -0.573671 -1.209788

In [96]: df ** 4
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

A B C
2000-01-01 2.265327 0.351172 2.694833
2000-01-02 0.280431 0.000217 0.000092
2000-01-03 0.234355 0.013638 0.848376
2000-01-04 1.519910 0.147740 0.000503
2000-01-05 0.287640 0.223714 0.000969
2000-01-06 0.026458 0.000576 0.008277
2000-01-07 22.579530 3.521204 0.829033
2000-01-08 4.577374 9.233151 0.466834

Boolean operators work as well:

In [97]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)

In [98]: df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [99]: df1 & df2

Out[99]:
a b
0 False False
1 False True
2 True False

In [100]: df1 | df2

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[100]:
a b
0 True True
1 True True
2 True True

In [101]: df1 ^ df2

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b
0 True True
1 True False
2 False True

In [102]: -df1
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b
0 False True
1 True False
2 False False

Transposing

To transpose, access the T attribute (also the transpose function), similar to an ndarray:

126 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

# only show the first 5 rows

In [103]: df[:5].T
Out[103]:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05
A -1.226825 -0.727707 0.695775 -1.110336 -0.732339
B 0.769804 -0.121306 0.341734 -0.619976 0.687738
C -1.281247 -0.097883 0.959726 0.149748 0.176444

DataFrame interoperability with NumPy functions

Elementwise NumPy ufuncs (log, exp, sqrt, . . . ) and various other NumPy functions can be used with no issues on
DataFrame, assuming the data within are numeric:

In [104]: np.exp(df)
Out[104]:
A B C
2000-01-01 0.293222 2.159342 0.277691
2000-01-02 0.483015 0.885763 0.906755
2000-01-03 2.005262 1.407386 2.610980
2000-01-04 0.329448 0.537957 1.161542
2000-01-05 0.480783 1.989212 1.192968
2000-01-06 1.496770 0.856457 1.352053
2000-01-07 0.113057 0.254145 0.385117
2000-01-08 4.317584 0.174966 0.437538

In [105]: np.asarray(df)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

array([[-1.2268, 0.7698, -1.2812],

[-0.7277, -0.1213, -0.0979],
[ 0.6958, 0.3417, 0.9597],
[-1.1103, -0.62 , 0.1497],
[-0.7323, 0.6877, 0.1764],
[ 0.4033, -0.155 , 0.3016],
[-2.1799, -1.3698, -0.9542],
[ 1.4627, -1.7432, -0.8266]])

The dot method on DataFrame implements matrix multiplication:

In [106]: df.T.dot(df)
Out[106]:
A B C
A 11.341858 -0.059772 3.007998
B -0.059772 6.520556 2.083308
C 3.007998 2.083308 4.310549

Similarly, the dot method on Series implements dot product:

In [107]: s1 = pd.Series(np.arange(5, 10))

In [108]: s1.dot(s1)
Out[108]: 255

DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in
places from a matrix.

3.4. Intro to Data Structures 127

pandas: powerful Python data analysis toolkit, Release 0.24.1

Console display

Very large DataFrames will be truncated to display them in the console. You can also get a summary using info().
(Here I am reading a CSV version of the baseball dataset from the plyr R package):

In [109]: baseball = pd.read_csv('data/baseball.csv')

In [110]: print(baseball)
id player year stint team lg g ab r ... sb cs bb so ibb
˓→ hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 ... 1.0 1.0 4 4.0 0.0
˓→ 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 ... 0.0 0.0 0 1.0 0.0
˓→ 0.0 0.0 0.0 0.0
.. ... ... ... ... ... .. .. ... .. ... ... ... .. ... ...
˓→ ... ... ... ...
98 89533 aloumo01 2007 1 NYN NL 87 328 51 ... 3.0 0.0 27 30.0 5.0
˓→ 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 ... 0.0 0.0 0 3.0 0.0
˓→ 0.0 0.0 0.0 0.0

[100 rows x 23 columns]

In [111]: baseball.info()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→<class 'pandas.core.frame.DataFrame'>

RangeIndex: 100 entries, 0 to 99

Data columns (total 23 columns):
id 100 non-null int64
player 100 non-null object
year 100 non-null int64
stint 100 non-null int64
team 100 non-null object
lg 100 non-null object
g 100 non-null int64
ab 100 non-null int64
r 100 non-null int64
h 100 non-null int64
X2b 100 non-null int64
X3b 100 non-null int64
hr 100 non-null int64
rbi 100 non-null float64
sb 100 non-null float64
cs 100 non-null float64
bb 100 non-null int64
so 100 non-null float64
ibb 100 non-null float64
hbp 100 non-null float64
sh 100 non-null float64
sf 100 non-null float64
gidp 100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.0+ KB

However, using to_string will return a string representation of the DataFrame in tabular form, though it won’t
always fit the console width:

128 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [112]: print(baseball.iloc[-20:, :12].to_string())

id player year stint team lg g ab r h X2b X3b
80 89474 finlest01 2007 1 COL NL 43 94 9 17 3 0
81 89480 embreal01 2007 1 OAK AL 4 0 0 0 0 0
82 89481 edmonji01 2007 1 SLN NL 117 365 39 92 15 2
83 89482 easleda01 2007 1 NYN NL 76 193 24 54 6 0
84 89489 delgaca01 2007 1 NYN NL 139 538 71 139 30 0
85 89493 cormirh01 2007 1 CIN NL 6 0 0 0 0 0
86 89494 coninje01 2007 2 NYN NL 21 41 2 8 2 0
87 89495 coninje01 2007 1 CIN NL 80 215 23 57 11 1
88 89497 clemero02 2007 1 NYA AL 2 2 0 1 0 0
89 89498 claytro01 2007 2 BOS AL 8 6 1 0 0 0
90 89499 claytro01 2007 1 TOR AL 69 189 23 48 14 0
91 89501 cirilje01 2007 2 ARI NL 28 40 6 8 4 0
92 89502 cirilje01 2007 1 MIN AL 50 153 18 40 9 2
93 89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 0
94 89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 3
95 89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0
96 89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0
97 89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0

Wide DataFrames will be printed across multiple rows by default:

In [113]: pd.DataFrame(np.random.randn(3, 12))

Out[113]:
0 1 2 3 4 ... 7 8 9
˓→ 10 11
0 -0.345352 1.314232 0.690579 0.995761 2.396780 ... -0.317441 -1.236269 0.
˓→896171 -0.487602 -0.082240

1 -2.182937 0.380396 0.084844 0.432390 1.519970 ... 0.274230 0.132885 -0.

˓→023688 2.410179 1.450520
2 0.206053 -0.251905 -2.213588 1.063327 1.266143 ... 0.408204 -1.048089 -0.
˓→025747 -0.988387 0.094055

[3 rows x 12 columns]

You can change how much to print on a single row by setting the display.width option:

In [114]: pd.set_option('display.width', 40) # default is 80

In [115]: pd.DataFrame(np.random.randn(3, 12))

Out[115]:
0 1 2 3 4 ... 7 8 9
˓→ 10 11
0 1.262731 1.289997 0.082423 -0.055758 0.536580 ... -0.034571 -2.484478 -0.
˓→281461 0.030711 0.109121
1 1.126203 -0.977349 1.474071 -0.064034 -1.282782 ... 0.441153 2.353925 0.
˓→583787 0.221471 -0.744471
2 0.758527 1.729689 -0.964980 -0.845696 -1.340896 ... 1.682706 -1.717693 0.
˓→888782 0.228440 0.901805

[3 rows x 12 columns]

You can adjust the max width of the individual columns by setting display.max_colwidth

3.4. Intro to Data Structures 129

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [116]: datafile = {'filename': ['filename_01', 'filename_02'],

.....: 'path': ["media/user_name/storage/folder_01/filename_01",
.....: "media/user_name/storage/folder_02/filename_02"]}
.....:

In [117]: pd.set_option('display.max_colwidth', 30)

In [118]: pd.DataFrame(datafile)
Out[118]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...

In [119]: pd.set_option('display.max_colwidth', 100)

In [120]: pd.DataFrame(datafile)
Out[120]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02

You can also disable this feature via the expand_frame_repr option. This will print the table in one block.

DataFrame column attribute access and IPython completion

If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:

In [121]: df = pd.DataFrame({'foo1': np.random.randn(5),

.....: 'foo2': np.random.randn(5)})
.....:

In [122]: df
Out[122]:
foo1 foo2
0 1.171216 -0.858447
1 0.520260 0.306996
2 -1.197071 -0.028665
3 -1.066969 0.384316
4 -0.303421 1.574159

In [123]: df.foo1
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 1.171216
1 0.520260
2 -1.197071
3 -1.066969
4 -0.303421
Name: foo1, dtype: float64

The columns are also connected to the IPython completion mechanism so they can be tab-completed:

In [5]: df.fo<TAB> # noqa: E225, E999

df.foo1 df.foo2

130 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.4.3 Panel

Warning: In 0.20.0, Panel is deprecated and will be removed in a future version. See the section Deprecate
Panel.

Panel is a somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived
from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are
intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric
analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you
may find the axis names slightly arbitrary:
• items: axis 0, each item corresponds to a DataFrame contained inside
• major_axis: axis 1, it is the index (rows) of each of the DataFrames
• minor_axis: axis 2, it is the columns of each of the DataFrames
Construction of Panels works about like you would expect:

From 3D ndarray with optional axis labels

In [124]: wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],

.....: major_axis=pd.date_range('1/1/2000', periods=5),
.....: minor_axis=['A', 'B', 'C', 'D'])
.....:

In [125]: wp
Out[125]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

From dict of DataFrame objects

In [126]: data = {'Item1': pd.DataFrame(np.random.randn(4, 3)),

.....: 'Item2': pd.DataFrame(np.random.randn(4, 2))}
.....:

In [127]: pd.Panel(data)
Out[127]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

Note that the values in the dict need only be convertible to DataFrame. Thus, they can be any of the other valid
inputs to DataFrame as per above.
One helpful factory method is Panel.from_dict, which takes a dictionary of DataFrames as above, and the
following named parameters:

3.4. Intro to Data Structures 131

pandas: powerful Python data analysis toolkit, Release 0.24.1

Parameter Default Description

intersect False drops elements whose indices do not align
orient items use minor to use DataFrames’ columns as panel items

For example, compare to the construction above:

In [128]: pd.Panel.from_dict(data, orient='minor')

Out[128]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: 0 to 2
Major_axis axis: 0 to 3
Minor_axis axis: Item1 to Item2

Orient is especially useful for mixed-type DataFrames. If you pass a dict of DataFrame objects with mixed-type
columns, all of the data will get upcasted to dtype=object unless you pass orient='minor':

In [129]: df = pd.DataFrame({'a': ['foo', 'bar', 'baz'],

.....: 'b': np.random.randn(3)})
.....:

In [130]: df
Out[130]:
a b
0 foo -0.308853
1 bar -0.681087
2 baz 0.377953

In [131]: data = {'item1': df, 'item2': df}

In [132]: panel = pd.Panel.from_dict(data, orient='minor')

In [133]: panel['a']
Out[133]:
item1 item2
0 foo foo
1 bar bar
2 baz baz

In [134]: panel['b']
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[134]:
item1 item2
0 -0.308853 -0.308853
1 -0.681087 -0.681087
2 0.377953 0.377953

In [135]: panel['b'].dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

item1 float64
item2 float64
dtype: object

Note: Panel, being less commonly used than Series and DataFrame, has been slightly neglected feature-wise. A
number of methods and options available in DataFrame are not available in Panel.

132 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

From DataFrame using to_panel method

to_panel converts a DataFrame with a two-level index to a Panel.

In [136]: midx = pd.MultiIndex(levels=[['one', 'two'], ['x', 'y']],

.....: codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
.....:

In [137]: df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=midx)

In [138]: df.to_panel()
Out[138]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: A to B
Major_axis axis: one to two
Minor_axis axis: x to y

Item selection / addition / deletion

Similar to DataFrame functioning as a dict of Series, Panel is like a dict of DataFrames:

In [139]: wp['Item1']
Out[139]:
A B C D
2000-01-01 1.588931 0.476720 0.473424 -0.242861
2000-01-02 -0.014805 -0.284319 0.650776 -1.461665
2000-01-03 -1.137707 -0.891060 -0.693921 1.613616
2000-01-04 0.464000 0.227371 -0.496922 0.306389
2000-01-05 -2.290613 -1.134623 -1.561819 -0.260838

In [140]: wp['Item3'] = wp['Item1'] / wp['Item2']

The API for insertion and deletion is the same as for DataFrame. And as with DataFrame, if the item is a valid Python
identifier, you can access it as an attribute and tab-complete it in IPython.

Transposing

A Panel can be rearranged using its transpose method (which does not make a copy by default unless the data are
heterogeneous):

In [141]: wp.transpose(2, 0, 1)
Out[141]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 5 (minor_axis)
Items axis: A to D
Major_axis axis: Item1 to Item3
Minor_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00

3.4. Intro to Data Structures 133

pandas: powerful Python data analysis toolkit, Release 0.24.1

Indexing / Selection

Operation Syntax Result

Select item wp[item] DataFrame
Get slice at major_axis label wp.major_xs(val) DataFrame
Get slice at minor_axis label wp.minor_xs(val) DataFrame

For example, using the earlier example data, we could do:

In [142]: wp['Item1']
Out[142]:
A B C D
2000-01-01 1.588931 0.476720 0.473424 -0.242861
2000-01-02 -0.014805 -0.284319 0.650776 -1.461665
2000-01-03 -1.137707 -0.891060 -0.693921 1.613616
2000-01-04 0.464000 0.227371 -0.496922 0.306389
2000-01-05 -2.290613 -1.134623 -1.561819 -0.260838

In [143]: wp.major_xs(wp.major_axis[2])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

Item1 Item2 Item3

A -1.137707 0.800193 -1.421791
B -0.891060 0.782098 -1.139320
C -0.693921 -1.069094 0.649074
D 1.613616 -1.099248 -1.467927

In [144]: wp.minor_axis
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→Index(['A', 'B', 'C', 'D'], dtype='object')

In [145]: wp.minor_xs('C')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

Item1 Item2 Item3

2000-01-01 0.473424 -0.902937 -0.524316
2000-01-02 0.650776 -1.144073 -0.568824
2000-01-03 -0.693921 -1.069094 0.649074
2000-01-04 -0.496922 0.661084 -0.751678
2000-01-05 -1.561819 -1.056652 1.478083

Squeezing

Another way to change the dimensionality of an object is to squeeze a 1-len object, similar to wp['Item1'].
In [146]: wp.reindex(items=['Item1']).squeeze()
Out[146]:
A B C D
2000-01-01 1.588931 0.476720 0.473424 -0.242861
2000-01-02 -0.014805 -0.284319 0.650776 -1.461665
2000-01-03 -1.137707 -0.891060 -0.693921 1.613616
2000-01-04 0.464000 0.227371 -0.496922 0.306389
2000-01-05 -2.290613 -1.134623 -1.561819 -0.260838

In [147]: wp.reindex(items=['Item1'], minor=['B']).squeeze()

(continues on next page)

134 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

2000-01-01 0.476720
2000-01-02 -0.284319
2000-01-03 -0.891060
2000-01-04 0.227371
2000-01-05 -1.134623
Freq: D, Name: B, dtype: float64

Conversion to DataFrame

A Panel can be represented in 2D form as a hierarchically indexed DataFrame. See the section hierarchical indexing
for more on this. To convert a Panel to a DataFrame, use the to_frame method:

In [148]: panel = pd.Panel(np.random.randn(3, 5, 4), items=['one', 'two', 'three'],

.....: major_axis=pd.date_range('1/1/2000', periods=5),
.....: minor_axis=['a', 'b', 'c', 'd'])
.....:

In [149]: panel.to_frame()
Out[149]:
one two three
major minor
2000-01-01 a 0.493672 1.219492 -1.290493
b -2.461467 0.062297 0.787872
c -1.553902 -0.110388 1.515707
d 2.015523 -1.184357 -0.276487
2000-01-02 a -1.833722 -0.558081 -0.223762
b 1.771740 0.077849 1.397431
c -0.670027 0.629498 1.503874
d 0.049307 -1.035260 -0.478905
2000-01-03 a -0.521493 -0.438229 -0.135950
b -3.201750 0.503703 -0.730327
c 0.792716 0.413086 -0.033277
d 0.146111 -1.139050 0.281151
2000-01-04 a 1.903247 0.660342 -1.298915
b -0.747169 0.464794 -2.819487
c -0.309038 -0.309337 -0.851985
d 0.393876 -0.649593 -1.106952
2000-01-05 a 1.861468 0.683758 -0.937731
b 0.936527 -0.643834 -1.537770
c 1.255746 0.421287 0.555759
d -2.655452 1.032814 -2.277282

3.4.4 Deprecate Panel

Over the last few years, pandas has increased in both breadth and depth, with new features, datatype support, and
manipulation routines. As a result, supporting efficient indexing and functional routines for Series, DataFrame
and Panel has contributed to an increasingly fragmented and difficult-to-understand code base.
The 3-D structure of a Panel is much less common for many types of data analysis, than the 1-D of the Series or
the 2-D of the DataFrame. Going forward it makes sense for pandas to focus on these areas exclusively.
Oftentimes, one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

3.4. Intro to Data Structures 135

pandas: powerful Python data analysis toolkit, Release 0.24.1

In addition, the xarray package was built from the ground up, specifically in order to support the multi-dimensional
analysis that is one of Panel s main use cases. Here is a link to the xarray panel-transition documentation.
In [150]: import pandas.util.testing as tm

In [151]: p = tm.makePanel()

In [152]: p
Out[152]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 30 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-02-11 00:00:00
Minor_axis axis: A to D

Convert to a MultiIndex DataFrame.

In [153]: p.to_frame()
Out[153]:
ItemA ItemB ItemC
major minor
2000-01-03 A -0.390201 -1.624062 -0.605044
B 1.562443 0.483103 0.583129
C -1.085663 0.768159 -0.273458
D 0.136235 -0.021763 -0.700648
2000-01-04 A 1.207122 -0.758514 0.878404
B 0.763264 0.061495 -0.876690
C -1.114738 0.225441 -0.335117
D 0.886313 -0.047152 -1.166607
2000-01-05 A 0.178690 -0.560859 -0.921485
B 0.162027 0.240767 -1.919354
C -0.058216 0.543294 -0.476268
D -1.350722 0.088472 -0.367236
2000-01-06 A -1.004168 -0.589005 -0.200312
B -0.902704 0.782413 -0.572707
C -0.486768 0.771931 -1.765602
D -0.886348 -0.857435 1.296674
2000-01-07 A -1.377627 -1.070678 0.522423
B 1.106010 0.628462 -1.736484
C 1.685148 -0.968145 0.578223
D -1.013316 -2.503786 0.641385
2000-01-10 A 0.499281 -1.681101 0.722511
B -0.199234 -0.880627 -1.335113
C 0.112572 -1.176383 0.242697
D 1.920906 -1.058041 -0.779432
2000-01-11 A -1.405256 0.403776 -1.702486
B 0.458265 0.777575 -1.244471
C -1.495309 -3.192716 0.208129
D -0.388231 -0.657981 0.602456
2000-01-12 A 0.162565 0.609862 -0.709535
B 0.491048 -0.779367 0.347339
... ... ... ...
2000-02-02 C -0.303961 -0.463752 -0.288962
D 0.104050 1.116086 0.506445
2000-02-03 A -2.338595 -0.581967 -0.801820
B -0.557697 -0.033731 -0.176382
C 0.625555 -0.055289 0.875359
D 0.174068 -0.443915 1.626369
(continues on next page)

136 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2000-02-04 A -0.374279 -1.233862 -0.915751
B 0.381353 -1.108761 -1.970108
C -0.059268 -0.360853 -0.614618
D -0.439461 -0.200491 0.429518
2000-02-07 A -2.359958 -3.520876 -0.288156
B 1.337122 -0.314399 -1.044208
C 0.249698 0.728197 0.565375
D -0.741343 1.092633 0.013910
2000-02-08 A -1.157886 0.516870 -1.199945
B -1.531095 -0.860626 -0.821179
C 1.103949 1.326768 0.068184
D -0.079673 -1.675194 -0.458272
2000-02-09 A -0.551865 0.343125 -0.072869
B 1.331458 0.370397 -1.914267
C -1.087532 0.208927 0.788871
D -0.922875 0.437234 -1.531004
2000-02-10 A 1.592673 2.137827 -1.828740
B -0.571329 -1.761442 -0.826439
C 1.998044 0.292058 -0.280343
D 0.303638 0.388254 -0.500569
2000-02-11 A 1.559318 0.452429 -1.716981
B -0.026671 -0.899454 0.124808
C -0.244548 -2.019610 0.931536
D -0.917368 0.479630 0.870690

[120 rows x 3 columns]

Alternatively, one can convert to an xarray DataArray.

In [154]: p.to_xarray()
Out[154]:
<xarray.DataArray (items: 3, major_axis: 30, minor_axis: 4)>
array([[[-0.390201, 1.562443, -1.085663, 0.136235],
[ 1.207122, 0.763264, -1.114738, 0.886313],
...,
[ 1.592673, -0.571329, 1.998044, 0.303638],
[ 1.559318, -0.026671, -0.244548, -0.917368]],

[[-1.624062, 0.483103, 0.768159, -0.021763],

[-0.758514, 0.061495, 0.225441, -0.047152],
...,
[ 2.137827, -1.761442, 0.292058, 0.388254],
[ 0.452429, -0.899454, -2.01961 , 0.47963 ]],

[[-0.605044, 0.583129, -0.273458, -0.700648],

[ 0.878404, -0.87669 , -0.335117, -1.166607],
...,
[-1.82874 , -0.826439, -0.280343, -0.500569],
[-1.716981, 0.124808, 0.931536, 0.87069 ]]])
Coordinates:
* items (items) object 'ItemA' 'ItemB' 'ItemC'
* major_axis (major_axis) datetime64[ns] 2000-01-03 2000-01-04 ... 2000-02-11
* minor_axis (minor_axis) object 'A' 'B' 'C' 'D'

You can see the full-documentation for the xarray package.

3.4. Intro to Data Structures 137

pandas: powerful Python data analysis toolkit, Release 0.24.1

3.5 Comparison with other tools

3.5.1 Comparison with R / R libraries

Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this
page was started to provide a more detailed look at the R language and its many third party libraries as they relate to
pandas. In comparisons with R and CRAN libraries, we care about the following things:
• Functionality / flexibility: what can/cannot be done with each tool
• Performance: how fast are operations. Hard numbers/benchmarks are preferable
• Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code
comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External Compatibility
for an example.

Quick Reference

We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.

Querying, Filtering, Sampling

R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[['col1', 'col2']]
select(df, col1:col3) df.loc[:, 'col1':'col3']
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see1
distinct(select(df, col1)) df[['col1']].drop_duplicates()
distinct(select(df, col1, col2)) df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)

Sorting

R pandas
arrange(df, col1, col2) df.sort_values(['col1', 'col2'])
arrange(df, desc(col1)) df.sort_values('col1', ascending=False)
1 R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the list of columns,

for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.

138 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Transforming

R pandas
select(df, col_one = df.rename(columns={'col1': 'col_one'})['col_one']
col1)
rename(df, col_one = df.rename(columns={'col1': 'col_one'})
col1)
mutate(df, c=a-b) df.assign(c=df.a-df.b)

Grouping and Summarizing

R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na. df.groupby('col1').agg({'col1':
rm=TRUE)) 'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()

Base R

Slicing with R’s c

R makes it easy to access data.frame columns by name

df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
df[, c("a", "c", "e")]

or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]

Selecting multiple columns by name in pandas is straightforward

In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))

In [2]: df[['a', 'c']]

Out[2]:
a c
0 0.469112 -1.509059
1 -1.135632 -0.173215
2 0.119209 -0.861849
3 -2.104569 1.071804
4 0.721555 -1.039575
5 0.271860 0.567020
6 0.276232 -0.673690
7 0.113648 0.524988
8 0.404705 -1.715002
9 -1.039268 -1.157892

In [3]: df.loc[:, ['a', 'c']]

(continues on next page)

3.5. Comparison with other tools 139

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a c
0 0.469112 -1.509059
1 -1.135632 -0.173215
2 0.119209 -0.861849
3 -2.104569 1.071804
4 0.721555 -1.039575
5 0.271860 0.567020
6 0.276232 -0.673690
7 0.113648 0.524988
8 0.404705 -1.715002
9 -1.039268 -1.157892

Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer
attribute and numpy.r_.
In [4]: named = list('abcdefg')

In [5]: n = 30

In [6]: columns = named + np.arange(len(named), n).tolist()

In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns)

In [8]: df.iloc[:, np.r_[:10, 24:30]]

Out[8]:
a b c d e ... 25 26
˓→27 28 29
0 -1.344312 0.844885 1.075770 -0.109050 1.643563 ... -0.226169 0.410835 0.
˓→813850 0.132003 -0.827317
1 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 ... -1.110336 -0.619976 0.
˓→149748 -0.732339 0.687738
2 0.176444 0.403310 -0.154951 0.301624 -2.179861 ... 0.432390 1.519970 -0.
˓→493662 0.600178 0.274230
3 0.132885 -0.023688 2.410179 1.450520 0.206053 ... -0.281461 0.030711 0.
˓→109121 1.126203 -0.977349
4 1.474071 -0.064034 -1.282782 0.781836 -1.071357 ... -1.066969 -0.303421 -0.
˓→858447 0.306996 -0.028665
5 0.384316 1.574159 1.588931 0.476720 0.473424 ... 0.068159 -0.057873 -0.
˓→368204 -1.144073 0.861209
6 0.800193 0.782098 -1.069094 -1.099248 0.255269 ... 2.121453 0.597701 0.
˓→563700 0.967661 -1.057909
.. ... ... ... ... ... ... ... ... ..
˓→. ... ...
23 1.534417 -1.374226 -0.367477 0.782551 1.356489 ... -1.690959 0.961088 0.
˓→052372 1.166439 0.407281
24 0.859275 -0.995910 0.261263 1.783442 0.380989 ... 0.840316 0.638172 0.
˓→890673 -1.949397 -0.003437

25 1.492125 -0.068190 0.681456 1.221829 -0.434352 ... 0.042344 -0.307904 0.

˓→428572 0.880609 0.487645
26 0.725238 0.624607 -0.141185 -0.143948 -0.328162 ... 1.190624 0.778507 1.
˓→008500 1.424017 0.717110
27 1.262419 1.950057 0.301038 -0.933858 0.814946 ... 0.334281 -0.162227 1.
˓→007824 2.826008 1.458383
28 -1.585746 -0.899734 0.921494 -0.211762 -0.059182 ... -0.026602 -0.240481 0.
˓→577223 -1.088417 0.326687
(continues on next page)

140 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

29 -0.986248 0.169729 -1.158091 1.019673 0.646039 ... -0.671466 0.332872 -2.
˓→013086 -1.602549 0.333109

[30 rows x 16 columns]

aggregate

In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting
it into groups by1 and by2:

df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)

The groupby() method is similar to base R aggregate function.

In [9]: df = pd.DataFrame(
...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
...: np.nan]})
...:

In [10]: g = df.groupby(['by1', 'by2'])

In [11]: g[['v1', 'v2']].mean()

Out[11]:
v1 v2
by1 by2
1 95 5.0 55.0
99 5.0 55.0
2 95 7.0 77.0
99 NaN NaN
big damp 3.0 33.0
blue dry 3.0 33.0
red red 4.0 44.0
wet 1.0 11.0

For more details and examples see the groupby documentation.

match / %in%

A common way to select data in R is using %in% which is defined using the function match. The operator %in% is
used to return a logical vector indicating if there is a match or not:

s <- 0:4
s %in% c(2,4)

The isin() method is similar to R %in% operator:

3.5. Comparison with other tools 141

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [12]: s = pd.Series(np.arange(5), dtype=np.float32)

In [13]: s.isin([2, 4])

Out[13]:
0 False
1 False
2 True
3 False
4 True
dtype: bool

The match function returns a vector of the positions of matches of its first argument in its second:

s <- 0:4
match(s, c(2,4))

For more details and examples see the reshaping documentation.

tapply

tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular.
Using a data.frame called baseball, and retrieving information based on the array team:

baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))

tapply(baseball$batting.average, baseball.example$team,
max)

In pandas we may use pivot_table() method to handle this:

In [14]: import random

In [15]: import string

In [16]: baseball = pd.DataFrame(

....: {'team': ["team %d" % (x + 1) for x in range(5)] * 5,
....: 'player': random.sample(list(string.ascii_lowercase), 25),
....: 'batting avg': np.random.uniform(.200, .400, 25)})
....:

In [17]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)

Out[17]:
team team 1 team 2 team 3 team 4 team 5
batting avg 0.352134 0.295327 0.397191 0.394457 0.396194

For more details and examples see the reshaping documentation.

subset

The query() method is similar to the base R subset function. In R you might want to get the rows of a data.
frame where one column’s values are less than another column’s values:

142 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

df <- data.frame(a=rnorm(10), b=rnorm(10))

subset(df, a <= b)
df[df$a <= df$b,] # note the comma

In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an
index/slice as well as standard boolean indexing:

In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})

In [19]: df.query('a <= b')

Out[19]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550

In [20]: df[df.a <= df.b]

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550

In [21]: df.loc[df.a <= df.b]

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550

For more details and examples see the query documentation.

with

An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:

df <- data.frame(a=rnorm(10), b=rnorm(10))

with(df, a + b)
df$a + df$b # same as the previous expression

In pandas the equivalent expression, using the eval() method, would be:

3.5. Comparison with other tools 143

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})

In [23]: df.eval('a + b')

Out[23]:
0 -0.091430
1 -2.483890
2 -0.252728
3 -0.626444
4 -0.261740
5 2.149503
6 -0.332214
7 0.799331
8 -2.377245
9 2.104677
dtype: float64

In [24]: df.a + df.b # same as the previous expression

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 -0.091430
1 -2.483890
2 -0.252728
3 -0.626444
4 -0.261740
5 2.149503
6 -0.332214
7 0.799331
8 -2.377245
9 2.104677
dtype: float64

In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the
eval documentation.

plyr

plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data
structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data
structures could be mapped in Python.

R Python
array list
lists dictionary or list of objects
data.frame dataframe

ddply

An expression using a data.frame called df in R where you want to summarize x by month:

require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
(continues on next page)

144 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)

ddply(df, .(month, week), summarize,

mean = round(mean(x), 2),
sd = round(sd(x), 2))

In pandas the equivalent expression, using the groupby() method, would be:
In [25]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 120),
....: 'y': np.random.uniform(7., 334., 120),
....: 'z': np.random.uniform(1.7, 20.7, 120),
....: 'month': [5, 6, 7, 8] * 30,
....: 'week': np.random.randint(1, 4, 120)})
....:

In [26]: grouped = df.groupby(['month', 'week'])

In [27]: grouped['x'].agg([np.mean, np.std])

Out[27]:
mean std
month week
5 1 63.653367 40.601965
2 78.126605 53.342400
3 92.091886 57.630110
6 1 81.747070 54.339218
2 70.971205 54.687287
3 100.968344 54.010081
7 1 61.576332 38.844274
2 61.733510 48.209013
3 71.688795 37.595638
8 1 62.741922 34.618153
2 91.774627 49.790202
3 73.936856 60.773900

For more details and examples see the groupby documentation.

reshape / reshape2

melt.array

An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4))
data.frame(melt(a))

In Python, since a is a list, you can simply use list comprehension.

In [28]: a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)

In [29]: pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)])

Out[29]:
0 1 2 3
(continues on next page)

3.5. Comparison with other tools 145

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

0 0 0 0 1.0
1 0 0 1 2.0
2 0 0 2 3.0
3 0 0 3 4.0
4 0 1 0 5.0
5 0 1 1 6.0
6 0 1 2 7.0
.. .. .. .. ...
17 1 1 1 18.0
18 1 1 2 19.0
19 1 1 3 20.0
20 1 2 0 21.0
21 1 2 1 22.0
22 1 2 2 23.0
23 1 2 3 NaN

[24 rows x 4 columns]

melt.list

An expression using a list called a in R where you want to melt it into a data.frame:

a <- as.list(c(1:4, NA))

data.frame(melt(a))

In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.

In [30]: a = list(enumerate(list(range(1, 5)) + [np.NAN]))

In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN

For more details and examples see the Into to Data Structures documentation.

melt.data.frame

An expression using a data.frame called cheese in R where you want to reshape the data.frame:

cheese <- data.frame(

first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))

In Python, the melt() method is the R equivalent:

146 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [32]: cheese = pd.DataFrame({'first': ['John', 'Mary'],

....: 'last': ['Doe', 'Bo'],
....: 'height': [5.5, 6.0],
....: 'weight': [130, 150]})
....:

In [33]: pd.melt(cheese, id_vars=['first', 'last'])

Out[33]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

In [34]: cheese.set_index(['first', 'last']).stack() # alternative way

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
dtype: float64

For more details and examples see the reshaping documentation.

cast

In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)

mdf <- melt(df, id=c("month", "week"))

acast(mdf, week ~ month ~ variable, mean)

In Python the best way is to make use of pivot_table():

In [35]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 12),
....: 'y': np.random.uniform(7., 334., 12),
....: 'z': np.random.uniform(1.7, 20.7, 12),
....: 'month': [5, 6, 7] * 4,
....: 'week': [1, 2] * 6})
....:

In [36]: mdf = pd.melt(df, id_vars=['month', 'week'])

In [37]: pd.pivot_table(mdf, values='value', index=['variable', 'week'],

....: columns=['month'], aggfunc=np.mean)
....:
Out[37]:
(continues on next page)

3.5. Comparison with other tools 147

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

month 5 6 7
variable week
x 1 93.888747 98.762034 55.219673
2 94.391427 38.112932 83.942781
y 1 94.306912 279.454811 227.840449
2 87.392662 193.028166 173.899260
z 1 11.016009 10.079307 16.170549
2 8.476111 17.638509 19.003494

Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and
FeedType:

df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)

dcast(df, Animal ~ FeedType, sum, fill=NaN)

# Alternative method using base R
with(df, tapply(Amount, list(Animal, FeedType), sum))

Python can approach this in two different ways. Firstly, similar to above using pivot_table():

In [38]: df = pd.DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
....:

In [39]: df.pivot_table(values='Amount', index='Animal', columns='FeedType',

....: aggfunc='sum')
....:
Out[39]:
FeedType A B
Animal
Animal1 10.0 5.0
Animal2 2.0 13.0
Animal3 6.0 NaN

The second approach is to use the groupby() method:

In [40]: df.groupby(['Animal', 'FeedType'])['Amount'].sum()

Out[40]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64

For more details and examples see the reshaping documentation or the groupby documentation.

148 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

factor

pandas has a data type for categorical data.

cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))

In pandas this is accomplished with pd.cut and astype("category"):

In [41]: pd.cut(pd.Series([1, 2, 3, 4, 5, 6]), 3)

Out[41]:
0 (0.995, 2.667]
1 (0.995, 2.667]
2 (2.667, 4.333]
3 (2.667, 4.333]
4 (4.333, 6.0]
5 (4.333, 6.0]
dtype: category
Categories (3, interval[float64]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]

In [42]: pd.Series([1, 2, 3, 2, 2, 3]).astype("category")

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 1
1 2
2 3
3 2
4 2
5 3
dtype: category
Categories (3, int64): [1, 2, 3]

For more details and examples see categorical introduction and the API documentation. There is also a documentation
regarding the differences to R’s factor.

3.5.2 Comparison with SQL

Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of
how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame
called tips and assume we have a database table of the same name and structure.

In [3]: url = ('https://raw.github.com/pandas-dev'

...: '/pandas/master/pandas/tests/data/tips.csv')
...:

(continues on next page)

3.5. Comparison with other tools 149

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [4]: tips = pd.read_csv(url)

In [5]: tips.head()
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

SELECT

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):

SELECT total_bill, tip, smoker, time

FROM tips
LIMIT 5;

With pandas, column selection is done by passing a list of column names to your DataFrame:

In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).

WHERE

Filtering in SQL is done via a WHERE clause.

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [7]: tips[tips['time'] == 'Dinner'].head(5)

Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with
True.

150 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [8]: is_dinner = tips['time'] == 'Dinner'

In [9]: is_dinner.value_counts()
Out[9]:
True 176
False 68
Name: time, dtype: int64

In [10]: tips[is_dinner].head(5)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).

-- tips of more than $5.00 at Dinner meals

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

# tips of more than $5.00 at Dinner meals

In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
Out[11]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3

-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45
In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[12]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
(continues on next page)

3.5. Comparison with other tools 151

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5

NULL checking is done using the notna() and isna() methods.

In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],

....: 'col2': ['F', np.NaN, 'G', 'H', 'I']})
....:

In [14]: frame
Out[14]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I

Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS
NULL with the following query:

SELECT *
FROM frame
WHERE col2 IS NULL;

In [15]: frame[frame['col2'].isna()]
Out[15]:
col1 col2
1 B NaN

Getting items where col1 IS NOT NULL can be done with notna().

SELECT *
FROM frame
WHERE col1 IS NOT NULL;

In [16]: frame[frame['col1'].notna()]
Out[16]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I

152 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

GROUP BY

In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method.
groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically
aggregation) , and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a
query getting us the number of tips left by sex:

SELECT sex, count(*)

FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/

The pandas equivalent would be:

In [17]: tips.groupby('sex').size()
Out[17]:
sex
Female 87
Male 157
dtype: int64

Notice that in the pandas code we used size() and not count(). This is because count() applies the function
to each column, returning the number of not null records within each.

In [18]: tips.groupby('sex').count()
Out[18]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157

Alternatively, we could have applied the count() method to an individual column:

In [19]: tips.groupby('sex')['total_bill'].count()
Out[19]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64

Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs by day of
the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to
specific columns.

SELECT day, AVG(tip), COUNT(*)

FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/

3.5. Comparison with other tools 153

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})

Out[20]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SELECT smoker, day, COUNT(*), AVG(tip)

FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
* /

In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})

Out[21]:
tip
size mean
smoker day
No Fri 4.0 2.812500
Sat 45.0 3.102889
Sun 57.0 3.167895
Thur 45.0 2.673778
Yes Fri 15.0 2.714000
Sat 42.0 2.875476
Sun 19.0 3.516842
Thur 17.0 3.030000

JOIN

JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices.
Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or
the columns to join on (column names or indices).

In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],

....: 'value': np.random.randn(4)})
....:

In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],

....: 'value': np.random.randn(4)})
....:

Assume we have two database tables of the same name and structure as our DataFrames.

154 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Now let’s go over the various types of JOINs.

INNER JOIN

SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;

# merge performs an INNER JOIN by default

In [24]: pd.merge(df1, df2, on='key')
Out[24]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209

merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s
index.
In [25]: indexed_df2 = df2.set_index('key')

In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)

Out[26]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
3 D -1.135632 0.119209

LEFT OUTER JOIN

-- show all records from df1

SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;

# show all records from df1

In [27]: pd.merge(df1, df2, on='key', how='left')
Out[27]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

RIGHT JOIN

-- show all records from df2

SELECT *
(continues on next page)

3.5. Comparison with other tools 155

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;

# show all records from df2

In [28]: pd.merge(df1, df2, on='key', how='right')
Out[28]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236

FULL JOIN

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a
match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;

# show all records from both frames

In [29]: pd.merge(df1, df2, on='key', how='outer')
Out[29]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

UNION

UNION ALL can be performed using concat().

In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
....: 'rank': range(1, 4)})
....:

In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],

....: 'rank': [1, 4, 5]})
....:

SELECT city, rank

FROM df1
UNION ALL
SELECT city, rank
FROM df2;
(continues on next page)

156 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/

In [32]: pd.concat([df1, df2])

Out[32]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
* /

In pandas, you can use concat() in conjunction with drop_duplicates().

In [33]: pd.concat([df1, df2]).drop_duplicates()
Out[33]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5

Pandas equivalents for some SQL analytic and aggregate functions

Top N rows with offset

-- MySQL
SELECT * FROM tips
(continues on next page)

3.5. Comparison with other tools 157

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

ORDER BY tip DESC
LIMIT 10 OFFSET 5;

In [34]: tips.nlargest(10 + 5, columns='tip').tail(10)

Out[34]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4

Top N rows per group

-- Oracle's ROW_NUMBER() analytic function

SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;

In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)

....: .groupby(['day'])
....: .cumcount() + 1)
....: .query('rn < 3')
....: .sort_values(['day', 'rn']))
....:
Out[35]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2

the same using rank(method=’first’) function

In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
....: .rank(method='first', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['day', 'rnk']))
....:
Out[36]:
(continues on next page)

158 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0

-- Oracle's RANK() analytic function

SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;

Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min')
function rnk_min remains the same for the same tip (as Oracle’s RANK() function)

In [37]: (tips[tips['tip'] < 2]

....: .assign(rnk_min=tips.groupby(['sex'])['tip']
....: .rank(method='min'))
....: .query('rnk_min < 3')
....: .sort_values(['sex', 'rnk_min']))
....:
Out[37]:
total_bill tip sex smoker day time size rnk_min
67 3.07 1.00 Female Yes Sat Dinner 1 1.0
92 5.75 1.00 Female Yes Fri Dinner 2 1.0
111 7.25 1.00 Female No Sat Dinner 1 1.0
236 12.60 1.00 Male Yes Sat Dinner 2 1.0
237 32.83 1.17 Male Yes Sat Dinner 2 2.0

UPDATE

UPDATE tips
SET tip = tip*2
WHERE tip < 2;

In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE

DELETE FROM tips

WHERE tip > 9;

In pandas we select the rows that should remain, instead of deleting them

3.5. Comparison with other tools 159

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [39]: tips = tips.loc[tips['tip'] <= 9]

3.5.3 Comparison with SAS

For potential users coming from SAS this page is meant to demonstrate how different SAS operations would be
performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which displays
the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter notebook or
terminal) - the equivalent in SAS would be:

proc print data=df(obs=5);

run;

Data Structures

General Terminology Translation

pandas SAS
DataFrame data set
column variable
row observation
groupby BY-group
NaN .

DataFrame / Series

A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns that
can be of different types. As will be shown in this document, almost any operation that can be applied to a data set
using SAS’s DATA step, can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data
structure for a single column, but in general, working with a Series is analogous to referencing a column in the
DATA step.

Index

Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an
exactly analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index that can be

160 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

accessed during the DATA step (_N_).

In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on).
While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part
of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as
a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.

Data Input / Output

Constructing a DataFrame from Values

A SAS data set can be built from specified values by placing the data after a datalines statement and specifying
the column names.
data df;
input x y;
datalines;
1 2
3 4
5 6
;
run;

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.
In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})

In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6

Reading External Data

Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within the pandas
tests (csv) will be used in many of the following examples.
SAS provides PROC IMPORT to read csv data into a data set.
proc import datafile='tips.csv' dbms=csv out=tips replace;
getnames=yes;
run;

The pandas method is read_csv(), which works similarly.

In [5]: url = ('https://raw.github.com/pandas-dev/'
...: 'pandas/master/pandas/tests/data/tips.csv')
...:

In [6]: tips = pd.read_csv(url)

In [7]: tips.head()
(continues on next page)

3.5. Comparison with other tools 161

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed. For
example, if the data was instead tab delimited, and did not have column names, the pandas command would be:
tips = pd.read_csv('tips.csv', sep='\t', header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter

tips = pd.read_table('tips.csv', header=None)

In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL databases. These
are all read via a pd.read_* function. See the IO documentation for more details.

Exporting Data

The inverse of PROC IMPORT in SAS is PROC EXPORT

proc export data=tips outfile='tips2.csv' dbms=csv;
run;

Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.
tips.to_csv('tips2.csv')

Data Operations

Operations on Columns

In the DATA step, arbitrary math expressions can be used on new or existing columns.
data tips;
set tips;
total_bill = total_bill - 2;
new_bill = total_bill / 2;
run;

pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way.
In [8]: tips['total_bill'] = tips['total_bill'] - 2

In [9]: tips['new_bill'] = tips['total_bill'] / 2.0

In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
(continues on next page)

162 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295

Filtering

Filtering in SAS is done with an if or where statement, on one or more columns.

data tips;
set tips;
if total_bill > 10;
run;

data tips;
set tips;
where total_bill > 10;
/* equivalent in this case - where happens before the
DATA step begins and can also be used in PROC statements */
run;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing
In [11]: tips[tips['total_bill'] > 10].head()
Out[11]:
total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
5 23.29 4.71 Male No Sun Dinner 4

If/Then Logic

In SAS, if/then logic can be used to create new columns.

data tips;
set tips;
format bucket $4.;

if total_bill < 10 then bucket = 'low';

else bucket = 'high';
run;

The same operation in pandas can be accomplished using the where method from numpy.
In [12]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')

In [13]: tips.head()
Out[13]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
(continues on next page)

3.5. Comparison with other tools 163

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high

Date Functionality

SAS provides a variety of functions to do operations on date/datetime columns.

data tips;
set tips;
format date1 date2 date1_plusmonth mmddyy10.;
date1 = mdy(1, 15, 2013);
date2 = mdy(2, 15, 2015);
date1_year = year(date1);
date2_month = month(date2);
* shift date to beginning of next interval;
date1_next = intnx('MONTH', date1, 1);
* count intervals between dates;
months_between = intck('MONTH', date1, date2);
run;

The equivalent pandas operations are shown below. In addition to these functions pandas supports other Time Series
features not available in Base SAS (such as resampling and custom offsets) - see the timeseries documentation for
more details.

In [14]: tips['date1'] = pd.Timestamp('2013-01-15')

In [15]: tips['date2'] = pd.Timestamp('2015-02-15')

In [16]: tips['date1_year'] = tips['date1'].dt.year

In [17]: tips['date2_month'] = tips['date2'].dt.month

In [18]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()

In [19]: tips['months_between'] = (
....: tips['date2'].dt.to_period('M') - tips['date1'].dt.to_period('M'))
....:

In [20]: tips[['date1', 'date2', 'date1_year', 'date2_month',

....: 'date1_next', 'months_between']].head()
....:
Out[20]:
date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

Selection of Columns

SAS provides keywords in the DATA step to select, drop, and rename columns.

164 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

data tips;
set tips;
keep sex total_bill tip;
run;

data tips;
set tips;
drop sex;
run;

data tips;
set tips;
rename total_bill=total_bill_2;
run;

The same operations are expressed in pandas below.

# keep
In [21]: tips[['sex', 'total_bill', 'tip']].head()
Out[21]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61

# drop
In [22]: tips.drop('sex', axis=1).head()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

total_bill tip smoker day time size

0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4

# rename
In [23]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

total_bill_2 tip sex smoker day time size

0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4

Sorting by Values

Sorting in SAS is accomplished via PROC SORT

proc sort data=tips;

by sex total_bill;
run;

3.5. Comparison with other tools 165

pandas: powerful Python data analysis toolkit, Release 0.24.1

pandas objects have a sort_values() method, which takes a list of columns to sort by.

In [24]: tips = tips.sort_values(['sex', 'total_bill'])

In [25]: tips.head()
Out[25]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2

String Processing

Length

SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN excludes
trailing blanks and LENGTHC includes trailing blanks.

data _null_;
set tips;
put(LENGTHN(time));
put(LENGTHC(time));
run;

Python determines the length of a character string with the len function. len includes trailing blanks. Use len and
rstrip to exclude trailing blanks.

In [26]: tips['time'].str.len().head()
Out[26]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64

In [27]: tips['time'].str.rstrip().str.len().head()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[27]:
˓→

67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64

Find

SAS determines the position of a character in a string with the FINDW function. FINDW takes the string defined by
the first argument and searches for the first position of the substring you supply as the second argument.

166 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

data _null_;
set tips;
put(FINDW(sex,'ale'));
run;

Python determines the position of a character in a string with the find function. find searches for the first position
of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes are
zero-based and the function will return -1 if it fails to find the substring.

In [28]: tips['sex'].str.find("ale").head()
Out[28]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64

Substring

SAS extracts a substring from a string based on its position with the SUBSTR function.

data _null_;
set tips;
put(substr(sex,1,1));
run;

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that
Python indexes are zero-based.

In [29]: tips['sex'].str[0:1].head()
Out[29]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object

Scan

The SAS SCAN function returns the nth word from a string. The first argument is the string you want to parse and the
second argument specifies which word you want to extract.

data firstlast;
input String $60.;
First_Name = scan(string, 1);
Last_Name = scan(string, -1);
datalines2;
John Smith;
Jane Cook;
;;;
run;

3.5. Comparison with other tools 167

pandas: powerful Python data analysis toolkit, Release 0.24.1

Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful
approaches, but this just shows a simple approach.

In [30]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})

In [31]: firstlast['First_Name'] = firstlast['String'].str.split(" ", expand=True)[0]

In [32]: firstlast['Last_Name'] = firstlast['String'].str.rsplit(" ", expand=True)[0]

In [33]: firstlast
Out[33]:
String First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane

Upcase, Lowcase, and Propcase

The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.

data firstlast;
input String $60.;
string_up = UPCASE(string);
string_low = LOWCASE(string);
string_prop = PROPCASE(string);
datalines2;
John Smith;
Jane Cook;
;;;
run;

The equivalent Python functions are upper, lower, and title.

In [34]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})

In [35]: firstlast['string_up'] = firstlast['String'].str.upper()

In [36]: firstlast['string_low'] = firstlast['String'].str.lower()

In [37]: firstlast['string_prop'] = firstlast['String'].str.title()

In [38]: firstlast
Out[38]:
String string_up string_low string_prop
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook

Merging

The following tables will be used in the merge examples

In [39]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],

....: 'value': np.random.randn(4)})
....:

(continues on next page)

168 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [40]: df1
Out[40]:
key value
0 A 0.469112
1 B -0.282863
2 C -1.509059
3 D -1.135632

In [41]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],

....: 'value': np.random.randn(4)})
....:

In [42]: df2
Out[42]:
key value
0 B 1.212112
1 D -0.173215
2 D 0.119209
3 E -1.044236

In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the in= dummy
variables to track whether a match was found in one or both input frames.

proc sort data=df1;

by key;
run;

proc sort data=df2;

by key;
run;

data left_join inner_join right_join outer_join;

merge df1(in=a) df2(in=b);

if a and b then output inner_join;

if a then output left_join;
if b then output right_join;
if a or b then output outer_join;
run;

pandas DataFrames have a merge() method, which provides similar functionality. Note that the data does not have
to be sorted ahead of time, and different join types are accomplished via the how keyword.

In [43]: inner_join = df1.merge(df2, on=['key'], how='inner')

In [44]: inner_join
Out[44]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209

In [45]: left_join = df1.merge(df2, on=['key'], how='left')

In [46]: left_join
Out[46]:
(continues on next page)

3.5. Comparison with other tools 169

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

In [47]: right_join = df1.merge(df2, on=['key'], how='right')

In [48]: right_join
Out[48]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236

In [49]: outer_join = df1.merge(df2, on=['key'], how='outer')

In [50]: outer_join
Out[50]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

Missing Data

Like SAS, pandas has a representation for missing data - which is the special float value NaN (not a number). Many
of the semantics are the same, for example missing data propagates through numeric operations, and is ignored by
default for aggregations.

In [51]: outer_join
Out[51]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

In [52]: outer_join['value_x'] + outer_join['value_y']

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 NaN
1 0.929249
2 NaN
3 -1.308847
4 -1.016424
5 NaN
dtype: float64
(continues on next page)

170 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [53]: outer_join['value_x'].sum()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→-3.5940742896293765

One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could do this
to filter missing values.

data outer_join_nulls;
set outer_join;
if value_x = .;
run;

data outer_join_no_nulls;
set outer_join;
if value_x ^= .;
run;

Which doesn’t work in pandas. Instead, the pd.isna or pd.notna functions should be used for comparisons.

In [54]: outer_join[pd.isna(outer_join['value_x'])]
Out[54]:
key value_x value_y
5 E NaN -1.044236

In [55]: outer_join[pd.notna(outer_join['value_x'])]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[55]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

pandas also provides a variety of methods to work with missing data - some of which would be challenging to express
in SAS. For example, there are methods to drop all rows with any missing values, replacing missing values with a
specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.

In [56]: outer_join.dropna()
Out[56]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

In [57]: outer_join.fillna(method='ffill')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

key value_x value_y

0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E -1.135632 -1.044236

(continues on next page)

3.5. Comparison with other tools 171

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [58]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 0.469112
1 -0.282863
2 -1.509059
3 -1.135632
4 -1.135632
5 -0.718815
Name: value_x, dtype: float64

GroupBy

Aggregation

SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on numeric
columns.

proc summary data=tips nway;

class sex smoker;
var total_bill tip;
output out=tips_summed sum=;
run;

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for
more details and examples.

In [59]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()

In [60]: tips_summed.head()
Out[60]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07

Transformation

In SAS, if the group aggregations need to be used with the original frame, it must be merged back together. For
example, to subtract the mean for each observation by smoker group.

proc summary data=tips missing nway;

class smoker;
var total_bill;
output out=smoker_means mean(total_bill)=group_bill;
run;

proc sort data=tips;

by smoker;
run;
(continues on next page)

172 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

data tips;
merge tips(in=a) smoker_means(in=b);
by smoker;
adj_total_bill = total_bill - group_bill;
if a and b;
run;

pandas groubpy provides a transform mechanism that allows these type of operations to be succinctly expressed
in one operation.

In [61]: gb = tips.groupby('smoker')['total_bill']

In [62]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')

In [63]: tips.head()
Out[63]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278

By Group Processing

In addition to aggregation, pandas groupby can be used to replicate most other by group processing from SAS. For
example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.

proc sort data=tips;

by sex smoker;
run;

data tips_first;
set tips;
by sex smoker;
if FIRST.sex or FIRST.smoker then output;
run;

In pandas this would be written as:

In [64]: tips.groupby(['sex', 'smoker']).first()

Out[64]:
total_bill tip day time size adj_total_bill
sex smoker
Female No 5.25 1.00 Sat Dinner 1 -11.938278
Yes 1.07 1.00 Sat Dinner 1 -17.686344
Male No 5.51 2.00 Thur Lunch 2 -11.678278
Yes 5.25 5.15 Sun Dinner 2 -13.506344

Other Considerations

3.5. Comparison with other tools 173

pandas: powerful Python data analysis toolkit, Release 0.24.1

Disk vs Memory

pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of data able to
be loaded in pandas is limited by your machine’s memory, but also that the operations on that data may be faster.
If out of core processing is needed, one possibility is the dask.dataframe library (currently in development) which
provides a subset of pandas functionality for an on-disk DataFrame

Data Interop

pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary format.

libname xportout xport 'transport-file.xpt';

data xportout.tips;
set tips(rename=(total_bill=tbill));
* xport variable names limited to 6 characters;
run;

df = pd.read_sas('transport-file.xpt')
df = pd.read_sas('binary-file.sas7bdat')

You can also specify the file format directly. By default, pandas will try to infer the file format based on its extension.

df = pd.read_sas('transport-file.xpt', format='xport')
df = pd.read_sas('binary-file.sas7bdat', format='sas7bdat')

XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas readers. An
alternative way to interop data between SAS and pandas is to serialize to csv.

# version 0.17, 10M rows

In [8]: %time df = pd.read_sas('big.xpt')

Wall time: 14.6 s

In [9]: %time df = pd.read_csv('big.csv')

Wall time: 4.86 s

3.5.4 Comparison with Stata

For potential users coming from Stata this page is meant to demonstrate how different Stata operations would be
performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows. This means that we can refer to the libraries as pd and np,
respectively, for the rest of the document.

In [1]: import pandas as pd

In [2]: import numpy as np

174 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

terminal) – the equivalent in Stata would be:

list in 1/5

Data Structures

General Terminology Translation

pandas Stata
DataFrame data set
column variable
row observation
groupby bysort
NaN .

DataFrame / Series

A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns that
can be of different types. As will be shown in this document, almost any operation that can be applied to a data set in
Stata can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. Stata doesn’t have a separate data
structure for a single column, but in general, working with a Series is analogous to referencing a column of a data
set in Stata.

Index

Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly
analogous concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index that can
be accessed with _n.
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on).
While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part
of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as
a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.

Data Input / Output

Constructing a DataFrame from Values

A Stata data set can be built from specified values by placing the data after an input statement and specifying the
column names.

input x y
1 2
3 4
5 6
end

3.5. Comparison with other tools 175

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})

In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6

Reading External Data

Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within the
pandas tests (csv) will be used in many of the following examples.
Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the
current working directory, we can import it as follows.

import delimited tips.csv

The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the data
set if presented with a url.

In [5]: url = ('https://raw.github.com/pandas-dev'

...: '/pandas/master/pandas/tests/data/tips.csv')
...:

In [6]: tips = pd.read_csv(url)

In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Like import delimited, read_csv() can take a number of parameters to specify how the data should be
parsed. For example, if the data were instead tab delimited, did not have column names, and existed in the current
working directory, the pandas command would be:

tips = pd.read_csv('tips.csv', sep='\t', header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter

tips = pd.read_table('tips.csv', header=None)

Pandas can also read Stata data sets in .dta format with the read_stata() function.

df = pd.read_stata('data.dta')

In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS, HDF5,
Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation for more
details.

176 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Exporting Data

The inverse of import delimited in Stata is export delimited

export delimited tips2.csv

Similarly in pandas, the opposite of read_csv is DataFrame.to_csv().

tips.to_csv('tips2.csv')

Pandas can also export to Stata file format with the DataFrame.to_stata() method.

tips.to_stata('tips2.dta')

Data Operations

Operations on Columns

In Stata, arbitrary math expressions can be used with the generate and replace commands on new or existing
columns. The drop command drops the column from the data set.

replace total_bill = total_bill - 2

generate new_bill = total_bill / 2
drop new_bill

pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way. The DataFrame.drop() method drops a column from the DataFrame.

In [8]: tips['total_bill'] = tips['total_bill'] - 2

In [9]: tips['new_bill'] = tips['total_bill'] / 2

In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295

In [11]: tips = tips.drop('new_bill', axis=1)

Filtering

Filtering in Stata is done with an if clause on one or more columns.

list if total_bill > 10

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

3.5. Comparison with other tools 177

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [12]: tips[tips['total_bill'] > 10].head()

Out[12]:
total_bill tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
5 23.29 4.71 Male No Sun Dinner 4

If/Then Logic

In Stata, an if clause can also be used to create new columns.

generate bucket = "low" if total_bill < 10

replace bucket = "high" if total_bill >= 10

The same operation in pandas can be accomplished using the where method from numpy.

In [13]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')

In [14]: tips.head()
Out[14]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high

Date Functionality

Stata provides a variety of functions to do operations on date/datetime columns.

generate date1 = mdy(1, 15, 2013)

generate date2 = date("Feb152015", "MDY")

generate date1_year = year(date1)

generate date2_month = month(date2)

* shift date to beginning of next month

generate date1_next = mdy(month(date1) + 1, 1, year(date1)) if month(date1) != 12
replace date1_next = mdy(1, 1, year(date1) + 1) if month(date1) == 12
generate months_between = mofd(date2) - mofd(date1)

list date1 date2 date1_year date2_month date1_next months_between

The equivalent pandas operations are shown below. In addition to these functions, pandas supports other Time Series
features not available in Stata (such as time zone handling and custom offsets) – see the timeseries documentation for
more details.

In [15]: tips['date1'] = pd.Timestamp('2013-01-15')

In [16]: tips['date2'] = pd.Timestamp('2015-02-15')

(continues on next page)

178 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [17]: tips['date1_year'] = tips['date1'].dt.year

In [18]: tips['date2_month'] = tips['date2'].dt.month

In [19]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()

In [20]: tips['months_between'] = (tips['date2'].dt.to_period('M')

....: - tips['date1'].dt.to_period('M'))
....:

In [21]: tips[['date1', 'date2', 'date1_year', 'date2_month', 'date1_next',

....: 'months_between']].head()
....:
Out[21]:
date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

Selection of Columns

Stata provides keywords to select, drop, and rename columns.

keep sex total_bill tip

drop sex

rename total_bill total_bill_2

The same operations are expressed in pandas below. Note that in contrast to Stata, these operations do not happen in
place. To make these changes persist, assign the operation back to a variable.
# keep
In [22]: tips[['sex', 'total_bill', 'tip']].head()
Out[22]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61

# drop
In [23]: tips.drop('sex', axis=1).head()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

total_bill tip smoker day time size

0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
(continues on next page)

3.5. Comparison with other tools 179

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

# rename
In [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

total_bill_2 tip sex smoker day time size

0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4

Sorting by Values

Sorting in Stata is accomplished via sort

sort sex total_bill

pandas objects have a DataFrame.sort_values() method, which takes a list of columns to sort by.

In [25]: tips = tips.sort_values(['sex', 'total_bill'])

In [26]: tips.head()
Out[26]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2

String Processing

Finding Length of String

Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII and
Unicode strings, respectively.

generate strlen_time = strlen(time)

generate ustrlen_time = ustrlen(time)

Python determines the length of a character string with the len function. In Python 3, all strings are Unicode strings.
len includes trailing blanks. Use len and rstrip to exclude trailing blanks.

In [27]: tips['time'].str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
(continues on next page)

180 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [28]: tips['time'].str.rstrip().str.len().head()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[28]:
˓→

67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64

Finding Position of Substring

Stata determines the position of a character in a string with the strpos() function. This takes the string defined by
the first argument and searches for the first position of the substring you supply as the second argument.

generate str_position = strpos(sex, "ale")

Python determines the position of a character in a string with the find() function. find searches for the first
position of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes
are zero-based and the function will return -1 if it fails to find the substring.

In [29]: tips['sex'].str.find("ale").head()
Out[29]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64

Extracting Substring by Position

Stata extracts a substring from a string based on its position with the substr() function.

generate short_sex = substr(sex, 1, 1)

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that
Python indexes are zero-based.

In [30]: tips['sex'].str[0:1].head()
Out[30]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object

3.5. Comparison with other tools 181

pandas: powerful Python data analysis toolkit, Release 0.24.1

Extracting nth Word

The Stata word() function returns the nth word from a string. The first argument is the string you want to parse and
the second argument specifies which word you want to extract.

clear
input str20 string
"John Smith"
"Jane Cook"
end

generate first_name = word(name, 1)

generate last_name = word(name, -1)

Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful
approaches, but this just shows a simple approach.

In [31]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})

In [32]: firstlast['First_Name'] = firstlast['string'].str.split(" ", expand=True)[0]

In [33]: firstlast['Last_Name'] = firstlast['string'].str.rsplit(" ", expand=True)[0]

In [34]: firstlast
Out[34]:
string First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane

Changing Case

The Stata strupper(), strlower(), strproper(), ustrupper(), ustrlower(), and ustrtitle()

functions change the case of ASCII and Unicode strings, respectively.

clear
input str20 string
"John Smith"
"Jane Cook"
end

generate upper = strupper(string)

generate lower = strlower(string)
generate title = strproper(string)
list

The equivalent Python functions are upper, lower, and title.

In [35]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})

In [36]: firstlast['upper'] = firstlast['string'].str.upper()

In [37]: firstlast['lower'] = firstlast['string'].str.lower()

In [38]: firstlast['title'] = firstlast['string'].str.title()

(continues on next page)

182 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [39]: firstlast
Out[39]:
string upper lower title
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook

Merging

The following tables will be used in the merge examples

In [40]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],

....: 'value': np.random.randn(4)})
....:

In [41]: df1
Out[41]:
key value
0 A 0.469112
1 B -0.282863
2 C -1.509059
3 D -1.135632

In [42]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],

....: 'value': np.random.randn(4)})
....:

In [43]: df2
Out[43]:
key value
0 B 1.212112
1 D -0.173215
2 D 0.119209
3 E -1.044236

In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file name on disk.
In contrast, Python must have both DataFrames already in memory.
By default, Stata performs an outer join, where all observations from both data sets are left in memory after the merge.
One can keep only observations from the initial data set, the merged data set, or the intersection of the two by using
the values created in the _merge variable.

* First create df2 and save to disk

clear
input str1 key
B
D
D
E
end
generate value = rnormal()
save df2.dta

* Now create df1 in memory

clear
input str1 key
(continues on next page)

3.5. Comparison with other tools 183

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

A
B
C
D
end
generate value = rnormal()

preserve

* Left join
merge 1:n key using df2.dta
keep if _merge == 1

* Right join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 2

* Inner join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 3

* Outer join
restore
merge 1:n key using df2.dta

pandas DataFrames have a DataFrame.merge() method, which provides similar functionality. Note that different
join types are accomplished via the how keyword.

In [44]: inner_join = df1.merge(df2, on=['key'], how='inner')

In [45]: inner_join
Out[45]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209

In [46]: left_join = df1.merge(df2, on=['key'], how='left')

In [47]: left_join
Out[47]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

In [48]: right_join = df1.merge(df2, on=['key'], how='right')

In [49]: right_join
Out[49]:
key value_x value_y
0 B -0.282863 1.212112
(continues on next page)

184 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236

In [50]: outer_join = df1.merge(df2, on=['key'], how='outer')

In [51]: outer_join
Out[51]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

Missing Data

Like Stata, pandas has a representation for missing data – the special float value NaN (not a number). Many of the
semantics are the same; for example missing data propagates through numeric operations, and is ignored by default
for aggregations.
In [52]: outer_join
Out[52]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236

In [53]: outer_join['value_x'] + outer_join['value_y']

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 NaN
1 0.929249
2 NaN
3 -1.308847
4 -1.016424
5 NaN
dtype: float64

In [54]: outer_join['value_x'].sum()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→-3.5940742896293765

One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you could do this
to filter missing values.

* Keep missing values

list if value_x == .
* Keep non-missing values
list if value_x != .

This doesn’t work in pandas. Instead, the pd.isna() or pd.notna() functions should be used for comparisons.

3.5. Comparison with other tools 185

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [55]: outer_join[pd.isna(outer_join['value_x'])]
Out[55]:
key value_x value_y
5 E NaN -1.044236

In [56]: outer_join[pd.notna(outer_join['value_x'])]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[56]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

Pandas also provides a variety of methods to work with missing data – some of which would be challenging to express
in Stata. For example, there are methods to drop all rows with any missing values, replacing missing values with a
specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.

# Drop rows with any missing value

In [57]: outer_join.dropna()
Out[57]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209

# Fill forwards
In [58]: outer_join.fillna(method='ffill')
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

key value_x value_y

0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E -1.135632 -1.044236

# Impute missing values with the mean

In [59]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

0 0.469112
1 -0.282863
2 -1.509059
3 -1.135632
4 -1.135632
5 -0.718815
Name: value_x, dtype: float64

GroupBy

Aggregation

Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric columns.

186 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

collapse (sum) total_bill tip, by(sex smoker)

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for
more details and examples.

In [60]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()

In [61]: tips_summed.head()
Out[61]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07

Transformation

In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort with
egen(). For example, to subtract the mean for each observation by smoker group.

bysort sex smoker: egen group_bill = mean(total_bill)

generate adj_total_bill = total_bill - group_bill

pandas groubpy provides a transform mechanism that allows these type of operations to be succinctly expressed
in one operation.

In [62]: gb = tips.groupby('smoker')['total_bill']

In [63]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')

In [64]: tips.head()
Out[64]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278

By Group Processing

In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from Stata. For
example, the following example lists the first observation in the current sort order by sex/smoker group.

bysort sex smoker: list if _n == 1

In pandas this would be written as:

In [65]: tips.groupby(['sex', 'smoker']).first()

Out[65]:
total_bill tip day time size adj_total_bill
sex smoker
(continues on next page)

3.5. Comparison with other tools 187

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Female No 5.25 1.00 Sat Dinner 1 -11.938278
Yes 1.07 1.00 Sat Dinner 1 -17.686344
Male No 5.51 2.00 Thur Lunch 2 -11.678278
Yes 5.25 5.15 Sun Dinner 2 -13.506344

Other Considerations

Disk vs Memory

Pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded in pandas is
limited by your machine’s memory. If out of core processing is needed, one possibility is the dask.dataframe library,
which provides a subset of pandas functionality for an on-disk DataFrame.

3.6 Tutorials

This is a guide to many pandas tutorials, geared mainly for new users.

3.6.1 Internal Guides

pandas’ own 10 Minutes to pandas.

More complex recipes are in the Cookbook.
A handy pandas cheat sheet.

3.6.2 Community Guides

pandas Cookbook by Julia Evans

The goal of this 2015 cookbook (by Julia Evans) is to give you some concrete examples for getting started with pandas.
These are examples with real-world data, and all the bugs and weirdness that entails. For the table of contents, see the
pandas-cookbook GitHub repository.

Learn Pandas by Hernan Rojas

A set of lesson for new pandas users: https://bitbucket.org/hrojas/learn-pandas

Practical data analysis with Python

This guide is an introduction to the data analysis process using the Python data ecosystem and an interesting open
dataset. There are four sections covering selected topics as munging data, aggregating data, visualizing data and time
series.

Exercises for new users

Practice your skills with real data sets and exercises. For more resources, please visit the main repository.

188 Chapter 3. Getting started

pandas: powerful Python data analysis toolkit, Release 0.24.1

Modern pandas

Tutorial series written in 2016 by Tom Augspurger. The source may be found in the GitHub repository
TomAugspurger/effective-pandas.
• Modern Pandas
• Method Chaining
• Indexes
• Performance
• Tidy Data
• Visualization
• Timeseries

Excel charts with pandas, vincent and xlsxwriter

• Using Pandas and XlsxWriter to create Excel charts

Video Tutorials

• Pandas From The Ground Up (2015) (2:24) GitHub repo

• Introduction Into Pandas (2016) (1:28) GitHub repo
• Pandas: .head() to .tail() (2016) (1:26) GitHub repo
• Data analysis in Python with pandas (2016-2018) GitHub repo and Jupyter Notebook
• Best practices with pandas (2018) GitHub repo and Jupyter Notebook

Various Tutorials

• Wes McKinney’s (pandas BDFL) blog

• Statistical analysis made easy in Python with SciPy and pandas DataFrames, by Randal Olson
• Statistical Data Analysis in Python, tutorial videos, by Christopher Fonnesbeck from SciPy 2013
• Financial analysis in Python, by Thomas Wiecki
• Intro to pandas data structures, by Greg Reda
• Pandas and Python: Top 10, by Manish Amde
• Pandas DataFrames Tutorial, by Karlijn Willems
• A concise tutorial with real life examples

3.6. Tutorials 189

pandas: powerful Python data analysis toolkit, Release 0.24.1

190 Chapter 3. Getting started

CHAPTER

FOUR

USER GUIDE

The User Guide covers all of pandas by topic area. Each of the subsections introduces a topic (such as “working with
missing data”), and discusses how pandas approaches the problem, with many examples throughout.
Users brand-new to pandas should start with 10min.
Further information on any specific method can be obtained in the API Reference.

4.1 IO Tools (Text, CSV, HDF5, . . . )

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally
return a pandas object. The corresponding writer functions are object methods that are accessed like DataFrame.
to_csv(). Below is a table containing available readers and writers.

Format Data Description Reader Writer

Type
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

Here is an informal performance comparison for some of these IO methods.

Note: For examples that use the StringIO class, make sure you import it according to your Python version, i.e.
from StringIO import StringIO for Python 2 and from io import StringIO for Python 3.

191
pandas: powerful Python data analysis toolkit, Release 0.24.1

4.1.1 CSV & Text files

The workhorse function for reading text files (a.k.a. flat files) is read_csv(). See the cookbook for some advanced
strategies.

Parsing options

read_csv() accepts the following common arguments:

Basic

filepath_or_buffer [various] Either a path to a file (a str, pathlib.Path, or py._path.local.

LocalPath), URL (including http, ftp, and S3 locations), or any object with a read() method (such as
an open file or StringIO).
sep [str, defaults to ',' for read_csv(), \t for read_table()] Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used
and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators
longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force
the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex
example: '\\r\\t'.
delimiter [str, default None] Alternative argument name for sep.
delim_whitespace [boolean, default False] Specifies whether or not whitespace (e.g. ' ' or '\t') will be used as
the delimiter. Equivalent to setting sep='\s+'. If this option is set to True, nothing should be passed in for
the delimiter parameter.
New in version 0.18.1: support for the Python parser.

Column and Index Locations and Names

header [int or list of ints, default 'infer'] Row number(s) to use as the column names, and the start of the data.
Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the
behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names.
The header can be a list of ints that specify row locations for a MultiIndex on the columns e.g. [0,1,3].
Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the
first line of data rather than the first line of the file.
names [array-like, default None] List of column names to use. If file contains no header row, then you should
explicitly pass header=None. Duplicates in this list will cause a UserWarning to be issued.
index_col [int or sequence or False, default None] Column to use as the row labels of the DataFrame. If a
sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line,
you might consider index_col=False to force pandas to not use the first column as the index (row names).
usecols [list-like or callable, default None] Return a subset of the columns. If list-like, all elements must either be
positional (i.e. integer indices into the document columns) or strings that correspond to column names provided
either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].
Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a
DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo',

192 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data,

usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the column names, returning names where the callable
function evaluates to True:
In [1]: from pandas.compat import StringIO, BytesIO

In [2]: data = ('col1,col2,col3\n'

...: 'a,b,1\n'
...: 'a,b,2\n'
...: 'c,d,3')
...:

In [3]: pd.read_csv(StringIO(data))
Out[3]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In [4]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3

˓→'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[4]:
˓→

col1 col3
0 a 1
1 a 2
2 c 3

Using this parameter results in much faster parsing time and lower memory usage.
squeeze [boolean, default False] If the parsed data only contains one column then return a Series.
prefix [str, default None] Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, . . .
mangle_dupe_cols [boolean, default True] Duplicate columns will be specified as ‘X’, ‘X.1’. . . ’X.N’, rather than
‘X’. . . ’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

General Parsing Configuration

dtype [Type name or dict of column -> type, default None] Data type for data or columns. E.g. {'a': np.
float64, 'b': np.int32} (unsupported with engine='python'). Use str or object together with
suitable na_values settings to preserve and not interpret dtype.
New in version 0.20.0: support for the Python parser.
engine [{'c', 'python'}] Parser engine to use. The C engine is faster while the Python engine is currently more
feature-complete.
converters [dict, default None] Dict of functions for converting values in certain columns. Keys can either be integers
or column labels.
true_values [list, default None] Values to consider as True.
false_values [list, default None] Values to consider as False.
skipinitialspace [boolean, default False] Skip spaces after delimiter.
skiprows [list-like or integer, default None] Line numbers to skip (0-indexed) or number of lines to skip (int) at the
start of the file.

4.1. IO Tools (Text, CSV, HDF5, . . . ) 193

pandas: powerful Python data analysis toolkit, Release 0.24.1

If callable, the callable function will be evaluated against the row indices, returning True if the row should be
skipped and False otherwise:

In [5]: data = ('col1,col2,col3\n'

...: 'a,b,1\n'
...: 'a,b,2\n'
...: 'c,d,3')
...:

In [6]: pd.read_csv(StringIO(data))
Out[6]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In [7]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[7]:
˓→

col1 col2 col3

0 a b 2

skipfooter [int, default 0] Number of lines at bottom of file to skip (unsupported with engine=’c’).
nrows [int, default None] Number of rows of file to read. Useful for reading pieces of large files.
low_memory [boolean, default True] Internally process the file in chunks, resulting in lower memory use while
parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with
the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize
or iterator parameter to return the data in chunks. (Only valid with C parser)
memory_map [boolean, default False] If a filepath is provided for filepath_or_buffer, map the file object
directly onto memory and access the data directly from there. Using this option can improve performance
because there is no longer any I/O overhead.

NA and Missing Data Handling

na_values [scalar, str, list-like, or dict, default None] Additional strings to recognize as NA/NaN. If dict passed,
specific per-column NA values. See na values const below for a list of the values interpreted as NaN by default.
keep_default_na [boolean, default True] Whether or not to include the default NaN values when parsing the data.
Depending on whether na_values is passed in, the behavior is as follows:
• If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values
used for parsing.
• If keep_default_na is True, and na_values are not specified, only the default NaN values are used for
parsing.
• If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are
used for parsing.
• If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.
na_filter [boolean, default True] Detect missing value markers (empty strings and the value of na_values). In data
without any NAs, passing na_filter=False can improve the performance of reading a large file.
verbose [boolean, default False] Indicate number of NA values placed in non-numeric columns.

194 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

skip_blank_lines [boolean, default True] If True, skip over blank lines rather than interpreting as NaN values.

Datetime Handling

parse_dates [boolean or list of ints or names or list of lists or dict, default False.]
• If True -> try parsing the index.
• If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
• If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
• If {'foo': [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’. A fast-path exists for iso8601-
formatted dates.
infer_datetime_format [boolean, default False] If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing.
keep_date_col [boolean, default False] If True and parse_dates specifies combining multiple columns then keep
the original columns.
date_parser [function, default None] Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to
call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined
by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst [boolean, default False] DD/MM format dates, international and European format.

Iteration

iterator [boolean, default False] Return TextFileReader object for iteration or getting chunks with get_chunk().
chunksize [int, default None] Return TextFileReader object for iteration. See iterating and chunking below.

Quoting, Compression, and File Format

compression [{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'] For on-the-fly decompres-
sion of on-disk data. If ‘infer’, then use gzip, bz2, zip, or xz if filepath_or_buffer is a string ending in ‘.gz’,
‘.bz2’, ‘.zip’, or ‘.xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain
only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
Changed in version 0.24.0: ‘infer’ option added and set to default.
thousands [str, default None] Thousands separator.
decimal [str, default '.'] Character to recognize as decimal point. E.g. use ',' for European data.
float_precision [string, default None] Specifies which converter the C engine should use for floating-point values.
The options are None for the ordinary converter, high for the high-precision converter, and round_trip for
the round-trip converter.
lineterminator [str (length 1), default None] Character to break file into lines. Only valid with C parser.
quotechar [str (length 1)] The character used to denote the start and end of a quoted item. Quoted items can include
the delimiter and it will be ignored.

4.1. IO Tools (Text, CSV, HDF5, . . . ) 195

pandas: powerful Python data analysis toolkit, Release 0.24.1

quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per csv.QUOTE_* constants.
Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [boolean, default True] When quotechar is specified and quoting is not QUOTE_NONE, indi-
cate whether or not to interpret two consecutive quotechar elements inside a field as a single quotechar
element.
escapechar [str (length 1), default None] One-character string used to escape delimiter when quoting is
QUOTE_NONE.
comment [str, default None] Indicates remainder of line should not be parsed. If found at the beginning of a line,
the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long
as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by
skiprows. For example, if comment='#', parsing ‘#empty\na,b,c\n1,2,3’ with header=0 will result in ‘a,b,c’
being treated as the header.
encoding [str, default None] Encoding to use for UTF when reading/writing (e.g. 'utf-8'). List of Python standard
encodings.
dialect [str or csv.Dialect instance, default None] If provided, this parameter will override values (default or
not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting.
If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect documentation for
more details.
tupleize_cols [boolean, default False]
Deprecated since version 0.21.0.
This argument will be removed and will always convert to MultiIndex
Leave a list of tuples on columns as is (default is to convert to a MultiIndex on the columns).

Error Handling

error_bad_lines [boolean, default True] Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines”
will dropped from the DataFrame that is returned. See bad lines below.
warn_bad_lines [boolean, default True] If error_bad_lines is False, and warn_bad_lines is True, a warning for
each “bad line” will be output.

Specifying column data types

You can indicate the data type for the whole DataFrame or individual columns:
In [8]: data = ('a,b,c,d\n'
...: '1,2,3,4\n'
...: '5,6,7,8\n'
...: '9,10,11')
...:

In [9]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [10]: df = pd.read_csv(StringIO(data), dtype=object)

(continues on next page)

196 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [11]: df
Out[11]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN

In [12]: df['a'][0]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[12]:
˓→'1'

In [13]: df = pd.read_csv(StringIO(data),
....: dtype={'b': object, 'c': np.float64, 'd': 'Int64'})
....:

In [14]: df.dtypes
Out[14]:
a int64
b object
c float64
d Int64
dtype: object

Fortunately, pandas offers more than one way to ensure that your column(s) contain only one dtype. If you’re
unfamiliar with these concepts, you can see here to learn more about dtypes, and here to learn more about object
conversion in pandas.
For instance, you can use the converters argument of read_csv():
In [15]: data = ("col_1\n"
....: "1\n"
....: "2\n"
....: "'A'\n"
....: "4.22")
....:

In [16]: df = pd.read_csv(StringIO(data), converters={'col_1': str})

In [17]: df
Out[17]:
col_1
0 1
1 2
2 'A'
3 4.22

In [18]: df['col_1'].apply(type).value_counts()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[18]:
<class 'str'> 4
Name: col_1, dtype: int64

Or you can use the to_numeric() function to coerce the dtypes after reading in the data,
In [19]: df2 = pd.read_csv(StringIO(data))

In [20]: df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')

(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 197

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [21]: df2
Out[21]:
col_1
0 1.00
1 2.00
2 NaN
3 4.22

In [22]: df2['col_1'].apply(type).value_counts()
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[22]:
<class 'float'> 4
Name: col_1, dtype: int64

which will convert all valid parsing to floats, leaving the invalid parsing as NaN.
Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs. In the case
above, if you wanted to NaN out the data anomalies, then to_numeric() is probably your best option. However, if
you wanted for all the data to be coerced, no matter the type, then using the converters argument of read_csv()
would certainly be worth trying.
New in version 0.20.0: support for the Python parser.
The dtype option is supported by the ‘python’ engine.

Note: In some cases, reading in abnormal data with columns containing mixed dtypes will result in an inconsistent
dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go and infer the dtypes for
different chunks of the data, rather than the whole dataset at once. Consequently, you can end up with column(s) with
mixed dtypes. For example,

In [23]: col_1 = list(range(500000)) + ['a', 'b'] + list(range(500000))

In [24]: df = pd.DataFrame({'col_1': col_1})

In [25]: df.to_csv('foo.csv')

In [26]: mixed_df = pd.read_csv('foo.csv')

In [27]: mixed_df['col_1'].apply(type).value_counts()
Out[27]:
<class 'int'> 737858
<class 'str'> 262144
Name: col_1, dtype: int64

In [28]: mixed_df['col_1'].dtype
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[28]:
˓→dtype('O')

will result with mixed_df containing an int dtype for certain chunks of the column, and str for others due to the
mixed dtypes from the data that was read in. It is important to note that the overall column will be marked with a
dtype of object, which is used for columns with mixed dtypes.

Specifying Categorical dtype

New in version 0.19.0.

198 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Categorical columns can be parsed directly by specifying dtype='category' or

dtype=CategoricalDtype(categories, ordered).

In [29]: data = ('col1,col2,col3\n'

....: 'a,b,1\n'
....: 'a,b,2\n'
....: 'c,d,3')
....:

In [30]: pd.read_csv(StringIO(data))
Out[30]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3

In [31]: pd.read_csv(StringIO(data)).dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[31]:
˓→

col1 object
col2 object
col3 int64
dtype: object

In [32]: pd.read_csv(StringIO(data), dtype='category').dtypes

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

col1 category
col2 category
col3 category
dtype: object

Individual columns can be parsed as a Categorical using a dict specification:

In [33]: pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

Out[33]:
col1 category
col2 object
col3 int64
dtype: object

New in version 0.21.0.

Specifying dtype='cateogry' will result in an unordered Categorical whose categories are the unique
values observed in the data. For more control on the categories and order, create a CategoricalDtype ahead of
time, and pass that for that column’s dtype.

In [34]: from pandas.api.types import CategoricalDtype

In [35]: dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)

In [36]: pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes

Out[36]:
col1 category
col2 object
col3 int64
dtype: object

When using dtype=CategoricalDtype, “unexpected” values outside of dtype.categories are treated as

4.1. IO Tools (Text, CSV, HDF5, . . . ) 199

pandas: powerful Python data analysis toolkit, Release 0.24.1

missing values.

In [37]: dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c'

In [38]: pd.read_csv(StringIO(data), dtype={'col1': dtype}).col1

Out[38]:
0 a
1 a
2 NaN
Name: col1, dtype: category
Categories (3, object): [a, b, d]

This matches the behavior of Categorical.set_categories().

Note: With dtype='category', the resulting categories will always be parsed as strings (object dtype). If the
categories are numeric they can be converted using the to_numeric() function, or as appropriate, another converter
such as to_datetime().
When dtype is a CategoricalDtype with homogeneous categories ( all numeric, all datetimes, etc.), the
conversion is done automatically.

In [39]: df = pd.read_csv(StringIO(data), dtype='category')

In [40]: df.dtypes
Out[40]:
col1 category
col2 category
col3 category
dtype: object

In [41]: df['col3']
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[41]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, object): [1, 2, 3]

In [42]: df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)

In [43]: df['col3']
Out[43]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]

Naming and Using Columns

Handling column names

A file may or may not have a header row. pandas assumes the first row should be used as the column names:

200 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [44]: data = ('a,b,c\n'

....: '1,2,3\n'
....: '4,5,6\n'
....: '7,8,9')
....:

In [45]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9

In [46]: pd.read_csv(StringIO(data))
\\\\\\\\\\\\\\\\\\\\\\\\Out[46]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9

By specifying the names argument in conjunction with header you can indicate other names to use and whether or
not to throw away the header row (if any):

In [47]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9

In [48]: pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=0)

\\\\\\\\\\\\\\\\\\\\\\\\Out[48]:
foo bar baz
0 1 2 3
1 4 5 6
2 7 8 9

In [49]: pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=None)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

foo bar baz

0 a b c
1 1 2 3
2 4 5 6
3 7 8 9

If the header is in a row other than the first, pass the row number to header. This will skip the preceding rows:

In [50]: data = ('skip this skip it\n'

....: 'a,b,c\n'
....: '1,2,3\n'
....: '4,5,6\n'
....: '7,8,9')
....:

In [51]: pd.read_csv(StringIO(data), header=1)

Out[51]:
a b c
0 1 2 3
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 201

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

1 4 5 6
2 7 8 9

Note: Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0
and column names are inferred from the first non-blank line of the file, if column names are passed explicitly then the
behavior is identical to header=None.

Duplicate names parsing

If the file or header contains duplicate names, pandas will by default distinguish between them so as to prevent
overwriting data:

In [52]: data = ('a,b,a\n'

....: '0,1,2\n'
....: '3,4,5')
....:

In [53]: pd.read_csv(StringIO(data))
Out[53]:
a b a.1
0 0 1 2
1 3 4 5

There is no more duplicate data because mangle_dupe_cols=True by default, which modifies a series of dupli-
cate columns ‘X’, . . . , ‘X’ to become ‘X’, ‘X.1’, . . . , ‘X.N’. If mangle_dupe_cols=False, duplicate data can
arise:

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'

In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
Out[3]:
a b a
0 2 1 2
1 5 4 5

To prevent users from encountering this problem with duplicate data, a ValueError exception is raised if
mangle_dupe_cols != True:

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'

In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
...
ValueError: Setting mangle_dupe_cols=False is not supported yet

Filtering columns (usecols)

The usecols argument allows you to select any subset of the columns in a file, either using the column names,
position numbers or a callable:
New in version 0.20.0: support for callable usecols arguments

In [54]: data = 'a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz'

(continues on next page)

202 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [55]: pd.read_csv(StringIO(data))
Out[55]:
a b c d
0 1 2 3 foo
1 4 5 6 bar
2 7 8 9 baz

In [56]: pd.read_csv(StringIO(data), usecols=['b', 'd'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[56]:
b d
0 2 foo
1 5 bar
2 8 baz

In [57]: pd.read_csv(StringIO(data), usecols=[0, 2, 3])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a c d
0 1 3 foo
1 4 6 bar
2 7 9 baz

In [58]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['A', 'C'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

a c
0 1 3
1 4 6
2 7 9

The usecols argument can also be used to specify which columns not to use in the final result:
In [59]: pd.read_csv(StringIO(data), usecols=lambda x: x not in ['a', 'c'])
Out[59]:
b d
0 2 foo
1 5 bar
2 8 baz

In this case, the callable is specifying that we exclude the “a” and “c” columns from the output.

Comments and Empty Lines

Ignoring line comments and empty lines

If the comment parameter is specified, then completely commented lines will be ignored. By default, completely
blank lines will be ignored as well.
In [60]: data = ('\n'
....: 'a,b,c\n'
....: ' \n'
....: '# commented line\n'
....: '1,2,3\n'
....: '\n'
....: '4,5,6')
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 203

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

....:

In [61]: print(data)

a,b,c

# commented line
1,2,3

4,5,6

In [62]: pd.read_csv(StringIO(data), comment='#')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[62]:
a b c
0 1 2 3
1 4 5 6

If skip_blank_lines=False, then read_csv will not ignore blank lines:

In [63]: data = ('a,b,c\n'
....: '\n'
....: '1,2,3\n'
....: '\n'
....: '\n'
....: '4,5,6')
....:

In [64]: pd.read_csv(StringIO(data), skip_blank_lines=False)

Out[64]:
a b c
0 NaN NaN NaN
1 1.0 2.0 3.0
2 NaN NaN NaN
3 NaN NaN NaN
4 4.0 5.0 6.0

Warning: The presence of ignored lines might create ambiguities involving line numbers; the parameter header
uses row numbers (ignoring commented/empty lines), while skiprows uses line numbers (including com-
mented/empty lines):
In [65]: data = ('#comment\n'
....: 'a,b,c\n'
....: 'A,B,C\n'
....: '1,2,3')
....:

In [66]: pd.read_csv(StringIO(data), comment='#', header=1)

Out[66]:
A B C
0 1 2 3

In [67]: data = ('A,B,C\n'

....: '#comment\n'
....: 'a,b,c\n'
....: '1,2,3')
....:

In [68]: pd.read_csv(StringIO(data), comment='#', skiprows=2)

Out[68]:
204 a b c Chapter 4. User Guide
0 1 2 3
pandas: powerful Python data analysis toolkit, Release 0.24.1

If both header and skiprows are specified, header will be relative to the end of skiprows. For example:

In [69]: data = ('# empty\n'

....: '# second empty line\n'
....: '# third emptyline\n'
....: 'X,Y,Z\n'
....: '1,2,3\n'
....: 'A,B,C\n'
....: '1,2.,4.\n'
....: '5.,NaN,10.0\n')
....:

In [70]: print(data)
# empty
# second empty line
# third emptyline
X,Y,Z
1,2,3
A,B,C
1,2.,4.
5.,NaN,10.0

In [71]: pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[71]:
˓→

A B C
0 1.0 2.0 4.0
1 5.0 NaN 10.0

Comments

Sometimes comments or meta data may be included in a file:

In [72]: print(open('tmp.csv').read())
ID,level,category
Patient1,123000,x # really unpleasant
Patient2,23000,y # wouldn't take his medicine
Patient3,1234018,z # awesome

By default, the parser includes the comments in the output:

In [73]: df = pd.read_csv('tmp.csv')

In [74]: df
Out[74]:
ID level category
0 Patient1 123000 x # really unpleasant
1 Patient2 23000 y # wouldn't take his medicine
2 Patient3 1234018 z # awesome

We can suppress the comments using the comment keyword:

4.1. IO Tools (Text, CSV, HDF5, . . . ) 205

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [75]: df = pd.read_csv('tmp.csv', comment='#')

In [76]: df
Out[76]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z

Dealing with Unicode Data

The encoding argument should be used for encoded unicode data, which will result in byte strings being decoded
to unicode in the result:

In [77]: data = (b'word,length\n'

....: b'Tr\xc3\xa4umen,7\n'
....: b'Gr\xc3\xbc\xc3\x9fe,5')
....:

In [78]: data = data.decode('utf8').encode('latin-1')

In [79]: df = pd.read_csv(BytesIO(data), encoding='latin-1')

In [80]: df
Out[80]:
word length
0 Träumen 7
1 Grüße 5

In [81]: df['word'][1]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[81]: 'Grüße'

Some formats which encode all characters as multiple bytes, like UTF-16, won’t parse correctly at all without speci-
fying the encoding. Full list of Python standard encodings.

Index columns and trailing delimiters

If a file has one more column of data than the number of column names, the first column will be used as the
DataFrame’s row names:

In [82]: data = ('a,b,c\n'

....: '4,apple,bat,5.7\n'
....: '8,orange,cow,10')
....:

In [83]: pd.read_csv(StringIO(data))
Out[83]:
a b c
4 apple bat 5.7
8 orange cow 10.0

In [84]: data = ('index,a,b,c\n'

....: '4,apple,bat,5.7\n'
....: '8,orange,cow,10')
(continues on next page)

206 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

....:

In [85]: pd.read_csv(StringIO(data), index_col=0)

Out[85]:
a b c
index
4 apple bat 5.7
8 orange cow 10.0

Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing
the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False:
In [86]: data = ('a,b,c\n'
....: '4,apple,bat,\n'
....: '8,orange,cow,')
....:

In [87]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,

In [88]: pd.read_csv(StringIO(data))
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[88]:
a b c
4 apple bat NaN
8 orange cow NaN

In [89]: pd.read_csv(StringIO(data), index_col=False)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\O
˓→

a b c
0 4 apple bat
1 8 orange cow

If a subset of data is being parsed using the usecols option, the index_col specification is based on that subset,
not the original data.
In [90]: data = ('a,b,c\n'
....: '4,apple,bat,\n'
....: '8,orange,cow,')
....:

In [91]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,

In [92]: pd.read_csv(StringIO(data), usecols=['b', 'c'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[92]:
b c
4 bat NaN
8 cow NaN

In [93]: pd.read_csv(StringIO(data), usecols=['b', 'c'], index_col=0)

(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 207

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[93]:
b c
4 bat NaN
8 cow NaN

Date Handling

Specifying Date Columns

To better facilitate working with datetime data, read_csv() uses the keyword arguments parse_dates and
date_parser to allow users to specify a variety of columns and date/time formats to turn the input text data into
datetime objects.
The simplest case is to just pass in parse_dates=True:
# Use a column as an index, and parse it as dates.
In [94]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True)

In [95]: df
Out[95]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5

# These are Python datetime objects

In [96]: df.index
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]',

˓→name='date', freq=None)

It is often the case that we may want to store date and time data separately, or store various date fields separately. the
parse_dates keyword can be used to specify a combination of columns to parse the dates and/or times from.
You can specify a list of column lists to parse_dates, the resulting date columns will be prepended to the output
(so as to not affect the existing column order) and the new column names will be the concatenation of the component
column names:
In [97]: print(open('tmp.csv').read())
KORD,19990127, 19:00:00, 18:56:00, 0.8100
KORD,19990127, 20:00:00, 19:56:00, 0.0100
KORD,19990127, 21:00:00, 20:56:00, -0.5900
KORD,19990127, 21:00:00, 21:18:00, -0.9900
KORD,19990127, 22:00:00, 21:56:00, -0.5900
KORD,19990127, 23:00:00, 22:56:00, -0.5900

In [98]: df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]])

In [99]: df
Out[99]:
1_2 1_3 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
(continues on next page)

208 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

By default the parser removes the component date columns, but you can choose to retain them via the
keep_date_col keyword:

In [100]: df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]],

.....: keep_date_col=True)
.....:

In [101]: df
Out[101]:
1_2 1_3 0 1 2 3 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 19990127 19:00:00 18:56:00 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 19990127 20:00:00 19:56:00 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD 19990127 21:00:00 20:56:00 -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD 19990127 21:00:00 21:18:00 -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD 19990127 22:00:00 21:56:00 -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD 19990127 23:00:00 22:56:00 -0.59

Note that if you wish to combine multiple columns into a single date column, a nested list must be used. In other
words, parse_dates=[1, 2] indicates that the second and third columns should each be parsed as separate date
columns while parse_dates=[[1, 2]] means the two columns should be parsed into a single column.
You can also use a dict to specify custom name columns:

In [102]: date_spec = {'nominal': [1, 2], 'actual': [1, 3]}

In [103]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec)

In [104]: df
Out[104]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

It is important to remember that if multiple text columns are to be parsed into a single date column, then a new column
is prepended to the data. The index_col specification is based off of this new set of columns rather than the original
data columns:

In [105]: date_spec = {'nominal': [1, 2], 'actual': [1, 3]}

In [106]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,

.....: index_col=0) # index is the nominal column
.....:

In [107]: df
Out[107]:
actual 0 4
nominal
1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 209

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

Note: If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an
object data type. For non-standard datetime parsing, use to_datetime() after pd.read_csv.

Note: read_csv has a fast_path for parsing datetime strings in iso8601 format, e.g “2000-01-01T00:01:02+00:00” and
similar variations. If you can arrange for your data to store datetimes in this format, load times will be significantly
faster, ~20x has been observed.

Note: When passing a dict as the parse_dates argument, the order of the columns prepended is not guaranteed,
because dict objects do not impose an ordering on their keys. On Python 2.7+ you may use collections.OrderedDict
instead of a regular dict if this matters to you. Because of this, when using a dict for ‘parse_dates’ in conjunction with
the index_col argument, it’s best to specify index_col as a column label rather then as an index on the resulting frame.

Date Parsing Functions

Finally, the parser allows you to specify a custom date_parser function to take full advantage of the flexibility of
the date parsing API:

In [108]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,

.....: date_parser=pd.io.date_converters.parse_date_time)
.....:

In [109]: df
Out[109]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59

Pandas will try to call the date_parser function in three different ways. If an exception is raised, the next one is
tried:
1. date_parser is first called with one or more arrays as arguments, as defined using parse_dates (e.g.,
date_parser(['2013', '2013'], ['1', '2'])).
2. If #1 fails, date_parser is called with all the columns concatenated row-wise into a single array (e.g.,
date_parser(['2013 1', '2013 2'])).
3. If #2 fails, date_parser is called once for every row with one or more string arguments from
the columns indicated with parse_dates (e.g., date_parser('2013', '1') for the first row,
date_parser('2013', '2') for the second, etc.).

210 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Note that performance-wise, you should try these methods of parsing dates in order:
1. Try to infer the format using infer_datetime_format=True (see section below).
2. If you know the format, use pd.to_datetime(): date_parser=lambda x: pd.
to_datetime(x, format=...).
3. If you have a really non-standard format, use a custom date_parser function. For optimal performance, this
should be vectorized, i.e., it should accept arrays as arguments.
You can explore the date parsing functionality in date_converters.py and add your own. We would love to turn this
module into a community supported set of date/time parsers. To get you started, date_converters.py contains
functions to parse dual date and time columns, year/month/day columns, and year/month/day/hour/minute/second
columns. It also contains a generic_parser function so you can curry it with a function that deals with a single
date rather than the entire array.

Parsing a CSV with mixed Timezones

Pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a
mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.

In [110]: content = """\

.....: a
.....: 2000-01-01T00:00:00+05:00
.....: 2000-01-01T00:00:00+06:00"""
.....:

In [111]: df = pd.read_csv(StringIO(content), parse_dates=['a'])

In [112]: df['a']
Out[112]:
0 2000-01-01 00:00:00+05:00
1 2000-01-01 00:00:00+06:00
Name: a, dtype: object

To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with
utc=True as the date_parser.

In [113]: df = pd.read_csv(StringIO(content), parse_dates=['a'],

.....: date_parser=lambda col: pd.to_datetime(col, utc=True))
.....:

In [114]: df['a']
Out[114]:
0 1999-12-31 19:00:00+00:00
1 1999-12-31 18:00:00+00:00
Name: a, dtype: datetime64[ns, UTC]

Inferring Datetime Format

If you have parse_dates enabled for some or all of your columns, and your datetime strings are all formatted the
same way, you may get a large speed up by setting infer_datetime_format=True. If set, pandas will attempt
to guess the format of your datetime strings, and then use a faster means of parsing the strings. 5-10x parsing speeds
have been observed. pandas will fallback to the usual parsing if either the format cannot be guessed or the format that
was guessed cannot properly parse the entire column of strings. So in general, infer_datetime_format should
not have any negative consequences if enabled.

4.1. IO Tools (Text, CSV, HDF5, . . . ) 211

pandas: powerful Python data analysis toolkit, Release 0.24.1

Here are some examples of datetime strings that can be guessed (All representing December 30th, 2011 at 00:00:00):
• “20111230”
• “2011/12/30”
• “20111230 00:00:00”
• “12/30/2011 00:00:00”
• “30/Dec/2011 00:00:00”
• “30/December/2011 00:00:00”
Note that infer_datetime_format is sensitive to dayfirst. With dayfirst=True, it will guess
“01/12/2011” to be December 1st. With dayfirst=False (default) it will guess “01/12/2011” to be January
12th.

# Try to infer the format for the index column

In [115]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
.....: infer_datetime_format=True)
.....:

In [116]: df
Out[116]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5

International Date Formats

While US date formats tend to be MM/DD/YYYY, many international formats use DD/MM/YYYY instead. For
convenience, a dayfirst keyword is provided:

In [117]: print(open('tmp.csv').read())
date,value,cat
1/6/2000,5,a
2/6/2000,10,b
3/6/2000,15,c

In [118]: pd.read_csv('tmp.csv', parse_dates=[0])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[118]:
date value cat
0 2000-01-06 5 a
1 2000-02-06 10 b
2 2000-03-06 15 c

In [119]: pd.read_csv('tmp.csv', dayfirst=True, parse_dates=[0])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

date value cat

0 2000-06-01 5 a
1 2000-06-02 10 b
2 2000-06-03 15 c

212 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Specifying method for floating-point conversion

The parameter float_precision can be specified in order to use a specific floating-point converter during parsing
with the C engine. The options are the ordinary converter, the high-precision converter, and the round-trip converter
(which is guaranteed to round-trip values after writing to a file). For example:

In [120]: val = '0.3066101993807095471566981359501369297504425048828125'

In [121]: data = 'a,b,c\n1,2,{0}'.format(val)

In [122]: abs(pd.read_csv(StringIO(data), engine='c',

.....: float_precision=None)['c'][0] - float(val))
.....:
Out[122]: 1.1102230246251565e-16

In [123]: abs(pd.read_csv(StringIO(data), engine='c',

.....: float_precision='high')['c'][0] - float(val))
.....:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[123]: 5.5511151231257827e-17

In [124]: abs(pd.read_csv(StringIO(data), engine='c',

.....: float_precision='round_trip')['c'][0] - float(val))
.....:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[124]: 0.0

Thousand Separators

For large numbers that have been written with a thousands separator, you can set the thousands keyword to a string
of length 1 so that integers will be parsed correctly:
By default, numbers with a thousands separator will be parsed as strings:

In [125]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z

In [126]: df = pd.read_csv('tmp.csv', sep='|')

In [127]: df
Out[127]:
ID level category
0 Patient1 123,000 x
1 Patient2 23,000 y
2 Patient3 1,234,018 z

In [128]: df.level.dtype
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→dtype('O')

The thousands keyword allows integers to be parsed correctly:

In [129]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 213

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Patient2|23,000|y
Patient3|1,234,018|z

In [130]: df = pd.read_csv('tmp.csv', sep='|', thousands=',')

In [131]: df
Out[131]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z

In [132]: df.level.dtype
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→dtype('int64')

NA Values

To control which values are parsed as missing values (which are signified by NaN), specify a string in na_values.
If you specify a list of strings, then all values in it are considered to be missing values. If you specify a number (a
float, like 5.0 or an integer like 5), the corresponding equivalent values will also imply a missing value (in this
case effectively [5.0, 5] are recognized as NaN).
To completely override the default values that are recognized as missing, specify keep_default_na=False.
The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A
N/A', '#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan',
'-nan', ''].
Let us consider some examples:

pd.read_csv('path_to_file.csv', na_values=[5])

In the example above 5 and 5.0 will be recognized as NaN, in addition to the defaults. A string will first be interpreted
as a numerical 5, then as a NaN.

pd.read_csv('path_to_file.csv', keep_default_na=False, na_values=[""])

Above, only an empty field will be recognized as NaN.

pd.read_csv('path_to_file.csv', keep_default_na=False, na_values=["NA", "0"])

Above, both NA and 0 as strings are NaN.

pd.read_csv('path_to_file.csv', na_values=["Nope"])

The default values, in addition to the string "Nope" are recognized as NaN.

Infinity

inf like values will be parsed as np.inf (positive infinity), and -inf as -np.inf (negative infinity). These will
ignore the case of the value, meaning Inf, will also be parsed as np.inf.

214 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Returning Series

Using the squeeze keyword, the parser will return output with a single column as a Series:

In [133]: print(open('tmp.csv').read())
level
Patient1,123000
Patient2,23000
Patient3,1234018

In [134]: output = pd.read_csv('tmp.csv', squeeze=True)

In [135]: output
Out[135]:
Patient1 123000
Patient2 23000
Patient3 1234018
Name: level, dtype: int64

In [136]: type(output)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[
˓→pandas.core.series.Series

Boolean values

The common values True, False, TRUE, and FALSE are all recognized as boolean. Occasionally you might want to
recognize other values as being boolean. To do this, use the true_values and false_values options as follows:

In [137]: data = ('a,b,c\n'

.....: '1,Yes,2\n'
.....: '3,No,4')
.....:

In [138]: print(data)
a,b,c
1,Yes,2
3,No,4

In [139]: pd.read_csv(StringIO(data))
\\\\\\\\\\\\\\\\\\\\\Out[139]:
a b c
0 1 Yes 2
1 3 No 4

In [140]: pd.read_csv(StringIO(data), true_values=['Yes'], false_values=['No'])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[140]:
a b c
0 1 True 2
1 3 False 4

Handling “bad” lines

Some files may have malformed lines with too few fields or too many. Lines with too few fields will have NA values
filled in the trailing fields. Lines with too many fields will raise an error by default:

4.1. IO Tools (Text, CSV, HDF5, . . . ) 215

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [141]: data = ('a,b,c\n'

.....: '1,2,3\n'
.....: '4,5,6,7\n'
.....: '8,9,10')
.....:

In [142]: pd.read_csv(StringIO(data))
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-142-6388c394e6b8> in <module>
----> 1 pd.read_csv(StringIO(data))

/pandas/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header,

˓→names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine,

˓→converters, true_values, false_values, skipinitialspace, skiprows, skipfooter,

˓→nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_

˓→dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator,

˓→chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting,

˓→doublequote, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines,

˓→ warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)

700 skip_blank_lines=skip_blank_lines)
701
--> 702 return _read(filepath_or_buffer, kwds)
703
704 parser_f.__name__ = name

/pandas/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)

433
434 try:
--> 435 data = parser.read(nrows)
436 finally:
437 parser.close()

/pandas/pandas/io/parsers.py in read(self, nrows)

1137 def read(self, nrows=None):
1138 nrows = _validate_integer('nrows', nrows)
-> 1139 ret = self._engine.read(nrows)
1140
1141 # May alter columns / col_dict

/pandas/pandas/io/parsers.py in read(self, nrows)

1993 def read(self, nrows=None):
1994 try:
-> 1995 data = self._reader.read(nrows)
1996 except StopIteration:
1997 if self._first_chunk:

/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

216 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

You can elect to skip bad lines:

In [29]: pd.read_csv(StringIO(data), error_bad_lines=False)

Skipping line 3: expected 3 fields, saw 4

Out[29]:
a b c
0 1 2 3
1 8 9 10

You can also use the usecols parameter to eliminate extraneous column data that appear in some lines but not others:

In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])

Out[30]:
a b c
0 1 2 3
1 4 5 6
2 8 9 10

Dialect

The dialect keyword gives greater flexibility in specifying the file format. By default it uses the Excel dialect but
you can specify either the dialect name or a csv.Dialect instance.
Suppose you had data with unenclosed quotes:

In [143]: print(data)
label1,label2,label3
index1,"a,c,e
index2,b,d,f

By default, read_csv uses the Excel dialect and treats the double quote as the quote character, which causes it to
fail when it finds a newline before it finds the closing double quote.
We can get around this using dialect:

In [144]: import csv

In [145]: dia = csv.excel()

In [146]: dia.quoting = csv.QUOTE_NONE

In [147]: pd.read_csv(StringIO(data), dialect=dia)

Out[147]:
label1 label2 label3
index1 "a c e
index2 b d f

All of the dialect options can be specified separately by keyword arguments:

In [148]: data = 'a,b,c~1,2,3~4,5,6'

In [149]: pd.read_csv(StringIO(data), lineterminator='~')

Out[149]:
a b c
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 217

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

0 1 2 3
1 4 5 6

Another common dialect option is skipinitialspace, to skip any whitespace after a delimiter:

In [150]: data = 'a, b, c\n1, 2, 3\n4, 5, 6'

In [151]: print(data)
a, b, c
1, 2, 3
4, 5, 6

In [152]: pd.read_csv(StringIO(data), skipinitialspace=True)

\\\\\\\\\\\\\\\\\\\\\\\\Out[152]:
a b c
0 1 2 3
1 4 5 6

The parsers make every attempt to “do the right thing” and not be fragile. Type inference is a pretty big deal. If a
column can be coerced to integer dtype without altering the contents, the parser will do so. Any non-numeric columns
will come through as object dtype as with the rest of pandas objects.

Quoting and Escape Characters

Quotes (and other escape characters) in embedded fields can be handled in any number of ways. One way is to use
backslashes; to properly parse this data, you should pass the escapechar option:

In [153]: data = 'a,b\n"hello, \\"Bob\\", nice to see you",5'

In [154]: print(data)
a,b
"hello, \"Bob\", nice to see you",5

In [155]: pd.read_csv(StringIO(data), escapechar='\\')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[155]:
a b
0 hello, "Bob", nice to see you 5

Files with Fixed Width Columns

While read_csv() reads delimited data, the read_fwf() function works with data files that have known and fixed
column widths. The function parameters to read_fwf are largely the same as read_csv with two extra parameters,
and a different usage of the delimiter parameter:
• colspecs: A list of pairs (tuples) giving the extents of the fixed-width fields of each line as half-open intervals
(i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the column specifications
from the first 100 rows of the data. Default behavior, if not specified, is to infer.
• widths: A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.
• delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler
character of the fields if it is not spaces (e.g., ‘~’).
Consider a typical fixed-width data file:

218 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [156]: print(open('bar.csv').read())
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3

In order to parse this file into a DataFrame, we simply need to supply the column specifications to the read_fwf
function along with the file name:

# Column specifications are a list of half-intervals

In [157]: colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]

In [158]: df = pd.read_fwf('bar.csv', colspecs=colspecs, header=None, index_col=0)

In [159]: df
Out[159]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3

Note how the parser automatically picks column names X.<column number> when header=None argument is spec-
ified. Alternatively, you can supply just the column widths for contiguous columns:

# Widths are a list of integers

In [160]: widths = [6, 14, 13, 10]

In [161]: df = pd.read_fwf('bar.csv', widths=widths, header=None)

In [162]: df
Out[162]:
0 1 2 3
0 id8141 360.242940 149.910199 11950.7
1 id1594 444.953632 166.985655 11788.4
2 id1849 364.136849 183.628767 11806.2
3 id1230 413.836124 184.375703 11916.8
4 id1948 502.953953 173.237159 12468.3

The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between the
columns in the file.
By default, read_fwf will try to infer the file’s colspecs by using the first 100 rows of the file. It can do it
only in cases when the columns are aligned and correctly separated by the provided delimiter (default delimiter is
whitespace).

In [163]: df = pd.read_fwf('bar.csv', header=None, index_col=0)

In [164]: df
Out[164]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 219

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3

New in version 0.20.0.

read_fwf supports the dtype parameter for specifying the types of parsed columns to be different from the inferred
type.

In [165]: pd.read_fwf('bar.csv', header=None, index_col=0).dtypes

Out[165]:
1 float64
2 float64
3 float64
dtype: object

In [166]: pd.read_fwf('bar.csv', header=None, dtype={2: 'object'}).dtypes

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[166]:
0 object
1 float64
2 object
3 float64
dtype: object

Indexes

Files with an “implicit” index column

Consider a file with one less entry in the header than the number of data column:

In [167]: print(open('foo.csv').read())
A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5

In this special case, read_csv assumes that the first column is to be used as the index of the DataFrame:

In [168]: pd.read_csv('foo.csv')
Out[168]:
A B C
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5

Note that the dates weren’t automatically parsed. In that case you would need to do as before:

In [169]: df = pd.read_csv('foo.csv', parse_dates=True)

In [170]: df.index
Out[170]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype=
˓→'datetime64[ns]', freq=None)

220 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Reading an index with a MultiIndex

Suppose you have data indexed by two columns:

In [171]: print(open('data/mindex_ex.csv').read())
year,indiv,zit,xit
1977,"A",1.2,.6
1977,"B",1.5,.5
1977,"C",1.7,.8
1978,"A",.2,.06
1978,"B",.7,.2
1978,"C",.8,.3
1978,"D",.9,.5
1978,"E",1.4,.9
1979,"C",.2,.15
1979,"D",.14,.05
1979,"E",.5,.15
1979,"F",1.2,.5
1979,"G",3.4,1.9
1979,"H",5.4,2.7
1979,"I",6.4,1.2

The index_col argument to read_csv can take a list of column numbers to turn multiple columns into a
MultiIndex for the index of the returned object:
In [172]: df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])

In [173]: df
Out[173]:
zit xit
year indiv
1977 A 1.20 0.60
B 1.50 0.50
C 1.70 0.80
1978 A 0.20 0.06
B 0.70 0.20
C 0.80 0.30
D 0.90 0.50
E 1.40 0.90
1979 C 0.20 0.15
D 0.14 0.05
E 0.50 0.15
F 1.20 0.50
G 3.40 1.90
H 5.40 2.70
I 6.40 1.20

In [174]: df.loc[1978]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

zit xit
indiv
A 0.2 0.06
B 0.7 0.20
C 0.8 0.30
D 0.9 0.50
E 1.4 0.90

4.1. IO Tools (Text, CSV, HDF5, . . . ) 221

pandas: powerful Python data analysis toolkit, Release 0.24.1

Reading columns with a MultiIndex

By specifying list of row locations for the header argument, you can read in a MultiIndex for the columns.
Specifying non-consecutive rows will skip the intervening rows.

In [175]: from pandas.util.testing import makeCustomDataframe as mkdf

In [176]: df = mkdf(5, 3, r_idx_nlevels=2, c_idx_nlevels=4)

In [177]: df.to_csv('mi.csv')

In [178]: print(open('mi.csv').read())
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2

In [179]: pd.read_csv('mi.csv', header=[0, 1, 2, 3], index_col=[0, 1])

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

C0 C_l0_g0 C_l0_g1 C_l0_g2

C1 C_l1_g0 C_l1_g1 C_l1_g2
C2 C_l2_g0 C_l2_g1 C_l2_g2
C3 C_l3_g0 C_l3_g1 C_l3_g2
R0 R1
R_l0_g0 R_l1_g0 R0C0 R0C1 R0C2
R_l0_g1 R_l1_g1 R1C0 R1C1 R1C2
R_l0_g2 R_l1_g2 R2C0 R2C1 R2C2
R_l0_g3 R_l1_g3 R3C0 R3C1 R3C2
R_l0_g4 R_l1_g4 R4C0 R4C1 R4C2

read_csv is also able to interpret a more common format of multi-columns indices.

In [180]: print(open('mi2.csv').read())
,a,a,a,b,c,c
,q,r,s,t,u,v
one,1,2,3,4,5,6
two,7,8,9,10,11,12

In [181]: pd.read_csv('mi2.csv', header=[0, 1], index_col=0)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[181]:
a b c
q r s t u v
one 1 2 3 4 5 6
two 7 8 9 10 11 12

Note: If an index_col is not specified (e.g. you don’t have an index, or wrote it with df.to_csv(...,
index=False), then any names on the columns index will be lost.

222 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Automatically “sniffing” the delimiter

read_csv is capable of inferring delimited (not necessarily comma-separated) files, as pandas uses the csv.
Sniffer class of the csv module. For this, you have to specify sep=None.

In [182]: print(open('tmp2.sv').read())
:0:1:2:3
0:0.4691122999071863:-0.2828633443286633:-1.5090585031735124:-1.1356323710171934
1:1.2121120250208506:-0.17321464905330858:0.11920871129693428:-1.0442359662799567
2:-0.8618489633477999:-2.1045692188948086:-0.4949292740687813:1.071803807037338
3:0.7215551622443669:-0.7067711336300845:-1.0395749851146963:0.27185988554282986
4:-0.42497232978883753:0.567020349793672:0.27623201927771873:-1.0874006912859915
5:-0.6736897080883706:0.1136484096888855:-1.4784265524372235:0.5249876671147047
6:0.4047052186802365:0.5770459859204836:-1.7150020161146375:-1.0392684835147725
7:-0.3706468582364464:-1.1578922506419993:-1.344311812731667:0.8448851414248841
8:1.0757697837155533:-0.10904997528022223:1.6435630703622064:-1.4693879595399115
9:0.35702056413309086:-0.6746001037299882:-1.776903716971867:-0.9689138124473498

In [183]: pd.read_csv('tmp2.sv', sep=None, engine='python')

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

Unnamed: 0 0 1 2 3
0 0 0.469112 -0.282863 -1.509059 -1.135632
1 1 1.212112 -0.173215 0.119209 -1.044236
2 2 -0.861849 -2.104569 -0.494929 1.071804
3 3 0.721555 -0.706771 -1.039575 0.271860
4 4 -0.424972 0.567020 0.276232 -1.087401
5 5 -0.673690 0.113648 -1.478427 0.524988
6 6 0.404705 0.577046 -1.715002 -1.039268
7 7 -0.370647 -1.157892 -1.344312 0.844885
8 8 1.075770 -0.109050 1.643563 -1.469388
9 9 0.357021 -0.674600 -1.776904 -0.968914

Reading multiple files to create a single DataFrame

It’s best to use concat() to combine multiple files. See the cookbook for an example.

Iterating through files chunk by chunk

Suppose you wish to iterate through a (potentially very large) file lazily rather than reading the entire file into memory,
such as the following:

In [184]: print(open('tmp.sv').read())
|0|1|2|3
0|0.4691122999071863|-0.2828633443286633|-1.5090585031735124|-1.1356323710171934
1|1.2121120250208506|-0.17321464905330858|0.11920871129693428|-1.0442359662799567
2|-0.8618489633477999|-2.1045692188948086|-0.4949292740687813|1.071803807037338
3|0.7215551622443669|-0.7067711336300845|-1.0395749851146963|0.27185988554282986
4|-0.42497232978883753|0.567020349793672|0.27623201927771873|-1.0874006912859915
5|-0.6736897080883706|0.1136484096888855|-1.4784265524372235|0.5249876671147047
6|0.4047052186802365|0.5770459859204836|-1.7150020161146375|-1.0392684835147725
7|-0.3706468582364464|-1.1578922506419993|-1.344311812731667|0.8448851414248841
8|1.0757697837155533|-0.10904997528022223|1.6435630703622064|-1.4693879595399115
9|0.35702056413309086|-0.6746001037299882|-1.776903716971867|-0.9689138124473498
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 223

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [185]: table = pd.read_csv('tmp.sv', sep='|')

In [186]: table
Out[186]:
Unnamed: 0 0 1 2 3
0 0 0.469112 -0.282863 -1.509059 -1.135632
1 1 1.212112 -0.173215 0.119209 -1.044236
2 2 -0.861849 -2.104569 -0.494929 1.071804
3 3 0.721555 -0.706771 -1.039575 0.271860
4 4 -0.424972 0.567020 0.276232 -1.087401
5 5 -0.673690 0.113648 -1.478427 0.524988
6 6 0.404705 0.577046 -1.715002 -1.039268
7 7 -0.370647 -1.157892 -1.344312 0.844885
8 8 1.075770 -0.109050 1.643563 -1.469388
9 9 0.357021 -0.674600 -1.776904 -0.968914

By specifying a chunksize to read_csv, the return value will be an iterable object of type TextFileReader:

In [187]: reader = pd.read_csv('tmp.sv', sep='|', chunksize=4)

In [188]: reader
Out[188]: <pandas.io.parsers.TextFileReader at 0x7f7a09659400>

In [189]: for chunk in reader:

.....: print(chunk)
.....:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ Unnamed: 0
˓→0 1 2 3
0 0 0.469112 -0.282863 -1.509059 -1.135632
1 1 1.212112 -0.173215 0.119209 -1.044236
2 2 -0.861849 -2.104569 -0.494929 1.071804
3 3 0.721555 -0.706771 -1.039575 0.271860
Unnamed: 0 0 1 2 3
4 4 -0.424972 0.567020 0.276232 -1.087401
5 5 -0.673690 0.113648 -1.478427 0.524988
6 6 0.404705 0.577046 -1.715002 -1.039268
7 7 -0.370647 -1.157892 -1.344312 0.844885
Unnamed: 0 0 1 2 3
8 8 1.075770 -0.10905 1.643563 -1.469388
9 9 0.357021 -0.67460 -1.776904 -0.968914

Specifying iterator=True will also return the TextFileReader object:

In [190]: reader = pd.read_csv('tmp.sv', sep='|', iterator=True)

In [191]: reader.get_chunk(5)
Out[191]:
Unnamed: 0 0 1 2 3
0 0 0.469112 -0.282863 -1.509059 -1.135632
1 1 1.212112 -0.173215 0.119209 -1.044236
2 2 -0.861849 -2.104569 -0.494929 1.071804
3 3 0.721555 -0.706771 -1.039575 0.271860
4 4 -0.424972 0.567020 0.276232 -1.087401

224 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Specifying the parser engine

Under the hood pandas uses a fast and efficient parser implemented in C as well as a Python implementation which is
currently more feature-complete. Where possible pandas uses the C parser (specified as engine='c'), but may fall
back to Python if C-unsupported options are specified. Currently, C-unsupported options include:
• sep other than a single character (e.g. regex separators)
• skipfooter
• sep=None with delim_whitespace=False
Specifying any of the above options will produce a ParserWarning unless the python engine is selected explicitly
using engine='python'.

Reading remote files

You can pass in a URL to a CSV file:

df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',
sep='\t')

S3 URLs are handled as well but require installing the S3Fs library:

df = pd.read_csv('s3://pandas-test/tips.csv')

If your S3 bucket requires cedentials you will need to set them as environment variables or in the ~/.aws/
credentials config file, refer to the S3Fs documentation on credentials.

Writing out Data

Writing to CSV format

The Series and DataFrame objects have an instance method to_csv which allows storing the contents of the
object as a comma-separated-values file. The function takes a number of arguments. Only the first is required.
• path_or_buf: A string path to the file to write or a StringIO
• sep : Field delimiter for the output file (default “,”)
• na_rep: A string representation of a missing value (default ‘’)
• float_format: Format string for floating point numbers
• columns: Columns to write (default None)
• header: Whether to write out the column names (default True)
• index: whether to write row (index) names (default True)
• index_label: Column label(s) for index column(s) if desired. If None (default), and header and index are
True, then the index names are used. (A sequence should be given if the DataFrame uses MultiIndex).
• mode : Python write mode, default ‘w’
• encoding: a string representing the encoding to use if the contents are non-ASCII, for Python versions prior
to 3
• line_terminator: Character sequence denoting line end (default ‘\n’)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 225

pandas: powerful Python data analysis toolkit, Release 0.24.1

• quoting: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set
a float_format then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-
numeric
• quotechar: Character used to quote fields (default ‘”’)
• doublequote: Control quoting of quotechar in fields (default True)
• escapechar: Character used to escape sep and quotechar when appropriate (default None)
• chunksize: Number of rows to write at a time
• tupleize_cols: If False (default), write as a list of tuples, otherwise write in an expanded line format
suitable for read_csv
• date_format: Format string for datetime objects

Writing a formatted string

The DataFrame object has an instance method to_string which allows control over the string representation of
the object. All arguments are optional:
• buf default None, for example a StringIO object
• columns default None, which columns to write
• col_space default None, minimum width of each column.
• na_rep default NaN, representation of NA value
• formatters default None, a dictionary (by column) of functions each of which takes a single argument and
returns a formatted string
• float_format default None, a function which takes a single (float) argument and returns a formatted string;
to be applied to floats in the DataFrame.
• sparsify default True, set to False for a DataFrame with a hierarchical index to print every MultiIndex key
at each row.
• index_names default True, will print the names of the indices
• index default True, will print the index (ie, row labels)
• header default True, will print the column labels
• justify default left, will print column headers left- or right-justified
The Series object also has a to_string method, but with only the buf, na_rep, float_format arguments.
There is also a length argument which, if set to True, will additionally output the length of the Series.

4.1.2 JSON

Read and write JSON format files and strings.

Writing JSON

A Series or DataFrame can be converted to a valid JSON string. Use to_json with optional parameters:
• path_or_buf : the pathname or buffer to write the output This can be None in which case a JSON string is
returned

226 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

• orient :
Series:
– default is index
– allowed values are {split, records, index}
DataFrame:
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string

split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, . . . , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array

• date_format : string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601.
• double_precision : The number of decimal places to use when encoding floating point values, default 10.
• force_ascii : force encoded string to be ASCII, default True.
• date_unit : The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or
‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’.
• default_handler : The handler to call if an object cannot otherwise be converted to a suitable format for
JSON. Takes a single argument, which is the object to convert, and returns a serializable object.
• lines : If records orient, then will write each record per line as json.
Note NaN’s, NaT’s and None will be converted to null and datetime objects will be converted based on the
date_format and date_unit parameters.

In [192]: dfj = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [193]: json = dfj.to_json()

In [194]: json
Out[194]: '{"A":{"0":-1.2945235903,"1":0.2766617129,"2":-0.0139597524,"3":-0.
˓→0061535699,"4":0.8957173022},"B":{"0":0.4137381054,"1":-0.472034511,"2":-0.

˓→3625429925,"3":-0.923060654,"4":0.8052440254}}'

Orient Options

There are a number of different options for the format of the resulting JSON file / string. Consider the following
DataFrame and Series:

In [195]: dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),

.....: columns=list('ABC'), index=list('xyz'))
.....:

In [196]: dfjo
Out[196]:
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 227

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

A B C
x 1 4 7
y 2 5 8
z 3 6 9

In [197]: sjo = pd.Series(dict(x=15, y=16, z=17), name='D')

In [198]: sjo
Out[198]:
x 15
y 16
z 17
Name: D, dtype: int64

Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column labels acting
as the primary index:

In [199]: dfjo.to_json(orient="columns")
Out[199]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'

# Not available for Series

Index oriented (the default for Series) similar to column oriented but the index labels are now primary:

In [200]: dfjo.to_json(orient="index")
Out[200]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'

In [201]: sjo.to_json(orient="index")
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[201]:
˓→'{"x":15,"y":16,"z":17}'

Record oriented serializes the data to a JSON array of column -> value records, index labels are not included. This is
useful for passing DataFrame data to plotting libraries, for example the JavaScript library d3.js:

In [202]: dfjo.to_json(orient="records")
Out[202]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'

In [203]: sjo.to_json(orient="records")
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[203]:
˓→'[15,16,17]'

Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and index labels
are not included:

In [204]: dfjo.to_json(orient="values")
Out[204]: '[[1,4,7],[2,5,8],[3,6,9]]'

# Not available for Series

Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name is also
included for Series:

In [205]: dfjo.to_json(orient="split")
Out[205]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,
˓→6,9]]}'

(continues on next page)

228 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [206]: sjo.to_json(orient="split")
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[206]
˓→'{"name":"D","index":["x","y","z"],"data":[15,16,17]}'

Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including but not
limited to dtypes and index names.

Note: Any orient option that encodes to a JSON object will not preserve the ordering of index and column labels
during round-trip serialization. If you wish to preserve label ordering use the split option as it uses ordered containers.

Date Handling

Writing in ISO date format:

In [207]: dfd = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [208]: dfd['date'] = pd.Timestamp('20130101')

In [209]: dfd = dfd.sort_index(1, ascending=False)

In [210]: json = dfd.to_json(date_format='iso')

In [211]: json
Out[211]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":
˓→"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.

˓→000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4

˓→":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.

˓→4108345112,"4":0.1320031703}}'

Writing in ISO date format, with microseconds:

In [212]: json = dfd.to_json(date_format='iso', date_unit='us')

In [213]: json
Out[213]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z
˓→","2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-

˓→01T00:00:00.000000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3

˓→":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.

˓→1702987971,"3":0.4108345112,"4":0.1320031703}}'

Epoch timestamps, in seconds:

In [214]: json = dfd.to_json(date_format='epoch', date_unit='s')

In [215]: json
Out[215]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4
˓→":1356998400},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.

˓→8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.

˓→1702987971,"3":0.4108345112,"4":0.1320031703}}'

Writing to a file, with a date index and a date column:

4.1. IO Tools (Text, CSV, HDF5, . . . ) 229

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [216]: dfj2 = dfj.copy()

In [217]: dfj2['date'] = pd.Timestamp('20130101')

In [218]: dfj2['ints'] = list(range(5))

In [219]: dfj2['bools'] = True

In [220]: dfj2.index = pd.date_range('20130101', periods=5)

In [221]: dfj2.to_json('test.json')

In [222]: with open('test.json') as fh:

.....: print(fh.read())
.....:
{"A":{"1356998400000":-1.2945235903,"1357084800000":0.2766617129,"1357171200000":-0.
˓→0139597524,"1357257600000":-0.0061535699,"1357344000000":0.8957173022},"B":{

˓→"1356998400000":0.4137381054,"1357084800000":-0.472034511,"1357171200000":-0.

˓→3625429925,"1357257600000":-0.923060654,"1357344000000":0.8052440254},"date":{

˓→"1356998400000":1356998400000,"1357084800000":1356998400000,"1357171200000

˓→":1356998400000,"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":

˓→{"1356998400000":0,"1357084800000":1,"1357171200000":2,"1357257600000":3,

˓→"1357344000000":4},"bools":{"1356998400000":true,"1357084800000":true,"1357171200000

˓→":true,"1357257600000":true,"1357344000000":true}}

Fallback Behavior

If the JSON serializer cannot handle the container contents directly it will fall back in the following manner:
• if the dtype is unsupported (e.g. np.complex) then the default_handler, if provided, will be called for
each value, otherwise an exception is raised.
• if an object is unsupported it will attempt the following:
– check if the object has defined a toDict method and call it. A toDict method should return a dict
which will then be JSON serialized.
– invoke the default_handler if one was provided.
– convert the object to a dict by traversing its contents. However this will often fail with an
OverflowError or give unexpected results.
In general the best approach for unsupported objects or dtypes is to provide a default_handler. For example:

>>> DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json() # raises

RuntimeError: Unhandled numpy dtype 15

can be dealt with by specifying a simple default_handler:

In [223]: pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)

Out[223]: '{"0":{"0":"(1+0j)","1":"(2+0j)","2":"(1+2j)"}}'

Reading JSON

Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a DataFrame
if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series

230 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

• filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be a URL. Valid
URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a local file could be
file ://localhost/path/to/table.json
• typ : type of object to recover (series or frame), default ‘frame’
• orient :
Series :
– default is index
– allowed values are {split, records, index}
DataFrame
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string

• dtype : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at
all, default is True, apply only to the data.
• convert_axes : boolean, try to convert the axes to the proper dtypes, default is True
• convert_dates : a list of columns to parse for dates; If True, then try to parse date-like columns, default
is True.
• keep_default_dates : boolean, default True. If parsing dates, then parse the default date-like columns.
• numpy : direct decoding to NumPy arrays. default is False; Supports numeric data only, although labels may
be non-numeric. Also note that the JSON ordering MUST be the same for each term if numpy=True.
• precise_float : boolean, default False. Set to enable usage of higher precision (strtod) function when
decoding string to double values. Default (False) is to use fast but less precise builtin functionality.
• date_unit : string, the timestamp unit to detect if converting dates. Default None. By default the timestamp
precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force timestamp
precision to seconds, milliseconds, microseconds or nanoseconds respectively.
• lines : reads file as one json object per line.
• encoding : The encoding to use to decode py3 bytes.
• chunksize : when used in combination with lines=True, return a JsonReader which reads in chunksize
lines per iteration.
The parser will raise one of ValueError/TypeError/AssertionError if the JSON is not parseable.
If a non-default orient was used when encoding to JSON be sure to pass the same option here so that decoding
produces sensible results, see Orient Options for an overview.

4.1. IO Tools (Text, CSV, HDF5, . . . ) 231

pandas: powerful Python data analysis toolkit, Release 0.24.1

Data Conversion

The default of convert_axes=True, dtype=True, and convert_dates=True will try to parse the axes, and
all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to dtype.
convert_axes should only be set to False if you need to preserve string-like numbers (e.g. ‘1’, ‘2’) in an axes.

Note: Large integer values may be converted to dates if convert_dates=True and the data and / or column labels
appear ‘date-like’. The exact threshold depends on the date_unit specified. ‘date-like’ means that the column label
meets one of the following criteria:
• it ends with '_at'
• it ends with '_time'
• it begins with 'timestamp'
• it is 'modified'
• it is 'date'

Warning: When reading JSON data, automatic coercing into dtypes has some quirks:
• an index can be reconstructed in a different order from serialization, that is, the returned order is not guaran-
teed to be the same as before serialization
• a column that was float data will be converted to integer if it can be done safely, e.g. a column of 1.
• bool columns will be converted to integer on reconstruction
Thus there are times where you may want to specify specific dtypes via the dtype keyword argument.

Reading from a JSON string:

In [224]: pd.read_json(json)
Out[224]:
date B A
0 2013-01-01 2.565646 -1.206412
1 2013-01-01 1.340309 1.431256
2 2013-01-01 -0.226169 -1.170299
3 2013-01-01 0.813850 0.410835
4 2013-01-01 -0.827317 0.132003

Reading from a file:

In [225]: pd.read_json('test.json')
Out[225]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True

Don’t convert any data (but still convert axes and dates):

232 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [226]: pd.read_json('test.json', dtype=object).dtypes

Out[226]:
A object
B object
date object
ints object
bools object
dtype: object

Specify dtypes for conversion:

In [227]: pd.read_json('test.json', dtype={'A': 'float32', 'bools': 'int8'}).dtypes

Out[227]:
A float32
B float64
date datetime64[ns]
ints int64
bools int8
dtype: object

Preserve string indices:

In [228]: si = pd.DataFrame(np.zeros((4, 4)), columns=list(range(4)),

.....: index=[str(i) for i in range(4)])
.....:

In [229]: si
Out[229]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0

In [230]: si.index
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→Index(['0', '1', '2', '3'], dtype='object')

In [231]: si.columns
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→Int64Index([0, 1, 2, 3], dtype='int64')

In [232]: json = si.to_json()

In [233]: sij = pd.read_json(json, convert_axes=False)

In [234]: sij
Out[234]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0

In [235]: sij.index
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[235]:
˓→Index(['0', '1', '2', '3'], dtype='object')

(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 233

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [236]: sij.columns
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→Index(['0', '1', '2', '3'], dtype='object')

Dates written in nanoseconds need to be read back in nanoseconds:

In [237]: json = dfj2.to_json(date_unit='ns')

# Try to parse timestamps as millseconds -> Won't Work

In [238]: dfju = pd.read_json(json, date_unit='ms')

In [239]: dfju
Out[239]:
A B date ints bools
1356998400000000000 -1.294524 0.413738 1356998400000000000 0 True
1357084800000000000 0.276662 -0.472035 1356998400000000000 1 True
1357171200000000000 -0.013960 -0.362543 1356998400000000000 2 True
1357257600000000000 -0.006154 -0.923061 1356998400000000000 3 True
1357344000000000000 0.895717 0.805244 1356998400000000000 4 True

# Let pandas detect the correct precision

In [240]: dfju = pd.read_json(json)

In [241]: dfju
Out[241]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True

# Or specify that all timestamps are in nanoseconds

In [242]: dfju = pd.read_json(json, date_unit='ns')

In [243]: dfju
Out[243]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True

The Numpy Parameter

Note: This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates etc.

If numpy=True is passed to read_json an attempt will be made to sniff an appropriate dtype during deserialization
and to subsequently decode directly to NumPy arrays, bypassing the need for intermediate Python objects.
This can provide speedups if you are deserialising a large amount of numeric data:

234 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [244]: randfloats = np.random.uniform(-100, 1000, 10000)

In [245]: randfloats.shape = (1000, 10)

In [246]: dffloats = pd.DataFrame(randfloats, columns=list('ABCDEFGHIJ'))

In [247]: jsonfloats = dffloats.to_json()

In [248]: %timeit pd.read_json(jsonfloats)

11.8 ms +- 612 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [249]: %timeit pd.read_json(jsonfloats, numpy=True)

8.59 ms +- 192 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

The speedup is less noticeable for smaller datasets:

In [250]: jsonfloats = dffloats.head(100).to_json()

In [251]: %timeit pd.read_json(jsonfloats)

7.78 ms +- 155 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

In [252]: %timeit pd.read_json(jsonfloats, numpy=True)

6.88 ms +- 164 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

Warning: Direct NumPy decoding makes a number of assumptions and may fail or produce unexpected output if
these assumptions are not satisfied:
• data is numeric.
• data is uniform. The dtype is sniffed from the first value decoded. A ValueError may be raised, or
incorrect output may be produced if this condition is not satisfied.
• labels are ordered. Labels are only read from the first container, it is assumed that each subsequent row /
column has been encoded in the same order. This should be satisfied if the data was encoded using to_json
but may not be the case if the JSON is from another source.

Normalization

pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into a flat table.

In [253]: from pandas.io.json import json_normalize

In [254]: data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},

.....: {'name': {'given': 'Mose', 'family': 'Regner'}},
.....: {'id': 2, 'name': 'Faye Raker'}]
.....:

In [255]: json_normalize(data)
Out[255]:
id name name.family name.first name.given name.last
0 1.0 NaN NaN Coleen NaN Volk
1 NaN NaN Regner NaN Mose NaN
2 2.0 Faye Raker NaN NaN NaN NaN

4.1. IO Tools (Text, CSV, HDF5, . . . ) 235

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [256]: data = [{'state': 'Florida',

.....: 'shortname': 'FL',
.....: 'info': {'governor': 'Rick Scott'},
.....: 'counties': [{'name': 'Dade', 'population': 12345},
.....: {'name': 'Broward', 'population': 40000},
.....: {'name': 'Palm Beach', 'population': 60000}]},
.....: {'state': 'Ohio',
.....: 'shortname': 'OH',
.....: 'info': {'governor': 'John Kasich'},
.....: 'counties': [{'name': 'Summit', 'population': 1234},
.....: {'name': 'Cuyahoga', 'population': 1337}]}]
.....:

In [257]: json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor

˓→']])

Out[257]:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich

Line delimited json

New in version 0.19.0.

pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop
or Spark.
New in version 0.21.0.
For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time. This can
be useful for large files or to read from a stream.
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:

In [259]: df = pd.read_json(jsonl, lines=True)

In [260]: df
Out[260]:
a b
0 1 2
1 3 4

In [261]: df.to_json(orient='records', lines=True)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[261]: '{"a":1,"b":2}\n{"a":3,"b":4}'

# reader is an iterator that returns `chunksize` lines each iteration

In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)

In [263]: reader
Out[263]: <pandas.io.json.json.JsonReader at 0x7f7a09505898>
(continues on next page)

236 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

In [264]: for chunk in reader:

.....: print(chunk)
.....:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Empty DataFrame
Columns: []
Index: []
a b
0 1 2
a b
1 3 4

Table Schema

New in version 0.20.0.

Table Schema is a spec for describing tabular datasets as a JSON object. The JSON includes information on the field
names, types, and other attributes. You can use the orient table to build a JSON string with two fields, schema and
data.

In [265]: df = pd.DataFrame({'A': [1, 2, 3],

.....: 'B': ['a', 'b', 'c'],
.....: 'C': pd.date_range('2016-01-01', freq='d', periods=3)},
.....: index=pd.Index(range(3), name='idx'))
.....:

In [266]: df
Out[266]:
A B C
idx
0 1 a 2016-01-01
1 2 b 2016-01-02
2 3 c 2016-01-03

In [267]: df.to_json(orient='table', date_format="iso")

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→'{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"}

˓→,{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],

˓→"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a","C":"2016-01-

˓→01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,

˓→"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'

The schema field contains the fields key, which itself contains a list of column name to type pairs, including the
Index or MultiIndex (see below for a list of types). The schema field also contains a primaryKey field if the
(Multi)index is unique.
The second field, data, contains the serialized data with the records orient. The index is included, and any
datetimes are ISO 8601 formatted, as required by the Table Schema spec.
The full list of types supported are described in the Table Schema spec. This table shows the mapping from pandas
types:

4.1. IO Tools (Text, CSV, HDF5, . . . ) 237

pandas: powerful Python data analysis toolkit, Release 0.24.1

Pandas type Table Schema type

int64 integer
float64 number
bool boolean
datetime64[ns] datetime
timedelta64[ns] duration
categorical any
object str

A few notes on the generated table schema:

• The schema object contains a pandas_version field. This contains the version of pandas’ dialect of the
schema, and will be incremented with each revision.
• All dates are converted to UTC when serializing. Even timezone naive values, which are treated as UTC with
an offset of 0.

In [268]: from pandas.io.json import build_table_schema

In [269]: s = pd.Series(pd.date_range('2016', periods=4))

In [270]: build_table_schema(s)
Out[270]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

• datetimes with a timezone (before serializing), include an additional field tz with the time zone name (e.g.
'US/Central').

In [271]: s_tz = pd.Series(pd.date_range('2016', periods=12,

.....: tz='US/Central'))
.....:

In [272]: build_table_schema(s_tz)
Out[272]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

• Periods are converted to timestamps before serialization, and so have the same behavior of being converted to
UTC. In addition, periods will contain and additional field freq with the period’s frequency, e.g. 'A-DEC'.

In [273]: s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',

.....: periods=4))
.....:

In [274]: build_table_schema(s_per)
Out[274]:
{'fields': [{'name': 'index', 'type': 'datetime', 'freq': 'A-DEC'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

238 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

• Categoricals use the any type and an enum constraint listing the set of possible values. Additionally, an
ordered field is included:
In [275]: s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))

In [276]: build_table_schema(s_cat)
Out[276]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values',
'type': 'any',
'constraints': {'enum': ['a', 'b']},
'ordered': False}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}

• A primaryKey field, containing an array of labels, is included if the index is unique:

In [277]: s_dupe = pd.Series([1, 2], index=[1, 1])

In [278]: build_table_schema(s_dupe)
Out[278]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'pandas_version': '0.20.0'}

• The primaryKey behavior is the same with MultiIndexes, but in this case the primaryKey is an array:
In [279]: s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
.....: (0, 1)]))
.....:

In [280]: build_table_schema(s_multi)
Out[280]:
{'fields': [{'name': 'level_0', 'type': 'string'},
{'name': 'level_1', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': FrozenList(['level_0', 'level_1']),
'pandas_version': '0.20.0'}

• The default naming roughly follows these rules:

– For series, the object.name is used. If that’s none, then the name is values
– For DataFrames, the stringified version of the column name is used
– For Index (not MultiIndex), index.name is used, with a fallback to index if that is None.
– For MultiIndex, mi.names is used. If any level has no name, then level_<i> is used.
New in version 0.23.0.
read_json also accepts orient='table' as an argument. This allows for the preservation of metadata such as
dtypes and index names in a round-trippable manner.
In [281]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
.....: 'bar': ['a', 'b', 'c', 'd'],
.....: 'baz': pd.date_range('2018-01-01', freq='d',
˓→periods=4),

.....: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])

.....: }, index=pd.Index(range(4), name='idx'))
(continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 239

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

.....:

In [282]: df
Out[282]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c

In [283]: df.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

foo int64
bar object
baz datetime64[ns]
qux category
dtype: object

In [284]: df.to_json('test.json', orient='table')

In [285]: new_df = pd.read_json('test.json', orient='table')

In [286]: new_df
Out[286]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c

In [287]: new_df.dtypes
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
˓→

foo int64
bar object
baz datetime64[ns]
qux category
dtype: object

Please note that the literal string ‘index’ as the name of an Index is not round-trippable, nor are any names begin-
ning with 'level_' within a MultiIndex. These are used by default in DataFrame.to_json() to indicate
missing values and the subsequent read cannot distinguish the intent.

In [288]: df.index.name = 'index'

In [289]: df.to_json('test.json', orient='table')

In [290]: new_df = pd.read_json('test.json', orient='table')

In [291]: print(new_df.index.name)
None

240 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

4.1.3 HTML

Reading HTML Content

Warning: We highly encourage you to read the HTML Table Parsing gotchas below regarding the issues sur-
rounding the BeautifulSoup4/html5lib/lxml parsers.

The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into list of
pandas DataFrames. Let’s look at a few examples.

Note: read_html returns a list of DataFrame objects, even if there is only a single table contained in the
HTML content.

Read a URL with no options:

In [292]: url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'

In [293]: dfs = pd.read_html(url)

In [294]: dfs
Out[294]:
[ Bank Name ... Updated Date
0 Washington Federal Bank for Savings ... February 21, 2018
1 The Farmers and Merchants State Bank of Argonia ... February 21, 2018
2 Fayette County Bank ... January 29, 2019
3 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... ... March 22, 2018
4 First NBC Bank ... January 29, 2019
5 Proficio Bank ... January 29, 2019
6 Seaway Bank and Trust Company ... January 29, 2019
.. ... ... ...
548 Hamilton Bank, NA En Espanol ... September 21, 2015
549 Sinclair National Bank ... October 6, 2017
550 Superior Bank, FSB ... August 19, 2014
551 Malta National Bank ... November 18, 2002
552 First Alliance Bank & Trust Co. ... February 18, 2003
553 National State Bank of Metropolis ... March 17, 2005
554 Bank of Honolulu ... March 17, 2005

[555 rows x 7 columns]]

Note: The data from the above URL changes every Monday so the resulting data above and the data below may be
slightly different.

Read in the content of the file from the above URL and pass it to read_html as a string:
In [295]: with open(file_path, 'r') as f:
.....: dfs = pd.read_html(f.read())
.....:

In [296]: dfs
Out[296]:
[ Bank Name City ... Closing Date
˓→ Updated Date (continues on next page)

4.1. IO Tools (Text, CSV, HDF5, . . . ) 241

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013
˓→ May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013
˓→ May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013
˓→ May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013
˓→ May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013
˓→ May 16, 2013
5 Parkway Bank Lenoir ... April 26, 2013
˓→ May 17, 2013
6 Chipola Community Bank Marianna ... April 19, 2013
˓→ May 16, 2013
.. ... ... ... ...
˓→ ...
498 Hamilton Bank, NAEn Espanol Miami ... January 11, 2002
˓→ June 5, 2012
499 Sinclair National Bank Gravette ... September 7, 2001
˓→February 10, 2004

500 Superior Bank, FSB Hinsdale ... July 27, 2001

˓→ June 5, 2012
501 Malta National Bank Malta ... May 3, 2001
˓→November 18, 2002

502 First Alliance Bank & Trust Co. Manchester ... February 2, 2001
˓→February 18, 2003

503 National State Bank of Metropolis Metropolis ... December 14, 2000
˓→ March 17, 2005
504 Bank of Honolulu Honolulu ... October 13, 2000
˓→ March 17, 2005

[505 rows x 7 columns]]

You can even pass in an instance of StringIO if you so desire:

In [297]: with open(file_path, 'r') as f:
.....: sio = StringIO(f.read())
.....:

In [298]: dfs = pd.read_html(sio)

In [299]: dfs
Out[299]:
[ Bank Name City ... Closing Date
˓→ Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013
˓→ May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013
˓→ May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013
˓→ May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013
˓→ May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013
˓→ May 16, 2013
5 Parkway Bank Lenoir ... April 26, 2013
˓→ May 17, 2013
(continues on next page)

242 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

6 Chipola Community Bank Marianna ... April 19, 2013
˓→ May 16, 2013
.. ... ... ... ...
˓→ ...
498 Hamilton Bank, NAEn Espanol Miami ... January 11, 2002
˓→ June 5, 2012
499 Sinclair National Bank Gravette ... September 7, 2001
˓→ February 10, 2004
500 Superior Bank, FSB Hinsdale ... July 27, 2001
˓→ June 5, 2012
501 Malta National Bank Malta ... May 3, 2001
˓→November 18, 2002

502 First Alliance Bank & Trust Co. Manchester ... February 2, 2001
˓→February 18, 2003

503 National State Bank of Metropolis Metropolis ... December 14, 2000
˓→ March 17, 2005
504 Bank of Honolulu Honolulu ... October 13, 2000
˓→ March 17, 2005

[505 rows x 7 columns]]

Note: The following examples are not run by the IPython evaluator due to the fact that having so many network-
accessing functions slows down the documentation build. If you spot an error or an example that doesn’t run, please
do not hesitate to report it over on pandas GitHub issues page.

Read a URL and match a table that contains specific text:

match = 'Metcalf Bank'

df_list = pd.read_html(url, match=match)

Specify a header row (by default <th> or <td> elements located within a <thead> are used to form the column
index, if multiple rows are contained within <thead> then a MultiIndex is created); if specified, the header row is
taken from the data minus the parsed header elements (<th> elements).

dfs = pd.read_html(url, header=0)

Specify an index column:

dfs = pd.read_html(url, index_col=0)

Specify a number of rows to skip:

dfs = pd.read_html(url, skiprows=0)

Specify a number of rows to skip using a list (xrange (Python 2 only) works as well):

dfs = pd.read_html(url, skiprows=range(2))

Specify an HTML attribute:

dfs1 = pd.read_html(url, attrs={'id': 'table'})

dfs2 = pd.read_html(url, attrs={'class': 'sortable'})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True

Specify values that should be converted to NaN:

4.1. IO Tools (Text, CSV, HDF5, . . . ) 243

pandas: powerful Python data analysis toolkit, Release 0.24.1

dfs = pd.read_html(url, na_values=['No Acquirer'])

New in version 0.19.

Specify whether to keep the default set of NaN values:

dfs = pd.read_html(url, keep_default_na=False)

New in version 0.19.

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that
are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to
strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0,
converters={'MNC': str})

New in version 0.19.

Use some combination of the above:

dfs = pd.read_html(url, match='Metcalf Bank', index_col=0)

Read in pandas to_html output (with some loss of floating point precision):

df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format='{0:.40g}'.format)
dfin = pd.read_html(s, index_col=0)

The lxml backend will raise an error on a failed parse if that is the only parser you provide. If you only have a single
parser you can provide just a string, but it is considered good practice to pass a list with one string if, for example, the
function expects a sequence of strings. You may use:

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml'])

Or you could pass flavor='lxml' without a list:

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor='lxml')

However, if you have bs4 and html5lib installed and pass None or ['lxml', 'bs4'] then the parse will most
likely succeed. Note that as soon as a parse succeeds, the function will return.

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml', 'bs4'])

Writing to HTML files

DataFrame objects have an instance method to_html which renders the contents of the DataFrame as an HTML
table. The function arguments are as in the method to_string described above.

Note: Not all of the possible options for DataFrame.to_html are shown here for brevity’s sake. See
to_html() for the full set of options.

244 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [300]: df = pd.DataFrame(np.random.randn(2, 2))

In [301]: df
Out[301]:
0 1
0 -0.184744 0.496971
1 -0.856240 1.857977

In [302]: print(df.to_html()) # raw html

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\<table
˓→border="1" class="dataframe">

HTML:
The columns argument will limit the columns shown:

In [303]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.184744</td>
</tr>
<tr>
<th>1</th>
<td>-0.856240</td>
</tr>
</tbody>
</table>

HTML:
float_format takes a Python callable to control the precision of floating point values:

4.1. IO Tools (Text, CSV, HDF5, . . . ) 245

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [304]: print(df.to_html(float_format='{0:.10f}'.format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.1847438576</td>
<td>0.4969711327</td>
</tr>
<tr>
<th>1</th>
<td>-0.8562396763</td>
<td>1.8579766508</td>
</tr>
</tbody>
</table>

HTML:
bold_rows will make the row labels bold by default, but you can turn that off:
In [305]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-0.184744</td>
<td>0.496971</td>
</tr>
<tr>
<td>1</td>
<td>-0.856240</td>
<td>1.857977</td>
</tr>
</tbody>
</table>

The classes argument provides the ability to give the resulting HTML table CSS classes. Note that these classes
are appended to the existing 'dataframe' class.
In [306]: print(df.to_html(classes=['awesome_table_class', 'even_more_awesome_class
˓→']))

<table border="1" class="dataframe awesome_table_class even_more_awesome_class">

<thead>
<tr style="text-align: right;">
(continues on next page)

246 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

The render_links argument provides the ability to add hyperlinks to cells that contain URLs.
New in version 0.24.

In [307]: url_df = pd.DataFrame({

.....: 'name': ['Python', 'Pandas'],
.....: 'url': ['https://www.python.org/', 'http://pandas.pydata.org']})
.....:

In [308]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>url</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Python</td>
<td><a href="https://www.python.org/" target="_blank">https://www.python.org/</
˓→a></td>

</tr>
<tr>
<th>1</th>
<td>Pandas</td>
<td><a href="http://pandas.pydata.org" target="_blank">http://pandas.pydata.org
˓→</a></td>

</tr>
</tbody>
</table>

HTML:
Finally, the escape argument allows you to control whether the “<”, “>” and “&” characters escaped in the resulting
HTML (by default it is True). So to get the HTML without escaped characters pass escape=False

4.1. IO Tools (Text, CSV, HDF5, . . . ) 247

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [309]: df = pd.DataFrame({'a': list('&<>'), 'b': np.random.randn(3)})

Escaped:

In [310]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-0.400654</td>
</tr>
</tbody>
</table>

Not escaped:

In [311]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
(continues on next page)

248 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

(continued from previous page)

Note: Some browsers may not show a difference in the rendering of the previous two HTML tables.

HTML Table Parsing Gotchas

There are some versioning issues surrounding the libraries that are used to parse HTML tables in the top-level pandas
io function read_html.
Issues with lxml
• Benefits
– lxml is very fast.
– lxml requires Cython to install correctly.
• Drawbacks
– lxml does not make any guarantees about the results of its parse unless it is given strictly valid markup.
– In light of the above, we have chosen to allow you, the user, to use the lxml backend, but this backend
will use html5lib if lxml fails to parse
– It is therefore highly recommended that you install both BeautifulSoup4 and html5lib, so that you will
still get a valid result (provided everything else is valid) even if lxml fails.
Issues with BeautifulSoup4 using lxml as a backend
• The above issues hold here as well since BeautifulSoup4 is essentially just a wrapper around a parser backend.
Issues with BeautifulSoup4 using html5lib as a backend
• Benefits
– html5lib is far more lenient than lxml and consequently deals with real-life markup in a much saner way
rather than just, e.g., dropping an element without notifying you.
– html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important
for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is
“correct”, since the process of fixing markup does not have a single definition.
– html5lib is pure Python and requires no additional build steps beyond its own installation.
• Drawbacks
– The biggest drawback to using html5lib is that it is slow as molasses. However consider the fact that many
tables on the web are not big enough for the parsing algorithm runtime to matter. It is more likely that the
bottleneck will be in the process of reading the raw text from the URL over the web, i.e., IO (input-output).
For very large tables, this might not be true.

4.1. IO Tools (Text, CSV, HDF5, . . . ) 249

pandas: powerful Python data analysis toolkit, Release 0.24.1

4.1.4 Excel files

The read_excel() method can read Excel 2003 (.xls) and Excel 2007+ (.xlsx) files using the xlrd Python
module. The to_excel() instance method is used for saving a DataFrame to Excel. Generally the semantics are
similar to working with csv data. See the cookbook for some advanced strategies.

Reading Excel Files

In the most basic use-case, read_excel takes a path to an Excel file, and the sheet_name indicating which sheet
to parse.

# Returns a DataFrame
pd.read_excel('path_to_file.xls', sheet_name='Sheet1')

ExcelFile class

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and
can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read
into memory only once.

xlsx = pd.ExcelFile('path_to_file.xls')
df = pd.read_excel(xlsx, 'Sheet1')

The ExcelFile class can also be used as a context manager.

with pd.ExcelFile('path_to_file.xls') as xls:

df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')

The sheet_names property will generate a list of the sheet names in the file.
The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:

data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)

Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed to
read_excel with no loss in performance.

# using the ExcelFile class

data = {}
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=None,
na_values=['NA'])

# equivalent using the read_excel function

data = pd.read_excel('path_to_file.xls', ['Sheet1', 'Sheet2'],
index_col=None, na_values=['NA'])

250 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Specifying Sheets

Note: The second argument is sheet_name, not to be confused with ExcelFile.sheet_names.

Note: An ExcelFile’s attribute sheet_names provides access to a list of sheets.

• The arguments sheet_name allows specifying the sheet or sheets to read.

• The default value for sheet_name is 0, indicating to read the first sheet
• Pass a string to refer to the name of a particular sheet in the workbook.
• Pass an integer to refer to the index of a sheet. Indices follow Python convention, beginning at 0.
• Pass a list of either strings or integers, to return a dictionary of specified sheets.
• Pass a None to return a dictionary of all available sheets.

# Returns a DataFrame
pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

Using the sheet index:

# Returns a DataFrame
pd.read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])

Using all default values:

# Returns a DataFrame
pd.read_excel('path_to_file.xls')

Using None to get all sheets:

# Returns a dictionary of DataFrames

pd.read_excel('path_to_file.xls', sheet_name=None)

Using a list to get multiple sheets:

# Returns the 1st and 4th sheet, as a dictionary of DataFrames.

pd.read_excel('path_to_file.xls', sheet_name=['Sheet1', 3])

read_excel can read more than one sheet, by setting sheet_name to either a list of sheet names, a list of sheet
positions, or None to read all sheets. Sheets can be specified by sheet index or sheet name, using an integer or string,
respectively.

Reading a MultiIndex

read_excel can read a MultiIndex index, by passing a list of columns to index_col and a MultiIndex
column by passing a list of rows to header. If either the index or columns have serialized level names those will
be read in as well by specifying the rows/columns that make up the levels.
For example, to read in a MultiIndex index without names:

4.1. IO Tools (Text, CSV, HDF5, . . . ) 251

pandas: powerful Python data analysis toolkit, Release 0.24.1

In [312]: df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]},

.....: index=pd.MultiIndex.from_product([['a', 'b'], ['c', 'd
˓→']]))

.....:

In [313]: df.to_excel('path_to_file.xlsx')

In [314]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])

In [315]: df
Out[315]:
a b
a c 1 5
d 2 6
b c 3 7
d 4 8

If the index has level names, they will parsed as well, using the same parameters.

In [316]: df.index = df.index.set_names(['lvl1', 'lvl2'])

In [317]: df.to_excel('path_to_file.xlsx')

In [318]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])

In [319]: df
Out[319]:
a b
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8

If the source file has both MultiIndex index and columns, lists specifying each should be passed to index_col
and header:

In [320]: df.columns = pd.MultiIndex.from_product([['a'], ['b', 'd']],

.....: names=['c1', 'c2'])
.....:

In [321]: df.to_excel('path_to_file.xlsx')

In [322]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1], header=[0, 1])

In [323]: df
Out[323]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8

252 Chapter 4. User Guide

pandas: powerful Python data analysis toolkit, Release 0.24.1

Parsing Specific Columns

It is often the case that users will insert columns to do temporary computations in Excel and you may not want to read
in those columns. read_excel takes a usecols keyword to allow you to specify a subset of columns to parse.
Deprecated since version 0.24.0.
Passing in an integer for usecols has been deprecated. Please pass in a list of ints from 0 to usecols inclusive
instead.
If usecols is an integer, then it is assumed to indicate the last column to be parsed.

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=2)

You can also specify a comma-delimited set of Excel columns and ranges as a string:

pd.read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')

If usecols is a list of integers, then it is assumed to be the file column indices to be parsed.

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])

Element order is ignored, so usecols=[0, 1] is the same as [1, 0].

New in version 0.24.
If usecols is a list of strings, it is assumed that each string corresponds to a column name provided either by the
user in names or inferred from the document header row(s). Those strings define which columns will be parsed:

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])

Element order is ignored, so usecols=['baz', 'joe'] is the same as ['joe', 'baz'].

New in version 0.24.
If usecols is callable, the callable function will be evaluated against the column names, returning names where the
callable function evaluates to True.

pd.read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())

Parsing Dates

Datetime-like values are normally automatically converted to the appropriate dtype when reading the excel file. But
if you have a column of strings that look like dates (but are not actually formatted as dates in excel), you can use the
parse_dates keyword to parse those strings to datetimes:

pd.read_excel('path_to_file.xls', 'Sheet1', parse_dates=['date_strings'])

Cell Converters

It is possible to transform the contents of Excel cells via the converters option. For instance, to convert a column
to boolean:

pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})

4.1. IO Tools (Text, CSV, HDF5, . . . ) 253

pandas: powerful Python data analysis toolkit, Release 0.24.1

This options handles missing values and treats exceptions in the converters as missing data. Transformations are
applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For instance, a column
of integers with missing values cannot be transformed to an array with integer dtype, because NaN is strictly a float.
You can manually mask missing data to recover integer dtype:

def cfun(x):
return int(x) if x else -1

pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})

dtype Specifications

New in version 0.20.

As an alternative to converters, the type for an entire column can be specified using the dtype keyword, which takes a
dictionary mapping column names to types. To interpret data with no type inference, use the type str or object.

pd.read_excel('path_to_file.xls', dtype={'MyInts': 'int64', 'MyText': str})

Writing Excel Files

Writing Excel Files to Disk

To write a DataFrame object to a sheet of an Excel file, you can use the to_excel instance method. The arguments
are largely the same as to_csv described above, the first argument being the name of the excel file, and the optional
second argument the name of the sheet to which the DataFrame should be written. For example:

df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')

Files with a .xls extension will be written using xlwt and those with a .xlsx extension will be written using
xlsxwriter (if available) or openpyxl.
The DataFrame will be written in a way that tries to mimic the REPL output. The index_label will be placed
in the second row instead of the first. You can place it in the first row by setting the merge_cells option in
to_excel() to False:

df.to_excel('path_to_file.xlsx', index_label='label', merge_cells=False)

In order to write separate DataFrames to separate sheets in a single Excel file, one can pass an ExcelWriter.

with pd.ExcelWriter('path_to_file.xlsx') as writer:

df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

Note: Wringing a little more performance out of read_excel Internally, Excel stores all numeric data as floats.
Because this can produce unexpected behavior when reading in data, pandas defaults to trying to convert integers to
floats if it doesn’t lose information (1.0 --> 1). Yo