Pandas
Pandas
toolkit
Release 0.25.2
i
ii
pandas: powerful Python data analysis toolkit, Release 0.25.2
CONTENTS 1
pandas: powerful Python data analysis toolkit, Release 0.25.2
2 CONTENTS
CHAPTER
ONE
These are the changes in pandas 0.25.2. See release for a full changelog including other versions of pandas.
1.1.1 Indexing
1.1.2 I/O
• Fix regression in notebook display where <th> tags were missing for DataFrame.index values
(GH28204).
• Regression in to_csv() where writing a Series or DataFrame indexed by an IntervalIndex would
incorrectly raise a TypeError (GH28210)
• Fix to_csv() with ExtensionArray with list-like values (GH28840).
1.1.3 Groupby/resample/rolling
3
pandas: powerful Python data analysis toolkit, Release 0.25.2
1.2 Contributors
A total of 6 people contributed patches to this release. People with a “+” by their names contributed a
patch for the first time.
• Felix Divo +
• Jeremy Schendel
• Joris Van den Bossche
• MeeseeksMachine
• Tom Augspurger
• jbrockmendel
{{ header }}
TWO
INSTALLATION
The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform
distribution for data analysis and scientific computing. This is the recommended installation method for
most users.
Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development
version are also provided.
Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make up
the SciPy stack (IPython, NumPy, Matplotlib, …) is with Anaconda, a cross-platform (Linux, Mac OS X,
Windows) Python distribution for data analytics and scientific computing.
After running the installer, the user will have access to pandas and the rest of the SciPy stack without
needing to install anything else, and without needing to wait for any software to be compiled.
Installation instructions for Anaconda can be found here.
A full list of the packages available as part of the Anaconda distribution can be found here.
Another advantage to installing Anaconda is that you don’t need admin rights to install it. Anaconda can
install in the user’s home directory, which makes it trivial to delete Anaconda if you decide (just delete that
folder).
The previous section outlined how to get pandas installed as part of the Anaconda distribution. However
this approach means you will install well over one hundred packages and involves downloading the installer
which is a few hundred megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then installing
pandas with Miniconda may be a better solution.
5
pandas: powerful Python data analysis toolkit, Release 0.25.2
Conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is
both cross-platform and language agnostic (it can play a similar role to a pip and virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the Conda
command to install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this for you.
The installer can be found here
The next step is to create a new conda environment. A conda environment is like a virtualenv that allows
you to specify a specific version of Python and set of libraries. Run the following commands from a terminal
window:
This will create a minimal environment with only Python installed in it. To put your self inside this
environment run:
activate name_of_my_env
The final step required is to install pandas. This can be done with the following command:
If you need packages that are available to pip but not conda, then install pip, and then use pip to install
those packages:
Installation instructions for ActivePython can be found here. Versions 2.7 and 3.5 include pandas.
6 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.25.2
The commands in this table will install pandas for Python 3 from your distribution. To install pandas for
Python 2, you may need to use the python-pandas package.
However, the packages in the linux package managers are often a few versions behind, so to get the newest
version of pandas, it’s recommended to install using the pip or conda methods described above.
See the contributing guide for complete instructions on building from the git source tree. Further, see creating
a development environment if you wish to create a pandas development environment.
pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing.
To run it on your machine to verify that everything is working (and that you have all of the dependencies,
soft and hard, installed), make sure you have pytest >= 4.0.2 and Hypothesis >= 3.58, then run:
>>> pd.test()
running: pytest --skip-slow --skip-network C:\Users\TP\Anaconda3\envs\py36\lib\site-
,→packages\pandas
..................................................................S......
........S................................................................
.........................................................................
2.4 Dependencies
• numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart
chunking and caching to achieve large speedups. If installed, must be Version 2.6.2 or higher.
• bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython
routines to achieve large speedups. If installed, must be Version 1.2.1 or higher.
Note: You are highly encouraged to install these libraries, as they provide speed improvements, especially
when working with large data sets.
Pandas has many optional dependencies that are only used for specific methods. For example, pandas.
read_hdf() requires the pytables package. If the optional dependency is not installed, pandas will raise
an ImportError when the method requiring that dependency is called.
8 Chapter 2. Installation
pandas: powerful Python data analysis toolkit, Release 0.25.2
One of the following combinations of libraries is needed to use the top-level read_html() function:
Changed in version 0.23.0.
• BeautifulSoup4 and html5lib
• BeautifulSoup4 and lxml
• BeautifulSoup4 and html5lib and lxml
• Only lxml, although see HTML Table Parsing for reasons as to why you should probably not take this
approach.
Warning:
• if you install BeautifulSoup4 you must install either lxml or html5lib or both. read_html() will
not work with only BeautifulSoup4 installed.
• You are highly encouraged to read HTML Table Parsing gotchas. It explains issues surrounding
the installation and usage of the above three libraries.
2.4. Dependencies 9
pandas: powerful Python data analysis toolkit, Release 0.25.2
{{ header }}
10 Chapter 2. Installation
CHAPTER
THREE
GETTING STARTED
{{ header }}
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working
with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building
block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis / manipulation tool available
in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data actually need not be labeled at all to
be placed into a pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle
the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For
R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on
top of NumPy and is intended to integrate well within a scientific computing environment with many other
3rd party libraries.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point
data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional
objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the
user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you
in computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for
both aggregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures
into DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
11
pandas: powerful Python data analysis toolkit, Release 0.25.2
The best way to think about the pandas data structures is as flexible containers for lower dimensional data.
For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be
able to insert and remove objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the
typical orientation of time series and cross-sectional data sets. When using ndarrays to store 2- and 3-
dimensional data, a burden is placed on the user to consider the orientation of the data set when writing
functions; axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters
for performance). In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a
particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the
amount of mental effort required to code up data transformations in downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows)
and the columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results
in more readable code:
All pandas data structures are value-mutable (the values they contain can be altered) but not always size-
mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a
DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched.
In general we like to favor immutability where sensible.
The first stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas
community experts can answer through Stack Overflow.
3.1.4 Community
pandas is actively supported today by a community of like-minded individuals around the world who con-
tribute their valuable time and energy to help make open source pandas possible. Thanks to all of our
contributors.
If you’re interested in contributing, please visit the contributing guide.
pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as
a world-class open-source project, and makes it possible to donate to the project.
The governance process that pandas project has used informally since its inception in 2008 is formalized in
Project Governance documents. The documents clarify how decisions are made and how the various elements
of our community interact, including the relationship between open source collaborative development and
work that may be funded by for-profit or non-profit entities.
Wes McKinney is the Benevolent Dictator for Life (BDFL).
The list of the Core Team members and more detailed information can be found on the people’s page of the
governance repo.
The information about current institutional partners can be found on pandas website page.
3.1.8 License
Copyright (c) 2008-2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData␣
,→Development Team
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
{{ header }}
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in
the Cookbook.
Customarily, we import as follows:
In [4]: s
Out[4]:
0 1.0
1 3.0
(continues on next page)
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [8]: df
Out[8]:
A B C D
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
(continues on next page)
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically
enabled. Here’s a subset of the attributes that will be completed:
As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of
the attributes have been truncated for brevity.
In [14]: df.tail(3)
Out[14]:
A B C D
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-05 0.578869 0.782500 0.045559 -0.490189
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
Display the index, columns:
In [15]: df.index
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an
expensive operation when your DataFrame has columns with different data types, which comes down to a
fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire ar-
ray, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(),
pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being
object, which requires casting every value to a Python object.
For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying
data.
In [17]: df.to_numpy()
Out[17]:
array([[ 1.36382983, 0.77377516, -2.2758147 , -0.73333194],
[ 1.77586181, -0.86722409, 1.21445884, 1.84623251],
[ 0.74724665, 0.1665589 , -0.69400832, -0.68641742],
[ 0.34770833, 0.06047343, -1.39385516, -1.24987716],
[ 0.57886924, 0.78250045, 0.04555881, -0.49018939],
[ 0.58749057, -0.67538866, -0.43329251, -0.58303716]])
For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.
In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
Note: DataFrame.to_numpy() does not include the index or column labels in the output.
In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.900168 0.040116 -0.589492 -0.316103
std 0.549803 0.698714 1.198627 1.091824
min 0.347708 -0.867224 -2.275815 -1.249877
25% 0.581025 -0.491423 -1.218893 -0.721603
50% 0.667369 0.113516 -0.563650 -0.634727
75% 1.209684 0.621971 -0.074154 -0.513401
max 1.775862 0.782500 1.214459 1.846233
In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 1.363830 1.775862 0.747247 0.347708 0.578869 0.587491
B 0.773775 -0.867224 0.166559 0.060473 0.782500 -0.675389
C -2.275815 1.214459 -0.694008 -1.393855 0.045559 -0.433293
D -0.733332 1.846233 -0.686417 -1.249877 -0.490189 -0.583037
Sorting by an axis:
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 -0.733332 -2.275815 0.773775 1.363830
2013-01-02 1.846233 1.214459 -0.867224 1.775862
2013-01-03 -0.686417 -0.694008 0.166559 0.747247
2013-01-04 -1.249877 -1.393855 0.060473 0.347708
2013-01-05 -0.490189 0.045559 0.782500 0.578869
2013-01-06 -0.583037 -0.433293 -0.675389 0.587491
Sorting by values:
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-01 1.363830 0.773775 -2.275815 -0.733332
2013-01-05 0.578869 0.782500 0.045559 -0.490189
3.2.3 Selection
Note: While standard Python / Numpy expressions for selecting and setting are intuitive and come in
handy for interactive work, for production code, we recommend the optimized pandas data access methods,
.at, .iat, .loc and .iloc.
See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
Getting
In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
2013-01-04 0.347708 0.060473 -1.393855 -1.249877
Selection by label
In [26]: df.loc[dates[0]]
Out[26]:
A 1.363830
B 0.773775
C -2.275815
D -0.733332
Name: 2013-01-01 00:00:00, dtype: float64
Selection by position
In [32]: df.iloc[3]
Out[32]:
A 0.347708
B 0.060473
C -1.393855
D -1.249877
Name: 2013-01-04 00:00:00, dtype: float64
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 1.775862 -0.867224 1.214459 1.846233
2013-01-03 0.747247 0.166559 -0.694008 -0.686417
In [37]: df.iloc[1, 1]
Out[37]: -0.8672240914412521
In [38]: df.iat[1, 1]
Out[38]: -0.8672240914412521
Boolean indexing
In [43]: df2
Out[43]:
A B C D E
2013-01-01 1.363830 0.773775 -2.275815 -0.733332 one
2013-01-02 1.775862 -0.867224 1.214459 1.846233 one
2013-01-03 0.747247 0.166559 -0.694008 -0.686417 two
2013-01-04 0.347708 0.060473 -1.393855 -1.249877 three
2013-01-05 0.578869 0.782500 0.045559 -0.490189 four
2013-01-06 0.587491 -0.675389 -0.433293 -0.583037 three
Setting
In [46]: s1
Out[46]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [47]: df['F'] = s1
In [49]: df.iat[0, 1] = 0
In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -2.275815 5 NaN
2013-01-02 1.775862 -0.867224 1.214459 5 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0
2013-01-04 0.347708 0.060473 -1.393855 5 3.0
2013-01-05 0.578869 0.782500 0.045559 5 4.0
2013-01-06 0.587491 -0.675389 -0.433293 5 5.0
In [54]: df2
Out[54]:
A B C D F
2013-01-01 0.000000 0.000000 -2.275815 -5 NaN
2013-01-02 -1.775862 -0.867224 -1.214459 -5 -1.0
2013-01-03 -0.747247 -0.166559 -0.694008 -5 -2.0
2013-01-04 -0.347708 -0.060473 -1.393855 -5 -3.0
2013-01-05 -0.578869 -0.782500 -0.045559 -5 -4.0
2013-01-06 -0.587491 -0.675389 -0.433293 -5 -5.0
pandas primarily uses the value np.nan to represent missing data. It is by default not included in compu-
tations. See the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -2.275815 5 NaN 1.0
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0 NaN
2013-01-04 0.347708 0.060473 -1.393855 5 3.0 NaN
In [58]: df1.dropna(how='any')
Out[58]:
A B C D F E
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -2.275815 5 5.0 1.0
2013-01-02 1.775862 -0.867224 1.214459 5 1.0 1.0
2013-01-03 0.747247 0.166559 -0.694008 5 2.0 5.0
2013-01-04 0.347708 0.060473 -1.393855 5 3.0 5.0
In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
3.2.5 Operations
Stats
In [61]: df.mean()
Out[61]:
A 0.672863
B -0.088847
C -0.589492
D 5.000000
F 3.000000
dtype: float64
In [62]: df.mean(1)
Out[62]:
2013-01-01 0.681046
2013-01-02 1.624619
2013-01-03 1.443959
2013-01-04 1.402865
2013-01-05 2.081386
2013-01-06 1.895762
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas auto-
matically broadcasts along the specified dimension.
In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
Apply
Histogramming
In [69]: s
Out[69]:
0 3
1 2
2 4
3 5
4 1
5 1
6 3
7 2
8 4
9 0
dtype: int64
In [70]: s.value_counts()
Out[70]:
4 2
3 2
2 2
1 2
5 1
0 1
dtype: int64
String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate
on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses
regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
3.2.6 Merge
Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with vari-
ous kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type
operations.
See the Merging section.
Concatenating pandas objects together with concat():
In [74]: df
Out[74]:
0 1 2 3
0 2.215176 -1.315908 -0.317116 -0.842397
1 2.213106 -0.596732 0.409649 1.286960
2 1.006929 -1.120177 0.230672 0.537003
3 -2.502802 0.785754 -0.228391 0.063769
4 -0.469153 -0.620792 1.997251 -0.360238
5 1.116982 -3.165350 -1.410497 -0.466197
6 0.712124 0.581404 0.715763 1.044883
7 -2.015823 0.212701 0.841041 0.089602
8 -1.633435 0.535489 -1.146539 0.930138
9 -0.219756 -0.913111 -0.769806 -0.355672
In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 2.215176 -1.315908 -0.317116 -0.842397
1 2.213106 -0.596732 0.409649 1.286960
2 1.006929 -1.120177 0.230672 0.537003
3 -2.502802 0.785754 -0.228391 0.063769
4 -0.469153 -0.620792 1.997251 -0.360238
5 1.116982 -3.165350 -1.410497 -0.466197
6 0.712124 0.581404 0.715763 1.044883
7 -2.015823 0.212701 0.841041 0.089602
8 -1.633435 0.535489 -1.146539 0.930138
9 -0.219756 -0.913111 -0.769806 -0.355672
Join
In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2
In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5
In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2
In [85]: right
Out[85]:
key rval
0 foo 4
1 bar 5
Append
In [88]: df
Out[88]:
A B C D
0 0.292785 -0.036327 -0.303377 0.132313
1 -1.437443 -0.521953 0.693930 -0.556417
2 2.246185 -0.175934 1.702784 2.475101
3 0.040067 -0.987877 -0.741536 1.142627
4 -1.657556 -0.812830 0.587969 1.347117
5 0.433323 -0.045170 -0.694659 -0.954585
6 -1.258443 -0.380086 1.610889 0.841358
7 -0.660989 0.519592 -0.360741 -1.441424
In [89]: s = df.iloc[3]
3.2.7 Grouping
By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
See the Grouping section.
In [92]: df
Out[92]:
A B C D
0 foo one -1.593566 1.125380
1 bar one 0.854244 1.315419
2 foo two 1.218171 1.511499
3 bar three -0.028648 0.726054
4 foo two 1.632905 0.664038
5 bar two -0.637469 -0.707051
6 foo one -0.512159 -1.576752
7 foo three 0.312766 0.831920
Grouping and then applying the sum() function to the resulting groups.
In [93]: df.groupby('A').sum()
Out[93]:
C D
A
bar 0.188127 1.334423
foo 1.058118 2.556085
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
3.2.8 Reshaping
Stack
In [99]: df2
Out[99]:
A B
first second
bar one 1.374920 0.502088
two -0.304574 -1.021713
baz one 0.713245 0.315322
two 0.591517 0.543840
In [101]: stacked
Out[101]:
first second
bar one A 1.374920
B 0.502088
two A -0.304574
B -1.021713
baz one A 0.713245
(continues on next page)
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack()
is unstack(), which by default unstacks the last level:
In [102]: stacked.unstack()
Out[102]:
A B
first second
bar one 1.374920 0.502088
two -0.304574 -1.021713
baz one 0.713245 0.315322
two 0.591517 0.543840
In [103]: stacked.unstack(1)
Out[103]:
second one two
first
bar A 1.374920 -0.304574
B 0.502088 -1.021713
baz A 0.713245 0.591517
B 0.315322 0.543840
In [104]: stacked.unstack(0)
Out[104]:
first bar baz
second
one A 1.374920 0.713245
B 0.502088 0.315322
two A -0.304574 0.591517
B -1.021713 0.543840
Pivot tables
In [106]: df
Out[106]:
A B C D E
0 one A foo -0.278531 0.004556
1 one B foo 0.940889 -2.481391
2 two C foo 1.256053 1.125308
(continues on next page)
pandas has simple, powerful, and efficient functionality for performing resampling operations during fre-
quency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but
not limited to, financial applications. See the Time Series section.
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 26405
Freq: 5T, dtype: int64
In [113]: ts
Out[113]:
2012-03-06 1.106290
2012-03-07 -1.917986
(continues on next page)
In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 1.106290
2012-03-07 00:00:00+00:00 -1.917986
2012-03-08 00:00:00+00:00 1.099233
2012-03-09 00:00:00+00:00 -1.180562
2012-03-10 00:00:00+00:00 -0.140211
Freq: D, dtype: float64
In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 1.106290
2012-03-06 19:00:00-05:00 -1.917986
2012-03-07 19:00:00-05:00 1.099233
2012-03-08 19:00:00-05:00 -1.180562
2012-03-09 19:00:00-05:00 -0.140211
Freq: D, dtype: float64
In [119]: ts
Out[119]:
2012-01-31 0.471911
2012-02-29 -1.370620
2012-03-31 -0.267019
2012-04-30 0.111729
2012-05-31 0.611299
Freq: M, dtype: float64
In [120]: ps = ts.to_period()
In [121]: ps
Out[121]:
2012-01 0.471911
2012-02 -1.370620
2012-03 -0.267019
2012-04 0.111729
2012-05 0.611299
Freq: M, dtype: float64
In [122]: ps.to_timestamp()
Out[122]:
2012-01-01 0.471911
2012-02-01 -1.370620
2012-03-01 -0.267019
2012-04-01 0.111729
2012-05-01 0.611299
Freq: MS, dtype: float64
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the
following example, we convert a quarterly frequency with year ending in November to 9am of the end of the
month following the quarter end:
In [126]: ts.head()
Out[126]:
1990-03-01 09:00 -1.315293
1990-06-01 09:00 -0.487557
1990-09-01 09:00 -0.778802
1990-12-01 09:00 -0.221309
1991-03-01 09:00 0.174483
Freq: H, dtype: float64
3.2.10 Categoricals
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the
API documentation.
In [129]: df["grade"]
Out[129]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
Reorder the categories and simultaneously add the missing categories (methods under Series .cat return
a new Series by default).
In [132]: df["grade"]
Out[132]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
In [133]: df.sort_values(by="grade")
Out[133]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
3.2.11 Plotting
In [135]: ts = pd.Series(np.random.randn(1000),
.....: index=pd.date_range('1/1/2000', periods=1000))
.....:
In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x127495750>
On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [139]: df = df.cumsum()
In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>
In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x1277a0a90>
In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x1277a5310>
CSV
In [143]: df.to_csv('foo.csv')
In [144]: pd.read_csv('foo.csv')
Out[144]:
Unnamed: 0 A B C D
0 2000-01-01 0.714384 1.118691 0.214306 -0.209375
1 2000-01-02 1.099535 -0.094588 1.519586 1.103485
2 2000-01-03 2.212839 -0.425442 0.268937 1.276245
3 2000-01-04 2.695004 -2.043315 1.337847 1.970893
4 2000-01-05 2.669479 -2.612468 2.095337 2.351292
.. ... ... ... ... ...
995 2002-09-22 -32.014002 -33.223408 28.122550 55.528126
996 2002-09-23 -31.418475 -33.747331 27.812849 55.601151
(continues on next page)
HDF5
Excel
3.2.13 Gotchas
If you are attempting to perform an operation you might see an exception like:
Here we discuss a lot of the essential functionality common to the pandas data structures. Here’s how to
create some of the objects used in the examples from the previous section:
To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default
number of elements to display is five, but you may pass a custom number.
In [4]: long_series = pd.Series(np.random.randn(1000))
In [5]: long_series.head()
Out[5]:
0 -1.353625
1 0.446611
2 1.600544
3 -0.602700
4 0.096528
dtype: float64
In [6]: long_series.tail(3)
Out[6]:
997 -0.933084
998 -1.895481
999 -1.109266
dtype: float64
pandas objects have a number of attributes enabling you to access the metadata
• shape: gives the axis dimensions of the object, consistent with ndarray
• Axis labels
– Series: index (only axis)
– DataFrame: index (rows) and columns
Note, these attributes can be safely assigned to!
In [7]: df[:2]
Out[7]:
A B C
2000-01-01 0.717501 -0.936966 0.634588
2000-01-02 -0.712106 -0.600773 0.676949
In [9]: df
Out[9]:
a b c
2000-01-01 0.717501 -0.936966 0.634588
2000-01-02 -0.712106 -0.600773 0.676949
2000-01-03 1.578377 -0.108295 0.317850
2000-01-04 1.020407 -0.220174 0.719467
2000-01-05 0.423406 0.742788 -0.047665
2000-01-06 1.516605 -0.287212 -0.086036
2000-01-07 0.411242 0.988457 -1.187729
2000-01-08 -0.053553 0.930622 1.334235
Pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual
data and do the actual computation. For many types, the underlying array is a numpy.ndarray. However,
pandas and 3rd party libraries may extend NumPy’s type system to add support for custom arrays (see
dtypes).
To get the actual data inside a Index or Series, use the .array property
In [10]: s.array
Out[10]:
<PandasArray>
[ -1.245150154666457, -1.1212737834640922, 0.14371758135167426,
1.6052453659926817, 0.49430022914548555]
Length: 5, dtype: float64
In [11]: s.index.array
Out[11]:
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object
array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas
uses them is a bit beyond the scope of this introduction. See dtypes for more.
If you know you need a NumPy array, use to_numpy() or numpy.asarray().
In [12]: s.to_numpy()
Out[12]: array([-1.24515015, -1.12127378, 0.14371758, 1.60524537, 0.49430023])
In [13]: np.asarray(s)
Out[13]: array([-1.24515015, -1.12127378, 0.14371758, 1.60524537, 0.49430023])
When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and
coercing values. See dtypes for more.
to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider
datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are
two possibly useful representations:
1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the
timezone discarded
Timezones may be preserved with dtype=object
In [15]: ser.to_numpy(dtype=object)
Out[15]:
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
dtype=object)
In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
dtype='datetime64[ns]')
Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has
a single data type for all the columns, DataFrame.to_numpy() will return the underlying data:
In [17]: df.to_numpy()
Out[17]:
array([[ 0.71750128, -0.93696585, 0.63458795],
[-0.71210567, -0.60077296, 0.67694886],
[ 1.57837673, -0.10829478, 0.31785043],
[ 1.02040735, -0.22017439, 0.71946677],
[ 0.42340608, 0.74278775, -0.04766533],
[ 1.51660469, -0.28721192, -0.08603582],
(continues on next page)
If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and
the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s
columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis
labels, cannot be assigned to.
Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to
accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype.
If there are only floats and integers, the resulting array will be of float dtype.
In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a
Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we
recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:
1. When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy
array or the extension array. Series.array will always return an ExtensionArray, and will never
copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying
/ coercing values.
2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data
and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(),
being a method, makes it clearer that the returned NumPy array may not be a view on the same data
in the DataFrame.
pandas has support for accelerating certain types of binary numerical and boolean operations using the
numexpr library and the bottleneck libraries.
These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr
uses smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that
are especially fast when dealing with arrays that have nans.
Here is a sample (using 100 column x 100,000 row DataFrames):
You are highly encouraged to install both libraries. See the section Recommended Dependencies for more
installation info.
These are both enabled to be used by default, you can control this by setting the options:
New in version 0.20.0.
pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)
With binary operations between pandas data structures, there are two key points of interest:
• Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
• Missing data in computations.
We will demonstrate how to manage these issues independently, though they can be handled simultaneously.
DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), … for
carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these
functions, you can use to either match on the index or columns via the axis keyword:
In [18]: df = pd.DataFrame({
....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
....:
In [19]: df
Out[19]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
Series and Index also support the divmod() builtin. This function takes the floor division and modulo
operation at the same time returning a two-tuple of the same type as the left hand side. For example:
In [29]: s = pd.Series(np.arange(10))
In [30]: s
Out[30]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
In [32]: div
Out[32]:
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
dtype: int64
In [33]: rem
Out[33]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
9 0
dtype: int64
In [35]: idx
Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
In [37]: div
Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
In [38]: rem
Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')
We can also do elementwise divmod():
In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
In [40]: div
Out[40]:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
In [41]: rem
Out[41]:
0 0
1 1
2 2
3 0
4 0
5 1
6 1
7 2
8 2
9 3
dtype: int64
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value
to substitute when at most one of the values at a location are missing. For example, when adding two
DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in
which case the result will be NaN (you can later replace NaN with some other value using fillna if you
wish).
In [42]: df
Out[42]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
In [43]: df2
Out[43]:
one two three
a -0.466716 0.029137 1.000000
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
In [44]: df + df2
Out[44]:
one two three
a -0.933433 0.058274 NaN
b -1.931397 3.935247 -0.769386
c 1.089011 1.224997 1.132652
d NaN -0.383116 0.213640
Flexible comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is
analogous to the binary arithmetic operations described above:
In [46]: df.gt(df2)
Out[46]:
one two three
a False False False
b False False False
c False False False
d False False False
In [47]: df2.ne(df)
Out[47]:
one two three
a False False True
b False False False
c False False False
d True False False
These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool.
These boolean objects can be used in indexing operations, see the section on Boolean indexing.
Boolean reductions
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean
result.
In [48]: (df > 0).all()
Out[48]:
one False
two False
three False
dtype: bool
You can test if a pandas object is empty, via the empty property.
In [51]: df.empty
Out[51]: False
In [52]: pd.DataFrame(columns=list('ABC')).empty
Out[52]: True
To evaluate single-element pandas objects in a boolean context, use the method bool():
In [53]: pd.Series([True]).bool()
Out[53]: True
In [54]: pd.Series([False]).bool()
Out[54]: False
In [55]: pd.DataFrame([[True]]).bool()
Out[55]: True
In [56]: pd.DataFrame([[False]]).bool()
Out[56]: False
Or
>>> df and df2
These will both raise errors, as you are trying to compare multiple values.:
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
Often you may find that there is more than one way to compute the same result. As a simple example,
consider df + df and df * 2. To test that these two computations produce the same result, given the tools
shown above, you might imagine using (df + df == df * 2).all(). But in fact, this expression is False:
In [57]: df + df == df * 2
Out[57]:
one two three
a True True False
b True True True
c True True True
d False True True
So, NDFrames (such as Series and DataFrames) have an equals() method for testing equality, with NaNs
in corresponding locations treated as equal.
Note that the Series or DataFrame index needs to be in the same order for equality to be True:
In [61]: df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})
In [63]: df1.equals(df2)
Out[63]: False
In [64]: df1.equals(df2.sort_index())
Out[64]: True
You can conveniently perform element-wise comparisons when comparing a pandas data structure with a
scalar value:
In [65]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[65]:
0 True
1 False
2 False
dtype: bool
Note that this is different from the NumPy behavior where a comparison can be broadcast:
A problem occasionally arising is the combination of two similar data sets where values in one are preferred
over the other. An example would be two data series representing a particular economic indicator where one is
considered to be of “higher quality”. However, the lower quality series might extend further back in history
or have more complete data coverage. As such, we would like to combine two DataFrame objects where
missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame.
The function implementing this operation is combine_first(), which we illustrate:
In [71]: df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],
....: 'B': [np.nan, 2., 3., np.nan, 6.]})
....:
In [73]: df1
Out[73]:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0
In [74]: df2
Out[74]:
A B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0
In [75]: df1.combine_first(df2)
Out[75]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0
The combine_first() method above calls the more general DataFrame.combine(). This method takes
another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner
function pairs of Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first() as above:
There exists a large number of methods for computing descriptive statistics and other related operations on
Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(),
mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same
size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the
axis can be specified by name or integer:
• Series: no axis argument needed
• DataFrame: “index” (axis=0, default), “columns” (axis=1)
For example:
In [77]: df
Out[77]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
In [78]: df.mean(0)
Out[78]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64
In [79]: df.mean(1)
Out[79]:
a -0.218790
b 0.205744
c 0.574443
d -0.042369
dtype: float64
All such methods have a skipna option signaling whether to exclude missing data (True by default):
In [80]: df.sum(0, skipna=False)
Out[80]:
one NaN
two 2.417701
three NaN
dtype: float64
In [83]: ts_stand.std()
Out[83]:
one 1.0
two 1.0
three 1.0
dtype: float64
In [85]: xs_stand.std(1)
Out[85]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64
Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat
different from expanding() and rolling(). For more details please see this note.
In [86]: df.cumsum()
Out[86]:
one two three
a -0.466716 0.029137 NaN
b -1.432415 1.996760 -0.384693
c -0.887909 2.609259 0.181633
d NaN 2.417701 0.288453
Here is a quick reference summary table of common functions. Each also takes an optional level parameter
which applies only if the object has a hierarchical index.
Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum
Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by
default:
In [87]: np.mean(df['one'])
Out[87]: -0.29596980138380163
In [88]: np.mean(df['one'].to_numpy())
Out[88]: nan
Series.nunique() will return the number of unique non-NA values in a Series:
In [91]: series[10:20] = 5
In [92]: series.nunique()
Out[92]: 11
There is a convenient describe() function which computes a variety of summary statistics about a Series
or the columns of a DataFrame (excluding NAs of course):
In [95]: series.describe()
(continues on next page)
In [98]: frame.describe()
Out[98]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean -0.051793 0.014006 -0.088499 -0.065968 0.099254
std 0.946751 0.953261 1.012176 1.021467 0.939771
min -2.778772 -2.932532 -3.194228 -2.490294 -2.640602
25% -0.701138 -0.632182 -0.716483 -0.787017 -0.546624
50% -0.075766 0.010519 -0.061189 -0.076329 0.095151
75% 0.620752 0.681123 0.642695 0.673043 0.743923
max 2.470484 2.415173 2.952825 2.850750 3.220515
In [101]: s.describe()
(continues on next page)
Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numer-
ical columns or, if none are, only categorical columns:
In [103]: frame.describe()
Out[103]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
This behavior can be controlled by providing a list of types as include/exclude arguments. The special
value all can also be used:
In [104]: frame.describe(include=['object'])
Out[104]:
a
count 4
unique 2
top Yes
freq 2
In [105]: frame.describe(include=['number'])
Out[105]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
In [106]: frame.describe(include='all')
Out[106]:
a b
count 4 4.000000
unique 2 NaN
top Yes NaN
freq 2 NaN
mean NaN 1.500000
std NaN 1.290994
min NaN 0.000000
25% NaN 0.750000
50% NaN 1.500000
75% NaN 2.250000
max NaN 3.000000
That feature relies on select_dtypes. Refer to there for details about accepted inputs.
The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum
and maximum corresponding values:
In [107]: s1 = pd.Series(np.random.randn(5))
In [108]: s1
Out[108]:
0 -1.882481
1 0.007554
2 0.851716
3 -0.395161
4 -0.304605
dtype: float64
In [111]: df1
Out[111]:
A B C
0 0.029120 -1.109406 -0.690599
1 -1.083192 0.419577 -1.165015
2 1.112228 0.169718 -1.251593
3 -0.313268 -0.517805 -0.736270
4 -0.354983 0.601455 0.452575
In [112]: df1.idxmin(axis=0)
Out[112]:
A 1
B 0
C 2
dtype: int64
In [113]: df1.idxmax(axis=1)
Out[113]:
0 A
1 B
2 A
3 A
4 B
dtype: object
When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and
idxmax() return the first matching index:
In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
In [115]: df3
Out[115]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN
In [116]: df3['A'].idxmin()
Out[116]: 'd'
Note: idxmin and idxmax are called argmin and argmax in NumPy.
The value_counts() Series method and top-level function computes a histogram of a 1D array of values.
It can also be used as a function on regular arrays:
In [117]: data = np.random.randint(0, 7, size=50)
In [118]: data
Out[118]:
array([1, 1, 0, 4, 1, 3, 4, 5, 2, 3, 0, 6, 5, 1, 5, 4, 5, 2, 6, 3, 6, 6,
1, 2, 0, 3, 3, 3, 4, 3, 2, 4, 4, 0, 3, 2, 5, 3, 5, 5, 6, 5, 1, 3,
6, 5, 0, 6, 1, 1])
In [119]: s = pd.Series(data)
In [120]: s.value_counts()
Out[120]:
3 10
5 9
1 8
6 7
4 6
2 5
0 5
dtype: int64
In [121]: pd.value_counts(data)
Out[121]:
3 10
5 9
1 8
6 7
4 6
2 5
0 5
dtype: int64
Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or
DataFrame:
In [123]: s5.mode()
Out[123]:
0 3
1 7
dtype: int64
In [125]: df5.mode()
Out[125]:
A B
0 1 1
1 4 2
Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on
sample quantiles) functions:
In [128]: factor
Out[128]:
[(0.631, 1.318], (0.631, 1.318], (0.631, 1.318], (-0.0556, 0.631], (-0.742, -0.0556], ...
,→, (-0.742, -0.0556], (-0.742, -0.0556], (-0.742, -0.0556], (-0.0556, 0.631], (0.631, 1.
,→318]]
Length: 20
Categories (4, interval[float64]): [(-1.432, -0.742] < (-0.742, -0.0556] < (-0.0556, 0.
,→631] <
(0.631, 1.318]]
In [130]: factor
Out[130]:
[(0, 1], (0, 1], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (0, 1], (1, 5]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
qcut() computes sample quantiles. For example, we could slice up some normally distributed data into
equal-size quartiles like so:
In [131]: arr = np.random.randn(30)
In [133]: factor
Out[133]:
[(-0.69, 0.0118], (0.874, 2.622], (0.0118, 0.874], (-0.69, 0.0118], (0.874, 2.622], ...,␣
,→(-1.7619999999999998, -0.69], (0.874, 2.622], (0.0118, 0.874], (0.874, 2.622], (0.0118,
,→ 0.874]]
Length: 30
Categories (4, interval[float64]): [(-1.7619999999999998, -0.69] < (-0.69, 0.0118] < (0.
,→0118, 0.874] <
(0.874, 2.622]]
In [134]: pd.value_counts(factor)
Out[134]:
(0.874, 2.622] 8
(-1.7619999999999998, -0.69] 8
(0.0118, 0.874] 7
(-0.69, 0.0118] 7
dtype: int64
We can also pass infinite values to define the bins:
In [137]: factor
Out[137]:
[(-inf, 0.0], (0.0, inf], (-inf, 0.0], (-inf, 0.0], (0.0, inf], ..., (0.0, inf], (0.0,␣
,→inf], (0.0, inf], (-inf, 0.0], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]
To apply your own or another library’s functions to pandas objects, you should be aware of the three methods
below. The appropriate method to use depends on whether your function expects to operate on an entire
DataFrame or Series, row- or column-wise, or elementwise.
1. Tablewise Function Application: pipe()
2. Row or Column-wise Function Application: apply()
3. Aggregation API : agg() and transform()
4. Applying Elementwise Functions: applymap()
DataFrames and Series can of course just be passed into functions. However, if the function needs to be
called in a chain, consider using the pipe() method. Compare the following
>>> (df.pipe(h)
... .pipe(g, arg1=1)
... .pipe(f, arg2=2, arg3=3))
Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your
own or another library’s functions in method chains, alongside pandas’ methods.
In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument.
What if the function you wish to apply takes its data as, say, the second argument? In this case, provide
pipe with a tuple of (callable, data_keyword). .pipe will route the DataFrame to the argument specified
in the tuple.
For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame
as the second argument, data. We pass in the function, keyword pair (sm.ols, 'data') to pipe:
In [138]: import statsmodels.formula.api as sm
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
,→specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the
popular (%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at
home in python. We encourage you to view the source code of pipe().
Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like
the descriptive statistics methods, takes an optional axis argument:
In [141]: df.apply(np.mean)
Out[141]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64
In [144]: df.apply(np.cumsum)
Out[144]:
one two three
a -0.466716 0.029137 NaN
In [145]: df.apply(np.exp)
Out[145]:
one two three
a 0.627058 1.029566 NaN
b 0.380717 7.153656 0.680659
c 1.723756 1.845035 1.761782
d NaN 0.825672 1.112734
The apply() method will also dispatch on a string method name.
In [146]: df.apply('mean')
Out[146]:
one -0.295970
two 0.604425
three 0.096151
dtype: float64
You may also pass additional arguments and keyword arguments to the apply() method. For instance,
consider the following function you would like to apply:
Another useful feature is the ability to pass Series methods to carry out some Series operation on each
column or row:
In [150]: tsdf
Out[150]:
A B C
2000-01-01 0.752721 -0.364275 0.134481
2000-01-02 0.826213 1.208276 -0.154643
2000-01-03 0.041831 0.479876 -0.956380
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -1.549128 1.145240 0.320799
2000-01-09 0.507849 -1.142744 -0.155781
2000-01-10 -0.169863 1.813758 0.771396
In [151]: tsdf.apply(pd.Series.interpolate)
Out[151]:
A B C
2000-01-01 0.752721 -0.364275 0.134481
2000-01-02 0.826213 1.208276 -0.154643
2000-01-03 0.041831 0.479876 -0.956380
2000-01-04 -0.276361 0.612949 -0.700945
2000-01-05 -0.594552 0.746021 -0.445509
2000-01-06 -0.912744 0.879094 -0.190073
2000-01-07 -1.230936 1.012167 0.065363
2000-01-08 -1.549128 1.145240 0.320799
2000-01-09 0.507849 -1.142744 -0.155781
2000-01-10 -0.169863 1.813758 0.771396
Finally, apply() takes an argument raw which is False by default, which converts each row or column into
a Series before applying the function. When set to True, the passed function will instead receive an ndarray
object, which has positive performance implications if you do not need the indexing functionality.
Aggregation API
In [154]: tsdf
Out[154]:
A B C
2000-01-01 -0.096512 0.747788 0.465973
2000-01-02 1.596492 -1.093022 -0.571837
2000-01-03 0.090216 -1.615127 -0.025061
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -0.051882 -0.369806 -1.363744
2000-01-09 2.243620 -1.063587 0.448386
2000-01-10 0.096756 0.781555 -1.912402
Using a single function is equivalent to apply(). You can also pass named methods as strings. These will
return a Series of the aggregated output:
In [155]: tsdf.agg(np.sum)
Out[155]:
A 3.878691
B -2.612201
C -2.958685
dtype: float64
In [156]: tsdf.agg('sum')
Out[156]:
A 3.878691
B -2.612201
C -2.958685
dtype: float64
In [158]: tsdf.A.agg('sum')
Out[158]: 3.8786905103802636
You can pass multiple aggregation arguments as a list. The results of each of the passed functions will be a
row in the resulting DataFrame. These are naturally named from the aggregation function.
In [159]: tsdf.agg(['sum'])
Out[159]:
A B C
sum 3.878691 -2.612201 -2.958685
Passing a named function will yield that name for the row:
Passing a dictionary of column names to a scalar or a list of scalars, to DataFrame.agg allows you to
customize which functions are applied to which columns. Note that the results are not in any particular
order, you can use an OrderedDict instead to guarantee ordering.
Passing a list-like will generate a DataFrame output. You will get a matrix-like output of all of the aggregators.
The output will consist of all unique functions. Those that are not noted for a particular column will be
NaN:
Mixed dtypes
When presented with mixed dtypes that cannot aggregate, .agg will only take the valid aggregations. This
is similar to how groupby .agg works.
In [168]: mdf.dtypes
Out[168]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
Custom describe
With .agg() is it possible to easily create a custom describe function, similar to the built in describe function.
Transform API
In [178]: tsdf
Out[178]:
A B C
2000-01-01 -0.646172 0.316546 0.442399
2000-01-02 -0.656399 -2.103519 0.373859
2000-01-03 0.479981 -0.047075 -0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 -1.402090 -0.915210 0.541518
Transform the entire frame. .transform() allows input functions as: a NumPy function, a string function
name or a user defined function.
In [179]: tsdf.transform(np.abs)
Out[179]:
A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 1.402090 0.915210 0.541518
In [180]: tsdf.transform('abs')
Out[180]:
A B C
2000-01-01 0.646172 0.316546 0.442399
2000-01-02 0.656399 2.103519 0.373859
2000-01-03 0.479981 0.047075 0.002847
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 1.029689 0.548916 0.960356
2000-01-09 0.800346 0.604359 0.324512
2000-01-10 1.402090 0.915210 0.541518
Passing a single function to .transform() with a Series will yield a single Series in return.
In [183]: tsdf.A.transform(np.abs)
Out[183]:
2000-01-01 0.646172
2000-01-02 0.656399
2000-01-03 0.479981
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 1.029689
2000-01-09 0.800346
2000-01-10 1.402090
Freq: D, Name: A, dtype: float64
Passing multiple functions will yield a column MultiIndexed DataFrame. The first level will be the original
frame column names; the second level will be the names of the transforming functions.
Passing multiple functions to a Series will yield a DataFrame. The resulting column names will be the
transforming functions.
Passing a dict of lists will generate a MultiIndexed DataFrame with these selective transforms.
Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the
methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a
single value and returning a single value. For example:
In [188]: df4
Out[188]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
In [190]: df4['one'].map(f)
Out[190]:
a 19
b 19
c 17
d 3
Name: one, dtype: int64
In [191]: df4.applymap(f)
Out[191]:
one two three
a 19 20 3
b 19 18 19
c 17 17 18
d 3 20 18
Series.map() has an additional feature; it can be used to easily “link” or “map” values defined by a
secondary series. This is closely related to merging/joining functionality:
In [192]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
.....: index=['a', 'b', 'c', 'd', 'e'])
.....:
In [194]: s
Out[194]:
a six
b seven
c six
d seven
e six
dtype: object
In [195]: s.map(t)
Out[195]:
a 6.0
b 7.0
c 6.0
d 7.0
e 6.0
dtype: float64
reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other
features relying on label-alignment functionality. To reindex means to conform the data to match a given
set of labels along a particular axis. This accomplishes several things:
In [197]: s
Out[197]:
a 1.910501
b -0.247105
c -1.134167
d -1.591918
e 0.243080
dtype: float64
Note that the Index objects containing the actual axis labels can be shared between objects. So if we have
a Series and a DataFrame, the following can be done:
In [202]: rs = s.reindex(df.index)
In [203]: rs
Out[203]:
a 1.910501
b -0.247105
c -1.134167
d -1.591918
dtype: float64
Note: When writing performance-sensitive code, there is a good reason to spend some time becoming a
reindexing ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames
internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because
reindex has been heavily optimized), but when CPU cycles matter sprinkling a few explicit reindex calls
here and there can have an impact.
You may wish to take an object and reindex its axes to be labeled the same as another object. While the
syntax for this is straightforward albeit verbose, it is a common enough operation that the reindex_like()
method is available to make this simpler:
In [207]: df2
Out[207]:
one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498
In [208]: df3
Out[208]:
one two
a -0.170747 -0.840616
b -0.669729 1.097871
c 0.840475 -0.257255
In [209]: df.reindex_like(df2)
Out[209]:
one two
a -0.466716 0.029137
b -0.965699 1.967624
c 0.544505 0.612498
The align() method is the fastest way to simultaneously align two objects. It supports a join argument
(related to joining and merging):
• join='outer': take the union of the indexes (default)
• join='left': use the calling object’s index
• join='right': use the passed object’s index
• join='inner': intersect the indexes
It returns a tuple with both of the reindexed Series:
In [210]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [211]: s1 = s[:4]
In [212]: s2 = s[1:]
In [213]: s1.align(s2)
Out[213]:
(a -1.362711
b 0.385943
c -0.538184
d 0.502612
e NaN
dtype: float64, a NaN
b 0.385943
c -0.538184
d 0.502612
e -1.491718
dtype: float64)
You can also pass an axis option to only align on the specified axis:
If you pass a Series to DataFrame.align(), you can choose to align both objects either on the DataFrame’s
index or columns using the axis argument:
reindex() takes an optional parameter method which is a filling method chosen from the following table:
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value
In [222]: ts
Out[222]:
2000-01-03 -0.267818
2000-01-04 0.951544
2000-01-05 1.220097
2000-01-06 1.083557
2000-01-07 -0.556464
2000-01-08 0.614753
2000-01-09 0.617631
2000-01-10 -0.131812
Freq: D, dtype: float64
In [223]: ts2
Out[223]:
2000-01-03 -0.267818
2000-01-06 1.083557
2000-01-09 0.617631
dtype: float64
In [224]: ts2.reindex(ts.index)
Out[224]:
2000-01-03 -0.267818
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 1.083557
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 0.617631
2000-01-10 NaN
Freq: D, dtype: float64
2000-01-07 1.083557
2000-01-08 1.083557
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64
In [228]: ts2.reindex(ts.index).fillna(method='ffill')
Out[228]:
2000-01-03 -0.267818
2000-01-04 -0.267818
2000-01-05 -0.267818
2000-01-06 1.083557
2000-01-07 1.083557
2000-01-08 1.083557
2000-01-09 0.617631
2000-01-10 0.617631
Freq: D, dtype: float64
reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and
interpolate() will not perform any checks on the order of the index.
The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies
the maximum count of consecutive matches:
In contrast, tolerance specifies the maximum distance between the index and indexer values:
Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into
a Timedelta if possible. This allows you to specify tolerance with appropriate strings.
A method closely related to reindex is the drop() function. It removes a set of labels from an axis:
In [231]: df
Out[231]:
one two three
a -0.466716 0.029137 NaN
b -0.965699 1.967624 -0.384693
c 0.544505 0.612498 0.566326
d NaN -0.191558 0.106820
d -0.191558 0.106820
Note that the following also works, but is a bit less obvious / clean:
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary
function.
In [235]: s
Out[235]:
a -1.362711
b 0.385943
c -0.538184
d 0.502612
e -1.491718
dtype: float64
In [236]: s.rename(str.upper)
Out[236]:
A -1.362711
B 0.385943
C -0.538184
D 0.502612
E -1.491718
dtype: float64
If you pass a function, it must return a value when called with any of the labels (and must produce a set of
unique values). A dict or Series can also be used:
If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping
don’t throw an error.
New in version 0.21.0.
DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper
and the axis to apply that mapping to.
In [238]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
Out[238]:
In [240]: s.rename("scalar-name")
Out[240]:
a -1.362711
b 0.385943
c -0.538184
d 0.502612
e -1.491718
Name: scalar-name, dtype: float64
In [242]: df
Out[242]:
x y
let num
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
In [244]: df.rename_axis(index=str.upper)
Out[244]:
x y
LET NUM
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
3.3.8 Iteration
The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is
regarded as array-like, and basic iteration produces the values. DataFrames follow the dict-like convention
of iterating over the “keys” of the objects.
In short, basic iteration (for i in object) produces:
• Series: values
• DataFrame: column labels
Thus, for example, iterating over a DataFrame gives you the column names:
Pandas objects also have the dict-like items() method to iterate over the (key, value) pairs.
To iterate over the rows of a DataFrame, you can use the following methods:
• iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to
Series objects, which can change the dtypes and has some performance implications.
• itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster
than iterrows(), and is in most cases preferable to use to iterate over the values of a DataFrame.
Warning: Iterating through pandas objects is generally slow. In many cases, iterating manually over
the rows is not needed and can be avoided with one of the following approaches:
• Look for a vectorized solution: many operations can be performed using built-in methods or NumPy
functions, (boolean) indexing, …
• When you have a function that cannot work on the full DataFrame/Series at once, it is better to
use apply() instead of iterating over the values. See the docs on function application.
• If you need to do iterative manipulations on the values but performance is important, consider
writing the inner loop with cython or numba. See the enhancing performance section for some
examples of this approach.
Warning: You should never modify something you are iterating over. This is not guaranteed to work
in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it
will have no effect!
For example, in the following case setting the value has no effect:
In [247]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
In [249]: df
Out[249]:
a b
0 1 a
1 2 b
2 3 c
items
Consistent with the dict-like interface, items() iterates through key-value pairs:
• Series: (index, scalar value) pairs
• DataFrame: (column, Series) pairs
For example:
iterrows
iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator
yielding each index value along with a Series containing the data in each row:
Note: Because iterrows() returns a Series for each row, it does not preserve dtypes across the rows
(dtypes are preserved across columns for DataFrames). For example,
In [253]: df_orig.dtypes
Out[253]:
int int64
float float64
dtype: object
In [255]: row
Out[255]:
int 1.0
float 1.5
Name: 0, dtype: float64
All values in row, returned as a Series, are now upcasted to floats, also the original integer value in column
x:
In [256]: row['int'].dtype
Out[256]: dtype('float64')
In [257]: df_orig['int'].dtype
Out[257]: dtype('int64')
To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples
of the values and which is generally much faster than iterrows().
In [259]: print(df2)
x y
0 1 4
1 2 5
2 3 6
In [260]: print(df2.T)
0 1 2
x 1 2 3
y 4 5 6
In [262]: print(df2_t)
0 1 2
x 1 2 3
y 4 5 6
itertuples
The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame.
The first element of the tuple will be the row’s corresponding index value, while the remaining values are
the row values.
For instance:
This method does not convert the row to a Series object; it merely returns the values inside a namedtuple.
Therefore, itertuples() preserves the data type of the values and is generally faster as iterrows().
Note: The column names will be renamed to positional names if they are invalid Python identifiers,
repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a
datetime/period like Series. This will return a Series, indexed like the existing Series.
# datetime
In [264]: s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))
In [265]: s
Out[265]:
0 2013-01-01 09:10:12
1 2013-01-02 09:10:12
2 2013-01-03 09:10:12
3 2013-01-04 09:10:12
dtype: datetime64[ns]
In [266]: s.dt.hour
Out[266]:
0 9
1 9
2 9
3 9
dtype: int64
In [267]: s.dt.second
Out[267]:
0 12
1 12
2 12
3 12
dtype: int64
In [268]: s.dt.day
Out[268]:
0 1
1 2
2 3
3 4
dtype: int64
This enables nice expressions like this:
In [269]: s[s.dt.day == 2]
Out[269]:
1 2013-01-02 09:10:12
dtype: datetime64[ns]
In [271]: stz
Out[271]:
0 2013-01-01 09:10:12-05:00
1 2013-01-02 09:10:12-05:00
2 2013-01-03 09:10:12-05:00
3 2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]
In [272]: stz.dt.tz
Out[272]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>
You can also chain these types of operations:
In [273]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[273]:
(continues on next page)
You can also format datetime values as strings with Series.dt.strftime() which supports the same format
as the standard strftime().
# DatetimeIndex
In [274]: s = pd.Series(pd.date_range('20130101', periods=4))
In [275]: s
Out[275]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: datetime64[ns]
In [276]: s.dt.strftime('%Y/%m/%d')
Out[276]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
# PeriodIndex
In [277]: s = pd.Series(pd.period_range('20130101', periods=4))
In [278]: s
Out[278]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]
In [279]: s.dt.strftime('%Y/%m/%d')
Out[279]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
The .dt accessor works for period and timedelta dtypes.
# period
In [280]: s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))
In [281]: s
Out[281]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]
In [282]: s.dt.year
Out[282]:
0 2013
1 2013
2 2013
3 2013
dtype: int64
In [283]: s.dt.day
Out[283]:
0 1
1 2
2 3
3 4
dtype: int64
# timedelta
In [284]: s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))
In [285]: s
Out[285]:
0 1 days 00:00:05
1 1 days 00:00:06
2 1 days 00:00:07
3 1 days 00:00:08
dtype: timedelta64[ns]
In [286]: s.dt.days
Out[286]:
0 1
1 1
2 1
3 1
dtype: int64
In [287]: s.dt.seconds
Out[287]:
0 5
1 6
2 7
3 8
dtype: int64
In [288]: s.dt.components
Out[288]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 1 0 0 5 0 0 0
1 1 0 0 6 0 0 0
2 1 0 0 7 0 0 0
3 1 0 0 8 0 0 0
Note: Series.dt will raise a TypeError if you access with a non-datetime-like values.
Series is equipped with a set of string processing methods that make it easy to operate on each element of
the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are
accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in
string methods. For example:
In [290]: s.str.lower()
Out[290]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Powerful pattern-matching methods are provided as well, but note that pattern-matching generally uses
regular expressions by default (and in some cases always uses them).
Please see Vectorized String Methods for a complete description.
3.3.11 Sorting
Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a
combination of both.
By index
The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its
index levels.
In [291]: df = pd.DataFrame({
.....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
.....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
.....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
.....:
In [293]: unsorted_df
Out[293]:
three two one
a NaN 0.498142 -0.579017
d -1.636064 -0.051421 NaN
c 0.439120 -1.425223 -1.211745
b 1.021821 0.226007 1.192046
# DataFrame
In [294]: unsorted_df.sort_index()
Out[294]:
three two one
a NaN 0.498142 -0.579017
b 1.021821 0.226007 1.192046
c 0.439120 -1.425223 -1.211745
d -1.636064 -0.051421 NaN
In [295]: unsorted_df.sort_index(ascending=False)
Out[295]:
three two one
d -1.636064 -0.051421 NaN
c 0.439120 -1.425223 -1.211745
b 1.021821 0.226007 1.192046
a NaN 0.498142 -0.579017
In [296]: unsorted_df.sort_index(axis=1)
Out[296]:
one three two
a -0.579017 NaN 0.498142
d NaN -1.636064 -0.051421
c -1.211745 0.439120 -1.425223
b 1.192046 1.021821 0.226007
# Series
In [297]: unsorted_df['three'].sort_index()
Out[297]:
a NaN
b 1.021821
c 0.439120
d -1.636064
Name: three, dtype: float64
By values
The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values()
method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.
sort_values() may used to specify one or more columns to use to determine the sorted order.
In [299]: df1.sort_values(by='two')
Out[299]:
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2
These methods have special treatment of NA values via the na_position argument:
In [301]: s[2] = np.nan
In [302]: s.sort_values()
Out[302]:
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
2 NaN
5 NaN
dtype: object
In [303]: s.sort_values(na_position='first')
Out[303]:
2 NaN
5 NaN
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
dtype: object
Strings passed as the by parameter to DataFrame.sort_values() may refer to either columns or index level
names.
# Build MultiIndex
In [304]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
.....: ('b', 2), ('b', 1), ('b', 1)])
.....:
# Build DataFrame
In [306]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
.....: index=idx)
.....:
In [307]: df_multi
Out[307]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
Note: If a string matches both a column name and an index level name then a warning is issued and the
column takes precedence. This will result in an ambiguity error in a future version.
searchsorted
Series has the nsmallest() and nlargest() methods which return the smallest or largest n values. For a
large Series this can be much faster than sorting the entire Series and calling head(n) on the result.
In [316]: s = pd.Series(np.random.permutation(10))
In [317]: s
Out[317]:
0 3
1 0
2 2
3 6
4 9
5 7
6 4
7 1
8 8
9 5
dtype: int64
In [318]: s.sort_values()
Out[318]:
1 0
7 1
2 2
0 3
6 4
9 5
3 6
5 7
8 8
4 9
dtype: int64
In [319]: s.nsmallest(3)
Out[319]:
1 0
7 1
2 2
dtype: int64
In [320]: s.nlargest(3)
Out[320]:
4 9
8 8
5 7
dtype: int64
DataFrame also has the nlargest and nsmallest methods.
In [321]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
.....: 'b': list('abdceff'),
.....: 'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
.....:
You must be explicit about sorting when the column is a MultiIndex, and fully specify all levels to by.
In [326]: df1.columns = pd.MultiIndex.from_tuples([('a', 'one'),
.....: ('a', 'two'),
.....: ('b', 'three')])
(continues on next page)
3.3.12 Copying
The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they
are immutable) and returns a new object. Note that it is seldom necessary to copy objects. For
example, there are only a handful of ways to alter a DataFrame in-place:
• Inserting, deleting, or modifying a column.
• Assigning to the index or columns attributes.
• For homogeneous data, directly modifying the values via the values attribute or advanced indexing.
To be clear, no pandas method has the side effect of modifying your data; almost every method returns a
new object, leaving the original object untouched. If the data is modified, it is because you did so explicitly.
3.3.13 dtypes
For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame.
NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns] (note that NumPy
does not support timezone-aware datetimes).
Pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the
extensions pandas has made internally. See Extension types for how to write your own extension that works
with pandas. See Extension data types for a list of third-party libraries that have implemented an extension.
The following table lists all of pandas extension types. See the respective documentation sections for more
on each type.
Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible
(for performance and interoperability with other libraries and methods. See object conversion).
A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.
In [328]: dft = pd.DataFrame({'A': np.random.rand(3),
.....: 'B': 1,
.....: 'C': 'foo',
.....: 'D': pd.Timestamp('20010102'),
.....: 'E': pd.Series([1.0] * 3).astype('float32'),
.....: 'F': False,
.....: 'G': pd.Series([1] * 3, dtype='int8')})
.....:
In [329]: dft
Out[329]:
A B C D E F G
0 0.537956 1 foo 2001-01-02 1.0 False 1
1 0.623232 1 foo 2001-01-02 1.0 False 1
2 0.558728 1 foo 2001-01-02 1.0 False 1
In [330]: dft.dtypes
Out[330]:
A float64
B int64
C object
D datetime64[ns]
E float32
F bool
G int8
dtype: object
On a Series object, use the dtype attribute.
In [331]: dft['A'].dtype
Out[331]: dtype('float64')
If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be
chosen to accommodate all of the data types (object is the most general).
# these ints are coerced to floats
In [332]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[332]:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
dtype: float64
2 3
3 6
4 foo
dtype: object
The number of columns of each type in a DataFrame can be found by calling DataFrame.dtypes.
value_counts().
In [334]: dft.dtypes.value_counts()
Out[334]:
datetime64[ns] 1
object 1
bool 1
float64 1
int64 1
int8 1
float32 1
dtype: int64
Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the
dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations.
Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.
In [335]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
In [336]: df1
Out[336]:
A
0 1.111131
1 0.459178
2 0.633793
3 1.315839
4 0.075656
5 0.526596
6 0.442085
7 -0.224085
In [337]: df1.dtypes
Out[337]:
A float32
dtype: object
In [339]: df2
Out[339]:
A B C
0 -0.809082 0.268481 0
1 -2.189453 0.991689 255
2 0.447021 -0.273606 0
3 0.072876 0.848166 1
In [340]: df2.dtypes
Out[340]:
A float16
B float64
C uint8
dtype: object
defaults
By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit).
The following will all result in int64 dtypes.
In [341]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[341]:
a int64
dtype: object
upcasting
Types can potentially be upcasted when combined with other types, meaning they are promoted from the
current type (e.g. int to float).
In [345]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
In [346]: df3
Out[346]:
A B C
0 0.302049 0.268481 0.0
1 -1.730275 0.991689 255.0
2 1.080814 -0.273606 0.0
3 1.388715 0.848166 1.0
4 -0.286405 -0.340956 255.0
5 -0.003677 -0.602481 0.0
6 0.345284 -0.552552 0.0
In [347]: df3.dtypes
Out[347]:
A float32
B float64
C float64
dtype: object
DataFrame.to_numpy() will return the lower-common-denominator of the dtypes, meaning the dtype that
can accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force
some upcasting.
In [348]: df3.to_numpy().dtype
Out[348]: dtype('float64')
astype
You can use the astype() method to explicitly convert dtypes from one to another. These will by default
return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition,
they will raise an exception if the astype operation is invalid.
Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then
the more general one will be used as the result of the operation.
In [349]: df3
Out[349]:
A B C
0 0.302049 0.268481 0.0
1 -1.730275 0.991689 255.0
2 1.080814 -0.273606 0.0
3 1.388715 0.848166 1.0
4 -0.286405 -0.340956 255.0
5 -0.003677 -0.602481 0.0
6 0.345284 -0.552552 0.0
7 1.038611 0.324815 255.0
In [350]: df3.dtypes
Out[350]:
A float32
B float64
C float64
dtype: object
# conversion of dtypes
In [351]: df3.astype('float32').dtypes
Out[351]:
A float32
B float32
C float32
dtype: object
Convert a subset of columns to a specified type using astype().
In [352]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
In [354]: dft
Out[354]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [355]: dft.dtypes
Out[355]:
a uint8
b uint8
c int64
dtype: object
New in version 0.19.0.
Convert certain columns to a specific dtype by passing a dict to astype().
In [356]: dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})
In [358]: dft1
Out[358]:
a b c
0 True 4 7.0
1 False 5 8.0
2 True 6 9.0
In [359]: dft1.dtypes
Out[359]:
a bool
b int64
c float64
dtype: object
Note: When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting
occurs.
loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the
dtype from the right hand side. Therefore the following piece of code produces the unintended result.
In [360]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
object conversion
pandas offers various functions to try to force conversion of types from the object dtype to other types.
In cases where the data is already of the correct type, but stored in an object array, the DataFrame.
infer_objects() and Series.infer_objects() methods can be used to soft convert to the correct type.
In [364]: import datetime
In [366]: df = df.T
In [367]: df
Out[367]:
0 1 2
0 1 a 2016-03-02
1 2 b 2016-03-02
In [368]: df.dtypes
Out[368]:
0 object
1 object
2 datetime64[ns]
dtype: object
Because the data was transposed the original inference stored all columns as object, which infer_objects
will correct.
In [369]: df.infer_objects().dtypes
Out[369]:
0 int64
1 object
2 datetime64[ns]
dtype: object
The following functions are available for one dimensional object arrays or scalars to perform hard conversion
of objects to a specified type:
• to_numeric() (conversion to numeric dtypes)
In [370]: m = ['1.1', 2, 3]
In [371]: pd.to_numeric(m)
Out[371]: array([1.1, 2. , 3. ])
In [374]: pd.to_datetime(m)
Out[374]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]',␣
,→freq=None)
In [376]: pd.to_timedelta(m)
Out[376]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype=
,→'timedelta64[ns]', freq=None)
To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with
elements that cannot be converted to desired dtype or object. By default, errors='raise', meaning that
any errors encountered will be raised during the conversion process. However, if errors='coerce', these
errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta)
or np.nan (for numeric). This might be useful if you are reading in data which is mostly of the desired
dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to
represent as missing:
In [380]: m = ['apple', 2, 3]
The errors parameter has a third option of errors='ignore', which will simply return the passed in data
if it encounters any errors with the conversion to a desired data type:
In [387]: m = ['apple', 2, 3]
In addition to object conversion, to_numeric() provides another argument downcast, which gives the option
of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:
In [391]: m = ['1', 2, 3]
In [397]: df = pd.DataFrame([
.....: ['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
.....:
In [398]: df
Out[398]:
0 1
0 2016-07-09 2016-03-02 00:00:00
1 2016-07-09 2016-03-02 00:00:00
In [399]: df.apply(pd.to_datetime)
Out[399]:
0 1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02
In [401]: df
Out[401]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [402]: df.apply(pd.to_numeric)
Out[402]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [404]: df
Out[404]:
0 1
0 5us 1 days 00:00:00
1 5us 1 days 00:00:00
In [405]: df.apply(pd.to_timedelta)
Out[405]:
0 1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days
gotchas
Performing selection operations on integer type data can easily upcast the data to floating. The dtype
of the input data will be preserved in cases where nans are not introduced. See also Support for integer NA.
In [406]: dfi = df3.astype('int32')
In [407]: dfi['E'] = 1
In [408]: dfi
Out[408]:
A B C E
0 0 0 0 1
1 -1 0 255 1
2 1 0 0 1
3 1 0 1 1
4 0 0 255 1
5 0 0 0 1
6 0 0 0 1
7 1 0 255 1
In [409]: dfi.dtypes
Out[409]:
A int32
B int32
C int32
E int64
dtype: object
In [411]: casted
Out[411]:
A B C E
0 NaN NaN NaN 1
1 NaN NaN 255.0 1
2 1.0 NaN NaN 1
3 1.0 NaN 1.0 1
4 NaN NaN 255.0 1
5 NaN NaN NaN 1
6 NaN NaN NaN 1
7 1.0 NaN 255.0 1
In [412]: casted.dtypes
Out[412]:
A float64
B float64
C float64
E int64
dtype: object
While float dtypes are unchanged.
In [413]: dfa = df3.copy()
In [415]: dfa.dtypes
Out[415]:
A float32
B float64
C float64
dtype: object
In [417]: casted
Out[417]:
A B C
0 NaN 0.268481 NaN
1 NaN 0.991689 255.0
2 1.080814 NaN NaN
3 1.388715 0.848166 1.0
4 NaN NaN 255.0
5 NaN NaN NaN
6 NaN NaN NaN
7 1.038611 0.324815 255.0
In [418]: casted.dtypes
Out[418]:
A float32
B float64
C float64
dtype: object
In [424]: df
Out[424]:
string int64 uint8 float64 bool1 bool2 dates category␣
,→tdeltas uint64 other_dates tz_aware_dates
0 a 1 3 4.0 True False 2019-10-23 14:27:32.265347 A ␣
,→NaT 3 2013-01-01 2013-01-01 00:00:00-05:00
1 b 2 4 5.0 False True 2019-10-24 14:27:32.265347 B 1␣
,→days 4 2013-01-02 2013-01-02 00:00:00-05:00
2 c 3 5 6.0 True False 2019-10-25 14:27:32.265347 C 1␣
,→days 5 2013-01-03 2013-01-03 00:00:00-05:00
In [425]: df.dtypes
Out[425]:
string object
int64 int64
uint8 uint8
float64 float64
bool1 bool
bool2 bool
dates datetime64[ns]
category category
tdeltas timedelta64[ns]
uint64 uint64
(continues on next page)
select_dtypes() has two parameters include and exclude that allow you to say “give me the columns
with these dtypes” (include) and/or “give the columns without these dtypes” (exclude).
For example, to select bool columns:
In [426]: df.select_dtypes(include=[bool])
Out[426]:
bool1 bool2
0 True False
1 False True
2 True False
You can also pass the name of a dtype in the NumPy dtype hierarchy:
In [427]: df.select_dtypes(include=['bool'])
Out[427]:
bool1 bool2
0 True False
1 False True
2 True False
In [429]: df.select_dtypes(include=['object'])
Out[429]:
string
0 a
1 b
2 c
To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a
tree of child dtypes:
In [431]: subdtypes(np.generic)
Out[431]:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
Note: Pandas also defines the types category, and datetime64[ns, tz], which are not integrated into
the normal NumPy hierarchy and won’t show up with the above function.
{{ header }}
We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get
you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply
across all of the objects. To get started, import NumPy and load pandas into your namespace:
Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will
not be broken unless done so explicitly by you.
We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and
methods in separate sections.
3.4.1 Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point
numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method
to create a Series is to call:
In [4]: s
Out[4]:
a 0.544055
b 0.457937
c -0.055260
d 0.487673
e 0.021212
dtype: float64
In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [6]: pd.Series(np.random.randn(5))
Out[6]:
0 -0.269995
1 -0.453290
2 0.118795
3 0.658569
4 -0.208094
dtype: float64
Note: pandas supports non-unique index values. If an operation that does not support duplicate index
values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all
performance-based (there are many instances in computations, like parts of GroupBy, where the index is not
used).
From dict
Series can be instantiated from dicts:
In [8]: pd.Series(d)
Out[8]:
b 1
a 0
c 2
dtype: int64
Note: When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s
insertion order, if you’re using Python version >= 3.6 and Pandas version >= 0.23.
If you’re using Python < 3.6 or Pandas < 0.23, and an index is not passed, the Series index will be the
lexically ordered list of dict keys.
In the example above, if you were on a Python version lower than 3.6 or a Pandas version lower than 0.23,
the Series would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b',
'a', 'c']).
If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
In [9]: d = {'a': 0., 'b': 1., 'c': 2.}
In [10]: pd.Series(d)
Out[10]:
a 0.0
b 1.0
c 2.0
dtype: float64
Note: NaN (not a number) is the standard missing data marker used in pandas.
Series is ndarray-like
Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However,
operations such as slicing will also slice the index.
In [13]: s[0]
Out[13]: 0.544055375065062
In [14]: s[:3]
Out[14]:
a 0.544055
b 0.457937
c -0.055260
dtype: float64
In [17]: np.exp(s)
Out[17]:
a 1.722980
b 1.580809
c 0.946239
d 1.628523
e 1.021439
dtype: float64
In [18]: s.dtype
Out[18]: dtype('float64')
This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few
places, in which case the dtype would be a ExtensionDtype. Some examples within pandas are categorical
and Nullable integer data type. See dtypes for more.
If you need the actual array backing a Series, use Series.array.
In [19]: s.array
Out[19]:
<PandasArray>
(continues on next page)
Accessing the array can be useful when you need to do some operation without the index (to disable automatic
alignment, for example).
Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one
or more concrete arrays like a numpy.ndarray. Pandas knows how to take an ExtensionArray and store it
in a Series or a column of a DataFrame. See dtypes for more.
While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().
In [20]: s.to_numpy()
Out[20]: array([ 0.54405538, 0.45793694, -0.05526003, 0.48767342, 0.02121215])
Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.
Series is dict-like
A Series is like a fixed-size dict in that you can get and set values by index label:
In [21]: s['a']
Out[21]: 0.544055375065062
In [23]: s
Out[23]:
a 0.544055
b 0.457937
c -0.055260
d 0.487673
e 12.000000
dtype: float64
In [24]: 'e' in s
Out[24]: True
In [25]: 'f' in s
Out[25]: False
If a label is not contained, an exception is raised:
>>> s['f']
KeyError: 'f'
Using the get method, a missing label will return None or specified default:
In [26]: s.get('f')
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same
is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting
an ndarray.
In [28]: s + s
Out[28]:
a 1.088111
b 0.915874
c -0.110520
d 0.975347
e 24.000000
dtype: float64
In [29]: s * 2
Out[29]:
a 1.088111
b 0.915874
c -0.110520
d 0.975347
e 24.000000
dtype: float64
In [30]: np.exp(s)
Out[30]:
a 1.722980
b 1.580809
c 0.946239
d 1.628523
e 162754.791419
dtype: float64
A key difference between Series and ndarray is that operations between Series automatically align the data
based on label. Thus, you can write computations without giving consideration to whether the Series involved
have the same labels.
The result of an operation between unaligned Series will have the union of the indexes involved. If a label
is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code
without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis
and research. The integrated data alignment features of the pandas data structures set pandas apart from
the majority of related tools for working with labeled data.
Note: In general, we chose to make the default result of operations between differently indexed objects
yield the union of the indexes in order to avoid loss of information. Having an index label, though the data
is missing, is typically important information as part of a computation. You of course have the option of
dropping labels with missing data via the dropna function.
Name attribute
In [33]: s
Out[33]:
0 0.482656
1 0.041324
2 0.376516
3 0.386480
4 0.884489
Name: something, dtype: float64
In [34]: s.name
Out[34]: 'something'
The Series name will be assigned automatically in many cases, in particular when taking 1D slices of
DataFrame as you will see below.
New in version 0.18.0.
You can rename a Series with the pandas.Series.rename() method.
In [35]: s2 = s.rename("different")
In [36]: s2.name
Out[36]: 'different'
3.4.2 DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can
think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly
used pandas object. Like Series, DataFrame accepts many different kinds of input:
• Dict of 1D ndarrays, lists, dicts, or Series
• 2-D numpy.ndarray
• Structured or record ndarray
• A Series
• Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.
If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting
DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed
index.
If axis labels are not passed, they will be constructed from the input data based on common sense rules.
Note: When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by
the dict’s insertion order, if you are using Python version >= 3.6 and Pandas >= 0.23.
If you are using Python < 3.6 or Pandas < 0.23, and columns is not specified, the DataFrame columns will
be the lexically ordered list of dict keys.
The resulting index will be the union of the indexes of the various Series. If there are any nested dicts,
these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict
keys.
In [37]: d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [38]: df = pd.DataFrame(d)
In [39]: df
Out[39]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
Note: When a particular set of columns is passed along with a dict of data, the passed columns override
the keys in the dict.
In [42]: df.index
Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')
In [43]: df.columns
Out[43]: Index(['one', 'two'], dtype='object')
The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as
the arrays. If no index is passed, the result will be range(n), where n is the array length.
In [44]: d = {'one': [1., 2., 3., 4.],
....: 'two': [4., 3., 2., 1.]}
....:
In [45]: pd.DataFrame(d)
Out[45]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
In [49]: pd.DataFrame(data)
Out[49]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
Note: DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.
In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
In [53]: pd.DataFrame(data2)
Out[53]:
a b c
0 1 2 NaN
1 5 10 20.0
From a Series
The result will be a DataFrame with the same index as the input Series, and with one column whose name
is the original name of the Series (only if no other column name provided).
Missing data
Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing
data, we use np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the
data argument to the DataFrame constructor, and its masked entries will be considered missing.
Alternate constructors
DataFrame.from_dict
DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It
operates like the DataFrame constructor except for the orient parameter which is 'columns' by default,
but which can be set to 'index' in order to use the dict keys as row labels.
If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired
column names:
DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously
to the normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of
the structured dtype. For example:
In [59]: data
Out[59]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and
deleting columns works with the same syntax as the analogous dict operations:
In [61]: df['one']
Out[61]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
In [64]: df
Out[64]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False
In [67]: df
Out[67]:
one flag
a 1.0 False
b 2.0 False
c 3.0 True
d NaN False
When inserting a scalar value, it will naturally be propagated to fill the column:
In [69]: df
Out[69]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar
When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the
DataFrame’s index:
In [71]: df
Out[71]:
one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN
You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular
location in the columns:
In [73]: df
Out[73]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN
Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new
columns that are potentially derived from existing columns.
In [74]: iris = pd.read_csv('data/iris.data')
In [75]: iris.head()
Out[75]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
assign always returns a copy of the data, leaving the original DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to
the DataFrame at hand. This is common when using assign in a chain of operations. For example, we can
limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and
plot:
Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly,
this is the DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering
happens first, and then the ratio calculations. This is an example where we didn’t have a reference to the
filtered DataFrame available.
The function signature for assign is simply **kwargs. The keys are the column names for the new fields,
and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of
one argument to be called on the DataFrame. A copy of the original DataFrame is returned, with the new
values inserted.
Changed in version 0.23.0.
Starting with Python 3.6 the order of **kwargs is preserved. This allows for dependent assignment, where
an expression later in **kwargs can refer to a column created earlier in the same assign().
In the second expression, x['C'] will refer to the newly created column, that’s equal to dfa['A'] +
dfa['B'].
To write code compatible with all versions of Python, split the assignment in two.
Warning: Dependent assignment may subtly change the behavior of your code between Python 3.6
and older versions of Python.
If you wish to write code that supports versions of python before and after 3.6, you’ll need to take care
when passing assign expressions that
• Update an existing column
• Refer to the newly updated column in the same assign
For example, we’ll update column “A” and then refer to it when creating “B”.
>>> dependent = pd.DataFrame({"A": [1, 1, 1]})
>>> dependent.assign(A=lambda x: x["A"] + 1, B=lambda x: x["A"] + 2)
For Python 3.5 and earlier the expression creating B refers to the “old” value of A, [1, 1, 1]. The
output is then
A B
0 2 3
1 2 3
2 2 3
For Python 3.6 and later, the expression creating A refers to the “new” value of A, [2, 2, 2], which
results in
A B
0 2 4
1 2 4
2 2 4
Indexing / selection
Row selection, for example, returns a Series whose index is the columns of the DataFrame:
In [83]: df.loc['b']
Out[83]:
one 2
bar 2
flag False
foo bar
one_trunc 2
Name: b, dtype: object
In [84]: df.iloc[2]
Out[84]:
one 3
bar 3
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section on indexing.
We will address the fundamentals of reindexing / conforming to new sets of labels in the section on reindexing.
Data alignment between DataFrame objects automatically align on both the columns and the index
(row labels). Again, the resulting object will have the union of the column and row labels.
In [87]: df + df2
Out[87]:
(continues on next page)
When doing an operation between DataFrame and Series, the default behavior is to align the Series index
on the DataFrame columns, thus broadcasting row-wise. For example:
In [88]: df - df.iloc[0]
Out[88]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 0.823010 -1.407428 1.810920 0.036966
2 -0.836963 -0.013977 3.361133 -0.321214
3 0.215511 0.136409 2.241683 -1.609052
4 0.010095 -1.006592 2.969119 -1.362727
5 1.268823 -1.691757 2.108988 -2.197647
6 0.192297 0.926025 1.937398 -0.193649
7 -1.637169 1.843618 2.244498 -2.147770
8 2.009625 -2.136301 2.017400 1.580465
9 0.027738 1.062145 3.472820 -1.503278
In the special case of working with time series data, if the DataFrame index contains dates, the broadcasting
will be column-wise:
In [89]: index = pd.date_range('1/1/2000', periods=8)
In [91]: df
Out[91]:
A B C
2000-01-01 -0.334372 1.607388 -2.433251
2000-01-02 -0.196343 1.455554 -0.134245
2000-01-03 0.351420 1.148340 -1.691404
2000-01-04 -0.493614 0.073640 0.182179
2000-01-05 -0.203285 -0.010826 -1.539934
2000-01-06 -1.415928 -0.245523 -0.167154
2000-01-07 -0.223275 -1.167290 2.137569
2000-01-08 0.120902 -0.185518 1.927187
In [92]: type(df['A'])
Out[92]: pandas.core.series.Series
In [93]: df - df['A']
Out[93]:
Warning:
df - df['A']
is now deprecated and will be removed in a future release. The preferred way to replicate this behavior
is
df.sub(df['A'], axis=0)
For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.
Operations with scalars are just as you would expect:
In [94]: df * 5 + 2
Out[94]:
A B C
2000-01-01 0.328140 10.036938 -10.166254
2000-01-02 1.018287 9.277769 1.328775
2000-01-03 3.757099 7.741701 -6.457019
2000-01-04 -0.468070 2.368199 2.910897
2000-01-05 0.983577 1.945869 -5.699668
2000-01-06 -5.079639 0.772383 1.164231
2000-01-07 0.883623 -3.836452 12.687845
2000-01-08 2.604509 1.072411 11.635937
In [95]: 1 / df
Out[95]:
A B C
2000-01-01 -2.990682 0.622128 -0.410973
2000-01-02 -5.093138 0.687024 -7.449068
2000-01-03 2.845600 0.870822 -0.591225
2000-01-04 -2.025875 13.579631 5.489096
2000-01-05 -4.919213 -92.369255 -0.649379
2000-01-06 -0.706251 -4.072930 -5.982514
2000-01-07 -4.478772 -0.856685 0.467821
2000-01-08 8.271174 -5.390317 0.518891
In [96]: df ** 4
Out[96]:
A B C
2000-01-01 0.012500 6.675478e+00 35.054796
2000-01-02 0.001486 4.488623e+00 0.000325
2000-01-03 0.015251 1.738931e+00 8.184445
2000-01-04 0.059368 2.940682e-05 0.001102
2000-01-05 0.001708 1.373695e-08 5.623517
2000-01-06 4.019430 3.633894e-03 0.000781
2000-01-07 0.002485 1.856589e+00 20.877601
2000-01-08 0.000214 1.184521e-03 13.794178
Boolean operators work as well:
In [97]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
In [102]: -df1
Out[102]:
a b
0 False True
1 True False
2 False False
Transposing
To transpose, access the T attribute (also the transpose function), similar to an ndarray:
Elementwise NumPy ufuncs (log, exp, sqrt, …) and various other NumPy functions can be used with no
issues on Series and DataFrame, assuming the data within are numeric:
In [104]: np.exp(df)
Out[104]:
A B C
2000-01-01 0.715788 4.989759 0.087751
2000-01-02 0.821731 4.286857 0.874376
2000-01-03 1.421084 3.152955 0.184261
2000-01-04 0.610416 1.076419 1.199829
2000-01-05 0.816046 0.989232 0.214395
2000-01-06 0.242700 0.782295 0.846069
2000-01-07 0.799894 0.311209 8.478801
2000-01-08 1.128514 0.830674 6.870160
In [105]: np.asarray(df)
Out[105]:
array([[-0.33437191, 1.60738755, -2.43325073],
[-0.19634262, 1.45555386, -0.13424497],
[ 0.35141973, 1.1483402 , -1.69140383],
[-0.49361395, 0.0736397 , 0.18217936],
[-0.20328455, -0.01082611, -1.53993362],
[-1.41592783, -0.24552349, -0.16715382],
[-0.22327549, -1.16729044, 2.13756903],
[ 0.12090182, -0.18551784, 1.92718745]])
DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics and data model
are quite different in places from an n-dimensional array.
Series implements __array_ufunc__, which allows it to work with NumPy’s universal functions.
The ufunc is applied to the underlying array in a Series.
In [107]: np.exp(ser)
Out[107]:
0 2.718282
1 7.389056
2 20.085537
(continues on next page)
Changed in version 0.25.0: When multiple Series are passed to a ufunc, they are aligned before performing
the operation.
Like other parts of the library, pandas will automatically align labeled inputs as part of a ufunc with
multiple inputs. For example, using numpy.remainder() on two Series with differently ordered labels will
align before the operation.
In [108]: ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [110]: ser1
Out[110]:
a 1
b 2
c 3
dtype: int64
In [111]: ser2
Out[111]:
b 1
a 3
c 5
dtype: int64
In [114]: ser3
Out[114]:
b 2
c 4
d 6
dtype: int64
Series is returned.
NumPy ufuncs are safe to apply to Series backed by non-ndarray arrays, for example SparseArray (see
Sparse calculation). If possible, the ufunc is applied without converting the underlying data to an ndarray.
Console display
Very large DataFrames will be truncated to display them in the console. You can also get a summary using
info(). (Here I am reading a CSV version of the baseball dataset from the plyr R package):
In [119]: baseball = pd.read_csv('data/baseball.csv')
In [120]: print(baseball)
id player year stint team lg g ab r h X2b X3b hr rbi sb ␣
,→cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 1.0 1.
,→0 4 4.0 0.0 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 0.0 0.
,→0 0 1.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... .. .. ... .. ... ... ... .. ... ... ..
,→. .. ... ... ... ... ... ...
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 3.0 0.
,→0 27 30.0 5.0 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 0.0 0.
,→0 0 3.0 0.0 0.0 0.0 0.0 0.0
In [121]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
id 100 non-null int64
player 100 non-null object
year 100 non-null int64
stint 100 non-null int64
team 100 non-null object
lg 100 non-null object
g 100 non-null int64
ab 100 non-null int64
r 100 non-null int64
h 100 non-null int64
X2b 100 non-null int64
You can change how much to print on a single row by setting the display.width option:
You can adjust the max width of the individual columns by setting display.max_colwidth
In [128]: pd.DataFrame(datafile)
Out[128]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...
In [130]: pd.DataFrame(datafile)
Out[130]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02
You can also disable this feature via the expand_frame_repr option. This will print the table in one block.
If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:
In [131]: df = pd.DataFrame({'foo1': np.random.randn(5),
.....: 'foo2': np.random.randn(5)})
.....:
In [132]: df
Out[132]:
foo1 foo2
0 0.175701 0.688798
1 1.857269 0.874645
2 1.062442 -0.593865
3 0.780121 0.422091
4 0.357684 -0.227634
In [133]: df.foo1
Out[133]:
0 0.175701
1 1.857269
2 1.062442
3 0.780121
4 0.357684
Name: foo1, dtype: float64
The columns are also connected to the IPython completion mechanism so they can be tab-completed:
{{ header }}
{{ header }}
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for,
this page was started to provide a more detailed look at the R language and its many third party libraries
as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
• Functionality / flexibility: what can/cannot be done with each tool
• Performance: how fast are operations. Hard numbers/benchmarks are preferable
• Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side
code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External compatibility
for an example.
Quick reference
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas
equivalents.
R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[['col1', 'col2']]
select(df, col1:col3) df.loc[:, 'col1':'col3']
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see1
distinct(select(df, col1)) df[['col1']].drop_duplicates()
distinct(select(df, col1, col2)) df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)
Sorting
R pandas
arrange(df, col1, col2) df.sort_values(['col1', 'col2'])
arrange(df, desc(col1)) df.sort_values('col1', ascending=False)
Transforming
R pandas
select(df, col_one = col1) df.rename(columns={'col1': 'col_one'})['col_one']
rename(df, col_one = col1) df.rename(columns={'col1': 'col_one'})
mutate(df, c=a-b) df.assign(c=df.a-df.b)
R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na.rm=TRUE)) df.groupby('col1').agg({'col1': 'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()
Base R
list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]
In [5]: n = 30
aggregate
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df
and splitting it into groups by1 and by2:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
In [9]: df = pd.DataFrame(
...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
...: np.nan]})
...:
match / %in%
A common way to select data in R is using %in% which is defined using the function match. The operator
%in% is used to return a logical vector indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
tapply
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly
irregular. Using a data.frame called baseball, and retrieving information based on the array team:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))
tapply(baseball$batting.average, baseball.example$team,
max)
subset
The query() method is similar to the base R subset function. In R you might want to get the rows of a
data.frame where one column’s values are less than another column’s values:
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it
were an index/slice as well as standard boolean indexing:
In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
with
An expression using a data.frame called df in R with the columns a and b would be evaluated using with
like so:
In pandas the equivalent expression, using the eval() method, would be:
In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
Out[23]:
0 1.582100
1 1.102172
2 -0.621243
3 0.059273
4 -1.744880
5 0.915561
6 -2.643852
7 -0.519496
8 0.733243
9 -2.914023
dtype: float64
plyr
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around
three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how
these data structures could be mapped in Python.
R Python
array list
lists dictionary or list of objects
data.frame dataframe
ddply
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
(continues on next page)
In pandas the equivalent expression, using the groupby() method, would be:
reshape / reshape2
melt.array
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
melt.list
An expression using a list called a in R where you want to melt it into a data.frame:
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as
required.
In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN
For more details and examples see the Into to Data Structures documentation.
melt.data.frame
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
cast
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and
FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)
Python can approach this in two different ways. Firstly, similar to above using pivot_table():
In [38]: df = pd.DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
....:
For more details and examples see the reshaping documentation or the groupby documentation.
factor
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some
examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows:
Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a
DataFrame called tips and assume we have a database table of the same name and structure.
In [5]: tips.head()
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
SELECT
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all
columns):
With pandas, column selection is done by passing a list of column names to your DataFrame:
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).
WHERE
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows
with True.
In [8]: is_dinner = tips['time'] == 'Dinner'
In [9]: is_dinner.value_counts()
Out[9]:
True 176
False 68
Name: time, dtype: int64
In [10]: tips[is_dinner].head(5)
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
# tips by parties of at least 5 diners OR bill total was more than $45
In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[12]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
In [14]: frame
Out[14]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records where
SELECT *
FROM frame
WHERE col2 IS NULL;
In [15]: frame[frame['col2'].isna()]
Out[15]:
col1 col2
1 B NaN
Getting items where col1 IS NOT NULL can be done with notna().
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [16]: frame[frame['col1'].notna()]
Out[16]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
GROUP BY
In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method.
groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function
(typically aggregation) , and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For
instance, a query getting us the number of tips left by sex:
In [17]: tips.groupby('sex').size()
Out[17]:
sex
Female 87
Male 157
dtype: int64
Notice that in the pandas code we used size() and not count(). This is because count() applies the
function to each column, returning the number of not null records within each.
In [18]: tips.groupby('sex').count()
Out[18]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
In [19]: tips.groupby('sex')['total_bill'].count()
Out[19]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs
by day of the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which
functions to apply to specific columns.
Grouping by more than one column is done by passing a list of columns to the groupby() method.
JOIN
JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their
indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT,
INNER, FULL) or the columns to join on (column names or indices).
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another
DataFrame’s index.
RIGHT JOIN
FULL JOIN
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined
columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
UNION
SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
....: .rank(method='first', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['day', 'rnk']))
....:
Out[36]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
(continues on next page)
Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min')
function rnk_min remains the same for the same tip (as Oracle’s RANK() function)
UPDATE
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
DELETE
In pandas we select the rows that should remain, instead of deleting them
{{ header }}
For potential users coming from SAS this page is meant to demonstrate how different SAS operations would
be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows:
Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which
displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter
notebook or terminal) - the equivalent in SAS would be:
Data structures
pandas SAS
DataFrame data set
column variable
row observation
groupby BY-group
NaN .
DataFrame / Series
A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns
that can be of different types. As will be shown in this document, almost any operation that can be applied
to a data set using SAS’s DATA step, can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data
structure for a single column, but in general, working with a Series is analogous to referencing a column in
the DATA step.
Index
Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an
exactly analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index
that can be accessed during the DATA step (_N_).
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1,
and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately
an important part of pandas to understand, for this comparison we will essentially ignore the Index and just
treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on
how to use an Index effectively.
A SAS data set can be built from specified values by placing the data after a datalines statement and
specifying the column names.
data df;
input x y;
datalines;
1 2
3 4
5 6
;
run;
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the
data.
In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6
Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within
the pandas tests (csv) will be used in many of the following examples.
SAS provides PROC IMPORT to read csv data into a data set.
In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed.
For example, if the data was instead tab delimited, and did not have column names, the pandas command
would be:
In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL
databases. These are all read via a pd.read_* function. See the IO documentation for more details.
Exporting data
Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.
tips.to_csv('tips2.csv')
Data operations
Operations on columns
In the DATA step, arbitrary math expressions can be used on new or existing columns.
data tips;
set tips;
total_bill = total_bill - 2;
new_bill = total_bill / 2;
run;
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way.
In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
Filtering
data tips;
set tips;
if total_bill > 10;
run;
data tips;
set tips;
where total_bill > 10;
/* equivalent in this case - where happens before the
DATA step begins and can also be used in PROC statements */
run;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing
If/then logic
data tips;
set tips;
format bucket $4.;
The same operation in pandas can be accomplished using the where method from numpy.
In [13]: tips.head()
Out[13]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
Date functionality
data tips;
set tips;
format date1 date2 date1_plusmonth mmddyy10.;
date1 = mdy(1, 15, 2013);
date2 = mdy(2, 15, 2015);
date1_year = year(date1);
date2_month = month(date2);
* shift date to beginning of next interval;
date1_next = intnx('MONTH', date1, 1);
* count intervals between dates;
months_between = intck('MONTH', date1, date2);
run;
The equivalent pandas operations are shown below. In addition to these functions pandas supports other
Time Series features not available in Base SAS (such as resampling and custom offsets) - see the timeseries
documentation for more details.
In [19]: tips['months_between'] = (
....: tips['date2'].dt.to_period('M') - tips['date1'].dt.to_period('M'))
....:
Selection of columns
SAS provides keywords in the DATA step to select, drop, and rename columns.
data tips;
set tips;
keep sex total_bill tip;
run;
data tips;
set tips;
drop sex;
run;
data tips;
set tips;
rename total_bill=total_bill_2;
run;
# drop
In [22]: tips.drop('sex', axis=1).head()
Out[22]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
# rename
In [23]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
Out[23]:
total_bill_2 tip sex smoker day time size
Sorting by values
pandas objects have a sort_values() method, which takes a list of columns to sort by.
In [25]: tips.head()
Out[25]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
String processing
Length
SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN
excludes trailing blanks and LENGTHC includes trailing blanks.
data _null_;
set tips;
put(LENGTHN(time));
put(LENGTHC(time));
run;
Python determines the length of a character string with the len function. len includes trailing blanks. Use
len and rstrip to exclude trailing blanks.
In [26]: tips['time'].str.len().head()
Out[26]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
In [27]: tips['time'].str.rstrip().str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
Find
SAS determines the position of a character in a string with the FINDW function. FINDW takes the string
defined by the first argument and searches for the first position of the substring you supply as the second
argument.
data _null_;
set tips;
put(FINDW(sex,'ale'));
run;
Python determines the position of a character in a string with the find function. find searches for the first
position of the substring. If the substring is found, the function returns its position. Keep in mind that
Python indexes are zero-based and the function will return -1 if it fails to find the substring.
In [28]: tips['sex'].str.find("ale").head()
Out[28]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64
Substring
SAS extracts a substring from a string based on its position with the SUBSTR function.
data _null_;
set tips;
put(substr(sex,1,1));
run;
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in
mind that Python indexes are zero-based.
In [29]: tips['sex'].str[0:1].head()
Out[29]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object
Scan
The SAS SCAN function returns the nth word from a string. The first argument is the string you want to
parse and the second argument specifies which word you want to extract.
data firstlast;
input String $60.;
First_Name = scan(string, 1);
Last_Name = scan(string, -1);
datalines2;
John Smith;
Jane Cook;
;;;
run;
Python extracts a substring from a string based on its text by using regular expressions. There are much
more powerful approaches, but this just shows a simple approach.
In [33]: firstlast
Out[33]:
String First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane
The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.
data firstlast;
input String $60.;
string_up = UPCASE(string);
string_low = LOWCASE(string);
string_prop = PROPCASE(string);
datalines2;
John Smith;
Jane Cook;
;;;
run;
In [38]: firstlast
Out[38]:
String string_up string_low string_prop
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook
Merging
In [40]: df1
Out[40]:
key value
0 A 0.970270
1 B 0.438338
2 C 0.402038
3 D -0.629526
In [42]: df2
Out[42]:
key value
0 B -0.600662
1 D -0.478682
2 D -0.546915
3 E -0.054672
In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the
in= dummy variables to track whether a match was found in one or both input frames.
pandas DataFrames have a merge() method, which provides similar functionality. Note that the data does
not have to be sorted ahead of time, and different join types are accomplished via the how keyword.
In [44]: inner_join
Out[44]:
key value_x value_y
0 B 0.438338 -0.600662
1 D -0.629526 -0.478682
2 D -0.629526 -0.546915
In [46]: left_join
Out[46]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
In [48]: right_join
Out[48]:
key value_x value_y
0 B 0.438338 -0.600662
1 D -0.629526 -0.478682
2 D -0.629526 -0.546915
3 E NaN -0.054672
In [50]: outer_join
Out[50]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E NaN -0.054672
Missing data
Like SAS, pandas has a representation for missing data - which is the special float value NaN (not a number).
Many of the semantics are the same, for example missing data propagates through numeric operations, and
is ignored by default for aggregations.
In [51]: outer_join
Out[51]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E NaN -0.054672
In [53]: outer_join['value_x'].sum()
Out[53]: 0.5515934095727003
One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could
do this to filter missing values.
data outer_join_nulls;
set outer_join;
if value_x = .;
run;
data outer_join_no_nulls;
set outer_join;
if value_x ^= .;
run;
Which doesn’t work in pandas. Instead, the pd.isna or pd.notna functions should be used for comparisons.
In [54]: outer_join[pd.isna(outer_join['value_x'])]
Out[54]:
key value_x value_y
5 E NaN -0.054672
In [55]: outer_join[pd.notna(outer_join['value_x'])]
Out[55]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 NaN
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
pandas also provides a variety of methods to work with missing data - some of which would be challenging to
express in SAS. For example, there are methods to drop all rows with any missing values, replacing missing
values with a specified value, like the mean, or forward filling from previous rows. See the missing data
documentation for more.
In [56]: outer_join.dropna()
Out[56]:
key value_x value_y
1 B 0.438338 -0.600662
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
In [57]: outer_join.fillna(method='ffill')
Out[57]:
key value_x value_y
0 A 0.970270 NaN
1 B 0.438338 -0.600662
2 C 0.402038 -0.600662
3 D -0.629526 -0.478682
4 D -0.629526 -0.546915
5 E -0.629526 -0.054672
In [58]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
Out[58]:
0 0.970270
1 0.438338
2 0.402038
3 -0.629526
4 -0.629526
5 0.110319
Name: value_x, dtype: float64
GroupBy
Aggregation
SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on
numeric columns.
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documen-
tation for more details and examples.
In [60]: tips_summed.head()
(continues on next page)
Transformation
In SAS, if the group aggregations need to be used with the original frame, it must be merged back together.
For example, to subtract the mean for each observation by smoker group.
data tips;
merge tips(in=a) smoker_means(in=b);
by smoker;
adj_total_bill = total_bill - group_bill;
if a and b;
run;
pandas groupby provides a transform mechanism that allows these type of operations to be succinctly
expressed in one operation.
In [61]: gb = tips.groupby('smoker')['total_bill']
In [63]: tips.head()
Out[63]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
By group processing
In addition to aggregation, pandas groupby can be used to replicate most other by group processing from
SAS. For example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.
data tips_first;
set tips;
by sex smoker;
if FIRST.sex or FIRST.smoker then output;
run;
Other Considerations
Disk vs memory
pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of
data able to be loaded in pandas is limited by your machine’s memory, but also that the operations on that
data may be faster.
If out of core processing is needed, one possibility is the dask.dataframe library (currently in development)
which provides a subset of pandas functionality for an on-disk DataFrame
Data interop
pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary
format.
df = pd.read_sas('transport-file.xpt')
df = pd.read_sas('binary-file.sas7bdat')
You can also specify the file format directly. By default, pandas will try to infer the file format based on its
extension.
df = pd.read_sas('transport-file.xpt', format='xport')
df = pd.read_sas('binary-file.sas7bdat', format='sas7bdat')
XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas
readers. An alternative way to interop data between SAS and pandas is to serialize to csv.
{{ header }}
For potential users coming from Stata this page is meant to demonstrate how different Stata operations
would be performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself
with the library.
As is customary, we import pandas and NumPy as follows. This means that we can refer to the libraries as
pd and np, respectively, for the rest of the document.
Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which
displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter
notebook or terminal) – the equivalent in Stata would be:
list in 1/5
Data structures
pandas Stata
DataFrame data set
column variable
row observation
groupby bysort
NaN .
DataFrame / Series
A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns
that can be of different types. As will be shown in this document, almost any operation that can be applied
Index
Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly
analogous concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index
that can be accessed with _n.
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1,
and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately
an important part of pandas to understand, for this comparison we will essentially ignore the Index and just
treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on
how to use an Index effectively.
A Stata data set can be built from specified values by placing the data after an input statement and
specifying the column names.
input x y
1 2
3 4
5 6
end
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the
data.
In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6
Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within
the pandas tests (csv) will be used in many of the following examples.
Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the
current working directory, we can import it as follows.
The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the
data set if presented with a url.
In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Like import delimited, read_csv() can take a number of parameters to specify how the data should be
parsed. For example, if the data were instead tab delimited, did not have column names, and existed in the
current working directory, the pandas command would be:
Pandas can also read Stata data sets in .dta format with the read_stata() function.
df = pd.read_stata('data.dta')
In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS,
HDF5, Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation
for more details.
Exporting data
tips.to_csv('tips2.csv')
Pandas can also export to Stata file format with the DataFrame.to_stata() method.
tips.to_stata('tips2.dta')
Data operations
Operations on columns
In Stata, arbitrary math expressions can be used with the generate and replace commands on new or
existing columns. The drop command drops the column from the data set.
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way. The DataFrame.drop() method drops a column from the
DataFrame.
In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
Filtering
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
If/then logic
The same operation in pandas can be accomplished using the where method from numpy.
In [14]: tips.head()
Out[14]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
Date functionality
The equivalent pandas operations are shown below. In addition to these functions, pandas supports other
Time Series features not available in Stata (such as time zone handling and custom offsets) – see the timeseries
documentation for more details.
Selection of columns
drop sex
The same operations are expressed in pandas below. Note that in contrast to Stata, these operations do not
happen in place. To make these changes persist, assign the operation back to a variable.
# keep
In [22]: tips[['sex', 'total_bill', 'tip']].head()
Out[22]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61
# drop
In [23]: tips.drop('sex', axis=1).head()
Out[23]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
# rename
In [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
Out[24]:
total_bill_2 tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
Sorting by values
pandas objects have a DataFrame.sort_values() method, which takes a list of columns to sort by.
In [26]: tips.head()
Out[26]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
String processing
Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII
and Unicode strings, respectively.
Python determines the length of a character string with the len function. In Python 3, all strings are
Unicode strings. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.
In [27]: tips['time'].str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
In [28]: tips['time'].str.rstrip().str.len().head()
Out[28]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
Stata determines the position of a character in a string with the strpos() function. This takes the string
defined by the first argument and searches for the first position of the substring you supply as the second
argument.
Python determines the position of a character in a string with the find() function. find searches for the
first position of the substring. If the substring is found, the function returns its position. Keep in mind that
Python indexes are zero-based and the function will return -1 if it fails to find the substring.
In [29]: tips['sex'].str.find("ale").head()
Out[29]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64
Stata extracts a substring from a string based on its position with the substr() function.
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in
mind that Python indexes are zero-based.
In [30]: tips['sex'].str[0:1].head()
Out[30]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object
The Stata word() function returns the nth word from a string. The first argument is the string you want to
parse and the second argument specifies which word you want to extract.
clear
input str20 string
"John Smith"
"Jane Cook"
end
Python extracts a substring from a string based on its text by using regular expressions. There are much
more powerful approaches, but this just shows a simple approach.
In [34]: firstlast
Out[34]:
string First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane
Changing case
The Stata strupper(), strlower(), strproper(), ustrupper(), ustrlower(), and ustrtitle() functions
change the case of ASCII and Unicode strings, respectively.
clear
input str20 string
"John Smith"
"Jane Cook"
end
In [39]: firstlast
Out[39]:
string upper lower title
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook
Merging
In [41]: df1
Out[41]:
key value
0 A -1.456807
1 B 1.779988
2 C -1.602829
3 D 0.944411
In [43]: df2
Out[43]:
key value
0 B 0.750030
1 D 0.349022
2 D 1.433851
3 E -0.720900
In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file
name on disk. In contrast, Python must have both DataFrames already in memory.
By default, Stata performs an outer join, where all observations from both data sets are left in memory after
the merge. One can keep only observations from the initial data set, the merged data set, or the intersection
of the two by using the values created in the _merge variable.
preserve
* Left join
merge 1:n key using df2.dta
keep if _merge == 1
* Right join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 2
* Inner join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 3
* Outer join
restore
merge 1:n key using df2.dta
pandas DataFrames have a DataFrame.merge() method, which provides similar functionality. Note that
different join types are accomplished via the how keyword.
In [45]: inner_join
Out[45]:
key value_x value_y
0 B 1.779988 0.750030
1 D 0.944411 0.349022
2 D 0.944411 1.433851
In [47]: left_join
Out[47]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
In [49]: right_join
Out[49]:
key value_x value_y
0 B 1.779988 0.750030
1 D 0.944411 0.349022
(continues on next page)
In [51]: outer_join
Out[51]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E NaN -0.720900
Missing data
Like Stata, pandas has a representation for missing data – the special float value NaN (not a number). Many
of the semantics are the same; for example missing data propagates through numeric operations, and is
ignored by default for aggregations.
In [52]: outer_join
Out[52]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E NaN -0.720900
In [54]: outer_join['value_x'].sum()
Out[54]: 0.6091740246349007
One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you
could do this to filter missing values.
This doesn’t work in pandas. Instead, the pd.isna() or pd.notna() functions should be used for compar-
isons.
In [55]: outer_join[pd.isna(outer_join['value_x'])]
Out[55]:
key value_x value_y
5 E NaN -0.7209
In [56]: outer_join[pd.notna(outer_join['value_x'])]
Out[56]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 NaN
3 D 0.944411 0.349022
4 D 0.944411 1.433851
Pandas also provides a variety of methods to work with missing data – some of which would be challenging
to express in Stata. For example, there are methods to drop all rows with any missing values, replacing
missing values with a specified value, like the mean, or forward filling from previous rows. See the missing
data documentation for more.
# Drop rows with any missing value
In [57]: outer_join.dropna()
Out[57]:
key value_x value_y
1 B 1.779988 0.750030
3 D 0.944411 0.349022
4 D 0.944411 1.433851
# Fill forwards
In [58]: outer_join.fillna(method='ffill')
Out[58]:
key value_x value_y
0 A -1.456807 NaN
1 B 1.779988 0.750030
2 C -1.602829 0.750030
3 D 0.944411 0.349022
4 D 0.944411 1.433851
5 E 0.944411 -0.720900
GroupBy
Aggregation
Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric
columns.
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documen-
tation for more details and examples.
In [61]: tips_summed.head()
Out[61]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07
Transformation
In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort
with egen(). For example, to subtract the mean for each observation by smoker group.
pandas groupby provides a transform mechanism that allows these type of operations to be succinctly
expressed in one operation.
In [62]: gb = tips.groupby('smoker')['total_bill']
In [64]: tips.head()
Out[64]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
By group processing
In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from
Stata. For example, the following example lists the first observation in the current sort order by sex/smoker
group.
Other considerations
Disk vs memory
Pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded
in pandas is limited by your machine’s memory. If out of core processing is needed, one possibility is the
dask.dataframe library, which provides a subset of pandas functionality for an on-disk DataFrame. {{
header }}
3.6 Tutorials
This is a guide to many pandas tutorials, geared mainly for new users.
The goal of this 2015 cookbook (by Julia Evans) is to give you some concrete examples for getting started
with pandas. These are examples with real-world data, and all the bugs and weirdness that entails. For the
table of contents, see the pandas-cookbook GitHub repository.
This guide is an introduction to the data analysis process using the Python data ecosystem and an interesting
open dataset. There are four sections covering selected topics as munging data, aggregating data, visualizing
data and time series.
Practice your skills with real data sets and exercises. For more resources, please visit the main repository.
Modern pandas
Tutorial series written in 2016 by Tom Augspurger. The source may be found in the GitHub repository
TomAugspurger/effective-pandas.
• Modern Pandas
• Method Chaining
• Indexes
• Performance
• Tidy Data
• Visualization
• Timeseries
Video tutorials
Various tutorials
FOUR
USER GUIDE
The User Guide covers all of pandas by topic area. Each of the subsections introduces a topic (such as
“working with missing data”), and discusses how pandas approaches the problem, with many examples
throughout.
Users brand-new to pandas should start with 10min.
Further information on any specific method can be obtained in the API reference. {{ header }}
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally
return a pandas object. The corresponding writer functions are object methods that are accessed like
DataFrame.to_csv(). Below is a table containing available readers and writers.
Note: For examples that use the StringIO class, make sure you import it according to your Python
version, i.e. from StringIO import StringIO for Python 2 and from io import StringIO for Python 3.
187
pandas: powerful Python data analysis toolkit, Release 0.25.2
The workhorse function for reading text files (a.k.a. flat files) is read_csv(). See the cookbook for some
advanced strategies.
Parsing options
Basic
header [int or list of ints, default 'infer'] Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names are passed the behavior is identical
to header=0 and column names are inferred from the first line of the file, if column names are passed
explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace
existing names.
The header can be a list of ints that specify row locations for a MultiIndex on the columns e.g. [0,1,
3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note
that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
names [array-like, default None] List of column names to use. If file contains no header row, then you should
explicitly pass header=None. Duplicates in this list are not allowed.
index_col [int, str, sequence of int / str, or False, default None] Column(s) to use as the row labels of
the DataFrame, either given as string name or column index. If a sequence of int / str is given, a
MultiIndex is used.
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g.
when you have a malformed file with delimiters at the end of each line.
usecols [list-like or callable, default None] Return a subset of the columns. If list-like, all elements must
either be positional (i.e. integer indices into the document columns) or strings that correspond to
column names provided either by the user in names or inferred from the document header row(s). For
example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].
Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame
from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo',
'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the column names, returning names where
the callable function evaluates to True:
In [1]: from io import StringIO, BytesIO
In [3]: pd.read_csv(StringIO(data))
Out[3]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
dtype [Type name or dict of column -> type, default None] Data type for data or columns. E.g. {'a':
np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object together with
suitable na_values settings to preserve and not interpret dtype.
New in version 0.20.0: support for the Python parser.
engine [{'c', 'python'}] Parser engine to use. The C engine is faster while the Python engine is currently
more feature-complete.
converters [dict, default None] Dict of functions for converting values in certain columns. Keys can either
be integers or column labels.
true_values [list, default None] Values to consider as True.
In [6]: pd.read_csv(StringIO(data))
Out[6]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
na_values [scalar, str, list-like, or dict, default None] Additional strings to recognize as NA/NaN. If dict
passed, specific per-column NA values. See na values const below for a list of the values interpreted as
NaN by default.
keep_default_na [boolean, default True] Whether or not to include the default NaN values when parsing
the data. Depending on whether na_values is passed in, the behavior is as follows:
• If keep_default_na is True, and na_values are specified, na_values is appended to the default
NaN values used for parsing.
• If keep_default_na is True, and na_values are not specified, only the default NaN values are used
for parsing.
• If keep_default_na is False, and na_values are specified, only the NaN values specified na_values
are used for parsing.
• If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be
ignored.
na_filter [boolean, default True] Detect missing value markers (empty strings and the value of na_values).
In data without any NAs, passing na_filter=False can improve the performance of reading a large
file.
verbose [boolean, default False] Indicate number of NA values placed in non-numeric columns.
skip_blank_lines [boolean, default True] If True, skip over blank lines rather than interpreting as NaN
values.
Datetime handling
parse_dates [boolean or list of ints or names or list of lists or dict, default False.]
• If True -> try parsing the index.
• If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
• If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
• If {'foo': [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’. A fast-path exists for
iso8601-formatted dates.
infer_datetime_format [boolean, default False] If True and parse_dates is enabled for a column, at-
tempt to infer the datetime format to speed up the processing.
keep_date_col [boolean, default False] If True and parse_dates specifies combining multiple columns
then keep the original columns.
date_parser [function, default None] Function to use for converting a sequence of string columns to an
array of datetime instances. The default uses dateutil.parser.parser to do the conversion. pandas
will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1)
Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array and pass that; and 3) call
date_parser once for each row using one or more strings (corresponding to the columns defined by
parse_dates) as arguments.
dayfirst [boolean, default False] DD/MM format dates, international and European format.
cache_dates [boolean, default True] If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with
timezone offsets.
New in version 0.25.0.
Iteration
iterator [boolean, default False] Return TextFileReader object for iteration or getting chunks with
get_chunk().
chunksize [int, default None] Return TextFileReader object for iteration. See iterating and chunking below.
compression [{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'] For on-the-fly decompres-
sion of on-disk data. If ‘infer’, then use gzip, bz2, zip, or xz if filepath_or_buffer is a string ending
in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file
must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
Changed in version 0.24.0: ‘infer’ option added and set to default.
thousands [str, default None] Thousands separator.
decimal [str, default '.'] Character to recognize as decimal point. E.g. use ',' for European data.
float_precision [string, default None] Specifies which converter the C engine should use for floating-point
values. The options are None for the ordinary converter, high for the high-precision converter, and
round_trip for the round-trip converter.
lineterminator [str (length 1), default None] Character to break file into lines. Only valid with C parser.
quotechar [str (length 1)] The character used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per csv.QUOTE_* constants.
Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [boolean, default True] When quotechar is specified and quoting is not QUOTE_NONE, indicate
whether or not to interpret two consecutive quotechar elements inside a field as a single quotechar
element.
escapechar [str (length 1), default None] One-character string used to escape delimiter when quoting is
QUOTE_NONE.
comment [str, default None] Indicates remainder of line should not be parsed. If found at the beginning of
a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines
(as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but
not by skiprows. For example, if comment='#', parsing ‘#empty\na,b,c\n1,2,3’ with header=0 will
result in ‘a,b,c’ being treated as the header.
encoding [str, default None] Encoding to use for UTF when reading/writing (e.g. 'utf-8'). List of Python
standard encodings.
dialect [str or csv.Dialect instance, default None] If provided, this parameter will override values (default
or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar,
and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect
documentation for more details.
Error handling
error_bad_lines [boolean, default True] Lines with too many fields (e.g. a csv line with too many
commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False,
then these “bad lines” will dropped from the DataFrame that is returned. See bad lines below.
warn_bad_lines [boolean, default True] If error_bad_lines is False, and warn_bad_lines is True, a
warning for each “bad line” will be output.
You can indicate the data type for the whole DataFrame or individual columns:
In [8]: data = ('a,b,c,d\n'
...: '1,2,3,4\n'
...: '5,6,7,8\n'
...: '9,10,11')
...:
In [9]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [11]: df
Out[11]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [12]: df['a'][0]
Out[12]: '1'
In [13]: df = pd.read_csv(StringIO(data),
....: dtype={'b': object, 'c': np.float64, 'd': 'Int64'})
....:
In [14]: df.dtypes
Out[14]:
a int64
b object
c float64
d Int64
dtype: object
Fortunately, pandas offers more than one way to ensure that your column(s) contain only one dtype. If
you’re unfamiliar with these concepts, you can see here to learn more about dtypes, and here to learn more
about object conversion in pandas.
For instance, you can use the converters argument of read_csv():
In [15]: data = ("col_1\n"
....: "1\n"
....: "2\n"
....: "'A'\n"
....: "4.22")
....:
In [17]: df
Out[17]:
col_1
0 1
1 2
2 'A'
3 4.22
In [18]: df['col_1'].apply(type).value_counts()
Out[18]:
<class 'str'> 4
Name: col_1, dtype: int64
Or you can use the to_numeric() function to coerce the dtypes after reading in the data,
In [19]: df2 = pd.read_csv(StringIO(data))
In [21]: df2
Out[21]:
col_1
0 1.00
1 2.00
2 NaN
3 4.22
In [22]: df2['col_1'].apply(type).value_counts()
Out[22]:
<class 'float'> 4
Name: col_1, dtype: int64
which will convert all valid parsing to floats, leaving the invalid parsing as NaN.
Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs.
In the case above, if you wanted to NaN out the data anomalies, then to_numeric() is probably your best
option. However, if you wanted for all the data to be coerced, no matter the type, then using the converters
argument of read_csv() would certainly be worth trying.
New in version 0.20.0: support for the Python parser.
The dtype option is supported by the ‘python’ engine.
Note: In some cases, reading in abnormal data with columns containing mixed dtypes will result in an
inconsistent dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go
and infer the dtypes for different chunks of the data, rather than the whole dataset at once. Consequently,
you can end up with column(s) with mixed dtypes. For example,
In [23]: col_1 = list(range(500000)) + ['a', 'b'] + list(range(500000))
In [25]: df.to_csv('foo.csv')
In [27]: mixed_df['col_1'].apply(type).value_counts()
Out[27]:
In [28]: mixed_df['col_1'].dtype
Out[28]: dtype('O')
will result with mixed_df containing an int dtype for certain chunks of the column, and str for others due
to the mixed dtypes from the data that was read in. It is important to note that the overall column will be
marked with a dtype of object, which is used for columns with mixed dtypes.
In [30]: pd.read_csv(StringIO(data))
Out[30]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
In [31]: pd.read_csv(StringIO(data)).dtypes
Out[31]:
col1 object
col2 object
col3 int64
dtype: object
Note: With dtype='category', the resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the to_numeric() function, or as appropriate,
another converter such as to_datetime().
When dtype is a CategoricalDtype with homogeneous categories ( all numeric, all datetimes, etc.), the
conversion is done automatically.
In [39]: df = pd.read_csv(StringIO(data), dtype='category')
In [40]: df.dtypes
Out[40]:
col1 category
col2 category
col3 category
dtype: object
In [41]: df['col3']
Out[41]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, object): [1, 2, 3]
In [43]: df['col3']
Out[43]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]
A file may or may not have a header row. pandas assumes the first row should be used as the column names:
In [44]: data = ('a,b,c\n'
....: '1,2,3\n'
....: '4,5,6\n'
....: '7,8,9')
....:
In [45]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9
In [46]: pd.read_csv(StringIO(data))
Out[46]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
By specifying the names argument in conjunction with header you can indicate other names to use and
whether or not to throw away the header row (if any):
In [47]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9
Out[49]:
foo bar baz
0 a b c
1 1 2 3
2 4 5 6
3 7 8 9
If the header is in a row other than the first, pass the row number to header. This will skip the preceding
rows:
Note: Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0 and column names are inferred from the first non-blank line of the file, if column names are passed
explicitly then the behavior is identical to header=None.
If the file or header contains duplicate names, pandas will by default distinguish between them so as to
prevent overwriting data:
In [53]: pd.read_csv(StringIO(data))
Out[53]:
a b a.1
0 0 1 2
1 3 4 5
There is no more duplicate data because mangle_dupe_cols=True by default, which modifies a series of
duplicate columns ‘X’, …, ‘X’ to become ‘X’, ‘X.1’, …, ‘X.N’. If mangle_dupe_cols=False, duplicate data
can arise:
To prevent users from encountering this problem with duplicate data, a ValueError exception is raised if
mangle_dupe_cols != True:
The usecols argument allows you to select any subset of the columns in a file, either using the column
names, position numbers or a callable:
New in version 0.20.0: support for callable usecols arguments
In [54]: data = 'a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz'
In [55]: pd.read_csv(StringIO(data))
Out[55]:
a b c d
0 1 2 3 foo
1 4 5 6 bar
2 7 8 9 baz
In this case, the callable is specifying that we exclude the “a” and “c” columns from the output.
If the comment parameter is specified, then completely commented lines will be ignored. By default, com-
pletely blank lines will be ignored as well.
In [60]: data = ('\n'
....: 'a,b,c\n'
....: ' \n'
....: '# commented line\n'
....: '1,2,3\n'
....: '\n'
....: '4,5,6')
....:
In [61]: print(data)
a,b,c
# commented line
1,2,3
4,5,6
Warning: The presence of ignored lines might create ambiguities involving line numbers; the param-
eter header uses row numbers (ignoring commented/empty lines), while skiprows uses line numbers
(including commented/empty lines):
In [65]: data = ('#comment\n'
....: 'a,b,c\n'
....: 'A,B,C\n'
....: '1,2,3')
....:
If both header and skiprows are specified, header will be relative to the end of skiprows. For example:
In [70]: print(data)
# empty
# second empty line
# third emptyline
X,Y,Z
1,2,3
A,B,C
1,2.,4.
5.,NaN,10.0
Comments
In [72]: print(open('tmp.csv').read())
ID,level,category
Patient1,123000,x # really unpleasant
Patient2,23000,y # wouldn't take his medicine
Patient3,1234018,z # awesome
In [73]: df = pd.read_csv('tmp.csv')
In [74]: df
Out[74]:
ID level category
0 Patient1 123000 x # really unpleasant
1 Patient2 23000 y # wouldn't take his medicine
2 Patient3 1234018 z # awesome
In [76]: df
Out[76]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z
The encoding argument should be used for encoded unicode data, which will result in byte strings being
decoded to unicode in the result:
In [77]: data = (b'word,length\n'
....: b'Tr\xc3\xa4umen,7\n'
....: b'Gr\xc3\xbc\xc3\x9fe,5')
....:
In [80]: df
Out[80]:
word length
0 Träumen 7
1 Grüße 5
In [81]: df['word'][1]
Out[81]: 'Grüße'
Some formats which encode all characters as multiple bytes, like UTF-16, won’t parse correctly at all without
specifying the encoding. Full list of Python standard encodings.
If a file has one more column of data than the number of column names, the first column will be used as the
DataFrame’s row names:
In [83]: pd.read_csv(StringIO(data))
Out[83]:
a b c
4 apple bat 5.7
8 orange cow 10.0
Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line,
confusing the parser. To explicitly disable the index column inference and discard the last column, pass
index_col=False:
In [86]: data = ('a,b,c\n'
....: '4,apple,bat,\n'
....: '8,orange,cow,')
....:
In [87]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,
In [88]: pd.read_csv(StringIO(data))
Out[88]:
a b c
4 apple bat NaN
8 orange cow NaN
In [91]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,
Date Handling
To better facilitate working with datetime data, read_csv() uses the keyword arguments parse_dates and
date_parser to allow users to specify a variety of columns and date/time formats to turn the input text
data into datetime objects.
The simplest case is to just pass in parse_dates=True:
# Use a column as an index, and parse it as dates.
In [94]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True)
In [95]: df
Out[95]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5
It is often the case that we may want to store date and time data separately, or store various date fields
separately. the parse_dates keyword can be used to specify a combination of columns to parse the dates
and/or times from.
You can specify a list of column lists to parse_dates, the resulting date columns will be prepended to the
output (so as to not affect the existing column order) and the new column names will be the concatenation
of the component column names:
In [97]: print(open('tmp.csv').read())
KORD,19990127, 19:00:00, 18:56:00, 0.8100
KORD,19990127, 20:00:00, 19:56:00, 0.0100
KORD,19990127, 21:00:00, 20:56:00, -0.5900
KORD,19990127, 21:00:00, 21:18:00, -0.9900
KORD,19990127, 22:00:00, 21:56:00, -0.5900
KORD,19990127, 23:00:00, 22:56:00, -0.5900
In [99]: df
Out[99]:
1_2 1_3 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
By default the parser removes the component date columns, but you can choose to retain them via the
keep_date_col keyword:
In [101]: df
Out[101]:
1_2 1_3 0 1 2 3 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 19990127 19:00:00 18:56:00 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 19990127 20:00:00 19:56:00 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD 19990127 21:00:00 20:56:00 -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD 19990127 21:00:00 21:18:00 -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD 19990127 22:00:00 21:56:00 -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD 19990127 23:00:00 22:56:00 -0.59
Note that if you wish to combine multiple columns into a single date column, a nested list must be used.
In other words, parse_dates=[1, 2] indicates that the second and third columns should each be parsed as
separate date columns while parse_dates=[[1, 2]] means the two columns should be parsed into a single
column.
You can also use a dict to specify custom name columns:
In [104]: df
Out[104]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
It is important to remember that if multiple text columns are to be parsed into a single date column, then
a new column is prepended to the data. The index_col specification is based off of this new set of columns
rather than the original data columns:
In [107]: df
Out[107]:
actual 0 4
nominal
1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
Note: If a column or index contains an unparsable date, the entire column or index will be returned
unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after pd.read_csv.
Note: read_csv has a fast_path for parsing datetime strings in iso8601 format, e.g “2000-01-
01T00:01:02+00:00” and similar variations. If you can arrange for your data to store datetimes in this
format, load times will be significantly faster, ~20x has been observed.
Note: When passing a dict as the parse_dates argument, the order of the columns prepended is not
guaranteed, because dict objects do not impose an ordering on their keys. On Python 2.7+ you may use
collections.OrderedDict instead of a regular dict if this matters to you. Because of this, when using a dict for
‘parse_dates’ in conjunction with the index_col argument, it’s best to specify index_col as a column label
rather then as an index on the resulting frame.
Finally, the parser allows you to specify a custom date_parser function to take full advantage of the
flexibility of the date parsing API:
In [109]: df
Out[109]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
Pandas will try to call the date_parser function in three different ways. If an exception is raised, the next
one is tried:
1. date_parser is first called with one or more arrays as arguments, as defined using parse_dates (e.g.,
date_parser(['2013', '2013'], ['1', '2'])).
2. If #1 fails, date_parser is called with all the columns concatenated row-wise into a single array (e.g.,
date_parser(['2013 1', '2013 2'])).
3. If #2 fails, date_parser is called once for every row with one or more string arguments from
the columns indicated with parse_dates (e.g., date_parser('2013', '1') for the first row,
date_parser('2013', '2') for the second, etc.).
Note that performance-wise, you should try these methods of parsing dates in order:
1. Try to infer the format using infer_datetime_format=True (see section below).
2. If you know the format, use pd.to_datetime(): date_parser=lambda x: pd.to_datetime(x,
format=...).
3. If you have a really non-standard format, use a custom date_parser function. For optimal perfor-
mance, this should be vectorized, i.e., it should accept arrays as arguments.
You can explore the date parsing functionality in date_converters.py and add your own. We would
love to turn this module into a community supported set of date/time parsers. To get you started,
date_converters.py contains functions to parse dual date and time columns, year/month/day columns,
and year/month/day/hour/minute/second columns. It also contains a generic_parser function so you can
curry it with a function that deals with a single date rather than the entire array.
Pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains
columns with a mixture of timezones, the default result will be an object-dtype column with strings, even
with parse_dates.
In [112]: df['a']
Out[112]:
0 2000-01-01 00:00:00+05:00
1 2000-01-01 00:00:00+06:00
Name: a, dtype: object
To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with
utc=True as the date_parser.
In [114]: df['a']
Out[114]:
0 1999-12-31 19:00:00+00:00
1 1999-12-31 18:00:00+00:00
Name: a, dtype: datetime64[ns, UTC]
If you have parse_dates enabled for some or all of your columns, and your datetime strings are all formatted
the same way, you may get a large speed up by setting infer_datetime_format=True. If set, pandas will
attempt to guess the format of your datetime strings, and then use a faster means of parsing the strings.
5-10x parsing speeds have been observed. pandas will fallback to the usual parsing if either the format
cannot be guessed or the format that was guessed cannot properly parse the entire column of strings. So in
general, infer_datetime_format should not have any negative consequences if enabled.
Here are some examples of datetime strings that can be guessed (All representing December 30th, 2011 at
00:00:00):
• “20111230”
• “2011/12/30”
• “20111230 00:00:00”
• “12/30/2011 00:00:00”
• “30/Dec/2011 00:00:00”
• “30/December/2011 00:00:00”
In [116]: df
Out[116]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5
While US date formats tend to be MM/DD/YYYY, many international formats use DD/MM/YYYY instead.
For convenience, a dayfirst keyword is provided:
In [117]: print(open('tmp.csv').read())
date,value,cat
1/6/2000,5,a
2/6/2000,10,b
3/6/2000,15,c
The parameter float_precision can be specified in order to use a specific floating-point converter during
parsing with the C engine. The options are the ordinary converter, the high-precision converter, and the
round-trip converter (which is guaranteed to round-trip values after writing to a file). For example:
In [120]: val = '0.3066101993807095471566981359501369297504425048828125'
Thousand separators
For large numbers that have been written with a thousands separator, you can set the thousands keyword
to a string of length 1 so that integers will be parsed correctly:
By default, numbers with a thousands separator will be parsed as strings:
In [125]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z
In [127]: df
Out[127]:
ID level category
0 Patient1 123,000 x
1 Patient2 23,000 y
2 Patient3 1,234,018 z
In [128]: df.level.dtype
Out[128]: dtype('O')
The thousands keyword allows integers to be parsed correctly:
In [129]: print(open('tmp.csv').read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z
In [131]: df
Out[131]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z
In [132]: df.level.dtype
Out[132]: dtype('int64')
NA values
To control which values are parsed as missing values (which are signified by NaN), specify a string in
na_values. If you specify a list of strings, then all values in it are considered to be missing values. If
you specify a number (a float, like 5.0 or an integer like 5), the corresponding equivalent values will also
imply a missing value (in this case effectively [5.0, 5] are recognized as NaN).
To completely override the default values that are recognized as missing, specify keep_default_na=False.
The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A',
'#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].
Let us consider some examples:
pd.read_csv('path_to_file.csv', na_values=[5])
In the example above 5 and 5.0 will be recognized as NaN, in addition to the defaults. A string will first be
interpreted as a numerical 5, then as a NaN.
pd.read_csv('path_to_file.csv', na_values=["Nope"])
The default values, in addition to the string "Nope" are recognized as NaN.
Infinity
inf like values will be parsed as np.inf (positive infinity), and -inf as -np.inf (negative infinity). These
will ignore the case of the value, meaning Inf, will also be parsed as np.inf.
Returning Series
Using the squeeze keyword, the parser will return output with a single column as a Series:
In [133]: print(open('tmp.csv').read())
level
Patient1,123000
Patient2,23000
Patient3,1234018
In [135]: output
Out[135]:
Patient1 123000
Patient2 23000
Patient3 1234018
Name: level, dtype: int64
In [136]: type(output)
Out[136]: pandas.core.series.Series
Boolean values
The common values True, False, TRUE, and FALSE are all recognized as boolean. Occasionally you might
want to recognize other values as being boolean. To do this, use the true_values and false_values options
as follows:
In [137]: data = ('a,b,c\n'
.....: '1,Yes,2\n'
.....: '3,No,4')
.....:
In [138]: print(data)
a,b,c
1,Yes,2
3,No,4
In [139]: pd.read_csv(StringIO(data))
Out[139]:
a b c
0 1 Yes 2
1 3 No 4
Some files may have malformed lines with too few fields or too many. Lines with too few fields will have NA
values filled in the trailing fields. Lines with too many fields will raise an error by default:
In [142]: pd.read_csv(StringIO(data))
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-142-6388c394e6b8> in <module>
----> 1 pd.read_csv(StringIO(data))
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_
,→low_memory()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_
,→rows()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._
,→tokenize_rows()
~/sandbox/pandas-doc/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_
,→error()
Out[29]:
a b c
0 1 2 3
1 8 9 10
You can also use the usecols parameter to eliminate extraneous column data that appear in some lines but
not others:
Out[30]:
a b c
0 1 2 3
1 4 5 6
2 8 9 10
Dialect
The dialect keyword gives greater flexibility in specifying the file format. By default it uses the Excel
dialect but you can specify either the dialect name or a csv.Dialect instance.
Suppose you had data with unenclosed quotes:
In [143]: print(data)
label1,label2,label3
index1,"a,c,e
index2,b,d,f
By default, read_csv uses the Excel dialect and treats the double quote as the quote character, which causes
it to fail when it finds a newline before it finds the closing double quote.
We can get around this using dialect:
Another common dialect option is skipinitialspace, to skip any whitespace after a delimiter:
In [150]: data = 'a, b, c\n1, 2, 3\n4, 5, 6'
In [151]: print(data)
a, b, c
1, 2, 3
4, 5, 6
Quotes (and other escape characters) in embedded fields can be handled in any number of ways. One way
is to use backslashes; to properly parse this data, you should pass the escapechar option:
In [153]: data = 'a,b\n"hello, \\"Bob\\", nice to see you",5'
In [154]: print(data)
a,b
"hello, \"Bob\", nice to see you",5
While read_csv() reads delimited data, the read_fwf() function works with data files that have known
and fixed column widths. The function parameters to read_fwf are largely the same as read_csv with two
extra parameters, and a different usage of the delimiter parameter:
• colspecs: A list of pairs (tuples) giving the extents of the fixed-width fields of each line as half-open
intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the
column specifications from the first 100 rows of the data. Default behavior, if not specified, is to infer.
• widths: A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.
• delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify
the filler character of the fields if it is not spaces (e.g., ‘~’).
In [156]: print(open('bar.csv').read())
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
In order to parse this file into a DataFrame, we simply need to supply the column specifications to the
read_fwf function along with the file name:
In [159]: df
Out[159]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
Note how the parser automatically picks column names X.<column number> when header=None argument
is specified. Alternatively, you can supply just the column widths for contiguous columns:
In [162]: df
Out[162]:
0 1 2 3
0 id8141 360.242940 149.910199 11950.7
1 id1594 444.953632 166.985655 11788.4
2 id1849 364.136849 183.628767 11806.2
3 id1230 413.836124 184.375703 11916.8
4 id1948 502.953953 173.237159 12468.3
The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between
the columns in the file.
By default, read_fwf will try to infer the file’s colspecs by using the first 100 rows of the file. It can do
it only in cases when the columns are aligned and correctly separated by the provided delimiter (default
delimiter is whitespace).
In [164]: df
(continues on next page)
Indexes
Consider a file with one less entry in the header than the number of data column:
In [167]: print(open('foo.csv').read())
A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5
In this special case, read_csv assumes that the first column is to be used as the index of the DataFrame:
In [168]: pd.read_csv('foo.csv')
Out[168]:
A B C
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5
Note that the dates weren’t automatically parsed. In that case you would need to do as before:
In [170]: df.index
Out[170]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]
,→', freq=None)
In [171]: print(open('data/mindex_ex.csv').read())
year,indiv,zit,xit
1977,"A",1.2,.6
1977,"B",1.5,.5
1977,"C",1.7,.8
1978,"A",.2,.06
1978,"B",.7,.2
1978,"C",.8,.3
1978,"D",.9,.5
1978,"E",1.4,.9
1979,"C",.2,.15
1979,"D",.14,.05
1979,"E",.5,.15
1979,"F",1.2,.5
1979,"G",3.4,1.9
1979,"H",5.4,2.7
1979,"I",6.4,1.2
The index_col argument to read_csv can take a list of column numbers to turn multiple columns into a
MultiIndex for the index of the returned object:
In [172]: df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])
In [173]: df
Out[173]:
zit xit
year indiv
1977 A 1.20 0.60
B 1.50 0.50
C 1.70 0.80
1978 A 0.20 0.06
B 0.70 0.20
C 0.80 0.30
D 0.90 0.50
E 1.40 0.90
1979 C 0.20 0.15
D 0.14 0.05
E 0.50 0.15
F 1.20 0.50
G 3.40 1.90
H 5.40 2.70
I 6.40 1.20
In [174]: df.loc[1978]
Out[174]:
zit xit
indiv
A 0.2 0.06
B 0.7 0.20
C 0.8 0.30
D 0.9 0.50
E 1.4 0.90
By specifying list of row locations for the header argument, you can read in a MultiIndex for the columns.
Specifying non-consecutive rows will skip the intervening rows.
In [175]: from pandas.util.testing import makeCustomDataframe as mkdf
In [177]: df.to_csv('mi.csv')
In [178]: print(open('mi.csv').read())
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2
read_csv is capable of inferring delimited (not necessarily comma-separated) files, as pandas uses the csv.
Sniffer class of the csv module. For this, you have to specify sep=None.
In [182]: print(open('tmp2.sv').read())
:0:1:2:3
0:1.1214905765122583:-1.1011663421613171:-1.2725711408453018:0.8434589457722285
1:0.8739661419816901:-1.1622548707272122:0.12618578996106738:0.5057848504967111
2:0.6695152369722812:0.4833977900441433:-0.4383565886430891:-0.13952146077085656
3:1.6678766138462109:0.906356209978661:0.8603041052486606:-0.009413710135323125
4:-0.8075485015292924:-0.7848128653629299:-1.3155155066668116:0.6875244729698119
5:-0.1572352664979729:0.30339976035788174:-0.36340691002502046:-0.5526511482544121
6:0.41442095212262187:0.17517103850750262:-0.5295157789486404:-0.06745694327155764
7:1.058814717443789:-0.11789792502832808:-1.8534207864364352:-0.7018494437516053
8:0.26239634172416604:-1.7245959745828128:0.2765803759042711:1.0730241342647273
9:0.6352851164219758:-2.1785482358583024:0.3120437647651685:1.5723784501068536
It’s best to use concat() to combine multiple files. See the cookbook for an example.
Suppose you wish to iterate through a (potentially very large) file lazily rather than reading the entire file
into memory, such as the following:
In [184]: print(open('tmp.sv').read())
|0|1|2|3
0|1.1214905765122583|-1.1011663421613171|-1.2725711408453018|0.8434589457722285
1|0.8739661419816901|-1.1622548707272122|0.12618578996106738|0.5057848504967111
2|0.6695152369722812|0.4833977900441433|-0.4383565886430891|-0.13952146077085656
3|1.6678766138462109|0.906356209978661|0.8603041052486606|-0.009413710135323125
4|-0.8075485015292924|-0.7848128653629299|-1.3155155066668116|0.6875244729698119
5|-0.1572352664979729|0.30339976035788174|-0.36340691002502046|-0.5526511482544121
6|0.41442095212262187|0.17517103850750262|-0.5295157789486404|-0.06745694327155764
7|1.058814717443789|-0.11789792502832808|-1.8534207864364352|-0.7018494437516053
8|0.26239634172416604|-1.7245959745828128|0.2765803759042711|1.0730241342647273
9|0.6352851164219758|-2.1785482358583024|0.3120437647651685|1.5723784501068536
In [186]: table
Out[186]:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
4 4 -0.807549 -0.784813 -1.315516 0.687524
5 5 -0.157235 0.303400 -0.363407 -0.552651
6 6 0.414421 0.175171 -0.529516 -0.067457
7 7 1.058815 -0.117898 -1.853421 -0.701849
8 8 0.262396 -1.724596 0.276580 1.073024
9 9 0.635285 -2.178548 0.312044 1.572378
By specifying a chunksize to read_csv, the return value will be an iterable object of type TextFileReader:
In [187]: reader = pd.read_csv('tmp.sv', sep='|', chunksize=4)
In [188]: reader
Out[188]: <pandas.io.parsers.TextFileReader at 0x13eb31a90>
In [191]: reader.get_chunk(5)
Out[191]:
Unnamed: 0 0 1 2 3
0 0 1.121491 -1.101166 -1.272571 0.843459
1 1 0.873966 -1.162255 0.126186 0.505785
2 2 0.669515 0.483398 -0.438357 -0.139521
3 3 1.667877 0.906356 0.860304 -0.009414
4 4 -0.807549 -0.784813 -1.315516 0.687524
Under the hood pandas uses a fast and efficient parser implemented in C as well as a Python implementation
which is currently more feature-complete. Where possible pandas uses the C parser (specified as engine='c'),
but may fall back to Python if C-unsupported options are specified. Currently, C-unsupported options
include:
• sep other than a single character (e.g. regex separators)
• skipfooter
• sep=None with delim_whitespace=False
Specifying any of the above options will produce a ParserWarning unless the python engine is selected
explicitly using engine='python'.
df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',
sep='\t')
S3 URLs are handled as well but require installing the S3Fs library:
df = pd.read_csv('s3://pandas-test/tips.csv')
If your S3 bucket requires credentials you will need to set them as environment variables or in the ~/.aws/
credentials config file, refer to the S3Fs documentation on credentials.
The Series and DataFrame objects have an instance method to_csv which allows storing the contents of
the object as a comma-separated-values file. The function takes a number of arguments. Only the first is
required.
• path_or_buf: A string path to the file to write or a file object. If a file object it must be opened with
newline=’’
• sep : Field delimiter for the output file (default “,”)
The DataFrame object has an instance method to_string which allows control over the string representation
of the object. All arguments are optional:
• buf default None, for example a StringIO object
• columns default None, which columns to write
• col_space default None, minimum width of each column.
• na_rep default NaN, representation of NA value
• formatters default None, a dictionary (by column) of functions each of which takes a single argument
and returns a formatted string
• float_format default None, a function which takes a single (float) argument and returns a formatted
string; to be applied to floats in the DataFrame.
• sparsify default True, set to False for a DataFrame with a hierarchical index to print every MultiIndex
key at each row.
• index_names default True, will print the names of the indices
• index default True, will print the index (ie, row labels)
• header default True, will print the column labels
• justify default left, will print column headers left- or right-justified
The Series object also has a to_string method, but with only the buf, na_rep, float_format arguments.
There is also a length argument which, if set to True, will additionally output the length of the Series.
4.1.2 JSON
Writing JSON
A Series or DataFrame can be converted to a valid JSON string. Use to_json with optional parameters:
• path_or_buf : the pathname or buffer to write the output This can be None in which case a JSON
string is returned
• orient :
Series:
– default is index
– allowed values are {split, records, index}
DataFrame:
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string
split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, … , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array
• date_format : string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601.
• double_precision : The number of decimal places to use when encoding floating point values, default
10.
• force_ascii : force encoded string to be ASCII, default True.
• date_unit : The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’,
‘us’ or ‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’.
• default_handler : The handler to call if an object cannot otherwise be converted to a suitable format
for JSON. Takes a single argument, which is the object to convert, and returns a serializable object.
• lines : If records orient, then will write each record per line as json.
Note NaN’s, NaT’s and None will be converted to null and datetime objects will be converted based on the
date_format and date_unit parameters.
,→9464922966,"4":-0.5276761509}}'
Orient options
There are a number of different options for the format of the resulting JSON file / string. Consider the
following DataFrame and Series:
In [196]: dfjo
Out[196]:
A B C
x 1 4 7
y 2 5 8
z 3 6 9
In [198]: sjo
Out[198]:
x 15
y 16
z 17
Name: D, dtype: int64
Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column
labels acting as the primary index:
In [199]: dfjo.to_json(orient="columns")
Out[199]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'
Index oriented (the default for Series) similar to column oriented but the index labels are now primary:
In [200]: dfjo.to_json(orient="index")
Out[200]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'
In [201]: sjo.to_json(orient="index")
Out[201]: '{"x":15,"y":16,"z":17}'
Record oriented serializes the data to a JSON array of column -> value records, index labels are not
included. This is useful for passing DataFrame data to plotting libraries, for example the JavaScript library
d3.js:
In [202]: dfjo.to_json(orient="records")
Out[202]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'
In [203]: sjo.to_json(orient="records")
Out[203]: '[15,16,17]'
Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and
index labels are not included:
In [204]: dfjo.to_json(orient="values")
Out[204]: '[[1,4,7],[2,5,8],[3,6,9]]'
Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name
is also included for Series:
In [205]: dfjo.to_json(orient="split")
Out[205]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,
,→9]]}'
In [206]: sjo.to_json(orient="split")
Out[206]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'
Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including
but not limited to dtypes and index names.
Note: Any orient option that encodes to a JSON object will not preserve the ordering of index and column
labels during round-trip serialization. If you wish to preserve label ordering use the split option as it uses
ordered containers.
Date handling
In [211]: json
Out[211]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":
,→"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z
,→"},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.1144885256,"4":0.
,→5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,"3":1.5730764249,
,→"4":0.6200376615}}'
In [213]: json
Out[213]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z",
,→"2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-
(continues on next page)
,→01T00:00:00.000000Z"},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.
,→1144885256,"4":0.5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,
226
,→"3":1.5730764249,"4":0.6200376615}}'
Chapter 4. User Guide
pandas: powerful Python data analysis toolkit, Release 0.25.2
In [215]: json
Out[215]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4
,→":1356998400},"B":{"0":0.3903383957,"1":-0.5223681486,"2":2.0249145293,"3":2.
,→1144885256,"4":0.5337588359},"A":{"0":-1.0954121534,"1":-0.147141856,"2":0.6305826658,
,→"3":1.5730764249,"4":0.6200376615}}'
In [221]: dfj2.to_json('test.json')
,→"1356998400000":0.009310115,"1357084800000":-1.2591311739,"1357171200000":1.7549089729,
,→"1357257600000":0.9464922966,"1357344000000":-0.5276761509},"date":{"1356998400000
,→":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,
,→"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,
,→"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{
,→"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,
,→"1357344000000":true}}
Fallback behavior
If the JSON serializer cannot handle the container contents directly it will fall back in the following manner:
• if the dtype is unsupported (e.g. np.complex) then the default_handler, if provided, will be called
for each value, otherwise an exception is raised.
• if an object is unsupported it will attempt the following:
– check if the object has defined a toDict method and call it. A toDict method should return a
dict which will then be JSON serialized.
– invoke the default_handler if one was provided.
– convert the object to a dict by traversing its contents. However this will often fail with an
OverflowError or give unexpected results.
In general the best approach for unsupported objects or dtypes is to provide a default_handler. For
example:
Reading JSON
Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a
DataFrame if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series
• filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be a URL.
Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a
local file could be file ://localhost/path/to/table.json
• typ : type of object to recover (series or frame), default ‘frame’
• orient :
Series :
– default is index
– allowed values are {split, records, index}
DataFrame
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string
split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, … , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array
table adhering to the JSON Table Schema
• dtype : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer
dtypes at all, default is True, apply only to the data.
• convert_axes : boolean, try to convert the axes to the proper dtypes, default is True
• convert_dates : a list of columns to parse for dates; If True, then try to parse date-like columns,
default is True.
• keep_default_dates : boolean, default True. If parsing dates, then parse the default date-like
columns.
• numpy : direct decoding to NumPy arrays. default is False; Supports numeric data only, although
labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term if
numpy=True.
• precise_float : boolean, default False. Set to enable usage of higher precision (strtod) function when
decoding string to double values. Default (False) is to use fast but less precise builtin functionality.
• date_unit : string, the timestamp unit to detect if converting dates. Default None. By default the
timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force
timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively.
• lines : reads file as one json object per line.
• encoding : The encoding to use to decode py3 bytes.
• chunksize : when used in combination with lines=True, return a JsonReader which reads in
chunksize lines per iteration.
The parser will raise one of ValueError/TypeError/AssertionError if the JSON is not parseable.
If a non-default orient was used when encoding to JSON be sure to pass the same option here so that
decoding produces sensible results, see Orient Options for an overview.
Data conversion
The default of convert_axes=True, dtype=True, and convert_dates=True will try to parse the axes, and
all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to
dtype. convert_axes should only be set to False if you need to preserve string-like numbers (e.g. ‘1’, ‘2’)
in an axes.
Note: Large integer values may be converted to dates if convert_dates=True and the data and / or
column labels appear ‘date-like’. The exact threshold depends on the date_unit specified. ‘date-like’ means
that the column label meets one of the following criteria:
• it ends with '_at'
• it ends with '_time'
• it begins with 'timestamp'
• it is 'modified'
• it is 'date'
Warning: When reading JSON data, automatic coercing into dtypes has some quirks:
• an index can be reconstructed in a different order from serialization, that is, the returned order is
not guaranteed to be the same as before serialization
• a column that was float data will be converted to integer if it can be done safely, e.g. a column
of 1.
• bool columns will be converted to integer on reconstruction
Thus there are times where you may want to specify specific dtypes via the dtype keyword argument.
In [224]: pd.read_json(json)
Out[224]:
date B A
(continues on next page)
In [225]: pd.read_json('test.json')
Out[225]:
A B date ints bools
2013-01-01 2.002793 0.009310 2013-01-01 0 True
2013-01-02 -1.128420 -1.259131 2013-01-01 1 True
2013-01-03 -0.267123 1.754909 2013-01-01 2 True
2013-01-04 0.059081 0.946492 2013-01-01 3 True
2013-01-05 0.212602 -0.527676 2013-01-01 4 True
Don’t convert any data (but still convert axes and dates):
In [229]: si
Out[229]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
In [230]: si.index
In [231]: si.columns
Out[231]: Int64Index([0, 1, 2, 3], dtype='int64')
In [234]: sij
Out[234]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
In [235]: sij.index
Out[235]: Index(['0', '1', '2', '3'], dtype='object')
In [236]: sij.columns
Out[236]: Index(['0', '1', '2', '3'], dtype='object')
Dates written in nanoseconds need to be read back in nanoseconds:
In [237]: json = dfj2.to_json(date_unit='ns')
In [239]: dfju
Out[239]:
A B date ints bools
1356998400000000000 2.002793 0.009310 1356998400000000000 0 True
1357084800000000000 -1.128420 -1.259131 1356998400000000000 1 True
1357171200000000000 -0.267123 1.754909 1356998400000000000 2 True
1357257600000000000 0.059081 0.946492 1356998400000000000 3 True
1357344000000000000 0.212602 -0.527676 1356998400000000000 4 True
In [241]: dfju
Out[241]:
A B date ints bools
2013-01-01 2.002793 0.009310 2013-01-01 0 True
2013-01-02 -1.128420 -1.259131 2013-01-01 1 True
2013-01-03 -0.267123 1.754909 2013-01-01 2 True
2013-01-04 0.059081 0.946492 2013-01-01 3 True
2013-01-05 0.212602 -0.527676 2013-01-01 4 True
Note: This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates
etc.
If numpy=True is passed to read_json an attempt will be made to sniff an appropriate dtype during deseri-
alization and to subsequently decode directly to NumPy arrays, bypassing the need for intermediate Python
objects.
This can provide speedups if you are deserialising a large amount of numeric data:
Warning: Direct NumPy decoding makes a number of assumptions and may fail or produce unexpected
output if these assumptions are not satisfied:
• data is numeric.
• data is uniform. The dtype is sniffed from the first value decoded. A ValueError may be raised,
or incorrect output may be produced if this condition is not satisfied.
• labels are ordered. Labels are only read from the first container, it is assumed that each subsequent
row / column has been encoded in the same order. This should be satisfied if the data was encoded
using to_json but may not be the case if the JSON is from another source.
Normalization
pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into
a flat table.
In [255]: json_normalize(data)
Out[255]:
id name.first name.last name.given name.family name
0 1.0 Coleen Volk NaN NaN NaN
1 NaN NaN NaN Mose Regner NaN
2 2.0 NaN NaN NaN NaN Faye Raker
The max_level parameter provides more control over which level to end normalization. With max_level=1
the following snippet normalizes until 1st nesting level of the provided dict.
In [262]: df
Out[262]:
a b
0 1 2
1 3 4
In [265]: reader
Out[265]: <pandas.io.json._json.JsonReader at 0x13ed13b10>
a b
0 1 2
a b
1 3 4
Table schema
In [268]: df
Out[268]:
A B C
idx
0 1 a 2016-01-01
1 2 b 2016-01-02
2 3 c 2016-01-03
,→"C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},
,→{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'
The schema field contains the fields key, which itself contains a list of column name to type pairs, including
the Index or MultiIndex (see below for a list of types). The schema field also contains a primaryKey field
if the (Multi)index is unique.
The second field, data, contains the serialized data with the records orient. The index is included, and any
datetimes are ISO 8601 formatted, as required by the Table Schema spec.
The full list of types supported are described in the Table Schema spec. This table shows the mapping from
pandas types:
• The schema object contains a pandas_version field. This contains the version of pandas’ dialect of
the schema, and will be incremented with each revision.
• All dates are converted to UTC when serializing. Even timezone naive values, which are treated as
UTC with an offset of 0.
In [272]: build_table_schema(s)
Out[272]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• datetimes with a timezone (before serializing), include an additional field tz with the time zone name
(e.g. 'US/Central').
In [274]: build_table_schema(s_tz)
Out[274]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• Periods are converted to timestamps before serialization, and so have the same behavior of being con-
verted to UTC. In addition, periods will contain and additional field freq with the period’s frequency,
e.g. 'A-DEC'.
In [276]: build_table_schema(s_per)
Out[276]:
{'fields': [{'name': 'index', 'type': 'datetime', 'freq': 'A-DEC'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• Categoricals use the any type and an enum constraint listing the set of possible values. Additionally,
an ordered field is included:
In [278]: build_table_schema(s_cat)
Out[278]:
{'fields': [{'name': 'index', 'type': 'integer'},
(continues on next page)
In [280]: build_table_schema(s_dupe)
Out[280]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'pandas_version': '0.20.0'}
• The primaryKey behavior is the same with MultiIndexes, but in this case the primaryKey is an array:
In [282]: build_table_schema(s_multi)
Out[282]:
{'fields': [{'name': 'level_0', 'type': 'string'},
{'name': 'level_1', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': FrozenList(['level_0', 'level_1']),
'pandas_version': '0.20.0'}
In [284]: df
Out[284]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [285]: df.dtypes
Out[285]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
In [288]: new_df
Out[288]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [289]: new_df.dtypes
Out[289]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
Please note that the literal string ‘index’ as the name of an Index is not round-trippable, nor are any names
beginning with 'level_' within a MultiIndex. These are used by default in DataFrame.to_json() to
indicate missing values and the subsequent read cannot distinguish the intent.
In [293]: print(new_df.index.name)
None
4.1.3 HTML
Warning: We highly encourage you to read the HTML Table Parsing gotchas below regarding the
issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into
list of pandas DataFrames. Let’s look at a few examples.
Note: read_html returns a list of DataFrame objects, even if there is only a single table contained in the
HTML content.
In [296]: dfs
Out[296]:
[ Bank Name City ST CERT ␣
,→ Acquiring Institution Closing Date Updated Date
0 The Enloe State Bank Cooper TX 10716 ␣
,→ Legend Bank, N. A. May 31, 2019 August 22, 2019
1 Washington Federal Bank for Savings Chicago IL 30570 ␣
,→ Royal Savings Bank December 15, 2017 July 24, 2019
2 The Farmers and Merchants State Bank of Argonia Argonia KS 17719 ␣
,→ Conway Bank October 13, 2017 August 12, 2019
3 Fayette County Bank Saint Elmo IL 1802 ␣
,→ United Fidelity Bank, fsb May 26, 2017 January 29, 2019
4 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI 30003 First-
,→Citizens Bank & Trust Company May 5, 2017 March 22, 2018
.. ... ... .. ... ␣
,→ ... ... ...
551 Superior Bank, FSB Hinsdale IL 32646 ␣
,→ Superior Federal, FSB July 27, 2001 August 19, 2014
552 Malta National Bank Malta OH 6629 ␣
,→ North Valley Bank May 3, 2001 November 18, 2002
553 First Alliance Bank & Trust Co. Manchester NH 34264 Southern␣
,→New Hampshire Bank & Trust February 2, 2001 February 18, 2003
554 National State Bank of Metropolis Metropolis IL 3815 ␣
,→ Banterra Bank of Marion December 14, 2000 March 17, 2005
555 Bank of Honolulu Honolulu HI 21029 ␣
,→ Bank of the Orient October 13, 2000 March 17, 2005
Note: The data from the above URL changes every Monday so the resulting data above and the data
below may be slightly different.
Read in the content of the file from the above URL and pass it to read_html as a string:
In [298]: dfs
Out[298]:
[ Bank Name City ST CERT ␣
,→Acquiring Institution Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 ␣
,→North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale AZ 34527 ␣
,→ Western State Bank May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta GA 58185 ␣
,→ Synovus Bank May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville NC 58701 ␣
,→ Capital Bank, N.A. May 10, 2013 May 14, 2013
4 Douglas County Bank Douglasville GA 21649 ␣
,→Hamilton State Bank April 26, 2013 May 16, 2013
.. ... ... .. ... ␣
,→ ... ... ...
500 Superior Bank, FSB Hinsdale IL 32646 ␣
,→Superior Federal, FSB July 27, 2001 June 5, 2012
501 Malta National Bank Malta OH 6629 ␣
,→ North Valley Bank May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New␣
,→Hampshire Bank & Trust February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis IL 3815 ␣
,→Banterra Bank of Marion December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu HI 21029 ␣
,→ Bank of the Orient October 13, 2000 March 17, 2005
In [301]: dfs
Out[301]:
[ Bank Name City ST CERT ␣
,→Acquiring Institution Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 ␣
,→North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale AZ 34527 ␣
,→ Western State Bank May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta GA 58185 ␣
,→ Synovus Bank May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville NC 58701 ␣
,→ Capital Bank, N.A. May 10, 2013 May 14, 2013
(continues on next page)
Note: The following examples are not run by the IPython evaluator due to the fact that having so many
network-accessing functions slows down the documentation build. If you spot an error or an example that
doesn’t run, please do not hesitate to report it over on pandas GitHub issues page.
Specify a header row (by default <th> or <td> elements located within a <thead> are used to form the
column index, if multiple rows are contained within <thead> then a MultiIndex is created); if specified, the
header row is taken from the data minus the parsed header elements (<th> elements).
Specify a number of rows to skip using a list (xrange (Python 2 only) works as well):
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0,
converters={'MNC': str})
Read in pandas to_html output (with some loss of floating point precision):
df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format='{0:.40g}'.format)
dfin = pd.read_html(s, index_col=0)
The lxml backend will raise an error on a failed parse if that is the only parser you provide. If you only have
a single parser you can provide just a string, but it is considered good practice to pass a list with one string
if, for example, the function expects a sequence of strings. You may use:
However, if you have bs4 and html5lib installed and pass None or ['lxml', 'bs4'] then the parse will most
likely succeed. Note that as soon as a parse succeeds, the function will return.
DataFrame objects have an instance method to_html which renders the contents of the DataFrame as an
HTML table. The function arguments are as in the method to_string described above.
Note: Not all of the possible options for DataFrame.to_html are shown here for brevity’s sake. See
to_html() for the full set of options.
In [303]: df
Out[303]:
0 1
0 -1.050304 1.131622
1 -0.692581 -1.174172
In [305]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.050304</td>
</tr>
<tr>
<th>1</th>
<td>-0.692581</td>
</tr>
</tbody>
</table>
HTML:
float_format takes a Python callable to control the precision of floating point values:
In [306]: print(df.to_html(float_format='{0:.10f}'.format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-1.0503044154</td>
<td>1.1316218324</td>
</tr>
<tr>
<th>1</th>
<td>-0.6925807265</td>
<td>-1.1741715747</td>
</tr>
</tbody>
</table>
HTML:
bold_rows will make the row labels bold by default, but you can turn that off:
In [307]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-1.050304</td>
<td>1.131622</td>
</tr>
<tr>
<td>1</td>
<td>-0.692581</td>
<td>-1.174172</td>
</tr>
</tbody>
</table>
The classes argument provides the ability to give the resulting HTML table CSS classes. Note that these
The render_links argument provides the ability to add hyperlinks to cells that contain URLs.
New in version 0.24.
In [309]: url_df = pd.DataFrame({
.....: 'name': ['Python', 'Pandas'],
.....: 'url': ['https://www.python.org/', 'http://pandas.pydata.org']})
.....:
In [310]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>url</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Python</td>
<td><a href="https://www.python.org/" target="_blank">https://www.python.org/</a></
,→td>
</tr>
<tr>
<th>1</th>
<td>Pandas</td>
<td><a href="http://pandas.pydata.org" target="_blank">http://pandas.pydata.org</a>
,→</td>
(continues on next page)
HTML:
Finally, the escape argument allows you to control whether the “<”, “>” and “&” characters escaped in the
resulting HTML (by default it is True). So to get the HTML without escaped characters pass escape=False
Escaped:
In [312]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>1.254800</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>1.131996</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-1.311021</td>
</tr>
</tbody>
</table>
Not escaped:
In [313]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
(continues on next page)
Note: Some browsers may not show a difference in the rendering of the previous two HTML tables.
There are some versioning issues surrounding the libraries that are used to parse HTML tables in the top-level
pandas io function read_html.
Issues with lxml
• Benefits
– lxml is very fast.
– lxml requires Cython to install correctly.
• Drawbacks
– lxml does not make any guarantees about the results of its parse unless it is given strictly valid
markup.
– In light of the above, we have chosen to allow you, the user, to use the lxml backend, but this
backend will use html5lib if lxml fails to parse
– It is therefore highly recommended that you install both BeautifulSoup4 and html5lib, so that
you will still get a valid result (provided everything else is valid) even if lxml fails.
Issues with BeautifulSoup4 using lxml as a backend
• The above issues hold here as well since BeautifulSoup4 is essentially just a wrapper around a parser
backend.
Issues with BeautifulSoup4 using html5lib as a backend
• Benefits
– html5lib is far more lenient than lxml and consequently deals with real-life markup in a much
saner way rather than just, e.g., dropping an element without notifying you.
– html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely
important for parsing HTML tables, since it guarantees a valid document. However, that does
NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.
– html5lib is pure Python and requires no additional build steps beyond its own installation.
• Drawbacks
– The biggest drawback to using html5lib is that it is slow as molasses. However consider the fact
that many tables on the web are not big enough for the parsing algorithm runtime to matter. It
is more likely that the bottleneck will be in the process of reading the raw text from the URL
over the web, i.e., IO (input-output). For very large tables, this might not be true.
The read_excel() method can read Excel 2003 (.xls) files using the xlrd Python module. Excel 2007+
(.xlsx) files can be read using either xlrd or openpyxl. The to_excel() instance method is used for saving
a DataFrame to Excel. Generally the semantics are similar to working with csv data. See the cookbook for
some advanced strategies.
In the most basic use-case, read_excel takes a path to an Excel file, and the sheet_name indicating which
sheet to parse.
# Returns a DataFrame
pd.read_excel('path_to_file.xls', sheet_name='Sheet1')
ExcelFile class
To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the
file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as
the file is read into memory only once.
xlsx = pd.ExcelFile('path_to_file.xls')
df = pd.read_excel(xlsx, 'Sheet1')
The sheet_names property will generate a list of the sheet names in the file.
The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:
data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)
Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed
to read_excel with no loss in performance.
ExcelFile can also be called with a xlrd.book.Book object as a parameter. This allows the user to control
how the excel file is read. For example, sheets can be loaded on demand by calling xlrd.open_workbook()
with on_demand=True.
import xlrd
xlrd_book = xlrd.open_workbook('path_to_file.xls', on_demand=True)
with pd.ExcelFile(xlrd_book) as xls:
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
Specifying sheets
# Returns a DataFrame
pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
# Returns a DataFrame
pd.read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])
# Returns a DataFrame
pd.read_excel('path_to_file.xls')
read_excel can read more than one sheet, by setting sheet_name to either a list of sheet names, a list of
sheet positions, or None to read all sheets. Sheets can be specified by sheet index or sheet name, using an
integer or string, respectively.
Reading a MultiIndex
read_excel can read a MultiIndex index, by passing a list of columns to index_col and a MultiIndex
column by passing a list of rows to header. If either the index or columns have serialized level names those
will be read in as well by specifying the rows/columns that make up the levels.
For example, to read in a MultiIndex index without names:
In [315]: df.to_excel('path_to_file.xlsx')
In [317]: df
Out[317]:
a b
a c 1 5
d 2 6
b c 3 7
d 4 8
If the index has level names, they will parsed as well, using the same parameters.
In [319]: df.to_excel('path_to_file.xlsx')
In [321]: df
Out[321]:
a b
lvl1 lvl2
(continues on next page)
If the source file has both MultiIndex index and columns, lists specifying each should be passed to index_col
and header:
In [323]: df.to_excel('path_to_file.xlsx')
In [325]: df
Out[325]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
It is often the case that users will insert columns to do temporary computations in Excel and you may not
want to read in those columns. read_excel takes a usecols keyword to allow you to specify a subset of
columns to parse.
Deprecated since version 0.24.0.
Passing in an integer for usecols has been deprecated. Please pass in a list of ints from 0 to usecols
inclusive instead.
If usecols is an integer, then it is assumed to indicate the last column to be parsed.
You can also specify a comma-delimited set of Excel columns and ranges as a string:
If usecols is a list of integers, then it is assumed to be the file column indices to be parsed.
If usecols is a list of strings, it is assumed that each string corresponds to a column name provided either
by the user in names or inferred from the document header row(s). Those strings define which columns will
be parsed:
Parsing dates
Datetime-like values are normally automatically converted to the appropriate dtype when reading the excel
file. But if you have a column of strings that look like dates (but are not actually formatted as dates in
excel), you can use the parse_dates keyword to parse those strings to datetimes:
Cell converters
It is possible to transform the contents of Excel cells via the converters option. For instance, to convert a
column to boolean:
This options handles missing values and treats exceptions in the converters as missing data. Transformations
are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For
instance, a column of integers with missing values cannot be transformed to an array with integer dtype,
because NaN is strictly a float. You can manually mask missing data to recover integer dtype:
def cfun(x):
return int(x) if x else -1
Dtype specifications
To write a DataFrame object to a sheet of an Excel file, you can use the to_excel instance method. The
arguments are largely the same as to_csv described above, the first argument being the name of the excel
file, and the optional second argument the name of the sheet to which the DataFrame should be written. For
example:
df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
Files with a .xls extension will be written using xlwt and those with a .xlsx extension will be written
using xlsxwriter (if available) or openpyxl.
The DataFrame will be written in a way that tries to mimic the REPL output. The index_label will be
placed in the second row instead of the first. You can place it in the first row by setting the merge_cells
option in to_excel() to False:
In order to write separate DataFrames to separate sheets in a single Excel file, one can pass an ExcelWriter.
Note: Wringing a little more performance out of read_excel Internally, Excel stores all numeric data as
floats. Because this can produce unexpected behavior when reading in data, pandas defaults to trying to
convert integers to floats if it doesn’t lose information (1.0 --> 1). You can pass convert_float=False to
disable this behavior, which may give a slight performance improvement.
Pandas supports writing Excel files to buffer-like objects such as StringIO or BytesIO using ExcelWriter.
bio = BytesIO()
# Seek to the beginning and read to copy the workbook to a variable in memory
(continues on next page)
Note: engine is optional but recommended. Setting the engine determines the version of workbook pro-
duced. Setting engine='xlrd' will produce an Excel 2003-format workbook (xls). Using either 'openpyxl'
or 'xlsxwriter' will produce an Excel 2007-format workbook (xlsx). If omitted, an Excel 2007-formatted
workbook is produced.
df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
The look and feel of Excel worksheets created from pandas can be modified using the following parameters
on the DataFrame’s to_excel method.
• float_format : Format string for floating point numbers (default None).
• freeze_panes : A tuple of two integers representing the bottommost row and rightmost column to
freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default
None).
Using the Xlsxwriter engine provides many options for controlling the format of an Excel worksheet created
with the to_excel method. Excellent examples can be found in the Xlsxwriter documentation here: https:
//xlsxwriter.readthedocs.io/working_with_pandas.html
# Returns a DataFrame
pd.read_excel('path_to_file.ods', engine='odf')
Note: Currently pandas only supports reading OpenDocument spreadsheets. Writing is not implemented.
4.1.6 Clipboard
A handy way to grab data is to use the read_clipboard() method, which takes the contents of the clipboard
buffer and passes them to the read_csv method. For instance, you can copy the following text to the
clipboard (CTRL-C on many operating systems):
A B C
x 1 4 p
y 2 5 q
z 3 6 r
The to_clipboard method can be used to write the contents of a DataFrame to the clipboard. Following
which you can paste the clipboard contents into other applications (CTRL-V on many operating systems).
Here we illustrate writing a DataFrame into clipboard and reading it back.
We can see that we got the same content back, which we had earlier written to the clipboard.
Note: You may need to install xclip or xsel (with PyQt5, PyQt4 or qtpy) on Linux to use these methods.
4.1.7 Pickling
All pandas objects are equipped with to_pickle methods which use Python’s cPickle module to save data
structures to disk using the pickle format.
In [326]: df
Out[326]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
In [327]: df.to_pickle('foo.pkl')
The read_pickle function in the pandas namespace can be used to load any pickled pandas object (or any
other pickled object) from file:
In [328]: pd.read_pickle('foo.pkl')
Out[328]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
Warning: Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html
Warning: read_pickle() is only guaranteed backwards compatible back to pandas version 0.20.3
In [329]: df = pd.DataFrame({
.....: 'A': np.random.randn(1000),
.....: 'B': 'foo',
.....: 'C': pd.date_range('20130101', periods=1000, freq='s')})
.....:
In [330]: df
Out[330]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
In [333]: rt
Out[333]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
(continues on next page)
In [336]: rt
Out[336]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
In [337]: df.to_pickle("data.pkl.gz")
In [338]: rt = pd.read_pickle("data.pkl.gz")
In [339]: rt
Out[339]:
A B C
0 -0.053113 foo 2013-01-01 00:00:00
1 0.348832 foo 2013-01-01 00:00:01
2 -0.162729 foo 2013-01-01 00:00:02
3 -1.269943 foo 2013-01-01 00:00:03
4 -0.481824 foo 2013-01-01 00:00:04
.. ... ... ...
995 -1.001718 foo 2013-01-01 00:16:35
996 -0.471336 foo 2013-01-01 00:16:36
997 -0.071712 foo 2013-01-01 00:16:37
998 0.578273 foo 2013-01-01 00:16:38
999 0.595708 foo 2013-01-01 00:16:39
In [340]: df["A"].to_pickle("s1.pkl.bz2")
In [341]: rt = pd.read_pickle("s1.pkl.bz2")
(continues on next page)
In [342]: rt
Out[342]:
0 -0.053113
1 0.348832
2 -0.162729
3 -1.269943
4 -0.481824
...
995 -1.001718
996 -0.471336
997 -0.071712
998 0.578273
999 0.595708
Name: A, Length: 1000, dtype: float64
4.1.8 msgpack
pandas supports the msgpack format for object serialization. This is a lightweight portable binary format,
similar to binary JSON, that is highly space efficient, and provides good performance both on the writing
(serialization), and reading (deserialization).
Warning: The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is
recommended to use pyarrow for on-the-wire transmission of pandas objects.
Warning: read_msgpack() is only guaranteed backwards compatible back to pandas version 0.20.3
In [344]: df.to_msgpack('foo.msg')
In [345]: pd.read_msgpack('foo.msg')
Out[345]:
A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928
You can pass a list of objects and you will receive them back on deserialization.
In [348]: pd.read_msgpack('foo.msg')
(continues on next page)
In [351]: pd.read_msgpack('foo.msg')
Out[351]:
[ A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928, 'foo', array([1, 2, 3]), 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64, A B
0 0.541029 0.554672
1 0.150831 0.503287
(continues on next page)
Unlike other io methods, to_msgpack is available on both a per-object basis, df.to_msgpack() and using
the top-level pd.to_msgpack(...) where you can pack arbitrary collections of Python lists, dicts, scalars,
while intermixing pandas objects.
In [353]: pd.read_msgpack('foo2.msg')
Out[353]:
{'dict': ({'df': A B
0 0.541029 0.554672
1 0.150831 0.503287
2 0.834267 0.881894
3 0.706066 0.726912
4 0.639300 0.067928},
{'string': 'foo'},
{'scalar': 1.0},
{'s': 2013-01-01 0.753932
2013-01-02 0.676180
2013-01-03 0.924728
2013-01-04 0.338661
2013-01-05 0.592241
Freq: D, dtype: float64})}
Read/write API
In [354]: df.to_msgpack()
Out[354]: b"\x84\xa3typ\xadblock_
,→manager\xa5klass\xa9DataFrame\xa4axes\x92\x86\xa3typ\xa5index\xa5klass\xa5Index\xa4name\xc0\xa5dtype\x
,→index\xa5klass\xaaRangeIndex\xa4name\xc0\xa5start\x00\xa4stop\x05\xa4step\x01\xa6blocks\x91\x86\xa4loc
,→`)\x0c3kN\xc3?\xac\xa1:JQ\xb2\xea?\x8c|\xa87\x17\x98\xe6?\xf3H\x83*&u\xe4?\xd4S\xff
,→{\xe0\xbf\xe1?\xd3'2\xea\xed\x1a\xe0?6\x00'gy8\xec?S\x98/\xe7\xdcB\xe7?
,→`\xdbr\xed\xbac\xb1?
,→\xa5shape\x92\x02\x05\xa5dtype\xa7float64\xa5klass\xaaFloatBlock\xa8compress\xc0"
Furthermore you can concatenate the strings to produce a list of the original objects.
HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using
the excellent PyTables library. See the cookbook for some advanced strategies
Warning: pandas requires PyTables >= 3.0.0. There is a indexing bug in PyTables < 3.2 which may
appear when querying stores using an index. If you see a subset of results being returned, upgrade to
PyTables >= 3.2. Stores created previously will need to be rewritten using the updated version.
In [357]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Objects can be written to the file just like adding key-value pairs to a dict:
In [362]: store['df'] = df
In [363]: store
Out[363]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [367]: store
Out[367]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [369]: store
Out[369]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [370]: store.is_open
Out[370]: False
# Working with, and automatically closing the store using a context manager
In [371]: with pd.HDFStore('store.h5') as store:
.....: store.keys()
.....:
Read/write API
HDFStore supports an top-level API using read_hdf for reading and to_hdf for writing, similar to how
read_csv and to_csv work.
HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting
dropna=True.
In [375]: df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2],
.....: 'col2': [1, np.nan, np.nan]})
.....:
In [376]: df_with_missing
Out[376]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
Fixed format
The examples above show storing using put, which write the HDF5 to PyTables in a fixed array format,
called the fixed format. These types of stores are not appendable once written (though you can simply
remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do
not support dataframes with non-unique column names. The fixed format stores offer very fast writing and
slightly faster reading than table stores. This format is specified by default when using put or to_hdf or
by format='fixed' or format='f'.
Warning: A fixed format will raise a TypeError if you try to retrieve using a where:
>>> pd.DataFrame(np.random.randn(10, 2)).to_hdf('test_fixed.h5', 'df')
>>> pd.read_hdf('test_fixed.h5', 'df', where='index>5')
TypeError: cannot pass a where specification when reading a fixed format.
this store must be selected in its entirety
Table format
HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very
much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This format is specified by format='table'
or format='t' to append or put or to_hdf.
This format can be set as an option as well pd.set_option('io.hdf.default_format','table') to enable
put/append/to_hdf to by default store in the table format.
In [381]: store = pd.HDFStore('store.h5')
In [386]: store
Out[386]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Note: You can also create a table by passing format='table' or format='t' to a put operation.
Hierarchical keys
Keys to a store can be specified as a string. These can be in a hierarchical path-name like format (e.g.
foo/bar/bah), which will generate a hierarchy of sub-stores (or Groups in PyTables parlance). Keys can be
specified with out the leading ‘/’ and are always absolute (e.g. ‘foo’ refers to ‘/foo’). Removal operations
can remove everything in the sub-store and below, so be careful.
In [389]: store.put('foo/bar/bah', df)
In [392]: store
Out[392]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [395]: store
Out[395]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
You can walk through the group hierarchy using the walk method which will yield a tuple for each group
key along with the relative keys of its contents.
New in version 0.24.0.
In [396]: for (path, subgroups, subkeys) in store.walk():
.....: for subgroup in subgroups:
.....: print('GROUP: {}/{}'.format(path, subgroup))
.....: for subkey in subkeys:
.....: key = '/'.join([path, subkey])
.....: print('KEY: {}'.format(key))
.....: print(store.get(key))
.....:
GROUP: /foo
KEY: /df
A B C
2000-01-01 0.263806 1.913465 -0.274536
2000-01-02 0.283334 1.798001 -0.053258
2000-01-03 0.799684 -0.733715 1.205089
2000-01-04 0.131478 0.100995 -0.764260
2000-01-05 1.891112 -1.410251 0.752883
2000-01-06 -0.274852 -0.667027 -0.688782
2000-01-07 0.621607 -1.300199 0.050119
2000-01-08 -0.999591 -0.320658 -1.922640
GROUP: /foo/bar
(continues on next page)
Warning: Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for
items stored under the root node.
In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'
# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array),
,→'axis1' (Array)]
Storing types
Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the
appended column. Subsequent attempts at appending longer strings will raise a ValueError.
Passing min_itemsize={`values`: size} as a parameter to append will set a larger minimum for the
string columns. Storing floats, strings, ints, bools, datetime64 are currently supported. For string
columns, passing nan_rep = 'nan' to append will change the default nan representation on disk (which
converts to/from np.nan), this defaults to nan.
In [398]: df_mixed = pd.DataFrame({'A': np.random.randn(8),
.....: 'B': np.random.randn(8),
In [399]: df_mixed.loc[df_mixed.index[3:5],
.....: ['A', 'B', 'string', 'datetime64']] = np.nan
.....:
In [402]: df_mixed1
Out[402]:
A B C string int bool datetime64
0 0.894171 -1.452159 -0.105646 string 1 True 2001-01-02
1 -1.539066 1.018959 0.028593 string 1 True 2001-01-02
2 -0.114019 -0.087476 0.693070 string 1 True 2001-01-02
3 NaN NaN -0.646571 NaN 1 True NaT
4 NaN NaN -0.174558 NaN 1 True NaT
5 -2.110838 -1.234633 -1.257271 string 1 True 2001-01-02
6 -0.704558 0.463419 -0.917264 string 1 True 2001-01-02
7 -0.929182 0.841053 0.414183 string 1 True 2001-01-02
In [403]: df_mixed1.dtypes.value_counts()
Out[403]:
float64 2
datetime64[ns] 1
int64 1
object 1
bool 1
float32 1
dtype: int64