Pivot Tables - Python Data Science Handbook
Pivot Tables - Python Data Science Handbook
This is an excerpt from the Python Data Science Handbook (http://shop.oreilly.com/product/0636920034919.do) by Jake
VanderPlas; Jupyter notebooks are available on GitHub (https://github.com/jakevdp/PythonDataScienceHandbook).
The text is released under the CC-BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code
is released under the MIT license (https://opensource.org/licenses/MIT). If you find this content useful, please consider
supporting the work by buying the book (http://shop.oreilly.com/product/0636920034919.do)!
Pivot Tables
Open in Colab
(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/note
Pivot-Tables.ipynb)
We have seen how the GroupBy abstraction lets us explore relationships within
a dataset. A pivot table is a similar operation that is commonly seen in
spreadsheets and other programs that operate on tabular data. The pivot table
takes simple column-wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data.
The difference between pivot tables and GroupBy can sometimes cause
confusion; it helps me to think of pivot tables as essentially a multidimensional
version of GroupBy aggregation. That is, you split-apply-combine, but both the
split and the combine happen across not a one-dimensional index, but across a
two-dimensional grid.
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 1/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
In [2]: titanic.head()
In [3]: titanic.groupby('sex')[['survived']].mean()
Out[3]: survived
sex
female0.742038
male 0.188908
This immediately gives us some insight: overall, three of every four females on
board survived, while only one in five males survived!
This is useful, but we might like to go one step deeper and look at survival by
both sex and, say, class. Using the vocabulary of GroupBy , we might proceed
using something like this: we group by class and gender, select survival, apply a
mean aggregate, combine the resulting groups, and then unstack the
hierarchical index to reveal the hidden multidimensionality. In code:
This gives us a better idea of how both gender and class affected survival, but
the code is starting to look a bit garbled. While each step of this pipeline makes
sense in light of the tools we've previously discussed, the long string of code is
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 2/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
This is eminently more readable than the groupby approach, and produces the
same result. As you might expect of an early 20th-century transatlantic cruise,
the survival gradient favors both women and higher classes. First-class women
survived with near certainty (hi, Rose!), while only one in ten third-class men
survived (sorry, Jack!).
We can apply the same strategy when working with the columns as well; let's
add info on the fare paid using pd.qcut to automatically compute quantiles:
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 3/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
We've already seen examples of the first three arguments; here we'll take a
quick look at the remaining ones. Two of the options, fill_value and
dropna , have to do with missing data and are fairly straightforward; we will not
show examples of them here.
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 4/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
Notice also here that we've omitted the values keyword; when specifying a
mapping for aggfunc , this is determined automatically.
At times it's useful to compute totals along each grouping. This can be done via
the margins keyword:
Taking a look at the data, we see that it's relatively simple–it contains the
number of births grouped by date and gender:
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 5/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
In [12]: births.head()
We can start to understand this data a bit more by using a pivot table. Let's add
a decade column, and take a look at male and female births as a function of
decade:
Out[13]: gender F M
decade
1960 1753634 1846572
1970 1626307517121550
1980 1831035119243452
1990 1947945420420553
2000 1822930919106428
We immediately see that male births outnumber female births in every decade.
To see this trend a bit more clearly, we can use the built-in plotting tools in
Pandas to visualize the total number of births by year (see Introduction to
Matplotlib (04.00-introduction-to-matplotlib.html) for a discussion of plotting
with Matplotlib):
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 6/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
With a simple pivot table and plot() method, we can immediately see the
annual trend in births by gender. By eye, it appears that over the past 50 years
male births have outnumbered female births by around 5%.
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution (You can learn more
about sigma-clipping operations in a book I coauthored with Željko Ivezić,
Andrew J. Connolly, and Alexander Gray: "Statistics, Data Mining, and Machine
Learning in Astronomy" (http://press.princeton.edu/titles/10159.html)
(Princeton University Press, 2014)).
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 7/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
With this we can use the query() method (discussed further in High-
Performance Pandas: eval() and query() (03.12-performance-eval-and-
query.html)) to filter-out rows with births outside these values:
In [16]: births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5
Next we set the day column to integers; previously it had been a string because
some columns in the dataset contained the value 'null' :
In [17]: # set 'day' column to integer; it originally was a string due to nul
births['day'] = births['day'].astype(int)
Finally, we can combine the day, month, and year to create a Date index (see
Working with Time Series (03.11-working-with-time-series.html)). This allows
us to quickly compute the weekday corresponding to each row:
births['dayofweek'] = births.index.dayofweek
births.pivot_table('births', index='dayofweek',
columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sa
plt.ylabel('mean births by day');
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 8/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
Another intersting view is to plot the mean number of births by the day of the
year. Let's first group the data by month and day separately:
Out[20]: 1 1 4009.225
2 4247.400
3 4500.900
4 4571.350
5 4603.625
Name: births, dtype: float64
The result is a multi-index over months and days. To make this easily plottable,
let's turn these months and days into a date by associating them with a dummy
year variable (making sure to choose a leap year so February 29th is correctly
handled!)
Focusing on the month and day only, we now have a time series reflecting the
average number of births by date of the year. From this, we can use the plot
method to plot the data. It reveals some interesting trends:
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 9/10
2/18/25, 6:45 PM Pivot Tables | Python Data Science Handbook
Looking at this short example, you can see that many of the Python and Pandas
tools we've seen to this point can be combined and used to gain insight from a
variety of datasets. We will see some more sophisticated applications of these
data manipulations in future sections!
Open in Colab
(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/note
Pivot-Tables.ipynb)
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html 10/10