0% found this document useful (0 votes)
287 views578 pages

STA130

Uploaded by

linguanjing213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
287 views578 pages

STA130

Uploaded by

linguanjing213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 578

Modern Data

Science with R
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Francesca Dominici, Harvard School of Public Health, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada

Statistical Theory: A Concise Introduction Statistics for Technology: A Course in Applied


F. Abramovich and Y. Ritov Statistics, Third Edition
Practical Multivariate Analysis, Fifth Edition C. Chatfield
A. Afifi, S. May, and V.A. Clark Analysis of Variance, Design, and Regression :
Practical Statistics for Medical Research Linear Modeling for Unbalanced Data,
D.G. Altman Second Edition
R. Christensen
Interpreting Data: A First Course
in Statistics Bayesian Ideas and Data Analysis: An
A.J.B. Anderson Introduction for Scientists and Statisticians
Introduction to Probability with R R. Christensen, W. Johnson, A. Branscum,
K. Baclawski and T.E. Hanson

Linear Algebra and Matrix Analysis for Modelling Binary Data, Second Edition
Statistics D. Collett
S. Banerjee and A. Roy Modelling Survival Data in Medical Research,
Modern Data Science with R Third Edition
B. S. Baumer, D. T. Kaplan, and N. J. Horton D. Collett

Mathematical Statistics: Basic Ideas and Introduction to Statistical Methods for


Selected Topics, Volume I, Clinical Trials
Second Edition T.D. Cook and D.L. DeMets
P. J. Bickel and K. A. Doksum Applied Statistics: Principles and Examples
Mathematical Statistics: Basic Ideas and D.R. Cox and E.J. Snell
Selected Topics, Volume II Multivariate Survival Analysis and Competing
P. J. Bickel and K. A. Doksum Risks
Analysis of Categorical Data with R M. Crowder
C. R. Bilder and T. M. Loughin Statistical Analysis of Reliability Data
Statistical Methods for SPC and TQM M.J. Crowder, A.C. Kimber,
D. Bissell T.J. Sweeting, and R.L. Smith
Introduction to Probability An Introduction to Generalized
J. K. Blitzstein and J. Hwang Linear Models, Third Edition
A.J. Dobson and A.G. Barnett
Bayesian Methods for Data Analysis,
Third Edition Nonlinear Time Series: Theory, Methods, and
B.P. Carlin and T.A. Louis Applications with R Examples
R. Douc, E. Moulines, and D.S. Stoffer
Second Edition
R. Caulcutt Introduction to Optimization Methods and
Their Applications in Statistics
The Analysis of Time Series: An Introduction, B.S. Everitt
Sixth Edition
C. Chatfield Extending the Linear Model with R:
Generalized Linear, Mixed Effects and
Introduction to Multivariate Analysis Nonparametric Regression Models, Second
C. Chatfield and A.J. Collins Edition
Problem Solving: A Statistician’s Guide, J.J. Faraway
Second Edition Linear Models with R, Second Edition
C. Chatfield J.J. Faraway
A Course in Large Sample Theory Modeling and Analysis of Stochastic Systems,
T.S. Ferguson Second Edition
Multivariate Statistics: A Practical V.G. Kulkarni
Approach Exercises and Solutions in Biostatistical Theory
B. Flury and H. Riedwyl L.L. Kupper, B.H. Neelon, and S.M. O’Brien
Readings in Decision Analysis Exercises and Solutions in Statistical Theory
S. French L.L. Kupper, B.H. Neelon, and S.M. O’Brien
Discrete Data Analysis with R: Visualization Design and Analysis of Experiments with R
and Modeling Techniques for Categorical and J. Lawson
Count Data Design and Analysis of Experiments with SAS
M. Friendly and D. Meyer J. Lawson
Markov Chain Monte Carlo: A Course in Categorical Data Analysis
Stochastic Simulation for Bayesian Inference, T. Leonard
Second Edition
D. Gamerman and H.F. Lopes Statistics for Accountants
S. Letchford
Bayesian Data Analysis, Third Edition
A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, Introduction to the Theory of Statistical
A. Vehtari, and D.B. Rubin Inference
H. Liero and S. Zwanzig
Multivariate Analysis of Variance and
Repeated Measures: A Practical Approach for Statistical Theory, Fourth Edition
Behavioural Scientists B.W. Lindgren
D.J. Hand and C.C. Taylor Stationary Stochastic Processes: Theory and
Practical Longitudinal Data Analysis Applications
D.J. Hand and M. Crowder G. Lindgren

Logistic Regression Models Statistics for Finance


J.M. Hilbe E. Lindström, H. Madsen, and J. N. Nielsen

Richly Parameterized Linear Models: The BUGS Book: A Practical Introduction to


Additive, Time Series, and Spatial Models Bayesian Analysis
Using Random Effects D. Lunn, C. Jackson, N. Best, A. Thomas, and
J.S. Hodges D. Spiegelhalter

Statistics for Epidemiology Introduction to General and Generalized


N.P. Jewell Linear Models
H. Madsen and P. Thyregod
Stochastic Processes: An Introduction,
Second Edition Time Series Analysis
P.W. Jones and P. Smith H. Madsen

The Theory of Linear Models Pólya Urn Models


B. Jørgensen H. Mahmoud

Pragmatics of Uncertainty Randomization, Bootstrap and Monte Carlo


J.B. Kadane Methods in Biology, Third Edition
B.F.J. Manly
Principles of Uncertainty
J.B. Kadane Introduction to Randomized Controlled
Clinical Trials, Second Edition
Graphics for Statistics and Data Analysis with R J.N.S. Matthews
K.J. Keen
Statistical Rethinking: A Bayesian Course with
Mathematical Statistics Examples in R and Stan
K. Knight R. McElreath
Introduction to Multivariate Analysis: Statistical Methods in Agriculture and
Linear and Nonlinear Modeling Experimental Biology, Second Edition
S. Konishi R. Mead, R.N. Curnow, and A.M. Hasted
Nonparametric Methods in Statistics with SAS Statistics in Engineering: A Practical Approach
Applications A.V. Metcalfe
O. Korosteleva
Statistical Inference: An Integrated Approach, Spatio-Temporal Methods in Environmental
Second Edition Epidemiology
H. S. Migon, D. Gamerman, and G. Shaddick and J.V. Zidek
F. Louzada Decision Analysis: A Bayesian Approach
Beyond ANOVA: Basics of Applied Statistics J.Q. Smith
R.G. Miller, Jr. Analysis of Failure and Survival Data
A Primer on Linear Models P. J. Smith
J.F. Monahan Applied Statistics: Handbook of GENSTAT
Stochastic Processes: From Applications to Analyses
Theory E.J. Snell and H. Simpson
P.D Moral and S. Penev Applied Nonparametric Statistical Methods,
Applied Stochastic Modelling, Second Edition Fourth Edition
B.J.T. Morgan P. Sprent and N.C. Smeeton
Elements of Simulation Data Driven Statistical Methods
B.J.T. Morgan P. Sprent
Probability: Methods and Measurement Generalized Linear Mixed Models:
A. O’Hagan Modern Concepts, Methods and Applications
Introduction to Statistical Limit Theory W. W. Stroup
A.M. Polansky Survival Analysis Using S: Analysis of
Applied Bayesian Forecasting and Time Series Time-to-Event Data
Analysis M. Tableman and J.S. Kim
A. Pole, M. West, and J. Harrison Applied Categorical and Count Data Analysis
Statistics in Research and Development, W. Tang, H. He, and X.M. Tu
Time Series: Modeling, Computation, and Elementary Applications of Probability Theory,
Inference Second Edition
R. Prado and M. West H.C. Tuckwell
Essentials of Probability Theory for Introduction to Statistical Inference and Its
Statisticians Applications with R
M.A. Proschan and P.A. Shaw M.W. Trosset
Introduction to Statistical Process Control Understanding Advanced Statistical Methods
P. Qiu P.H. Westfall and K.S.S. Henning
Sampling Methodologies with Applications Statistical Process Control: Theory and
P.S.R.S. Rao Practice, Third Edition
A First Course in Linear Model Theory G.B. Wetherill and D.W. Brown
N. Ravishanker and D.K. Dey Generalized Additive Models:
Essential Statistics, Fourth Edition An Introduction with R
D.A.G. Rees S. Wood

Stochastic Modeling and Mathematical Epidemiology: Study Design and


Statistics: A Text for Statisticians and Data Analysis, Third Edition
Quantitative Scientists M. Woodward
F.J. Samaniego Practical Data Analysis for Designed
Statistical Methods for Spatial Data Analysis Experiments
O. Schabenberger and C.A. Gotway B.S. Yandell

Bayesian Networks: With Examples in R


M. Scutari and J.-B. Denis
Large Sample Methods in Statistics
P.K. Sen and J. da Motta Singer
Texts in Statistical Science

Modern Data
Science with R

Benjamin S. Baumer
Daniel T. Kaplan
Nicholas J. Horton
Figures 3.21, 3.27, and 12.1: © ESPN. Reprinted courtesy of FiveThiryEight.com

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20161221

International Standard Book Number-13: 978-1-4987-2448-7 (Pack - Book and Ebook)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

List of Tables xv

List of Figures xvii

Preface xxiii

I Introduction to Data Science 1


1 Prologue: Why data science? 3
1.1 What is data science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Case study: The evolution of sabermetrics . . . . . . . . . . . . . . . . . . 6
1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Data visualization 9
2.1 The 2012 federal election cycle . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Are these two groups different? . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Graphing variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Examining relationships among variables . . . . . . . . . . . . . . . 12
2.1.4 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Composing data graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 A taxonomy for data graphics . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Dissecting data graphics . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Importance of data graphics: Challenger . . . . . . . . . . . . . . . . . . . 23
2.4 Creating effective presentations . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 The wider world of data visualization . . . . . . . . . . . . . . . . . . . . . 28
2.6 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 A grammar for graphics 33


3.1 A grammar for data graphics . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.3 Guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.4 Facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.5 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Canonical data graphics in R . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Univariate displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vii
viii CONTENTS

3.2.2 Multivariate displays . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


3.2.3 Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Extended example: Historical baby names . . . . . . . . . . . . . . . . . . . 48
3.3.1 Percentage of people alive today . . . . . . . . . . . . . . . . . . . . 50
3.3.2 Most common women’s names . . . . . . . . . . . . . . . . . . . . . 56
3.4 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Data wrangling 63
4.1 A grammar for data wrangling . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 select() and filter() . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.2 mutate() and rename() . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.3 arrange() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.4 summarize() with group by() . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Extended example: Ben’s time with the Mets . . . . . . . . . . . . . . . . . 72
4.3 Combining multiple tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 inner join() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 left join() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Extended example: Manny Ramirez . . . . . . . . . . . . . . . . . . . . . . 82
4.5 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5 Tidy data and iteration 91


5.1 Tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.2 What are tidy data? . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Reshaping data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.1 Data verbs for converting wide to narrow and vice versa . . . . . . . 100
5.2.2 Spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.3 Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.4 Example: Gender-neutral names . . . . . . . . . . . . . . . . . . . . 101
5.3 Naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Automation and iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Vectorized operations . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 The apply() family of functions . . . . . . . . . . . . . . . . . . . . 106
5.4.3 Iteration over subgroups with dplyr::do() . . . . . . . . . . . . . . 110
5.4.4 Iteration with mosaic::do . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Data intake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.1 Data-table friendly formats . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.2 APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.3 Cleaning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.4 Example: Japanese nuclear reactors . . . . . . . . . . . . . . . . . . 126
5.6 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6 Professional Ethics 131


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Truthful falsehoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Some settings for professional ethics . . . . . . . . . . . . . . . . . . . . . . 134
6.3.1 The chief executive officer . . . . . . . . . . . . . . . . . . . . . . . . 134
CONTENTS ix

6.3.2 Employment discrimination . . . . . . . . . . . . . . . . . . . . . . . 134


6.3.3 Data scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3.4 Reproducible spreadsheet analysis . . . . . . . . . . . . . . . . . . . 135
6.3.5 Drug dangers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3.6 Legal negotiations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4 Some principles to guide ethical action . . . . . . . . . . . . . . . . . . . . . 136
6.4.1 Applying the precepts . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.5 Data and disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5.1 Reidentification and disclosure avoidance . . . . . . . . . . . . . . . 140
6.5.2 Safe data storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5.3 Data scraping and terms of use . . . . . . . . . . . . . . . . . . . . . 141
6.6 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.6.1 Example: Erroneous data merging . . . . . . . . . . . . . . . . . . . 142
6.7 Professional guidelines for ethical conduct . . . . . . . . . . . . . . . . . . . 143
6.8 Ethics, collectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.9 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

II Statistics and Modeling 147


7 Statistical foundations 149
7.1 Samples and populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Sample statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.4 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5 Statistical models: Explaining variation . . . . . . . . . . . . . . . . . . . . 159
7.6 Confounding and accounting for other factors . . . . . . . . . . . . . . . . . 162
7.7 The perils of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8 Statistical learning and predictive analytics 171


8.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.2.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.2.2 Example: High-earners in the 1994 United States Census . . . . . . 174
8.2.3 Tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.4 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.5 Nearest neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.2.6 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.2.7 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . 185
8.3 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.4 Evaluating models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.4.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.4.2 Measuring prediction error . . . . . . . . . . . . . . . . . . . . . . . 189
8.4.3 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.4.4 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.4.5 Bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.4.6 Example: Evaluation of income models . . . . . . . . . . . . . . . . 192
8.5 Extended example: Who has diabetes? . . . . . . . . . . . . . . . . . . . . 196
x CONTENTS

8.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201


8.7 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

9 Unsupervised learning 205


9.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.1.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.1.2 k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.2 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.2.1 Intuitive approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9.2.2 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . 213
9.3 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

10 Simulation 221
10.1 Reasoning in reverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.2 Extended example: Grouping cancers . . . . . . . . . . . . . . . . . . . . . 222
10.3 Randomizing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.4 Simulating variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.4.1 The partially planned rendezvous . . . . . . . . . . . . . . . . . . . . 225
10.4.2 The jobs report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.4.3 Restaurant health and sanitation grades . . . . . . . . . . . . . . . . 228
10.5 Simulating a complex system . . . . . . . . . . . . . . . . . . . . . . . . . . 231
10.6 Random networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.7 Key principles of simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.8 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

III Topics in Data Science 241


11 Interactive data graphics 243
11.1 Rich Web content using D3.js and htmlwidgets . . . . . . . . . . . . . . . 243
11.1.1 Leaflet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.1.2 Plot.ly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.1.3 DataTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.1.4 dygraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.1.5 streamgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.2 Dynamic visualization using ggvis . . . . . . . . . . . . . . . . . . . . . . . 246
11.3 Interactive Web apps with Shiny . . . . . . . . . . . . . . . . . . . . . . . . 247
11.4 Further customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.5 Extended example: Hot dog eating . . . . . . . . . . . . . . . . . . . . . . . 254
11.6 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

12 Database querying using SQL 261


12.1 From dplyr to SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
12.2 Flat-file databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.3 The SQL universe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
12.4 The SQL data manipulation language . . . . . . . . . . . . . . . . . . . . . 267
12.4.1 SELECT...FROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
CONTENTS xi

12.4.2 WHERE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272


12.4.3 GROUP BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12.4.4 ORDER BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
12.4.5 HAVING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
12.4.6 LIMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
12.4.7 JOIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12.4.8 UNION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
12.4.9 Subqueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
12.5 Extended example: FiveThirtyEight flights . . . . . . . . . . . . . . . . . . 289
12.6 SQL vs. R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.7 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

13 Database administration 301


13.1 Constructing efficient SQL databases . . . . . . . . . . . . . . . . . . . . . . 301
13.1.1 Creating new databases . . . . . . . . . . . . . . . . . . . . . . . . . 301
13.1.2 CREATE TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
13.1.3 Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.1.4 Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.1.5 EXPLAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
13.1.6 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
13.2 Changing SQL data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
13.2.1 UPDATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
13.2.2 INSERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.2.3 LOAD DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.3 Extended example: Building a database . . . . . . . . . . . . . . . . . . . . 309
13.3.1 Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
13.3.2 Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
13.3.3 Load into MySQL database . . . . . . . . . . . . . . . . . . . . . . . 310
13.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
13.5 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

14 Working with spatial data 317


14.1 Motivation: What’s so great about spatial data? . . . . . . . . . . . . . . . 317
14.2 Spatial data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
14.3 Making maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14.3.1 Static maps with ggmap . . . . . . . . . . . . . . . . . . . . . . . . . 322
14.3.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
14.3.3 Geocoding, routes, and distances . . . . . . . . . . . . . . . . . . . . 330
14.3.4 Dynamic maps with leaflet . . . . . . . . . . . . . . . . . . . . . . 332
14.4 Extended example: Congressional districts . . . . . . . . . . . . . . . . . . 333
14.4.1 Election results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
14.4.2 Congressional districts . . . . . . . . . . . . . . . . . . . . . . . . . . 336
14.4.3 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . 338
14.4.4 Using ggmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.4.5 Using leaflet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.5 Effective maps: How (not) to lie . . . . . . . . . . . . . . . . . . . . . . . . 343
14.6 Extended example: Historical airline route maps . . . . . . . . . . . . . . . 345
14.6.1 Using ggmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
14.6.2 Using leaflet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
xii CONTENTS

14.7 Projecting polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349


14.8 Playing well with others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
14.9 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
14.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

15 Text as data 355


15.1 Tools for working with text . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
15.1.1 Regular expressions using Macbeth . . . . . . . . . . . . . . . . . . . 355
15.1.2 Example: Life and death in Macbeth . . . . . . . . . . . . . . . . . . 359
15.2 Analyzing textual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
15.2.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
15.2.2 Word clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
15.2.3 Document term matrices . . . . . . . . . . . . . . . . . . . . . . . . . 365
15.3 Ingesting text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
15.3.1 Example: Scraping the songs of the Beatles . . . . . . . . . . . . . . 367
15.3.2 Scraping data from Twitter . . . . . . . . . . . . . . . . . . . . . . . 369
15.4 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
15.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

16 Network science 377


16.1 Introduction to network science . . . . . . . . . . . . . . . . . . . . . . . . . 377
16.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
16.1.2 A brief history of network science . . . . . . . . . . . . . . . . . . . . 378
16.2 Extended example: Six degrees of Kristen Stewart . . . . . . . . . . . . . . 382
16.2.1 Collecting Hollywood data . . . . . . . . . . . . . . . . . . . . . . . . 382
16.2.2 Building the Hollywood network . . . . . . . . . . . . . . . . . . . . 384
16.2.3 Building a Kristen Stewart oracle . . . . . . . . . . . . . . . . . . . . 387
16.3 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
16.4 Extended example: 1996 men’s college basketball . . . . . . . . . . . . . . . 391
16.5 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

17 Epilogue: Towards “big data” 401


17.1 Notions of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
17.2 Tools for bigger data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
17.2.1 Data and memory structures for big data . . . . . . . . . . . . . . . 403
17.2.2 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
17.2.3 Parallel and distributed computing . . . . . . . . . . . . . . . . . . . 404
17.2.4 Alternatives to SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
17.3 Alternatives to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
17.4 Closing thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
17.5 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

IV Appendices 415
A Packages used in this book 417
A.1 The mdsr package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
A.2 The etl package suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
A.3 Other packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
A.4 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
CONTENTS xiii

B Introduction to R and RStudio 421


B.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
B.1.1 Installation under Windows . . . . . . . . . . . . . . . . . . . . . . . 422
B.1.2 Installation under Mac OS X . . . . . . . . . . . . . . . . . . . . . . 422
B.1.3 Installation under Linux . . . . . . . . . . . . . . . . . . . . . . . . . 422
B.1.4 RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
B.2 Running RStudio and sample session . . . . . . . . . . . . . . . . . . . . . . 422
B.3 Learning R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
B.3.1 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
B.3.2 swirl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
B.4 Fundamental structures and objects . . . . . . . . . . . . . . . . . . . . . . 427
B.4.1 Objects and vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
B.4.2 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
B.4.3 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
B.4.4 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
B.4.5 Dataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
B.4.6 Attributes and classes . . . . . . . . . . . . . . . . . . . . . . . . . . 431
B.4.7 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
B.4.8 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
B.5 Add-ons: Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
B.5.1 Introduction to packages . . . . . . . . . . . . . . . . . . . . . . . . . 435
B.5.2 CRAN task views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
B.5.3 Session information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
B.5.4 Packages and name conflicts . . . . . . . . . . . . . . . . . . . . . . . 438
B.5.5 Maintaining packages . . . . . . . . . . . . . . . . . . . . . . . . . . 438
B.5.6 Installed libraries and packages . . . . . . . . . . . . . . . . . . . . . 438
B.6 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
B.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439

C Algorithmic thinking 443


C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
C.2 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
C.3 Extended example: Law of large numbers . . . . . . . . . . . . . . . . . . . 446
C.4 Non-standard evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
C.5 Debugging and defensive coding . . . . . . . . . . . . . . . . . . . . . . . . 452
C.6 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
C.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

D Reproducible analysis and workflow 455


D.1 Scriptable statistical computing . . . . . . . . . . . . . . . . . . . . . . . . 456
D.2 Reproducible analysis with R Markdown . . . . . . . . . . . . . . . . . . . . 456
D.3 Projects and version control . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
D.4 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
D.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461

E Regression modeling 465


E.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
E.1.1 Motivating example: Modeling usage of a rail trail . . . . . . . . . . 466
E.1.2 Model visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
E.1.3 Measuring the strength of fit . . . . . . . . . . . . . . . . . . . . . . 467
E.1.4 Categorical explanatory variables . . . . . . . . . . . . . . . . . . . . 469
xiv CONTENTS

E.2 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470


E.2.1 Parallel slopes: Multiple regression with a categorical
variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
E.2.2 Parallel planes: Multiple regression with a second
quantitative variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
E.2.3 Non-parallel slopes: Multiple regression with interaction . . . . . . . 472
E.2.4 Modelling non-linear relationships . . . . . . . . . . . . . . . . . . . 472
E.3 Inference for regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
E.4 Assumptions underlying regression . . . . . . . . . . . . . . . . . . . . . . . 475
E.5 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
E.6 Further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
E.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482

F Setting up a database server 487


F.1 SQLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
F.2 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
F.2.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
F.2.2 Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
F.2.3 Running scripts from the command line . . . . . . . . . . . . . . . . 491
F.3 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
F.4 Connecting to SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
F.4.1 The command line client . . . . . . . . . . . . . . . . . . . . . . . . . 492
F.4.2 GUIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
F.4.3 R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
F.4.4 Load into SQLite database . . . . . . . . . . . . . . . . . . . . . . . 497

Bibliography 499

Indices 513
Subject index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
R index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
List of Tables

3.1 A selection of variables from the first six rows of the CIACountries data table. 34
3.2 Glyph-ready data for the barplot layer in Figure 3.7. . . . . . . . . . . . . . 39
3.3 Table of canonical data graphics and their corresponding ggplot2 commands.
Note that mosaicplot() is not part of the ggplot2 package. . . . . . . . . 47

5.1 A data table showing how many babies were given each name in each year
in the U.S., for a few names. . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 The most popular baby names across all years. . . . . . . . . . . . . . . . . 94
5.3 Ward and precinct votes cast in the 2013 Minneapolis mayoral election. . . 95
5.4 A selection from the Minneapolis election data in tidy form. . . . . . . . . . 96
5.5 Individual ballots in the Minneapolis election. Each voter votes in one ward
in one precinct. The ballot marks the voter’s first three choices for mayor. . 97
5.6 An excerpt of runners’ performance over time in a 10-mile race. . . . . . . . 98
5.7 BP wide: a data table in a wide format . . . . . . . . . . . . . . . . . . . . . 99
5.8 BP narrow: a tidy data table in a narrow format. . . . . . . . . . . . . . . . 100
5.9 A data table extending the information in Tables 5.8 and 5.7 to include ad-
ditional variables and repeated measurements. The narrow format facilitates
including new cases or variables. . . . . . . . . . . . . . . . . . . . . . . . . 100
5.10 The third table embedded in the Wikipedia page on running records. . . . . 119
5.11 The fourth table embedded in the Wikipedia page on running records. . . . 120
5.12 Four of the variables from the houses-for-sale.csv file giving features of
the Saratoga houses stored as integer codes. Each case is a different house. 121
5.13 The Translations data table rendered in a wide format. . . . . . . . . . . 121
5.14 The Houses data with re-coded categorical variables. . . . . . . . . . . . . . 122
5.15 Starting and ending dates for each transcriber involved in the OrdwayBirds
project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

9.1 Sample voting records data from the Scottish Parliament. . . . . . . . . . . 212

12.1 Equivalent commands in SQL and R, where a and b are SQL tables and R
data.frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

14.1 Hypothetical data from 1854 cholera outbreak. . . . . . . . . . . . . . . . . 318

A.1 List of packages used in this book. Most packages are available on CRAN.
Packages available from GitHub include: airlines, fec, imdb, sparklyr,
and streamgraph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420

B.1 Some of the interactive courses available within swirl. . . . . . . . . . . . . 426


B.2 A complete list of CRAN task views. . . . . . . . . . . . . . . . . . . . . . . 437

xv
List of Figures

1.1 Excerpt from Graunt’s bills of mortality. . . . . . . . . . . . . . . . . . . . 4

2.1 Amount of money spent on individual candidates in the general election


phase of the 2012 federal election cycle, in millions of dollars . . . . . . . . 10
2.2 Amount of money spent on individual candidates in the general election
phase of the 2012 federal election cycle, in millions of dollars, broken down
by type of spending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Amount of money spent on individual candidacies by political party affilia-
tion during the general election phase of the 2012 federal election cycle . . 12
2.4 Amount of money spent on individual candidacies by political party affil-
iation during the general election phase of the 2012 federal election cycle,
broken down by office being sought . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Donations made by individuals to the PACs supporting the two major pres-
idential candidates in the 2012 election . . . . . . . . . . . . . . . . . . . . 14
2.6 Donations made by individuals to the PACs supporting the two major pres-
idential candidates in the 2012 election, separated by election phase . . . . 15
2.7 Scatterplot illustrating the relationship between number of dollars spent
supporting and number of votes earned by Democrats in 2012 elections for
the House of Representatives . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Scatterplot illustrating the relationship between percentage of dollars spent
supporting and percentage of votes earned by Democrats in the 2012 House
of Representatives elections . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Campaign funding network for candidates from Massachusetts, 2012 federal
elections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Diverging red-blue color palette . . . . . . . . . . . . . . . . . . . . . . . . 20
2.11 Palettes available through the RColorBrewer package . . . . . . . . . . . . 21
2.12 Bar graph of average SAT scores among states with at least two-thirds of
students taking the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.13 Scatterplot of world record time in 100-meter freestyle swimming. . . . . . 23
2.14 Pie charts showing the breakdown of substance of abuse among HELP study
participants, faceted by homeless status . . . . . . . . . . . . . . . . . . . . 24
2.15 Choropleth map of population among Massachusetts Census tracts, based
on 2010 U.S. Census. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.16 A scatterplot with smoother demonstrating the relationship between tem-
perature and O-ring damage on solid rocket motors. The dots are semi-
transparent, so that darker dots indicate multiple observations with the
same values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.17 A recreation of Tufte’s scatterplot demonstrating the relationship between
temperature and O-ring damage on solid rocket motors. . . . . . . . . . . . 26

xvii
xviii LIST OF FIGURES

2.18 Reprints of two Morton Thiokol data graphics. [195] . . . . . . . . . . . . . 27


2.19 Still images from Forms, by Memo Akten and Quayola. Each image rep-
resents an athletic movement made by a competitor at the Commonwealth
Games, but reimagined as a collection of moving 3D digital objects. Reprinted
with permission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Scatterplot using only the position aesthetic for glyphs. . . . . . . . . . . . 35


3.2 Scatterplot in which net users is mapped to color. . . . . . . . . . . . . . 35
3.3 Scatterplot using both location and label as aesthetics. . . . . . . . . . . . 36
3.4 Scatterplot in which net users is mapped to color and educ mapped to size.
Compare this graphic to Figure 3.6, which displays the same data using facets. 36
3.5 Scatterplot using a logarithmic transformation of GDP that helps to miti-
gate visual clustering caused by the right-skewed distribution of GDP among
countries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Scatterplot using facets for different ranges of Internet connectivity. . . . . 38
3.7 Bar graph of average charges for medical procedures in New Jersey. . . . . 40
3.8 Bar graph adding a second layer to provide a comparison of New Jersey to
other states. Each dot represents one state, while the bars represent New
Jersey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Histogram showing the distribution of Math SAT scores by state. . . . . . 41
3.10 Density plot showing the distribution of Math SAT scores by state. . . . . 42
3.11 A bar plot showing the distribution of Math SAT scores for a selection of
states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.12 A stacked bar plot showing the distribution of substance of abuse for par-
ticipants in the HELP study. Compare this to Figure 2.14. . . . . . . . . . 43
3.13 Scatterplot using the color aesthetic to separate the relationship between
two numeric variables by a third categorical variable. . . . . . . . . . . . . 44
3.14 Scatterplot using a facet wrap() to separate the relationship between two
numeric variables by a third categorical variable. . . . . . . . . . . . . . . . 45
3.15 A scatterplot for 1,000 random individuals from the NHANES study. Note
how mapping gender to color illuminates the differences in height between
men and women. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.16 A time series showing the change in temperature at the MacLeish field
station in 2015. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.17 A box-and-whisker plot showing the distribution of foot length by gender
for 39 children. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.18 Mosaic plot (eikosogram) of diabetes by age and weight status (BMI). . . . 47
3.19 A choropleth map displaying oil production by countries around the world
in barrels per day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.20 A network diagram displaying the relationship between types of cancer cell
lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.21 Popularity of the name “Joseph” as constructed by FiveThirtyEight. . . . 50
3.22 Recreation of the age distribution of “Joseph” plot . . . . . . . . . . . . . 53
3.23 Age distribution of American girls named “Josephine” . . . . . . . . . . . . 54
3.24 Comparison of the name “Jessie” across two genders . . . . . . . . . . . . 54
3.25 Gender breakdown for the three most “unisex” names . . . . . . . . . . . . 55
3.26 Gender breakdown for the three most “unisex” names, oriented vertically . 55
3.27 FiveThirtyEight’s depiction of the age ranges for the 25 most common female
names. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.28 Recreation of FiveThirtyEight’s plot of the age distributions for the 25 most
common women’s names . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
LIST OF FIGURES xix

4.1 The filter() function. At left, a data frame that contains matching entries
in a certain column for only a subset of the rows. At right, the resulting
data frame after filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 The select() function. At left, a data frame, from which we retrieve only a
few of the columns. At right, the resulting data frame after selecting those
columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 The mutate() function. At left, a data frame. At right, the resulting data
frame after adding a new column. . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 The arrange() function. At left, a data frame with an ordinal variable. At
right, the resulting data frame after sorting the rows in descending order of
that variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 The summarize() function. At left, a data frame. At right, the resulting
data frame after aggregating three of the columns. . . . . . . . . . . . . . . 70

5.1 A graphical depiction of voter turnout in the different wards . . . . . . . . 96


5.2 Part of the codebook for the HELPrct data table from the mosaicData package. 99
5.3 Fit for the Pythagorean Winning Percentage model for all teams since 1954 111
5.4 Number of home runs hit by the team with the most home runs, 1916–2014 113
5.5 Distribution of best-fitting exponent across single seasons from 1961–2014 114
5.6 Bootstrap distribution of mean optimal exponent . . . . . . . . . . . . . . 115
5.7 Part of a page on mile-run world records from Wikipedia. Two separate
data tables are visible. You can’t tell from this small part of the page, but
there are seven tables altogether on the page. These two tables are the third
and fourth in the page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.8 The transcribers of OrdwayBirds from lab notebooks worked during different
time intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.9 Screenshot of Wikipedia’s list of Japanese nuclear reactors. . . . . . . . . . 126
5.10 Distribution of capacity of Japanese nuclear power plants over time . . . . 128

6.1 Reproduction of a data graphic reporting the number of gun deaths in


Florida over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2 A tweet by National Review on December 14, 2015 showing the change in
global temperature over time. . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.1 The sampling distribution of the mean arrival delay with a sample size of
n = 25 (left) and also for a larger sample size of n = 100 (right). . . . . . . 154
7.2 Distribution of flight arrival delays in 2013 for flights to San Francisco from
NYC airports that were delayed less than seven hours. The distribution
features a long right tail (even after pruning the outliers). . . . . . . . . . . 159
7.3 Association of flight arrival delays with scheduled departure time for flights
to San Francisco from New York airports in 2013. . . . . . . . . . . . . . . 160
7.4 Scatterplot of average SAT scores versus average teacher salaries (in thou-
sands of dollars) for the 50 United States in 2010. . . . . . . . . . . . . . . 163
7.5 Scatterplot of average SAT scores versus average teacher salaries (in thou-
sands of dollars) for the 50 United States in 2010, stratified by the percentage
of students taking the SAT in each state. . . . . . . . . . . . . . . . . . . . 164
xx LIST OF FIGURES

8.1 A single partition of the census data set using the capital.gain variable
to determine the split. Color, and the vertical line at $5,095.50 in capital
gains tax indicate the split. If one paid more than this amount, one almost
certainly made more than $50,000 in income. On the other hand, if one paid
less than this amount in capital gains, one almost certainly made less than
$50,000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.2 Decision tree for income using the census data . . . . . . . . . . . . . . . . 178
8.3 Graphical depiction of the full recursive partitioning decision tree classifier 179
8.4 Performance of nearest neighbor classifier for different choices of k on census
training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.5 Visualization of an artificial neural network . . . . . . . . . . . . . . . . . . 187
8.6 ROC curve for naive Bayes model . . . . . . . . . . . . . . . . . . . . . . . 191
8.7 Performance of nearest neighbor classifier for different choices of k on census
training and testing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.8 Comparison of ROC curves across five models on the Census testing data . 197
8.9 Illustration of decision tree for diabetes . . . . . . . . . . . . . . . . . . . . 198
8.10 Scatterplot of age against BMI for individuals in the NHANES data set . . 199
8.11 Comparison of predictive models in the data space . . . . . . . . . . . . . . 202

9.1 An evolutionary tree for mammals. Source: [92] . . . . . . . . . . . . . . . 206


9.2 Distances between some U.S. cities. . . . . . . . . . . . . . . . . . . . . . . 208
9.3 A dendrogram constructed by hierarchical clustering from car-to-car dis-
tances implied by the Toyota fuel economy data . . . . . . . . . . . . . . . 209
9.4 The world’s 4,000 largest cities, clustered by the 6-means clustering algorithm211
9.5 Visualization of the Scottish Parliament votes . . . . . . . . . . . . . . . . 213
9.6 Scottish Parliament votes for two ballots . . . . . . . . . . . . . . . . . . . 214
9.7 Scatterplot showing the correlation between Scottish Parliament votes in
two arbitrary collections of ballots . . . . . . . . . . . . . . . . . . . . . . . 215
9.8 Clustering members of Scottish Parliament based on SVD along the members216
9.9 Clustering of Scottish Parliament ballots based on SVD along the ballots . 217
9.10 Illustration of the Scottish Parliament votes when ordered by the primary
vector of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

10.1 Comparing the variation in expression for individual probes across cell lines
in the NCI60 data (blue) and a simulation of a null hypothesis (red). . . . 224
10.2 Distribution of Sally and Joan arrival times (shaded area indicates where
they meet). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.3 True number of new jobs from simulation as well as three realizations from
a simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.4 Distribution of NYC restaurant health violation scores. . . . . . . . . . . . 230
10.5 Distribution of health violation scores under a randomization procedure. . 231
10.6 Convergence of the estimate of the proportion of times that Sally and Joan
meet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

11.1 ggplot2 depiction of the frequency of Beatles names over time . . . . . . . 245
11.2 A screenshot of the interactive plot of the frequency of Beatles names over
time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
11.3 A screenshot of the output of the DataTables package applied to the Beatles
names. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
LIST OF FIGURES xxi

11.4 A screenshot of the dygraphs display of the popularity of Beatles names


over time. In this screenshot, the years range from 1940 to 1980, but in the
live version, one can expand or contract that timespan. . . . . . . . . . . . 247
11.5 A screenshot of the streamgraph display of Beatles names over time. . . . 248
11.6 A screenshot of the ggvis display of the proportion and number of male
babies named “John” over time. . . . . . . . . . . . . . . . . . . . . . . . . 249
11.7 A screenshot of the Shiny app displaying babies with Beatles names. . . . 250
11.8 Comparison of two ggplot2 themes . . . . . . . . . . . . . . . . . . . . . . 252
11.9 Beatles plot with custom ggplot2 theme . . . . . . . . . . . . . . . . . . . 252
11.10 Beatles plot with customized mdsr theme . . . . . . . . . . . . . . . . . . . 253
11.11 Prevalence of Beatles names drawn in the style of an xkcd Web comic . . . 254
11.12 Nathan Yau’s Hot Dog Eating data graphic (reprinted with permission from
flowingdata.com). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
11.13 A simple bar graph of hot dog eating . . . . . . . . . . . . . . . . . . . . . 256
11.14 Recreating the hot dog graphic in R . . . . . . . . . . . . . . . . . . . . . . 258

12.1 FiveThirtyEight data graphic summarizing airline delays by carrier. Repro-


duced with permission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.2 Re-creation of the FiveThirtyEight plot on flight delays . . . . . . . . . . . 294

14.1 John Snow’s original map of the 1854 Broad Street cholera outbreak. Source:
Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
14.2 A simple ggplot2 of the cholera deaths, with no context provided . . . . . 322
14.3 A modern-day map of the area surrounding Broad Street in London . . . . 323
14.4 The world according to the Mercator (left) and Gall–Peters (right) projections325
14.5 The contiguous United States according to the Lambert conformal conic
(left) and Albers equal area (right) projections . . . . . . . . . . . . . . . . 326
14.6 Erroneous reproduction of John Snow’s original map of the 1854 cholera
outbreak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
14.7 Reproduction of John Snow’s original map of the 1854 cholera outbreak . . 329
14.8 The fastest route from Smith College to Amherst College . . . . . . . . . . 331
14.9 Alternative commuting routes from Ben’s old apartment in Brooklyn to Citi
Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
14.10 Static image from a leaflet plot of the White House. . . . . . . . . . . . 333
14.11 A basic map of the North Carolina congressional districts . . . . . . . . . . 338
14.12 Bichromatic choropleth map of the results of the 2012 congressional elections
in North Carolina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
14.13 Full color choropleth of the results of the 2012 congressional elections in
North Carolina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
14.14 Static image from a leaflet plot of the North Carolina congressional districts.344
14.15 Airports served by Delta Airlines in 2006 . . . . . . . . . . . . . . . . . . . 347
14.16 Full route map for Delta Airlines in 2006 . . . . . . . . . . . . . . . . . . . 348
14.17 Static image from a leaflet plot of the historical Delta airlines route map. 350
14.18 U.S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
14.19 U.S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
14.20 Screenshot of the North Carolina congressional districts as rendered in Google
Earth, after exporting to KML. Compare with Figure 14.13. . . . . . . . . 354

15.1 Speaking parts in Macbeth for four major characters . . . . . . . . . . . . . 361


15.2 A word cloud of terms that appear in the abstracts of arXiv papers on data
science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
xxii LIST OF FIGURES

15.3 Distribution of the number of characters in a sample of tweets . . . . . . . 371


15.4 Distribution of the number of retweets in a sample of tweets . . . . . . . . 372

16.1 Two Erdős–Rényi random graphs on 100 vertices with different values of p 379
16.2 Simulation of connectedness of ER random graphs on 1,000 vertices . . . . 380
16.3 Degree distribution for two random graphs . . . . . . . . . . . . . . . . . . 381
16.4 Visualization of Hollywood network for popular 2012 movies . . . . . . . . 385
16.5 Distribution of degrees for actors in the Hollywood network of popular 2012
movies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
16.6 The Hollywood network for popular 2012 movies, in ggplot2 . . . . . . . . 388
16.7 Atlantic 10 Conference network, NCAA men’s basketball, 1995–1996 . . . 396

B.1 Sample session in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423


B.2 Documentation on the mean() function. . . . . . . . . . . . . . . . . . . . . 425

C.1 Illustration of the location of the critical value for a 95% confidence interval
for a mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
C.2 Cauchy distribution (solid line) and t-distribution with 4 degrees of freedom
(dashed line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
C.3 Running average for t-distribution with four degrees of freedom and a Cauchy
random variable (equivalent to a t-distribution with one degree of freedom).
Note that while the former converges, the latter does not. . . . . . . . . . . 448

D.1 Generating a new R Markdown file in RStudio . . . . . . . . . . . . . . . . 457


D.2 Sample R Markdown input file. . . . . . . . . . . . . . . . . . . . . . . . . . 458
D.3 Formatted output from R Markdown example. . . . . . . . . . . . . . . . . 460

E.1 Scatterplot of number of trail crossings as a function of highest daily tem-


perature (in degrees Fahrenheit). . . . . . . . . . . . . . . . . . . . . . . . . 467
E.2 At left, the model based on the overall average high temperature . . . . . . 468
E.3 Visualization of parallel slopes model for the rail trail data . . . . . . . . . 471
E.4 Visualization of interaction model for the rail trail data . . . . . . . . . . . 473
E.5 Scatterplot of height as a function of age with superimposed linear model
(blue) and smoother (green) . . . . . . . . . . . . . . . . . . . . . . . . . . 474
E.6 Scatterplot of volume as a function of high temperature with superimposed
linear and smooth models for the rail trail data . . . . . . . . . . . . . . . 475
E.7 Assessing linearity using a scatterplot of residuals versus fitted (predicted)
values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
E.8 Assessing normality assumption using a Q–Q plot . . . . . . . . . . . . . . 477
E.9 Assessing equal variance using a scale–location plot . . . . . . . . . . . . . 478
E.10 Cook’s distance for rail trail model . . . . . . . . . . . . . . . . . . . . . . 479
E.11 Scatterplot of diabetes as a function of age with superimposed smoother. . 480
E.12 Scatterplot of diabetes as a function of BMI with superimposed smoother. 480
E.13 Predicted probabilities for diabetes as a function of BMI and age . . . . . 481

F.1 Schematic of SQL-related R packages and their dependencies. . . . . . . . 493


Preface

Background and motivation


The increasing volume and sophistication of data poses new challenges for analysts, who
need to be able to transform complex data sets to answer important statistical questions.
The widely-cited McKinsey & Company report stated that “by 2018, the United States
alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well
as 1.5 million managers and analysts with the know-how to use the analysis of big data to
make effective decisions.” There is a pressing need for additional resources to train existing
analysts as well as the next generation to be able to pose questions, suggest hypotheses,
collect, transform, and analyze data, then communicate results. According to the online
company ratings site Glassdoor, “data scientist” was the best job in America in 2016 [142].
Statistics can be defined as the science of learning from data [203]. Michael Jordan has
described data science as the marriage of computational thinking and inferential thinking.
Without the skills to be able to “wrangle” the increasingly rich and complex data that
surround us, analysts will not be able to use these data to make better decisions.
New data technologies and database systems facilitate scraping and merging data from
different sources and formats and restructuring it into a form suitable for analysis. State-of-
the-art workflow tools foster well-documented and reproducible analysis. Modern statistical
methods allow the analyst to fit and assess models as well as to undertake supervised or
unsupervised learning to extract information. Contemporary data science requires tight
integration of these statistical, computing, data-related, and communication skills.
The book is intended for readers to develop and reinforce the appropriate skills to tackle
complex data science projects and “think with data” (as coined by Diane Lambert). The
ability to solve problems using data is at the heart of our approach.
We feature a series of complex, real-world extended case studies and examples from a
broad range of application areas, including politics, transportation, sports, environmental
science, public health, social media, and entertainment. These rich data sets require the
use of sophisticated data extraction techniques, modern data visualization approaches, and
refined computational approaches.
It is impossible to cover all these topics in any level of detail within a single book: Many
of the chapters could productively form the basis for a course or series of courses. Our
goal is to lay a foundation for analysis of real-world data and to ensure that analysts see
the power of statistics and data analysis. After reading this book, readers will have greatly
expanded their skill set for working with these data, and should have a newfound confidence
about their ability to learn new technologies on-the-fly.

Key role of technology


While many tools can be used effectively to undertake data science, and the technologies to
undertake analyses are quickly changing, R and Python have emerged as two powerful and

xxiii
xxiv PREFACE

extensible environments. While it is important for data scientists to be able to use multiple
technologies for their analyses, we have chosen to focus on the use of R and RStudio to
avoid cognitive overload. By use of a “Less Volume, More Creativity” approach [162], we
intend to develop a small set of tools that can be mastered within the confines of a single
semester and that facilitate sophisticated data management and exploration.
We take full advantage of the RStudio environment. This powerful and easy-to-use front
end adds innumerable features to R including package support, code-completion, integrated
help, a debugger, and other coding tools. In our experience, the use of RStudio dramati-
cally increases the productivity of R users, and by tightly integrating reproducible analysis
tools, helps avoid error-prone “cut-and-paste” workflows. Our students and colleagues find
RStudio an extremely comfortable interface. No prior knowledge or experience with R or
RStudio is required: we include an introduction within the Appendix.
We used a reproducible analysis system (knitr) to generate the example code and output
in this book. Code extracted from these files is provided on the book’s website. We provide
a detailed discussion of the philosophy and use of these systems. In particular, we feel that
the knitr and markdown packages for R, which are tightly integrated with RStudio, should
become a part of every R user’s toolbox. We can’t imagine working on a project without
them (and we’ve incorporated reproducibility into all of our courses).
Modern data science is a team sport. To be able to fully engage, analysts must be able
to pose a question, seek out data to address it, ingest this into a computing environment,
model and explore, then communicate results. This is an iterative process that requires a
blend of statistics and computing skills.
Context is king for such questions, and we have structured the book to foster the parallel
developments of statistical thinking, data-related skills, and communication. Each chapter
focuses on a different extended example with diverse applications, while exercises allow for
the development and refinement of the skills learned in that chapter.

Intended audiences

This book was originally conceived to support a one-semester, 13-week upper-level course
in data science. We also intend that the book will be useful for more advanced students in
related disciplines, or analysts who want to bolster their data science skills. The book is in-
tended to be accessible to a general audience with some background in statistics (completion
of an introductory statistics course).
In addition to many examples and extended case studies, the book incorporates exercises
at the end of each chapter. Many of the exercises are quite open-ended, and are designed
to allow students to explore their creativity in tackling data science questions.
The book has been structured with three main sections plus supplementary appendices.
Part I provides an introduction to data science, an introduction to visualization, a foun-
dation for data management (or ‘wrangling’), and ethics. Part II extends key modeling
notions including regression modeling, classification and prediction, statistical foundations,
and simulation. Part III introduces more advanced topics, including interactive data visu-
alization, SQL and relational databases, spatial data, text mining, and network science.
We conclude with appendices that introduce the book’s R package, R and RStudio, key
aspects of algorithmic thinking, reproducible analysis, a review of regression, and how to
set up a local SQL database.
We have provided two indices: one organized by subject and the other organized by R
function and package. In addition, the book features extensive cross-referencing (given the
inherent connections between topics and approaches).
PREFACE xxv

Website
The book website at https://mdsr-book.github.io includes the table of contents, subject
and R indices, example datasets, code samples, exercises, additional activities, and a list of
errata.

How to use this book


The material from this book has supported several courses to date at Amherst, Smith, and
Macalester Colleges. This includes an intermediate course in data science (2013 and 2014
at Smith), an introductory course in data science (2016 at Smith), and a capstone course in
advanced data analysis (2015 and 2016 at Amherst). The intermediate data science course
required an introductory statistics course and some programming experience, and discussed
much of the material in this book in one semester, culminating with an integrated final
project [20]. The introductory data science course had no prerequisites and included the
following subset of material:
• Data Visualization: three weeks, covering Chapters 2 and 3
• Data Wrangling: four weeks, covering Chapters 4 and 5
• Database Querying: two weeks, covering Chapter 12
• Spatial Data: two weeks, covering Chapter 14
• Text Mining: two weeks, covering Chapter 15
The capstone course covered the following material:
• Data Visualization: two weeks, covering Chapters 2, 3, and 11
• Data Wrangling: two weeks, covering Chapters 4 and 5
• Ethics: one week, covering Chapter 6
• Simulation: one week, covering Chapter 10
• Statistical Learning: two weeks, covering Chapters 8 and 9
• Databases: one week, covering Chapter 12 and Appendix F
• Text Mining: one week, covering Chapter 15
• Spatial Data: one week, covering Chapter 14
• Big Data: one week, covering Chapter 17
We anticipate that this book could serve as the primary text for a variety of other
courses, with or without additional supplementary material.
The content in Part I—particularly the ggplot2 visualization concepts presented in
Chapter 3 and the dplyr data wrangling operations presented in Chapter 4—is fundamental
and is assumed in Parts II and III. Each of the chapters in Part III are independent of each
other and the material in Part II. Thus, while most instructors will want to cover most (if
not all) of Part I in any course, the material in Parts II and III can be added with almost
total freedom.
The material in Part II is designed to expose students with a beginner’s understanding of
statistics (i.e., basic inference and linear regression) to a richer world of statistical modeling
and statistical inference.
xxvi PREFACE

Acknowledgments
We would like to thank John Kimmel at Informa CRC/Chapman and Hall for his support
and guidance. We also thank Jim Albert, Nancy Boynton, Jon Caris, Mine Çetinkaya–
Rundel, Jonathan Che, Patrick Frenett, Scott Gilman, Johanna Hardin, John Horton, Azka
Javaid, Andrew Kim, Eunice Kim, Caroline Kusiak, Ken Kleinman, Priscilla (Wencong) Li,
Amelia McNamara, Tasheena Narraido, Melody Owen, Randall Pruim, Tanya Riseman,
Gabriel Sosa, Katie St. Clair, Amy Wagaman, Susan (Xiaofei) Wang, Hadley Wickham, J.
J. Allaire and the RStudio developers, the anonymous reviewers, the Spring 2015 SDS192
class, the Fall 2016 STAT495 class, and many others for contributions to the R and RStudio
environment, comments, guidance, and/or helpful suggestions on drafts of the manuscript.
Above all we greatly appreciate Cory, Maya, and Julia for their patience and support.

Northampton, MA and St. Paul, MN


December 2016
Part I

Introduction to Data Science

1
Chapter 1

Prologue: Why data science?

Information is what we want, but data are what we’ve got. The techniques for transforming
data into information go back hundreds of years. A good starting point is 1592 with the
publication of John Graunt’s weekly “bills of mortality” in London. (See Figure 1.1.) These
“bills” are tabulations—a condensation of data on individual events into a form more readily
assimilated by the human reader. Constructing such tabulations was a manual operation.
Over the centuries, as data became larger, machines were introduced to speed up the
tabulations. A major step was Herman Hollerith’s development of punched cards and an
electrical tabulating system for the United States Census of 1890. This was so successful
that Hollerith started a company, International Business Machines Corporation (IBM), that
came to play an important role in the development of today’s electronic computers.
Also in the late 19th century, statistical methods began to develop rapidly. These meth-
ods have been tremendously important in interpreting data, but they were not intrinsically
tied to mechanical data processing. Generations of students have learned to carry out
statistical operations by hand on small sets of data.
Nowadays, it is common to have data sets that are so large they can be processed only
by machine. In this era of “big data,” data are amassed by networks of instruments and
computers. The settings where such data arise are diverse: the genome, satellite observa-
tions of Earth, entries by web users, sales transactions, etc. There are new opportunities
for finding and characterizing patterns using techniques described as data mining, ma-
chine learning, data visualization, and so on. Such techniques require computer processing.
Among the tasks that need performing are data cleaning, combining data from multiple
sources, and reshaping data into a form suitable as input to data-summarization operations
for visualization and modeling.
In writing this book we hope to help people gain the understanding and skills for data
wrangling (a process of preparing data for visualization and other modern techniques of sta-
tistical interpretation) and using those data to answer statistical questions via modeling and
visualization. Doing so inevitably involves, at the center, the ability to reason statistically
and utilize computational and algorithmic capacities.
Is an extended study of computer programming necessary to engage in sophisticated
computing? Our view is that it is not. First, over the last half century, a coherent set
of simple data operations have been developed that can be used as the building blocks of
sophisticated data wrangling processes. The trick is not mastering programming but rather
learning to think in terms of these operations. Much of this book is intended to help you
master such thinking.
Second, it is possible to use recent developments in software to vastly reduce the amount
of programming needed to use these data operations. We have drawn on such software—

3
4 CHAPTER 1. PROLOGUE: WHY DATA SCIENCE?

(a) Title page. (b) Excerpt on the plague.

Figure 1.1: Excerpt from Graunt’s bills of mortality.

particularly R and the packages dplyr and ggplot2—to focus on a small subset of functions
that accomplish data wrangling tasks in a concise and expressive way. The programming
syntax is consistent enough that, with a little practice, you should be able to adapt the
code contained in this book to solve your own problems. (Experienced R programmers will
note the distinctive style of R statements in this book, including a consistent focus on a
small set of functions and extensive use of the “pipe” operator.) Part I of this book focuses
on data wrangling and data visualization as key building blocks for data science.

1.1 What is data science?


We hold a broad view of data science—we see it as the science of extracting meaningful
information from data. There are several key ideas embedded in that simple definition.
First, data science is a science, a rigorous discipline combining elements of statistics and
computer science, with roots in mathematics. Michael Jordan from the University of Cali-
fornia, Berkeley has described data science as a fine-grained blend of intellectual traditions
from statistics and computer science:

Computer science is more than just programming; it is the creation of appro-


priate abstractions to express computational structures and the development of
algorithms that operate on those abstractions. Similarly, statistics is more than
just collections of estimators and tests; it is the interplay of general notions of
1.1. WHAT IS DATA SCIENCE? 5

sampling, models, distributions and decision-making. [Data science] is based on


the idea that these styles of thinking support each other [159].

Second, data science is best applied in the context of expert knowledge about the domain
from which the data originate. This domain might be anything from astronomy to zoology;
business and health care are two particularly important domain areas. Third, the distinction
between data and information is the raison d’etre of data science. Data scientists are people
who are interested in converting the data that is now abundant into actionable information
that always seems to be scarce.
Many statisticians will say: “But we already have a field for that: it’s called statistics!”
The goals of data scientists and statisticians are the same: They both want to extract
meaningful information from data. Much of statistical technique was originally developed
in an environment where data were scarce and difficult or expensive to collect, so statisticians
focused on creating methods that would maximize the strength of inference one is able to
make, given the least amount of data. These techniques were often ingenious, involved
sophisticated mathematics, and have proven invaluable to the empirical sciences for going
on a century. While several of the most influential early statisticians saw computing as an
integral part of statistics, it is also true that much of the development of statistical theory
was to find mathematical approximations for things that we couldn’t yet compute [56].
Today, the manner in which we extract meaning from data is different in two ways—both
due primarily to advances in computing:

1. we are able to compute many more things than we could before, and;
2. we have a lot more data than we had before.

The first change means that some of the techniques that were ubiquitous in statistics ed-
ucation in the 20th century (e.g., t-tests, ANOVA) are being replaced by computational
techniques that are conceptually simpler, but were simply infeasible until the microcom-
puter revolution (e.g., the bootstrap, permutation tests). The second change means that
many of the data we now collect are observational—they don’t come from a designed experi-
ment and they aren’t really sampled at random. This makes developing realistic probability
models for these data much more challenging, which in turn makes formal statistical infer-
ence a more challenging (and perhaps less relevant) problem. In some settings (e.g., clinical
trials and A/B testing) the careful estimation of a model parameter is still the goal, and
inferential statistics are still the primary tools of the trade. But in an array of academic,
government, and industrial settings, the end result may instead be a predictive model, an
interactive visualization of the data, or a web application that allows the user to slice-and-
dice the data to make simple comparisons. We explore issues related to statistical inference
and modeling in greater depth in Part II of this book.
The increasing complexity and heterogeneity of modern data means that each data
analysis project needs to be custom-built. Simply put, the modern data analyst needs to
be able to read and write computer instructions, the so-called “code” from which data
analysis projects are built. Part I of this book develops foundational abilities in data
visualization and data wrangling—two essential skills for the modern data scientist. These
chapters focus on the traditional two-dimensional representation of data: rows and columns
in a data table, and horizontal and vertical in a data graphic. In Part III, we explore a
variety of non-traditional data types (e.g., spatial, text, network, “big”) and interactive
data graphics.
As you work through this book, you will develop computational skills that we describe
as “precursors” to big data [107]. In Chapter 17, we point to some tools for working with
truly big data. One has to learn to crawl before one can walk, and we argue that for most
6 CHAPTER 1. PROLOGUE: WHY DATA SCIENCE?

people the skills developed herein are more germane to the kinds of problems that you are
likely to encounter.

1.2 Case study: The evolution of sabermetrics


The evolution of baseball analytics (often called sabermetrics) in many ways recapitulates
the evolution of analytics in other domains. Although domain knowledge is always useful
in data science, no background in baseball is required for this section1 .
The use of statistics in baseball has a long and storied history—in part because the
game itself is naturally discrete, and in part because Henry Chadwick began publishing
boxscores in the early 1900s [184]. For these reasons, a rich catalog of baseball data began
to accumulate.
However, while more and more baseball data were piling up, analysis of that data was
not so prevalent. That is, the extant data provided a means to keep records, and as a result
some numerical elements of the game’s history took on a life of their own (e.g., Babe Ruth’s
714 home runs). But it is not as clear how much people were learning about the game of
baseball from the data. Knowing that Babe Ruth hit more home runs than Mel Ott tells us
something about two players, but doesn’t provide any insight into the nature of the game
itself.
In 1947—Jackie Robinson’s rookie season—Brooklyn Dodgers’ general manager Branch
Rickey made another significant innovation: He hired Allan Roth to be baseball’s first
statistical analyst. Roth’s analysis of baseball data led to insights that the Dodgers used
to win more games. In particular, Roth convinced Rickey that a measurement of how
often a batter reaches first base via any means (e.g., hit, walk, or being hit by the pitch)
was a better indicator of that batter’s value than how often he reaches first base via a hit
(which was—and probably still is—the most commonly cited batting statistic). The logic
supporting this insight was based on both Roth’s understanding of the game of baseball
(what we call domain knowledge) and his statistical analysis of baseball data.
During the next 50 years, many important contributions to baseball analytics were made
by a variety of people (most notably “The Godfather of Sabermetrics” Bill James [119]),
most of whom had little formal training in statistics, whose weapon of choice was a spread-
sheet. They were able to use their creativity, domain knowledge, and a keen sense of what
the interesting questions were to make interesting discoveries.
The 2003 publication of Moneyball [131]—which showcased how Billy Beane and Paul
DePodesta used statistical analysis to run the Oakland A’s—triggered a revolution in how
front offices in baseball were managed [27]. Over the next decade, the size of the data
expanded so rapidly that a spreadsheet was no longer a viable mechanism for storing—let
alone analyzing—all of the available data. Today, many professional sports teams have
research and development groups headed by people with Ph.D.’s in statistics or computer
science along with graduate training in machine learning [16]. This is not surprising given
that revenue estimates for major league baseball top $8 billion per year.
The contributions made by the next generation of baseball analysts will require coding
ability. The creativity and domain knowledge that fueled the work of Allan Roth and Bill
James remain necessary traits for success, but they are no longer sufficient. There is nothing
special about baseball in this respect—a similar profusion of data are now available in many
other areas, including astronomy, health services research, genomics, and climate change,
1 The main rules of baseball are these: Two teams of nine players alternate trying to score runs on a

field with four bases (first base, second base, third base, or home). The defensive team pitches while one
member of the offensive team bats while standing by home base). A run is scored when an offensive player
crosses home plate after advancing in order through the other bases.
1.3. DATASETS 7

among others. For data scientists of all application domains, creativity, domain knowledge,
and technical ability are absolutely essential.

1.3 Datasets
There are many data sets used in this book. The smaller ones are available through either
the mdsr (see Appendix A) or mosaic packages for R. Some other data used in this book are
pulled directly from the Internet—URLs for these data are embedded in the text. There
a few larger, more complicated data sets that we use repeatedly and that warrant some
explication here.

Airline Delays The U.S. Bureau of Transportation Statistics has collected data on more
than 169 million domestic flights dating back to October 1987. We have developed the
airlines package to allow R users to download and process these data with minimal
hassle. (Instructions as to how to set up a database can be found in Appendix F.)
These data were originally used for the 2009 ASA Data Expo [213]. The nycflights13
package contains a subset of these data (only flights leaving the three most prominent
New York City airports in 2013).

Baseball The Lahman database is maintained by Sean Lahman, a self-described database


journalist. Compiled by a team of volunteers, it contains complete seasonal records
going back to 1871 and is usually updated yearly. It is available for download both
as a pre-packaged SQL file and as an R package [80].

Baby Names The babynames package for R provides data about the popularity of indi-
vidual baby names from the U.S. Social Security Administration [221]. These data
can be used, for example, to track the popularity of certain names over time.

Federal Election Commission The fec package provides access to campaign spending
data for recent federal elections maintained by the Federal Election Commission.
These data include contributions by individuals to committees, spending by those
committees on behalf, or against individual candidates for president, the Senate, and
the House of Representatives, as well information about those committees and candi-
dates.

MacLeish The Ada and Archibald MacLeish Field Station is a 260-acre plot of land owned
and operated by Smith College. It is used by faculty, students, and members of the
local community for environmental research, outdoor activities, and recreation. The
macleish R package allows you to download and process weather data (as a time
series) from the MacLeish Field Station using the etl framework. It also contains
shapefiles for contextualizing spatial information.

Movies The Internet Movie Database is a massive repository of information about movies
[117]. The easiest way to get the IMDb data into SQL is by using the open-source
IMDbPY Python package [1].

Restaurant Violations The mdsr package contains data on restaurant health inspections
made by the New York City Health Department.

Twitter The micro-blogging social networking service Twitter has an application program-
ming interface (API) accessed using the twitteR package that can be used to access
data of short 140-character messages (called tweets) along with retweets and responses.
Approximately 500 million tweets are shared daily on the service.
8 CHAPTER 1. PROLOGUE: WHY DATA SCIENCE?

1.4 Further resources


Each chapter features a list of additional resources that provide further depth or serve as a
definitive reference for a given topic. Other definitions of data science and analytics can be
found in [158, 64, 57, 109, 95, 77, 160, 54].
Chapter 2

Data visualization

Data graphics provide one of the most accessible, compelling, and expressive modes to
investigate and depict patterns in data. This chapter will motivate the importance of well-
designed data graphics and describe a taxonomy for understanding their composition. If
you are seeing this material for the first time, you will never look at data graphics the same
way again—yours will soon be a more critical lens.

2.1 The 2012 federal election cycle


Every four years, the presidential election draws an enormous amount of interest in the
United States. The most prominent candidates announce their candidacy nearly two years
before the November elections, beginning the process of raising the hundreds of millions
of dollars necessary to orchestrate a national campaign. In many ways, the experience
of running a successful presidential campaign is in itself evidence of the leadership and
organizational skills necessary to be commander-in-chief.
Voices from all parts of the political spectrum are critical of the influence of money
upon political campaigns. While the contributions from individual citizens to individual
candidates are limited in various ways, the Supreme Court’s decision in Citizens United v.
Federal Election Commission allows unlimited political spending by corporations (non-profit
or otherwise). This has resulted in a system of committees (most notably, political action
committees (PACs)) that can accept unlimited contributions and spend them on behalf of
(or against) a particular candidate or set of candidates. Unraveling the complicated network
of campaign spending is a subject of great interest.
To perform that unraveling is an exercise in data science. The Federal Election Commis-
sion (FEC) maintains a website with logs of not only all of the ($200 or more) contributions
made by individuals to candidates and committees, but also of spending by committees
on behalf of (and against) candidates. Of course, the FEC also maintains data on which
candidates win elections, and by how much. These data sources are separate and it requires
some ingenuity to piece them together. We will develop these skills in Chapters 4 and 5,
but for now, we will focus on graphical displays of the information that can be gleaned
from these data. Our emphasis at this stage is on making intelligent decisions about how
to display certain data, so that a clear (and correct) message is delivered.
Among the most basic questions is: How much money did each candidate raise? How-
ever, the convoluted campaign finance network makes even this simple question difficult to
answer, and—perhaps more importantly—less meaningful than we might think. A better
question is: On whose candidacy was the most money spent? In Figure 2.1, we show a bar

9
10 CHAPTER 2. DATA VISUALIZATION

WEST, ALLEN B MR.


THOMPSON, TOMMY G
TESTER, JON
SUTTON, BETTY S
SCHILLING, ROBERT T.
ROMNEY, MITT / RYAN, PAUL D.
RENACCI, JAMES B
REHBERG, DENNIS RAY
OBAMA, BARACK
NELSON, BILL
MOURDOCK, RICHARD E
MCMAHON, LINDA
MANDEL, JOSH
MACK, CONNIE
LUNGREN, DANIEL E.
KAINE, TIMOTHY MICHAEL
HELLER, DEAN
FLAKE, JEFF
DUCKWORTH, L. TAMMY
DONNELLY, JOSEPH S
CRITZ, MARK
CRAVAACK, RAYMOND J MR.
COFFMAN, MICHAEL
CARMONA, RICHARD
CANSECO, FRANCISCO RAUL QUICO R.
BROWN, SHERROD
BILBRAY, BRIAN PHILLIP
BIGGERT, JUDY
BERKLEY, SHELLEY
BERG, RICHARD A
BALDWIN, TAMMY
ALLEN, GEORGE
AKIN, W TODD
$0 $100 $200
Money Spent (millions of USD)

Figure 2.1: Amount of money spent on individual candidates in the general election phase
of the 2012 federal election cycle, in millions of dollars. Candidacies with at least four
million dollars in spending are depicted.

graph of the amount of money (in millions of dollars) that were spent by committees on
particular candidates during the general election phase of the 2012 federal election cycle.
This includes candidates for president, the Senate, and the House of Representatives. Only
candidates on whose campaign at least $4 million was spent are included in Figure 2.1.
It seems clear from Figure 2.1 that President Barack Obama’s re-election campaign spent
far more money than any other candidate, in particular more than doubling the amount
of money spent by his Republican challenger, Mitt Romney. However, committees are not
limited to spending money in support of a candidate—they can also spend money against
a particular candidate (i.e., on attack ads). In Figure 2.2 we separate the same spending
shown in Figure 2.1 by whether the money was spent for or against the candidate.
In these elections, most of the money was spent against each candidate, and in particular,
$251 million of the $274 million spent on President Obama’s campaign was spent against
his candidacy. Similarly, most of the money spent on Mitt Romney’s campaign was against
him, but the percentage of negative spending on Romney’s campaign (70%) was lower than
that of Obama (92%).
The difference between Figure 2.1 and Figure 2.2 is that in the latter we have used color
to bring a third variable (type of spending) into the plot. This allows us to make a clear
comparison that importantly changes the conclusions we might draw from the former plot.
In particular, Figure 2.1 makes it appear as though President Obama’s war chest dwarfed
that of Romney, when in fact the opposite was true.

2.1.1 Are these two groups different?


Since so much more money was spent attacking Obama’s campaign than Romney’s, you
might conclude from Figure 2.2 that Republicans were more successful in fundraising during
this election cycle. In Figure 2.3 we can confirm that this was indeed the case, since more
money was spent supporting Republican candidates than Democrats, and more money was
spent attacking Democratic candidates than Republican. In also seems clear from Figure 2.3
that nearly all of the money was spent on either Democrats or Republicans.
2.1. THE 2012 FEDERAL ELECTION CYCLE 11

WEST, ALLEN B MR.


THOMPSON, TOMMY G
TESTER, JON
SUTTON, BETTY S
SCHILLING, ROBERT T.
ROMNEY, MITT / RYAN, PAUL D.
RENACCI, JAMES B
REHBERG, DENNIS RAY
OBAMA, BARACK
NELSON, BILL
MOURDOCK, RICHARD E
MCMAHON, LINDA
MANDEL, JOSH
MACK, CONNIE
LUNGREN, DANIEL E. type
KAINE, TIMOTHY MICHAEL
HELLER, DEAN against
FLAKE, JEFF
supporting
DUCKWORTH, L. TAMMY
DONNELLY, JOSEPH S
CRITZ, MARK
CRAVAACK, RAYMOND J MR.
COFFMAN, MICHAEL
CARMONA, RICHARD
CANSECO, FRANCISCO RAUL QUICO R.
BROWN, SHERROD
BILBRAY, BRIAN PHILLIP
BIGGERT, JUDY
BERKLEY, SHELLEY
BERG, RICHARD A
BALDWIN, TAMMY
ALLEN, GEORGE
AKIN, W TODD
$0 $100,000,000 $200,000,000
Money Spent (millions of USD)

Figure 2.2: Amount of money spent on individual candidates in the general election phase
of the 2012 federal election cycle, in millions of dollars, broken down by type of spending.
Candidacies with at least four million dollars in spending are depicted.

However, the question of whether the money spent on candidates really differed by party
affiliation is a bit thornier. As we saw above, the presidential election dominated the political
donations in this election cycle. Romney faced a serious disadvantage in trying to unseat
an incumbent president. In this case, the office being sought is a confounding variable. By
further subdividing the contributions in Figure 2.3 by the office being sought, we can see
in Figure 2.4 that while more money was spent supporting Republican candidates for all
three houses of government, it was only in the presidential election that more money was
spent attacking Democratic candidates. In fact, slightly more money was spent attacking
Republican House and Senate candidates.
Note that Figures 2.3 and 2.4 display the same data. In Figure 2.4 we have an additional
variable that provides and important clue into the mystery of campaign finance. Our choice
to include that variable results in Figure 2.4 conveying substantially more meaning than
Figure 2.3, even though both figures are “correct.” In this chapter, we will begin to develop
a framework for creating principled data graphics.

2.1.2 Graphing variation


One theme that arose during the presidential election was the allegation that Romney’s
campaign was supported by a few rich donors, whereas Obama’s support came from people
across the economic spectrum. If this were true, then we would expect to see a difference in
the distribution of donation amounts between the two candidates. In particular, we would
expect to see this in the histograms shown in Figure 2.5, which summarize the more than
one million donations made by individuals to the two major committees that supported
each candidate (for Obama, Obama for America, and the Obama Victory Fund 2012; for
Romney, Romney for President, and Romney Victory 2012). We do see some evidence for
this claim in Figure 2.5, Obama did appear to receive more smaller donations, but the
evidence is far from conclusive. One problem is that both candidates received many small
donations but just a few larger donations; the scale on the horizontal axis makes it difficult
to actually see what is going on. Secondly, the histograms are hard to compare in a side-
12 CHAPTER 2. DATA VISUALIZATION

$400,000,000
Money Spent (millions of USD)

type
against
supporting

$200,000,000

$0

DEM DFL NA REP

Figure 2.3: Amount of money spent on individual candidacies by political party affiliation
during the general election phase of the 2012 federal election cycle.

by-side placement. Finally, we have lumped all of the donations from both phases of the
presidential election (i.e., primary vs. general) in together.
In Figure 2.6, we remedy these issues by (1) using density curves instead of histograms,
so that we can compare the distributions directly, (2) plotting the logarithm of the donation
amount on the horizontal scale to focus on the data that are important, and (3) separating
the donations by the phase of the election. Figure 2.6 allows us to make more nuanced
conclusions. The right panel supports the allegation that Obama’s donations came from
a broader base during the primary election phase. It does appear that more of Obama’s
donations came in smaller amounts during this phase of the election. However, in the
general phase, there is virtually no difference in the distribution of donations made to
either campaign.

2.1.3 Examining relationships among variables


Naturally, the biggest questions raised by the Citizens United decision are about the in-
fluence of money in elections. If campaign spending is unlimited, does this mean that the
candidate who generates the most spending on their behalf will earn the most votes? One
way that we might address this question is to compare the amount of money spent on each
candidate in each election with the number of votes that candidate earned. Statisticians
will want to know the correlation between these two quantities—when one is high, is the
other one likely to be high as well?
Since all 435 members of the United States House of Representatives are elected every
two years, and the districts contain roughly the same number of people, House elections
provide a nice data set to make this type of comparison. In Figure 2.7, we show a simple
scatterplot relating the number of dollars spent on behalf of the Democratic candidate
against the number of votes that candidate earned for each of the House elections.
The relationship between the two quantities depicted in Figure 2.7 is very weak. It does
not appear that candidates who benefited more from campaign spending earned more votes.
However, the comparison in Figure 2.7 is misleading. On both axes, it is not the amount that
is important, but the percentage. Although the population of each congressional district is
2.1. THE 2012 FEDERAL ELECTION CYCLE 13

H P S
Money Spent (millions of USD)

$200,000,000

type
against
supporting

$100,000,000

$0

DEM DFL REP DEM DFL REP DEM DFL REP

Figure 2.4: Amount of money spent on individual candidacies by political party affiliation
during the general election phase of the 2012 federal election cycle, broken down by office
being sought.

similar, they are not the same, and voter turnout will vary based on a variety of factors. By
comparing the percentage of the vote, we can control for the size of the voting population in
each district. Similarly, it makes less sense to focus on the total amount of money spent, as
opposed to the percentage of money spent. In Figure 2.8 we present the same comparison,
but with both axes scaled to percentages.
Figure 2.8 captures many nuances that were impossible to see in Figure 2.7. First,
there does appear to be a positive association between the percentage of money supporting
a candidate and the percentage of votes that they earn. However, that relationship is
of greatest interest towards the center of the plot, where elections are actually contested.
Outside of this region, one candidate wins more than 55% of the vote. In this case, there is
usually very little money spent. These are considered “safe” House elections—you can see
these points on the plot because most of them are close to x = 0 or x = 1, and the dots are
very small. For example, in the lower right corner is the 8th district in Ohio, which was won
by the then-current Speaker of the House John Boehner, who ran unopposed. The election
in which the most money was spent (over $11 million) was also in Ohio. In the 16th district,
Republican incumbent Jim Renacci narrowly defeated Democratic challenger Betty Sutton,
who was herself an incumbent from the 13th district. This battle was made possible through
decennial redistricting (see Chapter 14). Of the money spent in this election, 51.2% was in
support of Sutton but she earned only 48.0% of the votes.
In the center of the plot, the dots are bigger, indicating that more money is being spent
on these contested elections. Of course this makes sense, since candidates who are fighting
for their political lives are more likely to fundraise aggressively. Nevertheless, the evidence
that more financial support correlates with more votes in contested elections is relatively
weak.

2.1.4 Networks
Not all relationships among variables are sensibly expressed by a scatterplot. Another way
in which variables can be related is in the form of a network (we will discuss these in more
14 CHAPTER 2. DATA VISUALIZATION

Obama Romney
6e+05

4e+05
Number of Donations

2e+05

0e+00

$0 $25,000 $50,000 $75,000 $100,000 $0 $25,000 $50,000 $75,000 $100,000


Amount of Donation (USD)

Figure 2.5: Donations made by individuals to the PACs supporting the two major presi-
dential candidates in the 2012 election.

detail in Chapter 16). In this case, campaign funding has a network structure in which
individuals donate money to committees, and committees then spend money on behalf of
candidates. While the national campaign funding network is far too complex to show here,
in Figure 2.9 we display the funding network for candidates from Massachusetts.
In Figure 2.9, we see that the two campaigns that benefited the most from committee
spending were Republicans Mitt Romney and Scott Brown. This is not surprising, since
Romney was running for president, and received massive donations from the Republican
National Committee, while Brown was running to keep his Senate seat in a heavily Demo-
cratic state against a strong challenger, Elizabeth Warren. Both men lost their elections.
The constellation of blue dots are the congressional delegation from Massachusetts, all of
whom are Democrats.

2.2 Composing data graphics


Former New York Times intern and FlowingData.com creator Nathan Yau makes the anal-
ogy that creating data graphics is like cooking: Anyone can learn to type graphical com-
mands and generate plots on the computer. Similarly, anyone can heat up food in a mi-
crowave. What separates a high-quality visualization from a plain one are the same elements
that separate great chefs from novices: mastery of their tools, knowledge of their ingredients,
insight, and creativity [243]. In this section, we present a framework—rooted in scientific
research—for understanding data graphics. Our hope is that by internalizing these ideas
you will refine your data graphics palette.

2.2.1 A taxonomy for data graphics


The taxonomy presented in [243] provides a systematic way of thinking about how data
graphics convey specific pieces of information, and how they could be improved. A com-
plementary grammar of graphics [238] is implemented by Hadley Wickham in the ggplot2
graphics package [212], albeit using slightly different terminology. For clarity, we will post-
2.2. COMPOSING DATA GRAPHICS 15

General Primary

2
density

Obama
Romney

$10 $1,000 $100,000 $10 $1,000 $100,000


Amount of Donation (USD)

Figure 2.6: Donations made by individuals to the PACs supporting the two major presi-
dential candidates in the 2012 election, separated by election phase.

pone discussion of ggplot2 until Chapter 3. (To extend our cooking analogy, you must
learn to taste before you can learn to cook well.)
In this framework, data graphics can be understood in terms of four basic elements:
visual cues, coordinate system, scale, and context. In what follows we explicate this vision
and append a few additional items (facets and layers). This section should equip the careful
reader with the ability to systematically break down data graphics, enabling a more critical
analysis of their content.

Visual Cues
Visual cues are graphical elements that draw the eye to what you want your audience to
focus upon. They are the fundamental building blocks of data graphics, and the choice of
which visual cues to use to represent which quantities is the central question for the data
graphic composer. Yau identifies nine distinct visual cues, for which we also list whether
that cue is used to encode a numerical or categorical quantity:
Position (numerical) where in relation to other things?
Length (numerical) how big (in one dimension)?
Angle (numerical) how wide? parallel to something else?
Direction (numerical) at what slope? In a time series, going up or down?
Shape (categorical) belonging to which group?
Area (numerical) how big (in two dimensions)?
Volume (numerical) how big (in three dimensions)?
Shade (either) to what extent? how severely?
Color (either) to what extent? how severely? Beware of red/green color blindness (see
Section 2.2.2)
16 CHAPTER 2. DATA VISUALIZATION

400,000 l

l
Number of Votes Earned by Democratic candidate

l l

300,000 l

ll ll
l l l
l ll l l
ll l l l l l
ll
l
ll l ll l l l l
ll
l
l l l
l ll l l
ll ll l
l l ll l l ll
l l ll
l l
l llll ll
200,000 l
l l l l
ll l l
l
lll l ll ll l l
l l
l l l ll
l lllll l
l
ll l ll l l l
lll ll l
l l l l l l
lll l ll
l ll ll
l l
l l l l l l l l
ll ll l ll l l l
l
lll lll l l l l l
l
ll ll
l l l l l l ll l l
l l
ll l l l l l
ll l l l l l
l
ll l l ll l l
lllll l l
l
l
l l
lll l l l l l l l l l
l
lll l l l l
l
l
l
l
l
l
l
l l lll l ll
l
ll
l l ll l l
l
lll
l l
l ll
100,000 l
l
ll
ll l l
ll ll l l
l
l
l ll
ll ll l
l ll
l
l l
lll
l
ll
ll
l
l
l
l
ll
l
l
l l
l

0 l
l
l l
l ll l l

$0 $2,000,000 $4,000,000 $6,000,0


Money spent supporting Democratic candidate (USD)

Figure 2.7: Scatterplot illustrating the relationship between number of dollars spent sup-
porting and number of votes earned by Democrats in 2012 elections for the House of Rep-
resentatives.

1.00
Percentage of Votes Earned by Democratic candidate

total_votes
0.75 0
200,000
400,000
l 600,000

0.50
l
total_spent
l $0

l $3,000,000

0.25 l $6,000,000

l $9,000,000

0.00

0.00 0.25 0.50 0.75 1.00


Percentage of Money supporting Democratic candidate

Figure 2.8: Scatterplot illustrating the relationship between percentage of dollars spent sup-
porting and percentage of votes earned by Democrats in the 2012 House of Representatives
elections. Each dot represents one district. The size of each dot is proportional to the total
spending in that election, and the alpha transparency of each dot is proportional to the
total number of votes in that district.
party

2.2. COMPOSING DATA GRAPHICS


DEM
DFL
GRE
IND
LIB
REP
NA
RETHINK PAC

amount
BROWN, SCOTT P $0
ROMNEY, MITT $5,000,000
$10,000,000

RYAN,
BLICAN PAUL D.
NATIONAL COMMITTEE TISEI, RICHARD R. CAPUANO, MICHAEL E
NEAL, RICHARD E MR.
$15,000,000
$20,000,000
WORKING AMERICA
LYNCH, STEPHEN F
FRANK, BARNEY MCGOVERN, JIM

OLVER, JOHN WTSONGAS, NICOLA S


KEATING, WILLIAM RICHARD
SERVICE EMPLOYEES INTERNATIONAL UNION PEA−FEDERAL KENNEDY, JOSEPH P III
INTERNATIONAL ASSOCIATION OF FIREFIGHTERS INTERESTED IN REGISTRATION AND EDUCATION PAC

NEA ADVOCACY FUND


money
TIERNEY, JOHN F
a

$0

OBAMA, BARACK
WARREN, ELIZABETH
a $5,000,000

BROWN, SHERROD a $10,000,000

LCV VICTORY FUND a $15,000,000

CROSSROADS GRASSROOTS POLICY STRATEGIES


a $20,000,000

attack
FALSE
TRUE

Figure 2.9: Campaign funding network for candidates from Massachusetts, 2012 federal elections. Each edge represents a contribution from

17
a PAC to a candidate.
18 CHAPTER 2. DATA VISUALIZATION

Research into graphical perception (dating back to the mid-1980s) has shown that human
beings’ ability to perceive differences in magnitude accurately descends in this order [55].
That is, humans are quite good at accurately perceiving differences in position (e.g., how
much taller one bar is than another), but not as good at perceiving differences in angles.
This is one reason why many people prefer bar charts to pie charts. Our relatively poor
ability to perceive differences in color is a major factor in the relatively low opinion of heat
maps that many data scientists have.

Coordinate systems

How are the data points organized? While any number of coordinate systems are possible,
three are most common:

Cartesian This is the familiar (x, y)-rectangular coordinate system with two perpendicular
axes.

Polar The radial analog of the Cartesian system with points identified by their radius ρ
and angle θ.

Geographic This is the increasingly important system in which we have locations on the
curved surface of the Earth, but we are trying to represent these locations in a flat
two-dimensional plane. We will discuss such spatial analyses in Chapter 14.

An appropriate choice for a coordinate system is critical in representing one’s data


accurately, since, for example, displaying spatial data like airline routes on a flat Cartesian
plane can lead to gross distortions of reality (see Section 14.3.2).

Scale

Scales translate values into visual cues. The choice of scale is often crucial. The central
question is how does distance in the data graphic translate into meaningful differences in
quantity? Each coordinate axis can have its own scale, for which we have three different
choices:

Numeric A numeric quantity is most commonly set on a linear, logarithmic, or percent-


age scale. Note that a logarithmic scale does not have the property that, say, a
one-centimeter difference in position corresponds to an equal difference in quantity
anywhere on the scale.

Categorical A categorical variable may have no ordering (e.g., Democrat, Republican, or


Independent), or it may be ordinal (e.g., never, former, or current smoker).

Time Time is a numeric quantity that has some special properties. First, because of the
calendar, it can be demarcated by a series of different units (e.g., year, month, day,
etc.). Second, it can be considered periodically (or cyclically) as a “wrap-around”
scale. Time is also so commonly used and misused that it warrants careful consider-
ation.

Misleading with scale is easy, since it has the potential to completely distort the relative
positions of data points in any graphic.
2.2. COMPOSING DATA GRAPHICS 19

Context
The purpose of data graphics is to help the viewer make meaningful comparisons, but a
bad data graphic can do just the opposite: It can instead focus the viewer’s attention on
meaningless artifacts, or ignore crucial pieces of relevant but external knowledge. Context
can be added to data graphics in the form of titles or subtitles that explain what is being
shown, axis labels that make it clear how units and scale are depicted, or reference points
or lines that contribute relevant external information. While one should avoid cluttering
up a data graphic with excessive annotations, it is necessary to provide proper context.

Small multiples and layers


One of the fundamental challenges of creating data graphics is condensing multivariate
information into a two-dimensional image. While three-dimensional images are occasionally
useful, they are often more confusing than anything else. Instead, here are three common
ways of incorporating more variables into a two-dimensional data graphic:

Small multiples Also known as facets, a single data graphic can be composed of several
small multiples of the same basic plot, with one (discrete) variable changing in each
of the small sub-images.
Layers It is sometimes appropriate to draw a new layer on top of an existing data graphic.
This new layer can provide context or comparison, but there is a limit to how many
layers humans can reliably parse.
Animation If time is the additional variable, then an animation can sometimes effectively
convey changes in that variable. Of course, this doesn’t work on the printed page,
and makes it impossible for the user to see all the data at once.

2.2.2 Color
Color is one of the flashiest, but most misperceived and misused visual cues. In making color
choices, there are a few key ideas that are important for any data scientist to understand.
First, as we saw above, color and its monochromatic cousin shade are two of the most
poorly perceived visual cues. Thus, while potentially useful for a small number of levels
of a categorical variable, color and shade are not particularly faithful ways to represent
numerical variables—especially if small differences in those quantities are important to
distinguish. This means that while color can be visually appealing to humans, it often
isn’t as informative as we might hope. For two numeric variables, it is hard to think of
examples where color and shade would be more useful than position. Where color can be
most effective is to represent a third or fourth numeric quantity on a scatterplot—once the
two position cues have been exhausted.
Second, approximately 8 percent of the population—most of whom are men—have some
form of color blindness. Most commonly, this renders them incapable of seeing colors accu-
rately, most notably of distinguishing between red and green. Compounding the problem,
many of these people do not know that they are color-blind. Thus, for professional graphics
it is worth thinking carefully about which colors to use. The NFL famously failed to account
for this in a 2015 game in which the Buffalo Bills wore all-red jerseys and the New York
Jets wore all-green, leaving colorblind fans unable to distinguish one team from the other!

Pro Tip: Avoid contrasting red with green in data graphics (Bonus: your plots won’t
seem Christmas-y).
20 CHAPTER 2. DATA VISUALIZATION

RdBu (divergent)

Figure 2.10: Diverging red-blue color palette.

Thankfully, we have been freed from the burden of having to create such intelligent
palettes by the research of Cynthia Brewer, creator of the ColorBrewer website (and R
package). Brewer has created colorblind-safe palettes in a variety of hues for three different
types of numeric data in a single variable:

Sequential The ordering of the data has only one direction. Positive integers are sequential
because they can only go up: they can’t go past 0. (Thus, if 0 is encoded as white,
then any darker shade of gray indicates a larger number.)

Diverging The ordering of the data has two directions. In an election forecast, we com-
monly see states colored based on how they are expected to vote for the president.
Since red is associated with Republicans and blue with Democrats, states that are
solidly red or blue are on opposite ends of the scale. But “swing states” that could go
either way may appear purple, white, or some other neutral color that is “between”
red and blue (see Figure 2.10).

Qualitative There is no ordering of the data, and we simply need color to differentiate
different categories.

The RColorBrewer package provides functionality to use these palettes directly in R. Fig-
ure 2.11 illustrates the sequential, qualitative, and diverging palettes built into RColorBrewer.

Pro Tip: Take the extra time to use a well-designed color palette. Accept that those who
work with color for a living will probably choose better colors than you.

2.2.3 Dissecting data graphics


With a little practice, one can learn to dissect data graphics in terms of the taxonomy
outlined above. For example, your basic scatterplot uses position in the Cartesian plane
with linear scales to show the relationship between two variables. In what follows, we
identify the visual cues, coordinate system, and scale in a series of simple data graphics.

1. The bar graph in Figure 2.12 displays the average score on the math portion of the
1994–1995 SAT (with possible scores ranging from 200 to 800) among states for whom
at least two-thirds of the students took the SAT.
This plot uses the visual cue of position to represent the math SAT score on the vertical
axis with a linear scale. The categorical variable of state is arrayed on the horizontal
axis. Although the states are ordered alphabetically, it would not be appropriate to
consider the state variable to be ordinal, since the ordering is not meaningful in the
2.2. COMPOSING DATA GRAPHICS 21

YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
Set3
Set2
Set1
Pastel2
Pastel1
Paired
Dark2
Accent
Spectral
RdYlGn
RdYlBu
RdGy
RdBu
PuOr
PRGn
PiYG
BrBG

Figure 2.11: Palettes available through the RColorBrewer package.

context of math SAT scores. The coordinate system is Cartesian, although as noted
previously, the horizontal coordinate is meaningless. Context is provided by the axis
labels and title. Note also that since 200 is the minimum score possible on each section
of the SAT, the vertical axis has been constrained to start at 200.

2. Next, we consider a time series that shows the progression of the world record times
in the 100-meter freestyle swimming event for men and women. Figure 2.13 displays
the times as a function of the year in which the new record was set.
At some level this is simply a scatterplot that uses position on both the vertical and
horizontal axes to indicate swimming time and chronological time, respectively, in a
Cartesian plane. The numeric scale on the vertical axis is linear, in units of seconds,
while the scale on the horizontal axis is also linear, measured in years. But there is
more going on here. Color is being used as a visual cue to distinguish the categorical
variable sex. Furthermore, since the points are connected by lines, direction is being
used to indicate the progression of the record times. (In this case, the records can
only get faster, so the direction is always down.) One might even argue that angle is
being used to compare the descent of the world records across time and/or gender. In
fact, in this case shape is also being used to distinguish sex.

3. Next, we present two pie charts in Figure 2.14 indicating the different substance
of abuse for subjects in the Health Evaluation and Linkage to Primary Care (HELP)
clinical trial. Each subject was identified with involvement with one primary substance
(alcohol, cocaine, or heroin). On the right, we see the distribution of substance for
housed (no nights in shelter or on the street) participants is fairly evenly distributed,
22 CHAPTER 2. DATA VISUALIZATION

Average SAT math score, 1994−1995

500
Average SAT score

400

300

200

Connecticut Delaware Maine Massachusetts New Hampshire New Jersey New York Pennsylvania Rhode Island Vermont

Figure 2.12: Bar graph of average SAT scores among states with at least two-thirds of
students taking the test.

while on the left, we see the same distribution for those who were homeless one or
more nights (more likely to have alcohol as their primary substance of abuse).
This graphic uses a radial coordinate system and the visual cue of color to distinguish
the three levels of the categorical variable substance. The visual cue of angle is being
used to quantify the differences in the proportion of patients using each substance.
Are you able to accurately identify these percentages from the figure? The actual
percentages are shown below.

Pro Tip: Don’t use pie charts, except perhaps in small multiples.

homeless
substance homeless housed
alcohol 0.4928 0.3033
cocaine 0.2823 0.3811
heroin 0.2249 0.3156

This is a case where a simple table of these proportions is more effective at commu-
nicating the true differences than this—and probably any—data graphic. Note that
there are only six data points presented, so any graphic is probably gratuitous.
4. Finally, in Figure 2.15 we present a choropleth map showing the population of Mas-
sachusetts by the 2010 Census tracts.
Clearly, we are using a geographic coordinate system here, with latitude and longitude
on the vertical and horizontal axes, respectively. (This plot is not projected: More
information about projection systems is provided in Chapter 14.) Shade is once
again being used to represent the quantity population, but here the scale is more
complicated. The ten shades of blue have been mapped to the deciles of the census
tract populations, and since the distribution of population across these tracts is right-
skewed, each shade does not correspond to a range of people of the same width, but
2.3. IMPORTANCE OF DATA GRAPHICS: CHALLENGER 23

World Record time in 100 m Freestyle

90

l
l

80
l
l
sex
Time (s)

l l
l l F
70 l l
M
l
ll
ll

ll
60 ll
l l
ll
ll
l l ll l l

50

1925 1950 1975 2000


Year

Figure 2.13: Scatterplot of world record time in 100-meter freestyle swimming.

rather to the same number of tracts that have a population in that range. Helpful
context is provided by the title, subtitles, and legend.

2.3 Importance of data graphics: Challenger


On January 27th, 1986, engineers at Morton Thiokol, who supplied solid rocket motors
(SRMs) to NASA for the space shuttle, recommended that NASA delay the launch of the
space shuttle Challenger due to concerns that the cold weather forecast for the next day’s
launch would jeopardize the stability of the rubber O-rings that held the rockets together.
These engineers provided 13 charts that were reviewed over a two-hour conference call
involving the engineers, their managers, and NASA. The engineers’ recommendation was
overruled due to a lack of persuasive evidence, and the launch proceeded on schedule. The
O-rings failed in exactly the manner the engineers had feared 73 seconds after launch,
Challenger exploded, and all seven astronauts on board died [195].
In addition to the tragic loss of life, the incident was a devastating blow to NASA and the
United States space program. The hand-wringing that followed included a two-and-a-half
year hiatus for NASA and the formation of the Rogers Commission to study the disaster.
What became clear is that the Morton Thiokol engineers had correctly identified the key
causal link between temperature and O-ring damage. They did this using statistical data
analysis combined with a plausible physical explanation: in short, that the rubber O-rings
became brittle in low temperatures. (This link was famously demonstrated by legendary
physicist and Rogers Commission member Richard Feynman during the hearings, using
a glass of water and some ice cubes [195].) Thus, the engineers were able to identify the
critical weakness using their domain knowledge—in this case, rocket science—and their data
analysis. Their failure—and its horrific consequences—was one of persuasion: They simply
did not present their evidence in a convincing manner to the NASA officials who ultimately
made the decision to proceed with the launch. More than 30 years later this tragedy remains
critically important. The evidence brought to the discussions about whether to launch was
in the form of hand-written data tables (or “charts”) but none were graphical. In his
sweeping critique of the incident, Edward Tufte creates a powerful scatterplot similar to
24 CHAPTER 2. DATA VISUALIZATION

Substance of Abuse among housed HELP participants


homeless housed

0.00/1.00 0.00/1.00

substance
alcohol
0.75 0.25 0.75 0.25
cocaine
heroin

0.50 0.50

count

Figure 2.14: Pie charts showing the breakdown of substance of abuse among HELP study
participants, faceted by homeless status.

the one shown in Figure 2.17, which can be derived from data that the engineers had at the
time, but in a far more effective presentation [195].
Figure 2.16 indicates a clear relationship between the ambient temperature and O-ring
damage on the solid rocket motors. To demonstrate the dramatic extrapolation made to
the predicted temperature on January 27th, 1986, Tufte extended the horizontal axis in his
scatterplot (Figure 2.17) to include the forecasted temperature. The huge gap makes plain
the problem with extrapolation.
Tufte provided a full critique of the engineers’ failures [195], many of which are instruc-
tive for data scientists.

Lack of authorship There were no names on any of the charts. This creates a lack of
accountability. No single person was willing to take responsibility for the data con-
tained in any of the charts. It is much easier to refute an argument made by a group
of nameless people, than to a single or group of named people.
Univariate analysis The engineers provided several data tables, but all were essentially
univariate. That is, they presented data on a single variable, but did not illustrate the
relationship between two variables. Note that while Figure 2.18a does show data for
two different variables, it is very hard to see the connection between the two in tabular
form. Since the crucial connection here was between temperature and O-ring damage,
this lack of bivariate analysis was probably the single most damaging omission in the
engineers’ presentation.
Anecdotal evidence With such a small sample size, anecdotal evidence can be particu-
larly challenging to refute. In this case, a bogus comparison was made based on two
observations. While the engineers argued that SRM-15 had the most damage on the
coldest previous launch date (see Figure 2.17), NASA officials were able to counter
that SRM-22 had the second-most damage on one of the warmer launch dates. These
anecdotal pieces of evidence fall apart when all of the data are considered in context—
in Figure 2.17 it is clear that SRM-22 is an outlier that deviates from the general
pattern—but the engineers never presented all of the data in context.
2.3. IMPORTANCE OF DATA GRAPHICS: CHALLENGER 25

2010 Massachusetts Census Tracts by Population

Population Count
(0,2347]
(2347,2954]
(2954,3450]
(3450,3883]
(3883,4309]
(4309,4811]
(4811,5267]
(5267,5919]
(5919,6713]
(6713,12079]

Quantiles (equal frequency)

Figure 2.15: Choropleth map of population among Massachusetts Census tracts, based on
2010 U.S. Census.

12

10 SRM 15
Tufte's O−ring damage index

SRM 22
4

55 60 65 70 75 80

Temperature (degrees F) of field joints at time of launch

Figure 2.16: A scatterplot with smoother demonstrating the relationship between temper-
ature and O-ring damage on solid rocket motors. The dots are semi-transparent, so that
darker dots indicate multiple observations with the same values.
26 CHAPTER 2. DATA VISUALIZATION

12

10 SRM 15
Tufte's O−ring damage index

SRM 22
4
26 − 29 degree range of forecasted temperatures
(as of January 27th, 1986) for the launch
2
of space shuttle Challenger on January 28th

25 30 35 40 45 50 55 60 65 70 75 80 85

Temperature (degrees F) of field joints at time of launch

Figure 2.17: A recreation of Tufte’s scatterplot demonstrating the relationship between


temperature and O-ring damage on solid rocket motors.

Omitted data For some reason, the engineers chose not to present data from 22 other
flights, which collectively represented 92% of launches. This may have been due to
time constraints. This dramatic reduction in the accumulated evidence played a role
in enabling the anecdotal evidence outlined above.
Confusion No doubt working against the clock, and most likely working in tandem, the
engineers were not always clear about two different types of damage: erosion and
blow-by. A failure to clearly define these terms may have hindered understanding on
the part of NASA officials.

Extrapolation Most forcefully, the failure to include a simple scatterplot of the full data
obscured the “stupendous extrapolation” [195] necessary to justify the launch. The
bottom line was that the forecasted launch temperatures (between 26 and 29 degrees
Fahrenheit) were so much colder than anything that had occurred previously, any
model for O-ring damage as a function of temperature would be untested.

Pro Tip: When more than a handful of observations are present, data graphics are
often more revealing than tables. Always consider alternative representations to improve
communication.

Pro Tip: Always ensure that graphical displays are clearly described with appropriate
axis labels, additional text descriptions, and a caption.

Tufte notes that the cardinal sin of the engineers was a failure to frame the data in
relation to what? The notion that certain data may be understood in relation to something,
is perhaps the fundamental and defining characteristic of statistical reasoning. We will follow
this thread throughout the book.
We present this tragic episode in this chapter as motivation for a careful study of data
visualization. It illustrates a critical truism for practicing data scientists: Being right isn’t
2.4. CREATING EFFECTIVE PRESENTATIONS 27

(a) One of the original 13 charts presented


by Morton Thiokol engineers to NASA on the
conference call the night before the Challenger (b) Evidence presented during the congressional
launch. This is one of the more data-intensive hearings after the Challenger explosion. This is
charts. a classic example of “chartjunk.”

Figure 2.18: Reprints of two Morton Thiokol data graphics. [195]

enough—you have to be convincing. Note that Figure 2.18b contains the same data that are
present in Figure 2.17, but in a far less suggestive format. It just so happens that for most
human beings, graphical explanations are particularly persuasive. Thus, to be a successful
data analyst, one must master at least the basics of data visualization.

2.4 Creating effective presentations


Giving effective presentations is an important skill for a data scientist. Whether these
presentations are in academic conferences, in a classroom, in a boardroom, or even on stage,
the ability to communicate to an audience is of immeasurable value. While some people
may be naturally more comfortable in the limelight, everyone can improve the quality of
their presentations.
A few pieces of general advice are warranted [136]:

Budget your time You only have x minutes to talk, and usually 1 or 2 minutes to answer
questions. If your talk runs too short or too long, it makes you seem unprepared.
Rehearse your talk several times in order to get a better feel for your timing. Note
also that you may have a tendency to talk faster during your actual talk than you will
during your rehearsal. Talking faster in order to speed up is not a good strategy—you
are much better off simply cutting material ahead of time. You will probably have a
hard time getting through x slides in x minutes.

Pro Tip: Talking faster in order to speed up is not a good strategy—you are much better
off simply cutting material ahead of time or moving to a key slide or conclusion.

Don’t write too much on each slide You don’t want people to have to read your slides,
because if the audience is reading your slides, then they aren’t listening to you. You
want your slides to provide visual cues to the points that you are making—not sub-
stitute for your spoken words. Concentrate on graphical displays and bullet-pointed
lists of ideas.
28 CHAPTER 2. DATA VISUALIZATION

Put your problem in context Remember that (in most cases) most of your audience
will have little or no knowledge of your subject matter. The easiest way to lose people
is to dive right into technical details that require prior domain knowledge. Spend a
few minutes at the beginning of your talk introducing your audience to the most basic
aspects of your topic and presenting some motivation for what you are studying.

Speak loudly and clearly Remember that (in most cases) you know more about your
topic that anyone else in the room, so speak and act with confidence!

Tell a story, but not necessarily the whole story It is unrealistic to expect that you
can tell your audience everything that you know about your topic in x minutes. You
should strive to convey the big ideas in a clear fashion, but not dwell on the details.
Your talk will be successful if your audience is able to walk away with an understanding
of what your research question was, how you addressed it, and what the implications
of your findings are.

2.5 The wider world of data visualization


Thus far our discussion of data visualization has been limited to static, two-dimensional
data graphics. However, there are many additional ways to visualize data. While Chapter 3
focuses on static data graphics, Chapter 11 presents several cutting-edge tools for making
interactive data visualizations. Even more broadly, the field of visual analytics is concerned
with the science behind building interactive visual interfaces that enhance one’s ability to
reason about data. Finally, we have data art.
You can do many things with data. On one end of the spectrum, you might be focused on
predicting the outcome of a specific response variable. In such cases, your goal is very well-
defined and your success can be quantified. On the other end of the spectrum are projects
called data art, wherein the meaning of what you are doing with the data is elusive, but
the experience of viewing the data in a new way is in itself meaningful.
Consider Memo Akten and Quayola’s Forms, which was inspired by the physical move-
ment of athletes in the Commonwealth Games. Through video analysis, these movements
were translated into 3D digital objects shown in Figure 2.19. Note how the image in the
upper-left is evocative of a swimmer surfacing after a dive. When viewed as a movie, Forms
is an arresting example of data art.
Successful data art projects require both artistic talent and technical ability. Before Us is
the Salesman’s House is a live, continuously-updating exploration of the online marketplace
eBay. This installation was created by statistician Mark Hansen and digital artist Jer
Thorpe and is projected on a big screen as you enter eBay’s campus. The display begins
by pulling up Arthur Miller’s classic play Death of a Salesman, and “reading” the text
of the first chapter. Along the way, several nouns are plucked from the text (e.g., flute,
refrigerator, chair, bed, trophy, etc.). For each in succession, the display then shifts to a
geographic display of where things with that noun in the description are currently being
sold on eBay, replete with price and auction information. (Note that these descriptions are
not always perfect. In the video, a search for “refrigerator” turns up a T-shirt of former
Chicago Bears defensive end William “Refrigerator” Perry). Next, one city where such an
item is being sold is chosen, and any classic books of American literature being sold nearby
are collected. One is chosen, and the cycle returns to the beginning by “reading” the first
page of that book. This process continues indefinitely. When describing the exhibit, Hansen
spoke of “one data set reading another.” It is this interplay of data and literature that makes
such data art projects so powerful.
2.5. THE WIDER WORLD OF DATA VISUALIZATION 29

Figure 2.19: Still images from Forms, by Memo Akten and Quayola. Each image represents
an athletic movement made by a competitor at the Commonwealth Games, but reimagined
as a collection of moving 3D digital objects. Reprinted with permission.
30 CHAPTER 2. DATA VISUALIZATION

Finally, we consider another Mark Hansen collaboration, this time with Ben Rubin and
Michele Gorman. In Shakespeare Machine, 37 digital LCD blades—each corresponding to
one of Shakespeare’s plays—are arrayed in a circle. The display on each blade is a pattern
of words culled from the text of these plays. First, pairs of hyphenated words are shown.
Next, Boolean pairs (e.g., “good or bad”) are found. Third, articles and adjectives modifying
nouns (e.g., “the holy father”). In this manner, the artistic masterpieces of Shakespeare are
shattered into formulaic chunks. In Chapter 15 we will learn how to use regular expressions
to find the data for Shakespeare Machine.

2.6 Further resources


While issues related to data visualization pervade this entire text, they will be the particular
focus of Chapters 3 (Data visualization II), 11 (Data visualization III), and 14 (Spatial data).
No education in data graphics is complete without reading Tufte’s Visual Display of
Quantitative Information [196], which also contains a description of John Snow’s cholera
map (see Chapter 14). For a full description of the Challenger incident, see Visual Expla-
nations [195]. Tufte has also published two other landmark books [194, 198], as well as
reasoned polemics about the shortcomings of PowerPoint [197]. Bill Cleveland’s work on
visual perception [55] provides the foundation for Yau’s taxonomy [243]. Yau’s text [242]
provides many examples of thought-provoking data visualizations, particularly data art.
The grammar of graphics was first described by Wilkinson [238]. Hadley Wickham imple-
mented ggplot2 based on this formulation [212].
Many important data graphics were developed by John Tukey [199]. Andrew Gel-
man [87] has also written persuasively about data graphics in statistical journals. Gelman
discusses a set of canonical data graphics as well as Tufte’s suggested modifications to them.
Nolan and Perrett discuss data visualization assignments and rubrics that can be used to
grade them [147]. Steven J. Murdoch has created some R functions for drawing the kind
of modified diagrams that Tufte describes in [196]. These also appear in the ggthemes
package [9].
Cynthia Brewer’s color palettes are available at http://colorbrewer2.org and through
the RColorBrewer package for R. Her work is described in more detail in [38, 39]. Wick-
ham and others created the whimsical color palette that evokes Wes Anderson’s distinctive
movies [173].
Technically Speaking (Denison University) is an NSF-funded project for presentation
advice that contains instructional videos for students [136].

2.7 Exercises

Exercise 2.1
What would a Cartesian plot that used colors to convey categorical values look like?

Exercise 2.2
Consider the two graphics related to The New York Times “Taxmageddon” article at
http://www.nytimes.com/2012/04/15/sunday-review/coming-soon-taxmageddon.html.
The first is “Whose Tax Rates Rose or Fell” and the second is “Who Gains Most From Tax
Breaks.”
1. Examine the two graphics carefully. Discuss what you think they convey. What story
do the graphics tell?
2.7. EXERCISES 31

2. Evaluate both graphics in terms of the taxonomy described in this chapter. Are the
scales appropriate? Consistent? Clearly labelled? Do variable dimensions exceed data
dimensions?
3. What, if anything, is misleading about these graphics?

Exercise 2.3
Choose one of the data graphics listed at http://mdsr-book.github.io/exercises.
html#exercise_23 and answer the following questions. Be sure to indicate which graphical
display you picked.

1. Identify the visual cues, coordinate system, and scale(s).


2. How many variables are depicted in the graphic? Explicitly link each variable to a
visual cue that you listed above.
3. Critique this data graphic using the taxonomy described in this chapter.

Exercise 2.4
Answer the following questions for each of the following collections of data graphics
listed at http://mdsr-book.github.io/exercises.html#exercise_24.
Briefly (one paragraph) critique the designer’s choices. Would you have made different
choices? Why or why not?
Note: Each link contains a collection of many data graphics, and we don’t expect (or
want) you to write a dissertation on each individual graphic. But each collection shares
some common stylistic elements. You should comment on a few things that you notice
about the design of the collection.

Exercise 2.5
Consider one of the more complicated data graphics listed at http://mdsr-book.
github.io/exercises.html#exercise_25.

1. What story does the data graphic tell? What is the main message that you take away
from it?
2. Can the data graphic be described in terms of the taxonomy presented in this chapter?
If so, list the visual cues, coordinate system, and scales(s) as you did in Problem 2(a).
If not, describe the feature of this data graphic that lies outside of that taxonomy.
3. Critique and/or praise the visualization choices made by the designer. Do they work?
Are they misleading? Thought-provoking? Brilliant? Are there things that you would
have done differently? Justify your response.

Exercise 2.6
Consider the data graphic (http://tinyurl.com/nytimes-unplanned) about birth con-
trol methods.
1. What quantity is being shown on the y-axis of each plot?
2. List the variables displayed in the data graphic, along with the units and a few typical
values for each.
32 CHAPTER 2. DATA VISUALIZATION

3. List the visual cues used in the data graphic and explain how each visual cue is linked
to each variable.
4. Examine the graphic carefully. Describe, in words, what information you think the
data graphic conveys. Do not just summarize the data—interpret the data in the
context of the problem and tell us what it means.
Chapter 3

A grammar for graphics

In Chapter 2, we presented a taxonomy for understanding data graphics. In this chapter,


we illustrate how the ggplot2 package can be used to create data graphics. Other packages
for creating static, two-dimensional data graphics in R include base graphics and the lat-
tice system. We employ the ggplot2 system because it provides a unifying framework—a
grammar—for describing and specifying graphics. The grammar for specifying graphics will
allow the creation of custom data graphics that support visual display in a purposeful way.
We note that while the terminology used in ggplot2 is not the same as the taxonomy we
outlined in Chapter 2, there are many close parallels, which we will make explicit.

3.1 A grammar for data graphics


The ggplot2 package is one of the many creations of prolific R programmer Hadley Wick-
ham. It has become one of the most widely-used R packages, in no small part because of
the way it builds data graphics incrementally from small pieces of code.
In the grammar of ggplot2, an aesthetic is an explicit mapping between a variable
and the visual cues that represent its values. A glyph is the basic graphical element that
represents one case (other terms used include “mark” and “symbol”). In a scatterplot,
the positions of a glyph on the plot—in both the horizontal and vertical senses—are the
visual cues that help the viewer understand how big the corresponding quantities are. The
aesthetic is the mapping that defines these correspondences. When more than two variables
are present, additional aesthetics can marshal additional visual cues. Note also that some
visual cues (like direction in a time series) are implicit and do not have a corresponding
aesthetic.
For many of the chapters in this book, the first step in following these examples will be
to load the mdsr package for R, which contains all of the data sets referenced in this book.
In particular, loading mdsr also loads the mosaic package, which in turn loads dplyr and
ggplot2. (For more information about the mdsr package see Appendix A. If you are using
R for the first time, please see Appendix B for an introduction.)

library(mdsr)

Pro Tip: If you want to learn how to use a particular command, we highly recommend
running the example code on your own.

33
34 CHAPTER 3. A GRAMMAR FOR GRAPHICS

We begin with a data set that includes measures that are relevant to answer questions
about economic productivity. The CIACountries data table contains seven variables col-
lected for each of 236 countries: population (pop), area (area), gross domestic product
(gdp), percentage of GDP spent on education (educ), length of roadways per unit area
(roadways), Internet use as a fraction of the population (net users), and the number of
barrels of oil produced per day (oil prod). Table 3.1 displays a selection of variables for
the first six countries.

country oil prod gdp educ roadways net users


Afghanistan 0.00 1900.00 0.06 >5%
Albania 20510.00 11900.00 3.30 0.63 >35%
Algeria 1420000.00 14500.00 4.30 0.05 >15%
American Samoa 0.00 13000.00 1.21
Andorra 37200.00 0.68 >60%
Angola 1742000.00 7300.00 3.50 0.04 >15%

Table 3.1: A selection of variables from the first six rows of the CIACountries data table.

3.1.1 Aesthetics
In the simple scatterplot shown in Figure 3.1, we employ the grammar of graphics to build
a multivariate data graphic. In ggplot2, a plot is created with the ggplot() command, and
any arguments to that function are applied across any subsequent plotting directives. In
this case, this means that any variables mentioned anywhere in the plot are understood to
be within the CIACountries data frame, since we have specified that in the data argument.
Graphics in ggplot2 are built incrementally by elements. In this case, the only elements are
points, which are plotted using the geom point() function. The arguments to geom point()
specify where and how the points are drawn. Here, the two aesthetics (aes()) map the
vertical (y) coordinate to the gdp variable, and the horizontal (x) coordinate to the educ
variable. The size argument to geom point() changes the size of all of the glyphs. Note
that here, every dot is the same size. Thus, size is not an aesthetic, since it does not map
a variable to a visual cue. Since each case (i.e., row in the data frame) is a country, each
dot represents one country.
In Figure 3.1 the glyphs are simple. Only position in the frame distinguishes one glyph
from another. The shape, size, etc. of all of the glyphs are identical—there is nothing about
the glyph itself that identifies the country.
However, it is possible to use a glyph with several attributes. We can define additional
aesthetics to create new visual cues. In Figure 3.2, we have extended the previous example
by mapping the color of each dot to the categorical net users variable.
Changing the glyph is as simple as changing the function that draws that glyph—the
aesthetic can often be kept exactly the same. In Figure 3.3, we plot text instead of a dot.
Of course, we can employ multiple aesthetics. There are four aesthetics in Figure 3.4.
Each of the four aesthetics is set in correspondence with a variable—we say the variable is
mapped to the aesthetic. Educational attainment is being mapped to horizontal position,
GDP to vertical position, Internet connectivity to color, and length of roadways to size.
Thus, we encode four variables (gdp, educ, net users, and roadways) using the visual
cues of position, position, color, and area, respectively.
A data table provides the basis for drawing a data graphic. The relationship between
a data table and a graphic is simple: Each case in the data table becomes a mark in the
graph (we will return to the notion of glyph-ready data in Chapter 5). As the designer of
3.1. A GRAMMAR FOR DATA GRAPHICS 35

g <- ggplot(data = CIACountries, aes(y = gdp, x = educ))


g + geom_point(size = 3)

1e+05 l l

l
l l
l l

l l
gdp

l
l l
l l l
5e+04 l l l
l l l l l
l l
l l l l
l l l l
l l l
l l l l l
l l
l l ll
l ll l l l ll
l l l l
l l
ll
l l l ll l
l ll l ll
ll
l ll l l
l l l l l l ll ll l
l l l l l
ll
l ll l l ll l l ll l l
l ll l l l ll l l l l l l l l
l ll
0e+00 l lll l lll lll ll llll llllll l l l l l l l ll l

0 5 10
educ

Figure 3.1: Scatterplot using only the position aesthetic for glyphs.

g + geom_point(aes(color = net_users), size = 3)

1e+05 l l

l net_users
l l
l >0%
l l
l >5%
l l
gdp

l l >15%
l l >35%
l
l l l
l >60%
5e+04 l l l
l l l l l
l l l NA
l l l l
l l l l
l l l
l l l l l
l l
l l l ll ll l l l ll
l l l l
ll
l l l lll l l ll l
l ll l ll l l
l l l l l l ll ll l
l l l lll l l
l l llll l ll l l ll l l l
l l ll l l l l l l l l l l l
0e+00 l lll l lll lll l ll llll l lllll l l l l l l l l ll l

0 5 10
educ

Figure 3.2: Scatterplot in which net users is mapped to color.


36 CHAPTER 3. A GRAMMAR FOR GRAPHICS

g + geom_text(aes(label = country, color = net_users), size = 3)

Qatar

1e+05 Macau Luxembourg

Liechtenstein net_users
Bermuda
Singapore
a >0%
Monaco Brunei
a >5%
Kuwait
gdp

Norway a >15%
Australia
Switzerland a >35%
Hong Kong United States Ireland
Saudi Arabia a >60%
5e+04 Bahrain Netherlands
Germany Austria Sweden Iceland
Oman Canada Denmark a NA
British Virgin Islands United Belgium
FranceKingdom
Finland
Japan Korea, South New Zealand
Italy Spain Malta
quatorial Guinea Trinidad and Tobago Israel Cyprus
Czechia
Slovakia Slovenia
Estonia
Lithuania
Portugal
Saint Greece
Seychelles
Russia
Kitts and Poland
Hungary
Latvia Malaysia
Aruba
Antigua and Kazakhstan
BarbudaPanama ChileNevis
Croatia Argentina
Turkey Uruguay
Romania
Azerbaijan Mauritius
Lebanon Bulgaria
Iran Belarus
Mexico
Barbados
Thailand Venezuela Botswana
Dominican Republic Algeria
Colombia Brazil
SerbiaMongolia Costa Rica
MaldivesPalau
Cook
Anguilla
PeruAlbaniaGrenada
Islands
Egypt
SaintEcuador
Lucia SouthTunisia
Africa Namibia
GeorgiaIndonesia
Sri Lanka Saint
Dominica Vincent and the Grenadines Cuba
Burma
El
Philippines
Laos Angola Fiji Paraguay
Armenia
Salvador
Guatemala
Guyana
India Bhutan Morocco
CaboSyriaCongo,
Verde
Jamaica Belize
Ukraine
Republic
Vietnam of the
Bolivia Swaziland
Timor−Leste
Pakistan
ZambiaBangladesh
Cambodia Tonga
Mauritania Nicaragua
Cote d'Ivoire Samoa Kyrgyzstan Moldova
Ghana Sao Tome and Principe
Chad
Zimbabwe
Congo,Republic
0e+00 Central African Sierra
Guinea
Democratic
Cameroon
Uganda
Burkina
Madagascar
Eritrea Leone
Liberia
Republic Tajikistan
Faso
Gambia,
Niger
of the Nepal
The
Togo Yemen
Vanuatu
Mali
Ethiopia
Rwanda
Mozambique
Malawi Tanzania
Senegal
Benin
Burundi
Kenya
Solomon IslandsDjibouti
Comoros Marshall
KiribatiIslands
Lesotho

0 5 10
educ

Figure 3.3: Scatterplot using both location and label as aesthetics.

g + geom_point(aes(color = net_users, size = roadways))

net_users
l >0%
1e+05
l l
l >5%
l
l l l >15%

l l l >35%

l l >60%
gdp

l
l
l NA
l
l l l
l

5e+04 l l l
l l
l l l roadways
l
l l
l
l l l
l 10
l l
l l l
l l l
l l ll
l
l
l l 20

l
l
l
l
l
l
l
l
l
ll l
l
ll l l
l l 30
l l l
l
l
l
ll
l l
l l l
l l

l
l l l ll
l
l
l l
l l
l l l l l l
l
l ll ll
l ll l ll l l
l l l l l l
l
l l
ll l
l
l
l l l l l
ll l l l l l llll l ll l l l
l l l
l lll
l ll ll l ll l l
l
l
l l l
0e+00

0 5 10
educ

Figure 3.4: Scatterplot in which net users is mapped to color and educ mapped to size.
Compare this graphic to Figure 3.6, which displays the same data using facets.
3.1. A GRAMMAR FOR DATA GRAPHICS 37

the graphic, you choose which variables the graphic will display and how each variable is to
be represented graphically: position, size, color, and so on.

3.1.2 Scale
Compare Figure 3.4 to Figure 3.5. In the former, it is hard to discern differences in GDP
due to its right-skewed distribution and the choice of a linear scale. In the latter, the
logarithmic scale on the vertical axis makes the scatterplot more readable. Of course, this
makes interpreting the plot more complex, so we must be very careful when doing so. Note
that the only difference in the code is the addition of the coord trans() directive.

g + geom_point(aes(color = net_users, size = roadways)) +


coord_trans(y = "log10")

l
125000
100000 l l

l
l l l l
75000 l l
l
l l l l
l
50000 l l l
l l l l net_users
l
l
l
l l l l
l
l l
l l
l l l l
l l >0%
l
l l
l ll
l l ll l l l
25000 l l
l
l l l
>5%
l
l ll l
l l
l l l l >15%
l
l
l l
l l
l l
l l
l l
l
l l l l l l >35%
l
l l l l
l
l l
l l l
l
l l l l >60%
ll l l l
gdp

l
l l l
l l
l l l
l l
l NA
l l
l
l l l l l l
l l
l
l l
l
l
l
l
l
l l l l roadways
l l l

l
l l l
l
l
l
10
l
l

l
l
l l l
l
l l
l l
20

l 30
l
l l
l l

l
l l

2.5 5.0 7.5 10.0 12.5


educ

Figure 3.5: Scatterplot using a logarithmic transformation of GDP that helps to mitigate
visual clustering caused by the right-skewed distribution of GDP among countries.

Scales can also be manipulated in ggplot2 using any of the scale() functions. For
example, instead of using the coord trans() function as we did above, we could have
achieved a similar plot through the use of the scale y continuous() function, as illustrated
below. In either case, the points will be drawn in the same location—the difference in the
two plots is how and where the major tick marks and axis labels are drawn. We prefer to use
coord trans() in Figure 3.5 because it draws attention to the use of the log scale. Similarly
named functions (e.g., scale x continuous(), scale x discrete(), scale color(), etc.)
perform analogous operations on different aesthetics.

g + geom_point(aes(color = net_users, size = roadways)) +


scale_y_continuous(name = "Gross Domestic Product", trans = "log10")

Not all scales are about position. For instance, in Figure 3.4, net users is translated
to color. Similarly, roadways is translated to size: the largest dot corresponds to a value of
five roadways per unit area.
38 CHAPTER 3. A GRAMMAR FOR GRAPHICS

3.1.3 Guides
Context is provided by guides (more commonly called legends). A guide helps a human
reader to understand the meaning of the visual cues by providing context.
For position visual cues, the most common sort of guide is the familiar axis with its
tick marks and labels. But other guides exist. In Figures 3.4 and 3.5, legends relate how
dot color corresponds to Internet connectivity, and how dot size corresponds to length of
roadways (note the use of a log scale). The geom text() and geom annotate() functions
can also be used to provide specific textual annotations on the plot. Examples of how to
use these functions for annotations are provide in Section 3.3.

3.1.4 Facets
Using multiple aesthetics such as shape, color, and size to display multiple variables can
produce a confusing, hard-to-read graph. Facets—multiple side-by-side graphs used to
display levels of a categorical variable—provide a simple and effective alternative. Figure
3.6 uses facets to show different levels of Internet connectivity, providing a better view than
Figure 3.4. There are two functions that create facets: facet wrap() and facet grid().
The former creates a facet for each level of a single categorical variable, whereas the latter
creates a facet for each combination of two categorical variables, arranging them in a grid.

g + geom_point(alpha = 0.9, aes(size = roadways)) + coord_trans(y="log10") +


facet_wrap(~net_users, nrow = 1) + theme(legend.position = "top")

roadways 10 20 30

>0% >5% >15% >35% >60% NA

125000
100000
75000
50000

25000
gdp

2.5 5.0 7.5 10.0 12.5 2.5 5.0 7.5 10.0 12.5 2.5 5.0 7.5 10.0 12.5 2.5 5.0 7.5 10.0 12.5 2.5 5.0 7.5 10.0 12.5 2.5 5.0 7.5 10.0 12.5
educ

Figure 3.6: Scatterplot using facets for different ranges of Internet connectivity.

3.1.5 Layers
On occasion, data from more than one data table are graphed together. For example,
the MedicareCharges and MedicareProviders data tables provide information about the
average cost of each medical procedure in each state. If you live in New Jersey, you might
wonder how providers in your state charge for different medical procedures. However, you
will certainly want to understand those averages in the context of the averages across all
3.2. CANONICAL DATA GRAPHICS IN R 39

states. In the MedicareCharges table, each row represents a different medical procedure
(drg) with its associated average cost in each state. We also create a second data table
called ChargesNJ, which contains only those rows corresponding to providers in the state
of New Jersey. Do not worry if these commands aren’t familiar—we will learn these in
Chapter 4.

data(MedicareCharges)
ChargesNJ <- MedicareCharges %>% filter(stateProvider == "NJ")

The first few rows from the data table for New Jersey are shown in Table 3.2. This glyph-
ready table (see Chapter 5) can be translated to a chart (Figure 3.7) using bars to represent
the average charges for different medical procedures in New Jersey. The geom bar() function
creates a separate bar for each of the 100 different medical procedures.

drg stateProvider num charges mean charge


039 NJ 31 35103.81
057 NJ 55 45692.07
064 NJ 55 87041.64
065 NJ 59 59575.74
066 NJ 56 45819.13
069 NJ 61 41916.70
074 NJ 41 42992.81
101 NJ 58 42314.18
149 NJ 50 34915.54
176 NJ 36 58940.98

Table 3.2: Glyph-ready data for the barplot layer in Figure 3.7.

How do the charges in New Jersey compare to those in other states? The two data
tables, one for New Jersey and one for the whole country, can be plotted with different
glyph types: bars for New Jersey and dots for the states across the whole country as in
Figure 3.8. With the context provided by the individual states, it is easy to see that the
charges in New Jersey are among the highest in the country for each medical procedure.

3.2 Canonical data graphics in R


Over time, statisticians have developed standard data graphics for specific use cases [199].
While these data graphics are not always mesmerizing, they are hard to beat for simple
effectiveness. Every data scientist should know how to make and interpret these canonical
data graphics—they are ignored at your peril.

3.2.1 Univariate displays


It is generally useful to understand how a single variable is distributed. If that variable is
numeric, then its distribution is commonly summarized graphically using a histogram or
density plot. Using the ggplot2 package, we can display either plot for the Math variable
in the SAT 2010 data frame by binding the Math variable to the x aesthetic.

g <- ggplot(data = SAT_2010, aes(x = math))


40 CHAPTER 3. A GRAMMAR FOR GRAPHICS

p <- ggplot(data = ChargesNJ,


aes(x = reorder(drg, mean_charge), y = mean_charge)) +
geom_bar(fill = "gray", stat = "identity") +
ylab("Statewide Average Charges ($)") + xlab("Medical Procedure (DRG)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

250000

200000
Statewide Average Charges ($)

150000

100000

50000

0
536
303
310
313
305
203
390
684
149
379
039
948
563
293
301
641
918
897
195
392
491
192
282
312
603
690
069
101
812
074
552
202
638
057
066
394
315
419
699
300
309
389
191
439
885
194
292
378
683
482
640
391
689
287
176
281
065
872
372
190
254
308
178
602
811
473
189
418
193
698
481
291
244
917
470
682
377
249
251
064
314
177
280
871
247
238
286
243
253
208
330
480
469
460
252
246
329
853
207
870
Medical Procedure (DRG)

Figure 3.7: Bar graph of average charges for medical procedures in New Jersey.

p + geom_point(data = MedicareCharges, size = 1, alpha = 0.3)


Statewide Average Charges ($)

2e+05

1e+05

0e+00
536
303
310
313
305
203
390
684
149
379
039
948
563
293
301
641
918
897
195
392
491
192
282
312
603
690
069
101
812
074
552
202
638
057
066
394
315
419
699
300
309
389
191
439
885
194
292
378
683
482
640
391
689
287
176
281
065
872
372
190
254
308
178
602
811
473
189
418
193
698
481
291
244
917
470
682
377
249
251
064
314
177
280
871
247
238
286
243
253
208
330
480
469
460
252
246
329
853
207
870

Medical Procedure (DRG)

Figure 3.8: Bar graph adding a second layer to provide a comparison of New Jersey to other
states. Each dot represents one state, while the bars represent New Jersey.
3.2. CANONICAL DATA GRAPHICS IN R 41

g + geom_histogram(binwidth = 10)

4
count

500 550 600


math

Figure 3.9: Histogram showing the distribution of Math SAT scores by state.

Then we only need to choose either geom histogram() or geom density(). Both Figures
3.9 and 3.10 convey the same information, but whereas the histogram uses pre-defined bins
to create a discrete distribution, a density plot uses a kernel smoother to make a continuous
curve.
Note that the binwidth argument is being used to specify the width of bins in the
histogram. Here, each bin contains a ten–point range of SAT scores. In general, the
appearance of a histogram can vary considerably based on the choice of bins, and there is
no one “best” choice. You will have to decide what bin width is most appropriate for your
data.
Similarly, in the density plot shown in Figure 3.10 we use the adjust argument to modify
the bandwidth being used by the kernel smoother. In the taxonomy defined above, a density
plot uses position and direction in a Cartesian plane with a horizontal scale defined by the
units in the data.
If your variable is categorical, it doesn’t make sense to think about the values as having a
continuous density. Instead, we can use bar graphs to display the distribution of a categorical
variable. To make a simple bar graph for math, identifying each bar by the label state,
we use the geom bar() command, as displayed in Figure 3.11. Note that we add a few
wrinkles to this plot. First, we use the head() function to display only the first 10 states
(in alphabetical order). Second, we use the reorder() function to sort the state names in
order of their average math SAT score. Third, we set the stat argument to identity to
force ggplot2 to use the y aesthetic, which is mapped to math.
As noted earlier, we recommend against the use of pie charts to display the distribution
of a categorical variable since, in most cases, a table of frequencies is more informative.
An informative graphical display can be achieved using a stacked bar plot, such as the one
shown in Figure 3.12. Note that we have used the coord flip() function to display the
bars horizontally instead of vertically.
This method of graphical display enables a more direct comparison of proportions than
would be possible using two pie charts. In this case, it is clear that homeless participants
were more likely to identify as being involved with alcohol as their primary substance of
42 CHAPTER 3. A GRAMMAR FOR GRAPHICS

g + geom_density(adjust = 0.3)

0.015

0.010
density

0.005

0.000

500 550 600


math

Figure 3.10: Density plot showing the distribution of Math SAT scores by state.

ggplot(data = head(SAT_2010, 10), aes(x = reorder(state, math), y = math)) +


geom_bar(stat = "identity")
600

400
math

200

Georgia Delaware Florida Connecticut Alaska California Arizona Alabama Arkansas Colorado
reorder(state, math)

Figure 3.11: A bar plot showing the distribution of Math SAT scores for a selection of
states.
3.2. CANONICAL DATA GRAPHICS IN R 43

ggplot(data = HELPrct, aes(x = homeless)) +


geom_bar(aes(fill = substance), position = "fill") +
coord_flip()

housed

substance
homeless

alcohol
cocaine
heroin

homeless

0.00 0.25 0.50 0.75 1.00


count

Figure 3.12: A stacked bar plot showing the distribution of substance of abuse for partici-
pants in the HELP study. Compare this to Figure 2.14.

abuse. However, like pie charts, bar charts are sometimes criticized for having a low data-
to-ink ratio. That is, they use a comparatively large amount of ink to depict relatively few
data points.

3.2.2 Multivariate displays


Multivariate displays are the most effective way to convey the relationship between more
than one variable. The venerable scatterplot remains an excellent way to display observa-
tions of two quantitative (or numerical) variables. The scatterplot is provided in ggplot2
by the geom point() command. The main purpose of a scatterplot is to show the relation-
ship between two variables across many cases. Most often, there is a Cartesian coordinate
system in which the x-axis represents one variable and the y-axis the value of a second
variable.

g <- ggplot(data = SAT_2010, aes(x = expenditure, y = math)) + geom_point()

We will also add a smooth trend line and some more specific axis labels.

g <- g + geom_smooth(method = "lm", se = 0) +


xlab("Average expenditure per student ($1000)") +
ylab("Average score on math SAT")

In Figures 3.13 and 3.14 we plot the relationship between the average SAT math score
and the expenditure per pupil (in thousands of United States dollars) among states in 2010.
A third (categorical) variable can be added through faceting and/or layering. In this case,
we use the mutate() function (see Chapter 4) to create a new variable called SAT rate that
44 CHAPTER 3. A GRAMMAR FOR GRAPHICS

places states into bins (e.g., high, medium, low) based on the percentage of students taking
the SAT. Additionally, in order to include that new variable in our plots, we use the %+%
operator to update the data frame that is bound to our plot.

SAT_2010 <- SAT_2010 %>%


mutate(SAT_rate = cut(sat_pct, breaks = c(0,30,60,100),
labels = c("low", "medium", "high")))
g <- g %+% SAT_2010

In Figure 3.13, we use the color aesthetic to separate the data by SAT rate on a single
plot (i.e., layering). Compare this with Figure 3.14 where we add a facet wrap() mapped
to SAT rate to separate by facet.

g + aes(color = SAT_rate)

l
l
l
l
600 l
l l
l l

l
Average score on math SAT

l l
l l
l

SAT_rate
550 l
l l
l l
l low
l
l l medium
l
l high
l l
l l
l
l l
l l
l l
l l
l l
l l
500 l l
l l l
l

10 15 20
Average expenditure per student ($1000)

Figure 3.13: Scatterplot using the color aesthetic to separate the relationship between two
numeric variables by a third categorical variable.

Note for these two plots we have used the geom smooth() function in order to plot the
simple linear regression line (method = "lm") through those points (see Section 7.6 and
Appendix E).
The NHANES data table provides medical, behavioral, and morphometric measurements
of individuals. The scatterplot in Figure 3.15 shows the relationship between two of the
variables, height and age. Each dot represents one person and the position of that dot
signifies the value of the two variables for that person. Scatterplots are useful for visualizing
a simple relationship between two variables. For instance, you can see in Figure 3.15 the
familiar pattern of growth in height from birth to the late teens.
Some scatterplots have special meanings. A time series—such as the one shown in
Figure 3.16—is just a scatterplot with time on the horizontal axis and points connected by
lines to indicate temporal continuity. In Figure 3.16, the temperature at a weather station
in western Massachusetts is plotted over the course of the year. The familiar fluctations
based on the seasons are evident. Be especially aware of dubious causality in these plots:
Is time really a good explanatory variable?
3.2. CANONICAL DATA GRAPHICS IN R 45

g + facet_wrap(~ SAT_rate)

low medium high

l
l
l
l
600 l
l l
l l

l
Average score on math SAT

l l
l l
l

550 l
l l
l l

l
l

l l
l l
l
l l l l
l l
l l
l l l
l l
500 l l
l l l
l

10 15 20 10 15 20 10 15 20
Average expenditure per student ($1000)

Figure 3.14: Scatterplot using a facet wrap() to separate the relationship between two
numeric variables by a third categorical variable.

library(NHANES)
ggplot(data = sample_n(NHANES, size = 1000),
aes(x = Age, y = Height, color = Gender)) +
geom_point() + geom_smooth() + xlab("Age (years)") + ylab("Height (cm)")

200 l

l l l
l
l l l
l l l l
l l l l l l l
l l l l l
l
l
l
l l l l l l
l l l
l l
l l l l l
l
l
l
l l l l l l l l l
l l l l l l l l l l l
l
l l l l
l l
l
l
l l
l l l l
l l l l
l l l l l l
l l l l
l l l l l l l
l l
l l l l l l
l l l l l l l l
l l l
l l l l
l l l l l l
l l l l l l l l
175 l l l l l l l
l l
l
l l l l l l
l l l l l l l l l l l l l l
l
l l l l l l
l l l l l
l l l l l l
l l l l
l l l l l l l
l l l l l l l l l l l l l l l l l
l
l
l l l l l l l l l l l l l l l
l l l l
l
l
l l l
l l l l l l l l l l l
l l
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
l
l
l l
l l l l l l l l
l l l l l l l l l l l l l
l l l l
l l l l l l
l l l l l l l l l
l l l l l l l l l
l
l l
l l l l l l l l l l l
l l l l l
l l l
l l l l
l l l l l l l l l l l l l l l l l l l
l l l
l l l
l l l l l l l l l l l l l l l l l l l l
l l l l l l l l l l l l l l l l l l
l l l l
l l l l l l
l l l l l l l l l l l l l l l
l l
l l
l l l l l l l l
l l l l l l l l
l l l l l
l
l l
l
l l l l l
l l l
l l l
l l l l l
l l l l l l l l l l l l l
l l l l l
Height (cm)

l
l l l l l l l l l l l l l l
l l l
150 l
l l l l
l l l l
l l
l l l
l
Gender
l l l l l l l l
l
l l female
l l l
l
l l
l l
l l l
l l l
l l male
l l
l l
l
l

125 l
l
l
l
l
l l
l

l l
l
l
100 l
l
l
l
l
l l
l
l
l
l
l

0 20 40 60 80
Age (years)

Figure 3.15: A scatterplot for 1,000 random individuals from the NHANES study. Note how
mapping gender to color illuminates the differences in height between men and women.
46 CHAPTER 3. A GRAMMAR FOR GRAPHICS

library(macleish)
ggplot(data = whately_2015, aes(x = when, y = temperature)) +
geom_line(color = "darkgray") + geom_smooth() +
xlab(NULL) + ylab("Temperature (degrees Fahrenheit)")

20
Temperature (degrees Fahrenheit)

−20

Jan 2015 Apr 2015 Jul 2015 Oct 2015 Jan 2016

Figure 3.16: A time series showing the change in temperature at the MacLeish field station
in 2015.

For displaying a numerical response variable against a categorical explanatory variable,


a common choice is a box-and-whisker (or box) plot, as shown in Figure 3.17. It may be
easiest to think about this as simply a graphical depiction of the five-number summary
(minimum, Q1, median, Q3, and maximum).

favstats(length ~ sex, data = KidsFeet)

sex min Q1 median Q3 max mean sd n missing


1 B 22.9 24.35 24.95 25.8 27.5 25.11 1.217 20 0
2 G 21.6 23.65 24.20 25.1 26.7 24.32 1.330 19 0

When both the explanatory and response variables are categorical (or binned), points
and lines don’t work as well. How likely is a person to have diabetes, based on their age
and BMI (body mass index)? In the mosaicplot (or eikosogram) shown in Figure 3.18 the
number of observations in each cell is proportional to the area of the box. Thus, you can see
that diabetes tends to be more common for older people as well as for those who are obese,
since the blue shaded regions are larger than expected under an independence model while
the pink are less than expected. These provide a more accurate depiction of the intuitive
notions of probability familiar from Venn diagrams [152].
In Table 3.3 we summarize the use of ggplot2 plotting commands and their relationship
to canonical data graphics. Note that the mosaicplot() function is not part of ggplot2,
but rather is available through the built-in graphics system.
3.2. CANONICAL DATA GRAPHICS IN R 47

ggplot(data = KidsFeet, aes(x = sex, y = length)) + geom_boxplot()

26
length

24

22

B G
sex

Figure 3.17: A box-and-whisker plot showing the distribution of foot length by gender for
39 children.

NHANES2
18.5_to_24.9 12.0_18.5

20−29 30−39 40−49 50−59 60−69 70+


No Yes No Yes No Yes No Yes No Yes No Yes

>4
2:4
0:2
<−4 −4:−2 −2:0
BMI_WHO

25.0_to_29.9

Standardized
Residuals:
30.0_plus

AgeDecade

Figure 3.18: Mosaic plot (eikosogram) of diabetes by age and weight status (BMI).

response (y) explanatory (x) plot type ggplot2 geom()


numeric histogram, density geom histogram, geom density()
categorical stacked bar geom bar()
numeric numeric scatter geom point()
numeric categorical box geom boxplot()
categorical categorical mosaic graphics::mosaicplot()

Table 3.3: Table of canonical data graphics and their corresponding ggplot2 commands.
Note that mosaicplot() is not part of the ggplot2 package.
48 CHAPTER 3. A GRAMMAR FOR GRAPHICS

>1000 >100,000 >10 million


Oil Prod. (bbl/day)
>10,000 >1 million NA

Figure 3.19: A choropleth map displaying oil production by countries around the world in
barrels per day.

3.2.3 Maps
Using a map to display data geographically helps both to identify particular cases and
to show spatial patterns and discrepancies. In Figure 3.19, the shading of each country
represents its oil production. This sort of map, where the fill color of each region reflects
the value of a variable, is sometimes called a choropleth map. We will learn more about
mapping and how to work with spatial data in Chapter 14.

3.2.4 Networks
A network is a set of connections, called edges, between nodes, called vertices. A vertex
represents an entity. The edges indicate pairwise relationships between those entities.
The NCI60 data set is about the genetics of cancer. The data set contains more than
40,000 probes for the expression of genes, in each of 60 cancers. In the network displayed in
Figure 3.20, a vertex is a given cell line, and each is depicted as a dot. The dot’s color and
label gives the type of cancer involved. These are ovarian, colon, central nervous system,
melanoma, renal, breast, and lung cancers. The edges between vertices show pairs of cell
lines that had a strong correlation in gene expression.
The network shows that the melanoma cell lines (ME) are closely related to each other
but not so much to other cell lines. The same is true for colon cancer cell lines (CO) and
for central nervous system (CN) cell lines. Lung cancers, on the other hand, tend to have
associations with multiple other types of cancers. We will explore the topic of network
science in greater depth in Chapter 16.

3.3 Extended example: Historical baby names


For many of us, there are few things that are more personal than your name. It is impossible
to remember a time when you didn’t have your name, and you carry it with you wherever
you go. You instinctively react when you hear it. And yet, you didn’t choose your name—
your parents did (unless you’ve legally changed your name).
3.3. EXTENDED EXAMPLE: HISTORICAL BABY NAMES 49

BR
CO CO
ME CN
ME CO CO
ME ME
CO
ME ME CO BR
ME
ME ME CO LC
BR
LE

ME
CO

OV
LE LC
LC
PR PR
LE LC
OV LC
LC OV
RE
OV PR
BR LC
RE LC
LC RERE
correlation
CN RE OV
CN LC OV 0.75
CN RE RE OV
RE
CN BR RE 0.80
CN
CN OV 0.85
BR
0.90

Figure 3.20: A network diagram displaying the relationship between types of cancer cell
lines.

How do parents go about choosing names? Clearly, there seem to be both short and
long-term trends in baby names. The popularity of the name “Bella” spiked after the lead
character in Twilight became a cultural phenomenon. Other once-popular names seem to
have fallen out of favor—writers at FiveThirtyEight asked, “where have all the Elmer’s
gone?”
Using data from the babynames package, which uses public data from the Social Security
Administration (SSA), we can re-create many of the plots presented in the FiveThirtyEight
blog post, and in the process learn how to use ggplot2 to make production-quality data
graphics.
In Figure 3.21, we have reprinted an informative, annotated FiveThirtyEight data
graphic that shows the relative ages of American males named “Joseph.” Drawing on what
you have learned in Chapter 2, take a minute to jot down the visual cues, coordinate system,
scales, and context present in this plot. This diagnosis will facilitate our use of ggplot2 to
re-construct it.
The key insight of the FiveThirtyEight work is the estimation of the number of people
with each name who are currently alive. The lifetables table from the babynames package
contains actuarial estimates of the number of people per 100,000 who are alive at age x, for
every 0 ≤ x ≤ 114. The make babynames dist() function in the mdsr package adds some
more convenient variables and filters for only the data that is relevant to people alive in
2014.1

library(babynames)
BabynamesDist <- make_babynames_dist()
head(BabynamesDist, 2)

# A tibble: 2 9
year sex name n prop alive_prob count_thousands age_today
<dbl> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 See the SSA documentation https://www.ssa.gov/oact/NOTES/as120/LifeTables_Body.html for more

information.
50 CHAPTER 3. A GRAMMAR FOR GRAPHICS

Figure 3.21: Popularity of the name “Joseph” as constructed by FiveThirtyEight.

1 1900 F Mary 16707 0.05257 0 16.707 114


2 1900 F Helen 6343 0.01996 0 6.343 114
# ... with 1 more variables: est_alive_today <dbl>

To find information about a specific name, we can just use the filter() function.

BabynamesDist %>% filter(name == "Benjamin")

3.3.1 Percentage of people alive today


What was your diagnosis of Figure 3.21? There are two main data elements in that plot:
a thick black line indicating the number of Josephs born each year, and the thin light blue
bars indicating the number of Josephs born in each year that are expected to still be alive
today. In both cases, the vertical axis corresponds to the number of people (in thousands),
and the horizontal axis corresponds to the year of birth.
We can compose a similar plot in ggplot2. First we take the relevant subset of the
data and set up the initial ggplot2 object. The data frame joseph is bound to the plot,
since this contains all of the data that we need for this plot, but we will be using it with
multiple geoms. Moreover, the year variable is mapped to the x-axis as an aesthetic. This
will ensure that everything will line up properly.
3.3. EXTENDED EXAMPLE: HISTORICAL BABY NAMES 51

joseph <- BabynamesDist %>%


filter(name == "Joseph" & sex == "M")
name_plot <- ggplot(data = joseph, aes(x = year))

Next, we will add the bars.

name_plot <- name_plot +


geom_bar(stat = "identity", aes(y = count_thousands * alive_prob),
fill = "#b2d7e9", colour = "white")

The geom bar() function adds bars, which are filled with a light blue color and a white
border. The height of the bars is an aesthetic that is mapped to the estimated number of
people alive today who were born in each year. The stat argument is set to identity,
since we want the actual y values to be used—not the number of each (which is the default).
The black line is easily added using the geom line() function.

name_plot <- name_plot + geom_line(aes(y = count_thousands), size = 2)

Adding an informative label for the vertical axis and removing an uninformative label
for the horizontal axis will improve the readability of our plot.

name_plot <- name_plot +


ylab("Number of People (thousands)") + xlab(NULL)

Inspecting the summary() of our plot at this point can help us keep things straight. Does
this accord with what you jotted down previously?

summary(name_plot)

data: year, sex, name, n, prop, alive_prob, count_thousands,


age_today, est_alive_today [111x9]
mapping: x = year
faceting: <ggproto object: Class FacetNull, Facet>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map: function
map_data: function
params: list
render_back: function
render_front: function
render_panels: function
setup_data: function
setup_params: function
shrink: TRUE
train: function
train_positions: function
52 CHAPTER 3. A GRAMMAR FOR GRAPHICS

train_scales: function
vars: function
super: <ggproto object: Class FacetNull, Facet>
-----------------------------------
mapping: y = count_thousands * alive_prob
geom_bar: width = NULL, na.rm = FALSE
stat_identity: na.rm = FALSE
position_stack

mapping: y = count_thousands
geom_line: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity

The final data-driven element of Figure 3.21 is a darker blue bar indicating the median
year of birth. We can compute this with the wtd.quantile() function in the Hmisc package.
Setting the probs argument to 0.5 will give us the median year of birth, weighted by the
number of people estimated to be alive today (est alive today).

wtd.quantile <- Hmisc::wtd.quantile


median_yob <-
with(joseph, wtd.quantile(year, est_alive_today, probs = 0.5))
median_yob

50%
1975

We can then overplot a single bar in a darker shade of blue. Here, we are using the
ifelse() function cleverly. If the year is equal to the median year of birth, then the height
of the bar is the estimated number of Josephs alive today. Otherwise, the height of the bar
is zero (so you can’t see it at all). In this manner we plot only the one darker blue bar that
we want to highlight.

name_plot <- name_plot +


geom_bar(stat = "identity", colour = "white", fill = "#008fd5",
aes(y = ifelse(year == median_yob, est_alive_today / 1000, 0)))

Lastly, Figure 3.21 contains many contextual elements specific to the name Joseph. We
can add a title, annotated text, and an arrow providing focus to a specific element of the
plot. Figure 3.22 displays our reproduction of Figure 3.21. There are a few differences in the
presentation of fonts, title, etc. These can be altered using ggplot2’s theming framework,
but we won’t explore these subtleties here (see Section 11.4).2

name_plot +
ggtitle("Age Distribution of American Boys Named Joseph") +
geom_text(x = 1935, y = 40, label = "Number of Josephs\nborn each year") +
2 You may note that our number of births per year are lower than FiveThirtyEight’s beginning in about

1940. It is explained in a footnote in their piece that some of the SSA records are incomplete for privacy
reasons, and thus they pro-rated their data based on United States Census estimates for the early years of
the century. We have omitted this step, but the births table in the babynames package will allow you to
perform it.
3.3. EXTENDED EXAMPLE: HISTORICAL BABY NAMES 53

geom_text(x = 1915, y = 13, label =


"Number of Josephs\nborn each year\nestimated to be alive\non 1/1/2014",
colour = "#b2d7e9") +
geom_text(x = 2003, y = 40,
label = "The median\nliving Joseph\nis 37 years old",
colour = "darkgray") +
geom_curve(x = 1995, xend = 1974, y = 40, yend = 24,
arrow = arrow(length = unit(0.3,"cm")), curvature = 0.5) + ylim(0, 42)

Age Distribution of American Boys Named Joseph

The median
Number of Josephs
40 living Joseph
born each year
is 37 years old

30
Number of People (thousands)

20

Number of Josephs
born each year
estimated to be alive
10 on 1/1/2014

1920 1950 1980 2010

Figure 3.22: Recreation of the age distribution of “Joseph” plot.

Notice that we did not update the name plot object with this contextual information.
This was intentional, since we can update the data argument of name plot and obtain an
analogous plot for another name. This functionality makes use of the special %+% operator.
As shown in Figure 3.23, the name “Josephine” enjoyed a spike in popularity around 1920
that later subsided.

name_plot %+% filter(BabynamesDist, name == "Josephine" & sex == "F")

While some names are almost always associated with a particular gender, many are not.
More interestingly, the proportion of people assigned male or female with a given name
often varies over time. These data were presented nicely by Nathan Yau at FlowingData.
We can compare how our name plot differs by gender for a given name using a facet.
To do this, we will simply add a call to the facet wrap() function, which will create small
multiples based on a single categorical variable, and then feed a new data frame to the
plot that contains data for both sexes. In Figure 3.24, we show the prevalence of “Jessie”
changed for the two sexes.
54 CHAPTER 3. A GRAMMAR FOR GRAPHICS

7.5
Number of People (thousands)

5.0

2.5

0.0

1920 1950 1980 2010

Figure 3.23: Age distribution of American girls named “Josephine”.

names_plot <- name_plot + facet_wrap(~sex)


names_plot %+% filter(BabynamesDist, name == "Jessie")

F M

3
Number of People (thousands)

1920 1950 1980 2010 1920 1950 1980 2010

Figure 3.24: Comparison of the name “Jessie” across two genders.

The plot at FlowingData shows the 35 most common “unisex” names—that is, the
names that have historically had the greatest balance between males and females. We can
use a facet grid() to compare the gender breakdown for a few of the most common of
these, as shown in Figures 3.25 and 3.26.

many_names_plot <- name_plot + facet_grid(name ~ sex)


mnp <- many_names_plot %+% filter(BabynamesDist, name %in%
3.3. EXTENDED EXAMPLE: HISTORICAL BABY NAMES 55

c("Jessie", "Marion", "Jackie"))


mnp

F M
6

Jackie
2
Number of People (thousands)

0
6

Jessie
2

0
6

Marion
2

0
1920 1950 1980 2010 1920 1950 1980 2010

Figure 3.25: Gender breakdown for the three most “unisex” names.

Reversing the order of the variables in the call to facet grid() flips the orientation of
the facets.

mnp + facet_grid(sex ~ name)

Jackie Jessie Marion


6

4
F
Number of People (thousands)

4
M

0
1920 1950 1980 2010 1920 1950 1980 2010 1920 1950 1980 2010

Figure 3.26: Gender breakdown for the three most “unisex” names, oriented vertically.
56 CHAPTER 3. A GRAMMAR FOR GRAPHICS

3.3.2 Most common women’s names


A second interesting data graphic from the same FiveThirtyEight articles is shown in Fig-
ure 3.27. Take a moment to analyze this data graphic. What are visual cues? What are
the variables? How are the variables being mapped to the visual cues? What geom()s are
present?
To recreate this data graphic, we need to collect the right data. We need to figure out
what the 25 most common female names are among those estimated to be alive today. We
can do this by counting the estimated number of people alive today for each name, filtering
for women, sorting by the number estimated to be alive, and then taking the top 25 results.
We also need to know the median age, as well as the first and third quartiles for age among
people having each name.

com_fem <- BabynamesDist %>%


filter(sex == "F") %>%
group_by(name) %>%
summarise(
N = n(), est_num_alive = sum(est_alive_today),
q1_age = wtd.quantile(age_today, est_alive_today, probs = 0.25),
median_age = wtd.quantile(age_today, est_alive_today, probs = 0.5),
q3_age = wtd.quantile(age_today, est_alive_today, probs = 0.75)) %>%
arrange(desc(est_num_alive)) %>%
head(25)

This data graphic is a bit trickier than the previous one. We’ll start by binding the
data, and defining the x and y aesthetics. Contrary to Figure 3.27, we put the names on
the x-axis and the median age on the y—the reasons for doing so will be made clearer later.
We will also define the title of the plot, and remove the x-axis label, since it is self-evident.

w_plot <- ggplot(data = com_fem, aes(x = reorder(name, -median_age),


y = median_age)) + xlab(NULL) + ylab("Age (in years)") +
ggtitle("Median ages for females with the 25 most common names")

The next element to add are the gold rectangles. To do this, we use the geom linerange()
function. It may help to think of these not as rectangles, but as really thick lines. Because
we have already mapped the names to the x-axis, we only need to specify the mappings
for ymin and ymax. These are mapped to the first and third quartiles, respectively. We
will also make these lines very thick and color them appropriately. geom linerange() only
understands ymin and ymax—there is not a corresponding function with xmin and xmax.
This is the reason that we are drawing our plot transposed to Figure 3.27. However, we
will fix this later. We have also added a slight alpha transparency to allow the gridlines to
be visible underneath the gold rectangles.

w_plot <- w_plot + geom_linerange(aes(ymin = q1_age, ymax = q3_age),


color = "#f3d478", size = 10, alpha = 0.8)

There is a red dot indicating the median age for each of these names. If you look carefully,
you can see a white border around each red dot. The default glyph for geom point() is a
solid dot, which is shape 19. By changing it to shape 21, we can use both the fill and
colour arguments.
3.3. EXTENDED EXAMPLE: HISTORICAL BABY NAMES 57

Figure 3.27: FiveThirtyEight’s depiction of the age ranges for the 25 most common female
names.
58 CHAPTER 3. A GRAMMAR FOR GRAPHICS

w_plot <- w_plot +


geom_point(fill = "#ed3324", colour = "white", size = 4, shape = 21)

It remains only to add the context and flip our plot around so the orientation matches
that of Figure 3.27. The coord flip() function does exactly that.

w_plot +
geom_point(aes(y = 55, x = 24), fill = "#ed3324", colour = "white",
size = 4, shape = 21) +
geom_text(aes(y = 58, x = 24, label = "median")) +
geom_text(aes(y = 26, x = 16, label = "25th")) +
geom_text(aes(y = 51, x = 16, label = "75th percentile")) +
geom_point(aes(y = 24, x = 16), shape = 17) +
geom_point(aes(y = 56, x = 16), shape = 17) +
coord_flip()

You will note that the name “Anna” was fifth most common in Figure 3.27 but did not
appear in Figure 3.28. This appears to be a result of that name’s extraordinarily large range
and the pro-rating that FiveThirtyEight did to their data. The “older” names—including
Anna—were more affected by this alteration. Anna was the 47th most popular name by
our calculations.

3.4 Further resources


The grammar of graphics was created by Wilkinson [238], and implemented in ggplot2
by Wickham [212]. Version 2.0.0 of the ggplot2 package was released in late 2015 and a
second edition of the ggplot2 book is forthcoming. The ggplot2 cheat sheet produced by
RStudio is an excellent reference for understanding the various features of ggplot2.

3.5 Exercises

Exercise 3.1
Using the famous Galton data set from the mosaicData package:

library(mosaic)
head(Galton)

family father mother sex height nkids


1 1 78.5 67.0 M 73.2 4
2 1 78.5 67.0 F 69.2 4
3 1 78.5 67.0 F 69.0 4
4 1 78.5 67.0 F 69.0 4
5 2 75.5 66.5 M 73.5 4
6 2 75.5 66.5 M 72.5 4

1. Create a scatterplot of each person’s height against their father’s height

2. Separate your plot into facets by sex


3.5. EXERCISES 59

3. Add regression lines to all of your facets

Recall that you can find out more about the data set by running the command ?Galton.

Exercise 3.2
Using the RailTrail data set from the mosaicData package:

library(mosaic)
head(RailTrail)

hightemp lowtemp avgtemp spring summer fall cloudcover precip volume


1 83 50 66.5 0 1 0 7.6 0.00 501
2 73 49 61.0 0 1 0 6.3 0.29 419
3 74 52 63.0 1 0 0 7.5 0.32 397
4 95 61 78.0 0 1 0 2.6 0.00 385
5 44 52 48.0 1 0 0 10.0 0.14 200
6 69 54 61.5 1 0 0 6.6 0.02 375
weekday
1 1
2 1
3 1
4 0
5 1
6 1

1. Create a scatterplot of the number of crossings per day volume against the high
temperature that day
2. Separate your plot into facets by weekday
3. Add regression lines to the two facets

Exercise 3.3
Angelica Schuyler Church (1756–1814) was the daughter of New York Governer Philip
Schuyler and sister of Elizabeth Schuyler Hamilton. Angelica, New York was named after
her. Generate a plot of the reported proportion of babies born with the name Angelica over
time and interpret the figure.

Exercise 3.4
The following questions use the Marriage data set from the mosaicData package.

library(mosaic)
head(Marriage, 2)

bookpageID appdate ceremonydate delay officialTitle person dob


1 B230p539 10/29/96 11/9/96 11 CIRCUIT JUDGE Groom 4/11/64
2 B230p677 11/12/96 11/12/96 0 MARRIAGE OFFICIAL Groom 8/6/64
age race prevcount prevconc hs college dayOfBirth sign
1 32.60 White 0 <NA> 12 7 102 Aries
2 32.29 White 1 Divorce 12 0 219 Leo
60 CHAPTER 3. A GRAMMAR FOR GRAPHICS

1. Create an informative and meaningful data graphic.

2. Identify each of the visual cues that you are using, and describe how they are related
to each variable.

3. Create a data graphic with at least five variables (either quantitative or categori-
cal). For the purposes of this exercise, do not worry about making your visualization
meaningful—just try to encode five variables into one plot.

Exercise 3.5

The MLB teams data set in the mdsr package contains information about Major League
Baseball teams in the past four seasons. There are several quantitative and a few categorical
variables present. See how many variables you can illustrate on a single plot in R. The
current record is 7. (Note: This is not good graphical practice—it is merely an exercise to
help you understand how to use visual cues and aesthetics!)

library(mdsr)
head(MLB_teams, 4)

# A tibble: 4 11
yearID teamID lgID W L WPct attendance normAttend payroll
<int> <chr> <fctr> <int> <int> <dbl> <int> <dbl> <int>
1 2008 ARI NL 82 80 0.5062 2509924 0.5839 66202712
2 2008 ATL NL 72 90 0.4444 2532834 0.5892 102365683
3 2008 BAL AL 68 93 0.4224 1950075 0.4536 67196246
4 2008 BOS AL 95 67 0.5864 3048250 0.7091 133390035
# ... with 2 more variables: metroPop <dbl>, name <chr>

Exercise 3.6

Use the MLB teams data in the mdsr package to create an informative data graphic that
illustrates the relationship between winning percentage and payroll in context.

Exercise 3.7

Use the make babynames dist() function in the mdsr package to recreate the “Deadest
Names” graphic from FiveThirtyEight (http://tinyurl.com/zcbcl9o).

library(mdsr)
babynames_dist <- make_babynames_dist()
3.5. EXERCISES 61

babynames_dist

# A tibble: 1,639,368 9
year sex name n prop alive_prob count_thousands
<dbl> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 1900 F Mary 16707 0.05257 0 16.707
2 1900 F Helen 6343 0.01996 0 6.343
3 1900 F Anna 6114 0.01924 0 6.114
4 1900 F Margaret 5306 0.01670 0 5.306
5 1900 F Ruth 4765 0.01499 0 4.765
6 1900 F Elizabeth 4096 0.01289 0 4.096
7 1900 F Florence 3920 0.01234 0 3.920
8 1900 F Ethel 3896 0.01226 0 3.896
9 1900 F Marie 3856 0.01213 0 3.856
10 1900 F Lillian 3414 0.01074 0 3.414
# ... with 1,639,358 more rows, and 2 more variables: age_today <dbl>,
# est_alive_today <dbl>

Exercise 3.8
The macleish package contains weather data collected every ten minutes in 2015 from
two weather stations in Whately, MA.

library(macleish)
head(whately_2015)

# A tibble: 6 8
when temperature wind_speed wind_dir rel_humidity
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-01-01 00:00:00 -9.32 1.399 225.4 54.55
2 2015-01-01 00:10:00 -9.46 1.506 248.2 55.38
3 2015-01-01 00:20:00 -9.44 1.620 258.3 56.18
4 2015-01-01 00:30:00 -9.30 1.141 243.8 56.41
5 2015-01-01 00:40:00 -9.32 1.223 238.4 56.87
6 2015-01-01 00:50:00 -9.34 1.090 241.7 57.25
# ... with 3 more variables: pressure <int>, solar_radiation <dbl>,
# rainfall <int>

Using ggpplot2, create a data graphic that displays the average temperature over each
10-minute interal (temperature) as a function of time (when).

Exercise 3.9
Using data from the nasaweather package, create a scatterplot between wind and
pressure, with color being used to distinguish the type of storm.

Exercise 3.10
Using data from the nasaweather package, use the geom path() function to plot the
path of each tropical storm in the storms data table. Use color to distinguish the storms
from one another, and use facetting to plot each year in its own panel.
62 CHAPTER 3. A GRAMMAR FOR GRAPHICS

Median ages for females with the 25 most common names

Emily l
Ashley l l median
Jessica l
Sarah l
Amanda l
Stephanie l
Melissa l
Jennifer l
Rebecca l
Elizabeth 25th l 75th percentile

Michelle l
Kimberly l
Laura l
Lisa l
Karen l
Susan l
Donna l
Deborah l
Sandra l
Patricia l
Nancy l
Mary l
Linda l
Carol l
Barbara l
20 40 60
Age (in years)

Figure 3.28: Recreation of FiveThirtyEight’s plot of the age distributions for the 25 most
common women’s names.
Chapter 4

Data wrangling

This chapter introduces basics of how to wrangle data in R. Wrangling skills will provide
an intellectual and practical foundation for working with modern data.

4.1 A grammar for data wrangling


In much the same way that ggplot2 presents a grammar for data graphics, the dplyr
package presents a grammar for data wrangling [234]. Hadley Wickham, one of the authors
of dplyr, has identified five verbs for working with data in a data frame:

select() take a subset of the columns (i.e., features, variables)


filter() take a subset of the rows (i.e., observations)
mutate() add or modify existing columns
arrange() sort the rows
summarize() aggregate the data across rows (e.g., group it according to some criteria)

Each of these functions takes a data frame as its first argument, and returns a data
frame. Thus, these five verbs can be used in conjunction with each other to provide a
powerful means to slice-and-dice a single table of data. As with any grammar, what these
verbs mean on their own is one thing, but being able to combine these verbs with nouns
(i.e., data frames) creates an infinite space for data wrangling. Mastery of these five verbs
can make the computation of most any descriptive statistic a breeze and facilitate further
analysis. Wickham’s approach is inspired by his desire to blur the boundaries between
R and the ubiquitous relational database querying syntax SQL. When we revisit SQL in
Chapter 12, we will see the close relationship between these two computing paradigms. A
related concept more popular in business settings is the OLAP (online analytical processing)
hypercube, which refers to the process by which multidimensional data is “sliced-and-diced.”

4.1.1 select() and filter()


The two simplest of the five verbs are filter() and select(), which allow you to return
only a subset of the rows or columns of a data frame, respectively. Generally, if we have a
data frame that consists of n rows and p columns, Figures 4.1 and 4.2 illustrate the effect of
filtering this data frame based on a condition on one of the columns, and selecting a subset
of the columns, respectively.

63
64 CHAPTER 4. DATA WRANGLING

m≤n
n

Figure 4.1: The filter() function. At left, a data frame that contains matching entries
in a certain column for only a subset of the rows. At right, the resulting data frame after
filtering.

n n

p ≤p

Figure 4.2: The select() function. At left, a data frame, from which we retrieve only a
few of the columns. At right, the resulting data frame after selecting those columns.

Specifically, we will demonstrate the use of these functions on the presidential data
frame (from the ggplot2 package), which contains p = 4 variables about the terms of n = 11
recent U.S. Presidents.

library(mdsr)
presidential

# A tibble: 11 4
name start end party
<chr> <date> <date> <chr>
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Kennedy 1961-01-20 1963-11-22 Democratic
3 Johnson 1963-11-22 1969-01-20 Democratic
4 Nixon 1969-01-20 1974-08-09 Republican
5 Ford 1974-08-09 1977-01-20 Republican
6 Carter 1977-01-20 1981-01-20 Democratic
7 Reagan 1981-01-20 1989-01-20 Republican
8 Bush 1989-01-20 1993-01-20 Republican
9 Clinton 1993-01-20 2001-01-20 Democratic
10 Bush 2001-01-20 2009-01-20 Republican
11 Obama 2009-01-20 2017-01-20 Democratic

To retrieve only the names and party affiliations of these presidents, we would use
select(). The first argument to the select() function is the data frame, followed by an
arbitrarily long list of column names, separated by commas. Note that it is not necessary
to wrap the column names in quotation marks.
4.1. A GRAMMAR FOR DATA WRANGLING 65

select(presidential, name, party)

# A tibble: 11 2
name party
<chr> <chr>
1 Eisenhower Republican
2 Kennedy Democratic
3 Johnson Democratic
4 Nixon Republican
5 Ford Republican
6 Carter Democratic
7 Reagan Republican
8 Bush Republican
9 Clinton Democratic
10 Bush Republican
11 Obama Democratic

Similarly, the first argument to filter() is a data frame, and subsequent arguments are
logical conditions that are evaluated on any involved columns. Thus, if we want to retrieve
only those rows that pertain to Republican presidents, we need to specify that the value of
the party variable is equal to Republican.

filter(presidential, party == "Republican")

# A tibble: 6 4
name start end party
<chr> <date> <date> <chr>
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Nixon 1969-01-20 1974-08-09 Republican
3 Ford 1974-08-09 1977-01-20 Republican
4 Reagan 1981-01-20 1989-01-20 Republican
5 Bush 1989-01-20 1993-01-20 Republican
6 Bush 2001-01-20 2009-01-20 Republican

Note that the == is a test for equality. If we were to use only a single equal sign here,
we would be asserting that the value of party was Republican. This would cause all of the
rows of presidential to be returned, since we would have overwritten the actual values of
the party variable. Note also the quotation marks around Republican are necessary here,
since Republican is a literal value, and not a variable name.
Naturally, combining the filter() and select() commands enables one to drill down to
very specific pieces of information. For example, we can find which Democratic presidents
served since Watergate.

select(filter(presidential, start > 1973 & party == "Democratic"), name)

# A tibble: 3 1
name
<chr>
1 Carter
2 Clinton
3 Obama
66 CHAPTER 4. DATA WRANGLING

n n

p p+1

Figure 4.3: The mutate() function. At left, a data frame. At right, the resulting data frame
after adding a new column.

In the syntax demonstrated above, the filter() operation is nested inside the select()
operation. As noted above, each of the five verbs takes and returns a data frame, which
makes this type of nesting possible. Shortly, we will see how these verbs can be chained
together to make rather long expressions that can become very difficult to read. Instead, we
recommend the use of the %>% (pipe) operator. Pipe-forwarding is an alternative to nesting
that yields code that can be easily read from top to bottom. With the pipe, we can write
the same expression as above in this more readable syntax.

presidential %>%
filter(start > 1973 & party == "Democratic") %>%
select(name)

# A tibble: 3 1
name
<chr>
1 Carter
2 Clinton
3 Obama

This expression is called a pipeline. Notice how the expression

dataframe %>% filter(condition)

is equivalent to filter(dataframe, condition). In later examples we will see how this


operator can make our code more readable and efficient, particularly for complex operations
on large data sets.

4.1.2 mutate() and rename()


Frequently, in the process of conducting our analysis, we will create, re-define, and rename
some of our variables. The functions mutate() and rename() provide these capabilities. A
graphical illustration of the mutate() operation is shown in Figure 4.3.
While we have the raw data on when each of these presidents took and relinquished
office, we don’t actually have a numeric variable giving the length of each president’s term.
Of course, we can derive this information from the dates given, and add the result as a
new column to our data frame. This date arithmetic is made easier through the use of the
lubridate package, which we use to compute the number of exact years (eyears(1)())
that elapsed since during the interval() from the start until the end of each president’s
term.
4.1. A GRAMMAR FOR DATA WRANGLING 67

In this situation, it is generally considered good style to create a new object rather
than clobbering the one that comes from an external source. To preserve the existing
presidential data frame, we save the result of mutate() as a new object called mypresidents.

library(lubridate)
mypresidents <- presidential %>%
mutate(term.length = interval(start, end) / eyears(1))
mypresidents

# A tibble: 11 5
name start end party term.length
<chr> <date> <date> <chr> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84
3 Johnson 1963-11-22 1969-01-20 Democratic 5.17
4 Nixon 1969-01-20 1974-08-09 Republican 5.55
5 Ford 1974-08-09 1977-01-20 Republican 2.45
6 Carter 1977-01-20 1981-01-20 Democratic 4.00
7 Reagan 1981-01-20 1989-01-20 Republican 8.01
8 Bush 1989-01-20 1993-01-20 Republican 4.00
9 Clinton 1993-01-20 2001-01-20 Democratic 8.01
10 Bush 2001-01-20 2009-01-20 Republican 8.01
11 Obama 2009-01-20 2017-01-20 Democratic 8.01

The mutate() function can also be used to modify the data in an existing column.
Suppose that we wanted to add to our data frame a variable containing the year in which
each president was elected. Our first naı̈ve attempt is to assume that every president was
elected in the year before he took office. Note that mutate() returns a data frame, so if we
want to modify our existing data frame, we need to overwrite it with the results.

mypresidents <- mypresidents %>% mutate(elected = year(start) - 1)


mypresidents

# A tibble: 11 6
name start end party term.length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01 1952
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
3 Johnson 1963-11-22 1969-01-20 Democratic 5.17 1962
4 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
5 Ford 1974-08-09 1977-01-20 Republican 2.45 1973
6 Carter 1977-01-20 1981-01-20 Democratic 4.00 1976
7 Reagan 1981-01-20 1989-01-20 Republican 8.01 1980
8 Bush 1989-01-20 1993-01-20 Republican 4.00 1988
9 Clinton 1993-01-20 2001-01-20 Democratic 8.01 1992
10 Bush 2001-01-20 2009-01-20 Republican 8.01 2000
11 Obama 2009-01-20 2017-01-20 Democratic 8.01 2008

Some aspects of this data set are wrong, because presidential elections are only held every
four years. Lyndon Johnson assumed the office after President Kennedy was assassinated in
1963, and Gerald Ford took over after President Nixon resigned in 1974. Thus, there were no
presidential elections in 1962 or 1973, as suggested in our data frame. We should overwrite
68 CHAPTER 4. DATA WRANGLING

these values with NA’s—which is how R denotes missing values. We can use the ifelse()
function to do this. Here, if the value of elected is either 1962 or 1973, we overwrite that
value with NA.1 Otherwise, we overwrite it with the same value that it currently has. In
this case, instead of checking to see whether the value of elected equals 1962 or 1973, for
brevity we can use the %in% operator to check to see whether the value of elected belongs
to the vector consisting of 1962 and 1973.

mypresidents <- mypresidents %>%


mutate(elected = ifelse((elected %in% c(1962, 1973)), NA, elected))
mypresidents

# A tibble: 11 6
name start end party term.length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01 1952
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
3 Johnson 1963-11-22 1969-01-20 Democratic 5.17 NA
4 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
5 Ford 1974-08-09 1977-01-20 Republican 2.45 NA
6 Carter 1977-01-20 1981-01-20 Democratic 4.00 1976
7 Reagan 1981-01-20 1989-01-20 Republican 8.01 1980
8 Bush 1989-01-20 1993-01-20 Republican 4.00 1988
9 Clinton 1993-01-20 2001-01-20 Democratic 8.01 1992
10 Bush 2001-01-20 2009-01-20 Republican 8.01 2000
11 Obama 2009-01-20 2017-01-20 Democratic 8.01 2008

Finally, it is considered bad practice to use periods in the name of functions, data frames,
and variables in R. Ill-advised periods could conflict with R’s use of generic functions (i.e., R’s
mechanism for method overloading). Thus, we should change the name of the term.length
column that we created earlier. In this book, we will use snake case for function and variable
names. We can achieve this using the rename() function.

Pro Tip: Don’t use periods in the names of functions, data frames, or variables, as this
can conflict with R’s programming model.

mypresidents <- mypresidents %>% rename(term_length = term.length)


mypresidents

# A tibble: 11 6
name start end party term_length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01 1952
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
3 Johnson 1963-11-22 1969-01-20 Democratic 5.17 NA
4 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
5 Ford 1974-08-09 1977-01-20 Republican 2.45 NA
6 Carter 1977-01-20 1981-01-20 Democratic 4.00 1976
7 Reagan 1981-01-20 1989-01-20 Republican 8.01 1980
8 Bush 1989-01-20 1993-01-20 Republican 4.00 1988
1 Incidentally, Johnson was elected in 1964 as an incumbent.
4.1. A GRAMMAR FOR DATA WRANGLING 69

n n

p p

Figure 4.4: The arrange() function. At left, a data frame with an ordinal variable. At
right, the resulting data frame after sorting the rows in descending order of that variable.

9 Clinton 1993-01-20 2001-01-20 Democratic 8.01 1992


10 Bush 2001-01-20 2009-01-20 Republican 8.01 2000
11 Obama 2009-01-20 2017-01-20 Democratic 8.01 2008

4.1.3 arrange()
The function sort() will sort a vector, but not a data frame. The function that will sort a
data frame is called arrange(), and its behavior is illustrated in Figure 4.4.
In order to use arrange() on a data frame, you have to specify the data frame, and the
column by which you want it to be sorted. You also have to specify the direction in which
you want it to be sorted. Specifying multiple sort conditions will result in any ties being
broken. Thus, to sort our presidential data frame by the length of each president’s term,
we specify that we want the column term length in descending order.

mypresidents %>% arrange(desc(term_length))

# A tibble: 11 6
name start end party term_length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01 1952
2 Reagan 1981-01-20 1989-01-20 Republican 8.01 1980
3 Clinton 1993-01-20 2001-01-20 Democratic 8.01 1992
4 Bush 2001-01-20 2009-01-20 Republican 8.01 2000
5 Obama 2009-01-20 2017-01-20 Democratic 8.01 2008
6 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
7 Johnson 1963-11-22 1969-01-20 Democratic 5.17 NA
8 Carter 1977-01-20 1981-01-20 Democratic 4.00 1976
9 Bush 1989-01-20 1993-01-20 Republican 4.00 1988
10 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
11 Ford 1974-08-09 1977-01-20 Republican 2.45 NA

A number of presidents completed either one or two full terms, and thus have the exact
same term length (4 or 8 years, respectively). To break these ties, we can further sort by
party and elected.

mypresidents %>% arrange(desc(term_length), party, elected)

# A tibble: 11 6
70 CHAPTER 4. DATA WRANGLING

n ≤p

Figure 4.5: The summarize() function. At left, a data frame. At right, the resulting data
frame after aggregating three of the columns.

name start end party term_length elected


<chr> <date> <date> <chr> <dbl> <dbl>
1 Clinton 1993-01-20 2001-01-20 Democratic 8.01 1992
2 Obama 2009-01-20 2017-01-20 Democratic 8.01 2008
3 Eisenhower 1953-01-20 1961-01-20 Republican 8.01 1952
4 Reagan 1981-01-20 1989-01-20 Republican 8.01 1980
5 Bush 2001-01-20 2009-01-20 Republican 8.01 2000
6 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
7 Johnson 1963-11-22 1969-01-20 Democratic 5.17 NA
8 Carter 1977-01-20 1981-01-20 Democratic 4.00 1976
9 Bush 1989-01-20 1993-01-20 Republican 4.00 1988
10 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
11 Ford 1974-08-09 1977-01-20 Republican 2.45 NA

Note that the default sort order is ascending order, so we do not need to specify an order
if that is what we want.

4.1.4 summarize() with group by()


Our last of the five verbs for single-table analysis is summarize(), which is nearly always
used in conjunction with group by(). The previous four verbs provided us with means to
manipulate a data frame in powerful and flexible ways. But the extent of the analysis we
can perform with these four verbs alone is limited. On the other hand, summarize() with
group by() enables us to make comparisons.
When used alone, summarize() collapses a data frame into a single row. This is illus-
trated in Figure 4.5. Critically, we have to specify how we want to reduce an entire column
of data into a single value. The method of aggregation that we specify controls what will
appear in the output.

mypresidents %>%
summarize(
N = n(), first_year = min(year(start)), last_year = max(year(end)),
num_dems = sum(party == "Democratic"),
years = sum(term_length),
avg_term_length = mean(term_length))

# A tibble: 1 6
4.1. A GRAMMAR FOR DATA WRANGLING 71

N first_year last_year num_dems years avg_term_length


<int> <dbl> <dbl> <int> <dbl> <dbl>
1 11 1953 2017 5 64 5.82

The first argument to summarize() is a data frame, followed by a list of variables that
will appear in the output. Note that every variable in the output is defined by operations
performed on vectors—not on individual values. This is essential, since if the specification
of an output variable is not an operation on a vector, there is no way for R to know how to
collapse each column.
In this example, the function n() simply counts the number of rows. This is almost
always useful information.

Pro Tip: To help ensure that data aggregation is being done correctly, use n() every time
you use summarize().

The next two variables determine the first year that one of these presidents assumed
office. This is the smallest year in the start column. Similarly, the most recent year is the
largest year in the end column. The variable num dems simply counts the number of rows
in which the value of the party variable was Democratic. Finally, the last two variables
compute the sum and average of the term length variable. Thus, we can quickly see that
5 of the 11 presidents who served from 1953 to 2017 were Democrats, and the average term
length over these 64 years was about 5.8 years.
This begs the question of whether Democratic or Republican presidents served a longer
average term during this time period. To figure this out, we can just execute summarize()
again, but this time, instead of the first argument being the data frame mypresidents, we
will specify that the rows of the mypresidents data frame should be grouped by the values
of the party variable. In this manner, the same computations as above will be carried out
for each party separately.

mypresidents %>%
group_by(party) %>%
summarize(
N = n(), first_year = min(year(start)), last_year = max(year(end)),
num_dems = sum(party == "Democratic"),
years = sum(term_length),
avg_term_length = mean(term_length))

# A tibble: 2 7
party N first_year last_year num_dems years avg_term_length
<chr> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 Democratic 5 1961 2017 5 28 5.6
2 Republican 6 1953 2009 0 36 6.0

This provides us with the valuable information that the six Republican presidents served
an average of 6 years in office, while the five Democratic presidents served an average of
only 5.6. As with all of the dplyr verbs, the final output is a data frame.

Pro Tip: In this chapter we are using the dplyr package. The most common way to
extract data from data tables is with SQL (structured query language). We’ll introduce
SQL in Chapter 12. The dplyr package provides a new interface that fits more smoothly
into an overall data analysis workflow and is, in our opinion, easier to learn. Once you
72 CHAPTER 4. DATA WRANGLING

understand data wrangling with dplyr, it’s straightforward to learn SQL if needed. And
dplyr can work as an interface to many systems that use SQL internally.

4.2 Extended example: Ben’s time with the Mets


In this extended example, we will continue to explore Sean Lahman’s historical baseball
database, which contains complete seasonal records for all players on all Major League
Baseball teams going back to 1871. These data are made available in R via the Lahman
package [80]. Here again, while domain knowledge may be helpful, it is not necessary to
follow the example. To flesh out your understanding, try reading the Wikipedia entry on
Major League Baseball.

library(Lahman)
dim(Teams)

[1] 2805 48

The Teams table contains the seasonal results of every major league team in every season
since 1871. There are 2805 rows and 48 columns in this table, which is far too much to show
here, and would make for a quite unwieldy spreadsheet. Of course, we can take a peek at
what this table looks like by printing the first few rows of the table to the screen with the
head() command, but we won’t print that on the page of this book.
Ben worked for the New York Mets from 2004 to 2012. How did the team do during
those years? We can use filter() and select() to quickly identify only those pieces of
information that we care about.

mets <- Teams %>% filter(teamID == "NYN")


myMets <- mets %>% filter(yearID %in% 2004:2012)
myMets %>% select(yearID, teamID, W, L)

yearID teamID W L
1 2004 NYN 71 91
2 2005 NYN 83 79
3 2006 NYN 97 65
4 2007 NYN 88 74
5 2008 NYN 89 73
6 2009 NYN 70 92
7 2010 NYN 79 83
8 2011 NYN 77 85
9 2012 NYN 74 88

Notice that we have broken this down into three steps. First, we filter the rows of the
Teams data frame into only those teams that correspond to the New York Mets.2 There are
54 of those, since the Mets joined the National League in 1962.

nrow(mets)

[1] 54
2 The teamID value of NYN stands for the New York National League club.
4.2. EXTENDED EXAMPLE: BEN’S TIME WITH THE METS 73

Next, we filtered these data so as to include only those seasons in which Ben worked
for the team—those with yearID between 2004 and 2012. Finally, we printed to the screen
only those columns that were relevant to our question: the year, the team’s ID, and the
number of wins and losses that the team had.
While this process is logical, the code can get unruly, since two ancillary data frames
(mets and myMets) were created during the process. It may be the case that we’d like to
use data frames later in the analysis. But if not, they are just cluttering our workspace,
and eating up memory. A more streamlined way to achieve the same result would be to
nest these commands together.

select(filter(mets, teamID == "NYN" & yearID %in% 2004:2012),


yearID, teamID, W, L)

yearID teamID W L
1 2004 NYN 71 91
2 2005 NYN 83 79
3 2006 NYN 97 65
4 2007 NYN 88 74
5 2008 NYN 89 73
6 2009 NYN 70 92
7 2010 NYN 79 83
8 2011 NYN 77 85
9 2012 NYN 74 88

This way, no additional data frames were created. However, it is easy to see that as we
nest more and more of these operations together, this code could become difficult to read.
To maintain readability, we instead chain these operations, rather than nest them (and get
the same exact results).

Teams %>%
select(yearID, teamID, W, L) %>%
filter(teamID == "NYN" & yearID %in% 2004:2012)

This piping syntax (introduced in Section 4.1.1) is provided by the dplyr package. It
retains the step-by-step logic of our original code, while being easily readable, and efficient
with respect to memory and the creation of temporary data frames. In fact, there are also
performance enhancements under the hood that make this the most efficient way to do
these kinds of computations. For these reasons we will use this syntax whenever possible
throughout the book. Note that we only have to type Teams once—it is implied by the
pipe operator (%>%) that the subsequent command takes the previous data frame as its first
argument. Thus, df %>% f(y) is equivalent to f(df, y).
We’ve answered the simple question of how the Mets performed during the time that
Ben was there, but since we are data scientists, we are interested in deeper questions. For
example, some of these seasons were subpar—the Mets had more losses than wins. Did the
team just get unlucky in those seasons? Or did they actually play as badly as their record
indicates?
In order to answer this question, we need a model for expected winning percentage. It
turns out that one of the most widely used contributions to the field of baseball analytics
(courtesy of Bill James) is exactly that. This model translates the number of runs 3 that
3 In baseball, a team scores a run when a player traverses the bases and return to home plate. The team

with the most runs in each game wins, and no ties are allowed.
74 CHAPTER 4. DATA WRANGLING

a team scores and allows over the course of an entire season into an expectation for how
many games they should have won. The simplest version of this model is this:

 1
W P ct =  RA 2 ,
1+ RS

where RA is the number of runs the team allows, RS is the number of runs that the team

scores, and W P ct is the team’s expected winning percentage. Luckily for us, the runs scored
and allowed are present in the Teams table, so let’s grab them and save them in a new data
frame.

metsBen <- Teams %>% select(yearID, teamID, W, L, R, RA) %>%


filter(teamID == "NYN" & yearID %in% 2004:2012)
metsBen

yearID teamID W L R RA
1 2004 NYN 71 91 684 731
2 2005 NYN 83 79 722 648
3 2006 NYN 97 65 834 731
4 2007 NYN 88 74 804 750
5 2008 NYN 89 73 799 715
6 2009 NYN 70 92 671 757
7 2010 NYN 79 83 656 652
8 2011 NYN 77 85 718 742
9 2012 NYN 74 88 650 709

First, note that the runs-scored variable is called R in the Teams table, but to stick with
our notation we want to rename it RS.

metsBen <- metsBen %>% rename(RS = R) # new name = old name


metsBen

yearID teamID W L RS RA
1 2004 NYN 71 91 684 731
2 2005 NYN 83 79 722 648
3 2006 NYN 97 65 834 731
4 2007 NYN 88 74 804 750
5 2008 NYN 89 73 799 715
6 2009 NYN 70 92 671 757
7 2010 NYN 79 83 656 652
8 2011 NYN 77 85 718 742
9 2012 NYN 74 88 650 709

Next, we need to compute the team’s actual winning percentage in each of these seasons.
Thus, we need to add a new column to our data frame, and we do this with the mutate()
command.

metsBen <- metsBen %>% mutate(WPct = W / (W + L))


metsBen

yearID teamID W L RS RA WPct


1 2004 NYN 71 91 684 731 0.438
4.2. EXTENDED EXAMPLE: BEN’S TIME WITH THE METS 75

2 2005 NYN 83 79 722 648 0.512


3 2006 NYN 97 65 834 731 0.599
4 2007 NYN 88 74 804 750 0.543
5 2008 NYN 89 73 799 715 0.549
6 2009 NYN 70 92 671 757 0.432
7 2010 NYN 79 83 656 652 0.488
8 2011 NYN 77 85 718 742 0.475
9 2012 NYN 74 88 650 709 0.457

We also need to compute the model estimates for winning percentage.

metsBen <- metsBen %>% mutate(WPct_hat = 1 / (1 + (RA/RS)^2))


metsBen

yearID teamID W L RS RA WPct WPct_hat


1 2004 NYN 71 91 684 731 0.438 0.467
2 2005 NYN 83 79 722 648 0.512 0.554
3 2006 NYN 97 65 834 731 0.599 0.566
4 2007 NYN 88 74 804 750 0.543 0.535
5 2008 NYN 89 73 799 715 0.549 0.555
6 2009 NYN 70 92 671 757 0.432 0.440
7 2010 NYN 79 83 656 652 0.488 0.503
8 2011 NYN 77 85 718 742 0.475 0.484
9 2012 NYN 74 88 650 709 0.457 0.457

The expected number of wins is then equal to the product of the expected winning
percentage times the number of games.

metsBen <- metsBen %>% mutate(W_hat = WPct_hat * (W + L))


metsBen

yearID teamID W L RS RA WPct WPct_hat W_hat


1 2004 NYN 71 91 684 731 0.438 0.467 75.6
2 2005 NYN 83 79 722 648 0.512 0.554 89.7
3 2006 NYN 97 65 834 731 0.599 0.566 91.6
4 2007 NYN 88 74 804 750 0.543 0.535 86.6
5 2008 NYN 89 73 799 715 0.549 0.555 90.0
6 2009 NYN 70 92 671 757 0.432 0.440 71.3
7 2010 NYN 79 83 656 652 0.488 0.503 81.5
8 2011 NYN 77 85 718 742 0.475 0.484 78.3
9 2012 NYN 74 88 650 709 0.457 0.457 74.0

In this case, the Mets’ fortunes were better than expected in three of these seasons, and
worse than expected in the other six.

filter(metsBen, W >= W_hat)

yearID teamID W L RS RA WPct WPct_hat W_hat


1 2006 NYN 97 65 834 731 0.599 0.566 91.6
2 2007 NYN 88 74 804 750 0.543 0.535 86.6
3 2012 NYN 74 88 650 709 0.457 0.457 74.0
76 CHAPTER 4. DATA WRANGLING

filter(metsBen, W < W_hat)

yearID teamID W L RS RA WPct WPct_hat W_hat


1 2004 NYN 71 91 684 731 0.438 0.467 75.6
2 2005 NYN 83 79 722 648 0.512 0.554 89.7
3 2008 NYN 89 73 799 715 0.549 0.555 90.0
4 2009 NYN 70 92 671 757 0.432 0.440 71.3
5 2010 NYN 79 83 656 652 0.488 0.503 81.5
6 2011 NYN 77 85 718 742 0.475 0.484 78.3

Naturally, the Mets experienced ups and downs during Ben’s time with the team. Which
seasons were best? To figure this out, we can simply sort the rows of the data frame.

arrange(metsBen, desc(WPct))

yearID teamID W L RS RA WPct WPct_hat W_hat


1 2006 NYN 97 65 834 731 0.599 0.566 91.6
2 2008 NYN 89 73 799 715 0.549 0.555 90.0
3 2007 NYN 88 74 804 750 0.543 0.535 86.6
4 2005 NYN 83 79 722 648 0.512 0.554 89.7
5 2010 NYN 79 83 656 652 0.488 0.503 81.5
6 2011 NYN 77 85 718 742 0.475 0.484 78.3
7 2012 NYN 74 88 650 709 0.457 0.457 74.0
8 2004 NYN 71 91 684 731 0.438 0.467 75.6
9 2009 NYN 70 92 671 757 0.432 0.440 71.3

In 2006, the Mets had the best record in baseball during the regular season and nearly
made the World Series. But how do these seasons rank in terms of the team’s performance
relative to our model?

metsBen %>%
mutate(Diff = W - W_hat) %>%
arrange(desc(Diff))

yearID teamID W L RS RA WPct WPct_hat W_hat Diff


1 2006 NYN 97 65 834 731 0.599 0.566 91.6 5.3840
2 2007 NYN 88 74 804 750 0.543 0.535 86.6 1.3774
3 2012 NYN 74 88 650 709 0.457 0.457 74.0 0.0199
4 2008 NYN 89 73 799 715 0.549 0.555 90.0 -0.9605
5 2009 NYN 70 92 671 757 0.432 0.440 71.3 -1.2790
6 2011 NYN 77 85 718 742 0.475 0.484 78.3 -1.3377
7 2010 NYN 79 83 656 652 0.488 0.503 81.5 -2.4954
8 2004 NYN 71 91 684 731 0.438 0.467 75.6 -4.6250
9 2005 NYN 83 79 722 648 0.512 0.554 89.7 -6.7249

So 2006 was the Mets’ most fortunate year—since they won five more games than our
model predicts—but 2005 was the least fortunate—since they won almost seven games fewer
than our model predicts. This type of analysis helps us understand how the Mets performed
in individual seasons, but we know that any randomness that occurs in individual years is
likely to average out over time. So while it is clear that the Mets performed well in some
seasons and poorly in others, what can we say about their overall performance?
4.2. EXTENDED EXAMPLE: BEN’S TIME WITH THE METS 77

We can easily summarize a single variable with the favstats() command from the
mosaic package.

favstats(~ W, data = metsBen)

min Q1 median Q3 max mean sd n missing


70 74 79 88 97 80.9 9.1 9 0
This tells us that the Mets won nearly 81 games on average during Ben’s tenure, which
corresponds almost exactly to a 0.500 winning percentage, since there are 162 games in a
regular season. But we may be interested in aggregating more than one variable at a time.
To do this, we use summarize().

metsBen %>%
summarize(
num_years = n(), total_W = sum(W), total_L = sum(L),
total_WPct = sum(W) / sum(W + L), sum_resid = sum(W - W_hat))

num_years total_W total_L total_WPct sum_resid


1 9 728 730 0.499 -10.6

In these nine years, the Mets had a combined record of 728 wins and 730 losses, for
an overall winning percentage of .499. Just one extra win would have made them exactly
0.500! (If we could pick which game, we would definitely pick the final game of the 2007
season. A win there would have resulted in a playoff berth.) However, we’ve also learned
that the team under-performed relative to our model by a total of 10.6 games over those
nine seasons.
Usually, when we are summarizing a data frame like we did above, it is interesting to
consider different groups. In this case, we can discretize these years into three chunks:
one for each of the three general managers under whom Ben worked. Jim Duquette was
the Mets’ general manager in 2004, Omar Minaya from 2005 to 2010, and Sandy Alderson
from 2011 to 2012. We can define these eras using two nested ifelse() functions (the
case when() function in the dplyr package is helpful in such a setting).

metsBen <- metsBen %>%


mutate(
gm = ifelse(yearID == 2004, "Duquette",
ifelse(yearID >= 2011, "Alderson", "Minaya")))

Next, we use the gm variable to define these groups with the group by() operator. The
combination of summarizing data by groups can be very powerful. Note that while the
Mets were far more successful during Minaya’s regime (i.e., many more wins than losses),
they did not meet expectations in any of the three periods.

metsBen %>%
group_by(gm) %>%
summarize(
num_years = n(), total_W = sum(W), total_L = sum(L),
total_WPct = sum(W) / sum(W + L), sum_resid = sum(W - W_hat)) %>%
arrange(desc(sum_resid))

# A tibble: 3 6
78 CHAPTER 4. DATA WRANGLING

gm num_years total_W total_L total_WPct sum_resid


<chr> <int> <int> <int> <dbl> <dbl>
1 Alderson 2 151 173 0.466 -1.32
2 Duquette 1 71 91 0.438 -4.63
3 Minaya 6 506 466 0.521 -4.70

The full power of the chaining operator is revealed below, where we do all the analysis
at once, but retain the step-by-step logic.

Teams %>%
select(yearID, teamID, W, L, R, RA) %>%
filter(teamID == "NYN" & yearID %in% 2004:2012) %>%
rename(RS = R) %>%
mutate(
WPct = W / (W + L), WPct_hat = 1 / (1 + (RA/RS)^2),
W_hat = WPct_hat * (W + L),
gm = ifelse(yearID == 2004, "Duquette",
ifelse(yearID >= 2011, "Alderson", "Minaya"))) %>%
group_by(gm) %>%
summarize(
num_years = n(), total_W = sum(W), total_L = sum(L),
total_WPct = sum(W) / sum(W + L), sum_resid = sum(W - W_hat)) %>%
arrange(desc(sum_resid))

# A tibble: 3 6
gm num_years total_W total_L total_WPct sum_resid
<chr> <int> <int> <int> <dbl> <dbl>
1 Alderson 2 151 173 0.466 -1.32
2 Duquette 1 71 91 0.438 -4.63
3 Minaya 6 506 466 0.521 -4.70

Even more generally, we might be more interested in how the Mets performed relative
to our model, in the context of all teams during that nine year period. All we need to do is
remove the teamID filter and group by franchise (franchID) instead.

Teams %>% select(yearID, teamID, franchID, W, L, R, RA) %>%


filter(yearID %in% 2004:2012) %>%
rename(RS = R) %>%
mutate(
WPct = W / (W + L), WPctHat = 1 / (1 + (RA/RS)^2),
WHat = WPctHat * (W + L)) %>%
group_by(franchID) %>%
summarize(
numYears = n(), totalW = sum(W), totalL = sum(L),
totalWPct = sum(W) / sum(W + L), sumResid = sum(W - WHat)) %>%
arrange(sumResid) %>%
print(n = 6)

# A tibble: 30 6
franchID numYears totalW totalL totalWPct sumResid
<fctr> <int> <int> <int> <dbl> <dbl>
4.3. COMBINING MULTIPLE TABLES 79

1 TOR 9 717 740 0.492 -29.2


2 ATL 9 781 677 0.536 -24.0
3 COL 9 687 772 0.471 -22.7
4 CHC 9 706 750 0.485 -14.5
5 CLE 9 710 748 0.487 -13.9
6 NYM 9 728 730 0.499 -10.6
# ... with 24 more rows

We can see now that only five other teams fared worse than the Mets,4 relative to our
model, during this time period. Perhaps they are cursed!

4.3 Combining multiple tables


In the previous section, we illustrated how the five verbs can be chained to perform opera-
tions on a single table. This single table is reminiscent of a single well-organized spreadsheet.
But in the same way that a workbook can contain multiple spreadsheets, we will often work
with multiple tables. In Chapter 12, we will describe how multiple tables related by unique
identifiers called keys can be organized into a relational database management system.
It is more efficient for the computer to store and search tables in which “like is stored
with like.” Thus, a database maintained by the Bureau of Transportation Statistics on
the arrival times of U.S. commercial flights will consist of multiple tables, each of which
contains data about different things. For example, the nycflights13 package contains one
table about flights—each row in this table is a single flight. As there are many flights, you
can imagine that this table will get very long—hundreds of thousands of rows per year. But
there are other related kinds of information that we will want to know about these flights.
We would certainly be interested in the particular airline to which each flight belonged. It
would be inefficient to store the complete name of the airline (e.g., American Airlines
Inc.) in every row of the flights table. A simple code (e.g., AA) would take up less space on
disk. For small tables, the savings of storing two characters instead of 25 is insignificant,
but for large tables, it can add up to noticeable savings both in terms of the size of data
on disk, and the speed with which we can search it. However, we still want to have the
full names of the airlines available if we need them. The solution is to store the data about
airlines in a separate table called airlines, and to provide a key that links the data in the
two tables together.

4.3.1 inner join()


If we examine the first few rows of the flights table, we observe that the carrier column
contains a two-character string corresponding to the airline.

library(nycflights13)
head(flights, 3)

# A tibble: 3 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
4 Note that whereas the teamID that corresponds to the Mets is NYN, the value of the franchID variable

is NYM.
80 CHAPTER 4. DATA WRANGLING

3 2013 1 1 542 540 2 923


# ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

In the airlines table, we have those same two-character strings, but also the full names
of the airline.

head(airlines, 3)

# A tibble: 3 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.

In order to retrieve a list of flights and the full names of the airlines that managed each
flight, we need to match up the rows in the flights table with those rows in the airlines
table that have the corresponding values for the carrier column in both tables. This is
achieved with the function inner join().

flightsJoined <- flights %>%


inner_join(airlines, by = c("carrier" = "carrier"))
glimpse(flightsJoined)

Observations: 336,776
Variables: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
$ name <chr> "United Air Lines Inc.", "United Air Lines Inc....

Notice that the flightsJoined data frame now has an additional variable called name.
4.3. COMBINING MULTIPLE TABLES 81

This is the column from airlines that is now attached to our combined data frame. Now
we can view the full names of the airlines instead of the cryptic two-character codes.

flightsJoined %>%
select(carrier, name, flight, origin, dest) %>%
head(3)

# A tibble: 3 5
carrier name flight origin dest
<chr> <chr> <int> <chr> <chr>
1 UA United Air Lines Inc. 1545 EWR IAH
2 UA United Air Lines Inc. 1714 LGA IAH
3 AA American Airlines Inc. 1141 JFK MIA
In an inner join(), the result set contains only those rows that have matches in both
tables. In this case, all of the rows in flights have exactly one corresponding entry in
airlines, so the number of rows in flightsJoined is the same as the number of rows in
flights (this will not always be the case).

nrow(flights)

[1] 336776

nrow(flightsJoined)

[1] 336776

Pro Tip: It is always a good idea to carefully check that the number of rows returned by
a join operation is what you expected. In particular, you often want to check for rows in
one table that matched to more than one row in the other table.

4.3.2 left join()


Another commonly used type of join is a left join(). Here the rows of the first table are
always returned, regardless of whether there is a match in the second table.
Suppose that we are only interested in flights from the NYC airports to the West Coast.
Specifically, we’re only interested in airports in the Pacific Time Zone. Thus, we filter the
airports data frame to only include those 152 airports.

airportsPT <- filter(airports, tz == -8)


nrow(airportsPT)

[1] 152

Now, if we perform an inner join() on flights and airportsPT, matching the desti-
nations in flights to the FAA codes in airports, we retrieve only those flights that flew
to our airports in the Pacific Time Zone.

nycDestsPT <- flights %>% inner_join(airportsPT, by = c("dest" = "faa"))


nrow(nycDestsPT)

[1] 46324
82 CHAPTER 4. DATA WRANGLING

However, if we use a left join() with the same conditions, we retrieve all of the rows
of flights. NA’s are inserted into the columns where no matched data was found.

nycDests <- flights %>% left_join(airportsPT, by = c("dest" = "faa"))


nrow(nycDests)

[1] 336776

sum(is.na(nycDests$name))

[1] 290452

Left joins are particularly useful in databases in which referential integrity is broken (not
all of the keys are present—see Chapter 12).

4.4 Extended example: Manny Ramirez


In the context of baseball and the Lahman package, multiple tables are used to store informa-
tion. The batting statistics of players are stored in one table (Batting), while information
about people (most of whom are players) is in a different table (Master).
Every row in the Batting table contains the statistics accumulated by a single player
during a single stint for a single team in a single year. Thus, a player like Manny Ramirez
has many rows in the Batting table (21, in fact).

manny <- filter(Batting, playerID == "ramirma02")


nrow(manny)

[1] 21

Using what we’ve learned, we can quickly tabulate Ramirez’s most common career of-
fensive statistics. For those new to baseball, some additional background may be helpful.
A hit (H) occurs when a batter reaches base safely. A home run (HR) occurs when the ball is
hit out of the park or the runner advances through all of the bases during that play. Barry
Bonds has the record for most home runs (762) hit in a career. A player’s batting average
(BA) is the ratio of the number of hits to the number of eligible at-bats. The highest career
batting average in major league baseball history of 0.366 was achieved by Ty Cobb—season
averages above 0.300 are impressive. Finally, runs batted in (RBI) is the number of runners
(including the batter in the case of a home run) that score during that batter’s at-bat. Hank
Aaron has the record for most career RBIs with 2,297.

manny %>% summarize(


span = paste(min(yearID), max(yearID), sep = "-"),
numYears = n_distinct(yearID), numTeams = n_distinct(teamID),
BA = sum(H)/sum(AB), tH = sum(H), tHR = sum(HR), tRBI = sum(RBI))

span numYears numTeams BA tH tHR tRBI


1 1993-2011 19 5 0.312 2574 555 1831

Notice how we have used the paste() function to combine results from multiple variables
into a new variable, and how we have used the n distinct() function to count the number
of distinct rows. In his 19-year career, Ramirez hit 555 home runs, which puts him in the
top 20 among all Major League players.
4.4. EXTENDED EXAMPLE: MANNY RAMIREZ 83

However, we also see that Ramirez played for five teams during his career. Did he
perform equally well for each of them? Breaking his statistics down by team, or by league,
is as easy as adding an appropriate group by() command.

manny %>%
group_by(teamID) %>%
summarize(
span = paste(min(yearID), max(yearID), sep = "-"),
numYears = n_distinct(yearID), numTeams = n_distinct(teamID),
BA = sum(H)/sum(AB), tH = sum(H), tHR = sum(HR), tRBI = sum(RBI)) %>%
arrange(span)

# A tibble: 5 8
teamID span numYears numTeams BA tH tHR tRBI
<fctr> <chr> <int> <int> <dbl> <int> <int> <int>
1 CLE 1993-2000 8 1 0.3130 1086 236 804
2 BOS 2001-2008 8 1 0.3117 1232 274 868
3 LAN 2008-2010 3 1 0.3224 237 44 156
4 CHA 2010-2010 1 1 0.2609 18 1 2
5 TBA 2011-2011 1 1 0.0588 1 0 1
While Ramirez was very productive for Cleveland, Boston, and the Los Angeles Dodgers,
his brief tours with the Chicago White Sox and Tampa Bay Rays were less than stellar. In
the pipeline below, we can see that Ramirez spent the bulk of his career in the American
League.

manny %>%
group_by(lgID) %>%
summarize(
span = paste(min(yearID), max(yearID), sep = "-"),
numYears = n_distinct(yearID), numTeams = n_distinct(teamID),
BA = sum(H)/sum(AB), tH = sum(H), tHR = sum(HR), tRBI = sum(RBI)) %>%
arrange(span)

# A tibble: 2 8
lgID span numYears numTeams BA tH tHR tRBI
<fctr> <chr> <int> <int> <dbl> <int> <int> <int>
1 AL 1993-2011 18 4 0.311 2337 511 1675
2 NL 2008-2010 3 1 0.322 237 44 156
If Ramirez played in only 19 different seasons, why were there 21 rows attributed to him?
Notice that in 2008, he was traded from the Boston Red Sox to the Los Angeles Dodgers,
and thus played for both teams. Similarly, in 2010 he played for both the Dodgers and
the Chicago White Sox. When summarizing data, it is critically important to understand
exactly how the rows of your data frame are organized. To see what can go wrong here,
suppose we were interested in tabulating the number of seasons in which Ramirez hit at
least 30 home runs. The simplest solution is:

manny %>%
filter(HR >= 30) %>%
nrow()

[1] 11
84 CHAPTER 4. DATA WRANGLING

But this answer is wrong, because in 2008, Ramirez hit 20 home runs for Boston before
being traded and then 17 more for the Dodgers afterwards. Neither of those rows were
counted, since they were both filtered out. Thus, the year 2008 does not appear among the
11 that we counted in the previous pipeline. Recall that each row in the manny data frame
corresponds to one stint with one team in one year. On the other hand, the question asks
us to consider each year, regardless of team. In order to get the right answer, we have to
aggregate the rows by team. Thus, the correct solution is:

manny %>%
group_by(yearID) %>%
summarize(tHR = sum(HR)) %>%
filter(tHR >= 30) %>%
nrow()

[1] 12

Note that the filter() operation is applied to tHR, the total number of home runs in a
season, and not HR, the number of home runs in a single stint for a single team in a single
season. (This distinction between filtering the rows of the original data versus the rows of
the aggregated results will appear again in Chapter 12.)
We began this exercise by filtering the Batting table for the player with playerID equal
to ramirma02. How did we know to use this identifier? This player ID is known as a key,
and in fact, playerID is the primary key defined in the Master table. That is, every row
in the Master table is uniquely identified by the value of playerID. Thus there is exactly
one row in that table for which playerID is equal to ramirma02.
But how did we know that this ID corresponds to Manny Ramirez? We can search the
Master table. The data in this table include characteristics about Manny Ramirez that do
not change across multiple seasons (with the possible exception of his weight).

Master %>% filter(nameLast == "Ramirez" & nameFirst == "Manny")

playerID birthYear birthMonth birthDay birthCountry birthState


1 ramirma02 1972 5 30 D.R. Distrito Nacional
birthCity deathYear deathMonth deathDay deathCountry deathState
1 Santo Domingo NA NA NA <NA> <NA>
deathCity nameFirst nameLast nameGiven weight height bats throws
1 <NA> Manny Ramirez Manuel Aristides 225 72 R R
debut finalGame retroID bbrefID deathDate birthDate
1 1993-09-02 2011-04-06 ramim002 ramirma02 <NA> 1972-05-30

The playerID column forms a primary key in the Master table, but it does not in
the Batting table, since as we saw previously, there were 21 rows with that playerID. In
the Batting table, the playerID column is known as a foreign key, in that it references a
primary key in another table. For our purposes, the presence of this column in both tables
allows us to link them together. This way, we can combine data from the Batting table
with data in the Master table. We do this with inner join() by specifying the two tables
that we want to join, and the corresponding columns in each table that provide the link.
Thus, if we want to display Ramirez’s name in our previous result, as well as his age, we
must join the Batting and Master tables together.
4.4. EXTENDED EXAMPLE: MANNY RAMIREZ 85

Batting %>%
filter(playerID == "ramirma02") %>%
inner_join(Master, by = c("playerID" = "playerID")) %>%
group_by(yearID) %>%
summarize(
Age = max(yearID - birthYear), numTeams = n_distinct(teamID),
BA = sum(H)/sum(AB), tH = sum(H), tHR = sum(HR), tRBI = sum(RBI)) %>%
arrange(yearID)

# A tibble: 19 7
yearID Age numTeams BA tH tHR tRBI
<int> <int> <int> <dbl> <int> <int> <int>
1 1993 21 1 0.1698 9 2 5
2 1994 22 1 0.2690 78 17 60
3 1995 23 1 0.3079 149 31 107
4 1996 24 1 0.3091 170 33 112
5 1997 25 1 0.3280 184 26 88
6 1998 26 1 0.2942 168 45 145
7 1999 27 1 0.3333 174 44 165
8 2000 28 1 0.3508 154 38 122
9 2001 29 1 0.3062 162 41 125
10 2002 30 1 0.3486 152 33 107
11 2003 31 1 0.3251 185 37 104
12 2004 32 1 0.3081 175 43 130
13 2005 33 1 0.2924 162 45 144
14 2006 34 1 0.3207 144 35 102
15 2007 35 1 0.2961 143 20 88
16 2008 36 2 0.3315 183 37 121
17 2009 37 1 0.2898 102 19 63
18 2010 38 2 0.2981 79 9 42
19 2011 39 1 0.0588 1 0 1

Pro Tip: Always specify the by argument that defines the join condition. Don’t rely on
the defaults.

Notice that even though Ramirez’s age is a constant for each season, we have to use a
vector operation (i.e., max()) in order to reduce any potential vector to a single number.
Which season was Ramirez’s best as a hitter? One relatively simple measurement of
batting prowess is OPS, or On-Base Plus Slugging Percentage, which is the simple sum
of two other statistics: On-Base Percentage (OBP) and Slugging Percentage (SLG). The
former basically measures the percentage of time that a batter reaches base safely, whether
it comes via a hit (H), a base on balls (BB), or from being hit by the pitch (HBP). The latter
measures the average number of bases advanced per at-bat (AB), where a single is worth
one base, a double (X2B) is worth two, a triple (X3B) is worth three, and a home run (HR)
is worth four. (Note that every hit is exactly one of a single, double, triple, or home run.)
Let’s add this statistic to our results and use it to rank the seasons.

mannyBySeason <- Batting %>%


filter(playerID == "ramirma02") %>%
inner_join(Master, by = c("playerID" = "playerID")) %>%
86 CHAPTER 4. DATA WRANGLING

group_by(yearID) %>%
summarize(
Age = max(yearID - birthYear), numTeams = n_distinct(teamID),
BA = sum(H)/sum(AB), tH = sum(H), tHR = sum(HR), tRBI = sum(RBI),
OBP = sum(H + BB + HBP) / sum(AB + BB + SF + HBP),
SLG = sum(H + X2B + 2*X3B + 3*HR) / sum(AB)) %>%
mutate(OPS = OBP + SLG) %>%
arrange(desc(OPS))
mannyBySeason

# A tibble: 19 10
yearID Age numTeams BA tH tHR tRBI OBP SLG OPS
<int> <int> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
1 2000 28 1 0.3508 154 38 122 0.4568 0.6970 1.154
2 1999 27 1 0.3333 174 44 165 0.4422 0.6628 1.105
3 2002 30 1 0.3486 152 33 107 0.4498 0.6468 1.097
4 2006 34 1 0.3207 144 35 102 0.4391 0.6192 1.058
5 2008 36 2 0.3315 183 37 121 0.4297 0.6014 1.031
6 2003 31 1 0.3251 185 37 104 0.4271 0.5870 1.014
7 2001 29 1 0.3062 162 41 125 0.4048 0.6087 1.014
8 2004 32 1 0.3081 175 43 130 0.3967 0.6127 1.009
9 2005 33 1 0.2924 162 45 144 0.3877 0.5939 0.982
10 1996 24 1 0.3091 170 33 112 0.3988 0.5818 0.981
11 1998 26 1 0.2942 168 45 145 0.3771 0.5989 0.976
12 1995 23 1 0.3079 149 31 107 0.4025 0.5579 0.960
13 1997 25 1 0.3280 184 26 88 0.4147 0.5383 0.953
14 2009 37 1 0.2898 102 19 63 0.4176 0.5312 0.949
15 2007 35 1 0.2961 143 20 88 0.3884 0.4928 0.881
16 1994 22 1 0.2690 78 17 60 0.3571 0.5207 0.878
17 2010 38 2 0.2981 79 9 42 0.4094 0.4604 0.870
18 1993 21 1 0.1698 9 2 5 0.2000 0.3019 0.502
19 2011 39 1 0.0588 1 0 1 0.0588 0.0588 0.118

We see that Ramirez’s OPS was highest in 2000. But 2000 was the height of the steroid
era, when many sluggers were putting up tremendous offensive numbers. As data scientists,
we know that it would be more instructive to put Ramirez’s OPS in context by comparing
it to the league average OPS in each season—the resulting ratio is often called OPS+. To
do this, we will need to compute those averages. Because there is missing data in some of
these columns in some of these years, we need to invoke the na.rm argument to ignore that
data.

mlb <- Batting %>%


filter(yearID %in% 1993:2011) %>%
group_by(yearID) %>%
summarize(lgOPS =
sum(H + BB + HBP, na.rm = TRUE) / sum(AB + BB + SF + HBP, na.rm = TRUE) +
sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE) / sum(AB, na.rm = TRUE))

Next, we need to match these league average OPS values to the corresponding entries
for Ramirez. We can do this by joining these tables together, and computing the ratio of
Ramirez’s OPS to that of the league average.
4.4. EXTENDED EXAMPLE: MANNY RAMIREZ 87

mannyRatio <- mannyBySeason %>%


inner_join(mlb, by = c("yearID" = "yearID")) %>%
mutate(OPSplus = OPS / lgOPS) %>%
select(yearID, Age, OPS, lgOPS, OPSplus) %>%
arrange(desc(OPSplus))
mannyRatio

# A tibble: 19 5
yearID Age OPS lgOPS OPSplus
<int> <int> <dbl> <dbl> <dbl>
1 2000 28 1.154 0.782 1.475
2 2002 30 1.097 0.748 1.466
3 1999 27 1.105 0.778 1.420
4 2006 34 1.058 0.768 1.377
5 2008 36 1.031 0.749 1.376
6 2003 31 1.014 0.755 1.344
7 2001 29 1.014 0.759 1.336
8 2004 32 1.009 0.763 1.323
9 2005 33 0.982 0.749 1.310
10 1998 26 0.976 0.755 1.292
11 1996 24 0.981 0.767 1.278
12 1995 23 0.960 0.755 1.272
13 2009 37 0.949 0.751 1.264
14 1997 25 0.953 0.756 1.261
15 2010 38 0.870 0.728 1.194
16 2007 35 0.881 0.758 1.162
17 1994 22 0.878 0.763 1.150
18 1993 21 0.502 0.736 0.682
19 2011 39 0.118 0.720 0.163

In this case, 2000 still ranks as Ramirez’s best season relative to his peers, but notice
that his 1999 season has fallen from 2nd to 3rd. Since by definition a league batter has
an OPS+ of 1, Ramirez posted 17 consecutive seasons with an OPS that was at least 15%
better than the average across the major leagues—a truly impressive feat.
Finally, not all joins are the same. An inner join() requires corresponding entries in
both tables. Conversely, a left join() returns at least as many rows as there are in the first
table, regardless of whether there are matches in the second table. Thus, an inner join() is
bidirectional, whereas in a left join(), the order in which you specify the tables matters.
Consider the career of Cal Ripken, who played in 21 seasons from 1981 to 2001. His
career overlapped with Ramirez’s in the nine seasons from 1993 to 2001, so for those, the
league averages we computed before are useful.

ripken <- Batting %>% filter(playerID == "ripkeca01")


nrow(inner_join(ripken, mlb, by = c("yearID" = "yearID")))

[1] 9

nrow(inner_join(mlb, ripken, by = c("yearID" = "yearID"))) #same

[1] 9

For seasons when Ramirez did not play, NA’s will be returned.
88 CHAPTER 4. DATA WRANGLING

ripken %>%
left_join(mlb, by = c("yearID" = "yearID")) %>%
select(yearID, playerID, lgOPS) %>%
head(3)

yearID playerID lgOPS


1 1981 ripkeca01 NA
2 1982 ripkeca01 NA
3 1983 ripkeca01 NA

Conversely, by reversing the order of the tables in the join, we return the 19 seasons
for which we have already computed the league averages, regardless of whether there is a
match for Ripken (results not displayed).

mlb %>%
left_join(ripken, by = c("yearID" = "yearID")) %>%
select(yearID, playerID, lgOPS)

4.5 Further resources


Hadley Wickham is an enormously influential innovator in the field of statistical comput-
ing. Along with his colleagues at RStudio and other organizations, he has made significant
contributions to improve data wrangling in R. These packages are sometimes called the
“Hadleyverse” or the “tidyverse,” and are now manageable through a single tidyverse [231]
package. His papers and vignettes describing widely used packages such as dplyr [234] and
tidyr [230] are highly recommended reading. In particular, his paper on tidy data [218]
builds upon notions of normal forms—common to database designers from computer science—
to describe a process of thinking about how data should be stored and formatted. Finzer [77]
writes of a “data habit of mind” that needs to be inculcated among data scientists. The
RStudio data wrangling cheat sheet is a useful reference.
Sean Lahman, a self-described “database journalist,” has long curated his baseball data
set, which feeds the popular website baseball-reference.com. Michael Friendly maintains the
Lahman R package [80]. For the baseball enthusiast, Cleveland Indians analyst Max Marchi
and Jim Albert have written an excellent book on analyzing baseball data in R [140]. Albert
has also written a book describing how baseball can be used as a motivating example for
teaching statistics [2].

4.6 Exercises

Exercise 4.1
Each of these tasks can be performed using a single data verb. For each task, say which
verb it is:

1. Find the average of one of the variables.

2. Add a new column that is the ratio between two variables.

3. Sort the cases in descending order of a variable.


4.6. EXERCISES 89

4. Create a new data table that includes only those cases that meet a criterion.

5. From a data table with three categorical variables A, B, and C, and a quantitative
variable X, produce a data frame that has the same cases but only the variables A
and X.

Exercise 4.2
Use the nycflights13 package and the flights data frame to answer the following
questions: What month had the highest proportion of cancelled flights? What month had
the lowest? Interpret any seasonal patterns.

Exercise 4.3
Use the nycflights13 package and the flights data frame to answer the following
question: What plane (specified by the tailnum variable) traveled the most times from
New York City airports in 2013? Plot the number of trips per week over the year.

Exercise 4.4
Use the nycflights13 package and the flights and planes tables to answer the fol-
lowing questions: What is the oldest plane (specified by the tailnum variable) that flew
from New York City airports in 2013? How many airplanes that flew from New York City
are included in the planes table?

Exercise 4.5
Use the nycflights13 package and the flights and planes tables to answer the fol-
lowing questions: How many planes have a missing date of manufacture? What are the five
most common manufacturers? Has the distribution of manufacturer changed over time as
reflected by the airplanes flying from NYC in 2013? (Hint: you may need to recode the
manufacturer name and collapse rare vendors into a category called Other.)

Exercise 4.6
Use the nycflights13 package and the weather table to answer the following questions:
What is the distribution of temperature in July, 2013? Identify any important outliers in
terms of the wind speed variable. What is the relationship between dewp and humid? What
is the relationship between precip and visib?

Exercise 4.7
Use the nycflights13 package and the weather table to answer the following questions:
On how many days was there precipitation in the New York area in 2013? Were there
differences in the mean visibility (visib) based on the day of the week and/or month of
the year?

Exercise 4.8
Define two new variables in the Teams data frame from the Lahman package: batting
average (BA) and slugging percentage (SLG). Batting average is the ratio of hits (H) to
at-bats (AB), and slugging percentage is total bases divided by at-bats. To compute total
bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

Exercise 4.9
90 CHAPTER 4. DATA WRANGLING

Plot a time series of SLG since 1954 conditioned by lgID. Is slugging percentage typically
higher in the American League (AL) or the National League (NL)? Can you think of why
this might be the case?

Exercise 4.10
Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat
this using teams since 1969.

Exercise 4.11
The Angels have at times been called the California Angels (CAL), the Anaheim Angels
(ANA), and the Los Angeles Angels of Anaheim (LAA). Find the 10 most successful seasons
in Angels history. Have they ever won the World Series?

Exercise 4.12
Create a factor called election that divides the yearID into four-year blocks that
correspond to U.S. presidential terms. During which term have the most home runs been
hit?

Exercise 4.13
Name every player in baseball history who has accumulated at least 300 home runs (HR)
and at least 300 stolen bases (SB).

Exercise 4.14
Name every pitcher in baseball history who has accumulated at least 300 wins (W) and
at least 3,000 strikeouts (SO).

Exercise 4.15
Identify the name and year of every player who has hit at least 50 home runs in a single
season. Which player had the lowest batting average in that season?

Exercise 4.16
The Relative Age Effect is an attempt to explain anomalies in the distribution of birth
month among athletes. Briefly, the idea is that children born just after the age cut-off for
participation will be as much as 11 months older than their fellow athletes, which is enough
of a disparity to give them an advantage. That advantage will then be compounded over
the years, resulting in notably more professional athletes born in these months. Display the
distribution of birth months of baseball players who batted during the decade of the 2000s.
How are they distributed over the calendar year? Does this support the notion of a relative
age effect?

Exercise 4.17
The Violations data set in the mdsr package contains information regarding the out-
come of health inspections of restaurants in New York City. Use these data to calculate the
median violation score by zip code for zip codes in Manhattan with 50 or more inspections.
What pattern do you see between the number of inspections and the median score?

Exercise 4.18
Download data on the number of deaths by firearm from the Florida Department of Law
Enforcement. Wrangle these data and use ggplot2 to re-create Figure 6.1.
Chapter 5

Tidy data and iteration

In this chapter, we will continue to develop data wrangling skills. In particular, we will
discuss tidy data, how to automate iterative processes, common file formats, and techniques
for scraping and cleaning data, especially dates. Together with the material from Chapter 4,
these skills will provide facility with wrangling data that is foundational for data science.

5.1 Tidy data


5.1.1 Motivation
One popular source of data is Gapminder [180], the brainchild of Swedish physician and
public health researcher Hans Rosling. Gapminder contains data about countries over time
for a variety of different variables such as the prevalence of HIV (human immunodeficiency
virus) among adults aged 15 to 49 and other health and economic indicators. These data are
stored in Google Spreadsheets, or one can download them as Microsoft Excel workbooks.
The typical presentation of a small subset of such data is shown below, where we have used
the googlesheets package to pull these data directly into R.

library(mdsr)
library(googlesheets)
hiv_key <- "pyj6tScZqmEfbZyl0qjbiRQ"
hiv <- gs_key(hiv_key, lookup = FALSE) %>%
gs_read(ws = "Data", range = cell_limits(c(1, 1), c(276, 34)))
names(hiv)[1] <- "Country"
hiv %>%
filter(Country %in% c("United States", "France", "South Africa")) %>%
select(Country, `1979`, `1989`, `1999`, `2009`)

# A tibble: 3 5
Country `1979` `1989` `1999` `2009`
<chr> <dbl> <lgl> <dbl> <dbl>
1 France NA NA 0.3 0.4
2 South Africa NA NA 14.8 17.2
3 United States 0.0318 NA 0.5 0.6

The data set has the form of a two-dimensional array where each of the n = 3 rows
represents a country and each of the p = 4 columns is a year. Each entry represents the

91
92 CHAPTER 5. TIDY DATA AND ITERATION

percentage of adults aged 15 to 49 living with HIV in the ith country in the j th year. This
presentation of the data has some advantages. First, it is possible (with a big enough
monitor) to see all of the data. One can quickly follow the trend over time for a particular
country, and one can also estimate quite easily the percentage of data that is missing (e.g.,
NA). Thus, if visual inspection is the primary analytical technique, this spreadsheet-style
presentation can be convenient.
Alternatively, consider this presentation of those same data.

library(tidyr)
hiv_long <- hiv %>% gather(key = Year, value = hiv_rate, -Country)
hiv_long %>%
filter(Country %in% c("United States", "France", "South Africa")) %>%
filter(Year %in% c(1979, 1989, 1999, 2009))

# A tibble: 12 3
Country Year hiv_rate
<chr> <chr> <dbl>
1 France 1979 NA
2 South Africa 1979 NA
3 United States 1979 0.0318
4 France 1989 NA
5 South Africa 1989 NA
6 United States 1989 NA
7 France 1999 0.3000
8 South Africa 1999 14.8000
9 United States 1999 0.5000
10 France 2009 0.4000
11 South Africa 2009 17.2000
12 United States 2009 0.6000

While our data can still be represented by a two-dimensional array, it now has np = 12
rows and just three columns. Visual inspection of the data is now more difficult, since our
data are long and very narrow—the aspect ratio is not similar to that of our screen.
It turns out that there are substantive reasons to prefer the long (or tall), narrow
version of these data. With multiple tables (see Chapter 12), it is a more efficient way
for the computer to store and retrieve the data. It is more convenient for the purpose
of data analysis. And it is more scalable, in that the addition of a second variable simply
contributes another column, whereas to add another variable to the spreadsheet presentation
would require a confusing three-dimensional view, multiple tabs in the spreadsheet, or worse,
merged cells.
These gains come at a cost: we have relinquished our ability to see all the data at
once. When data sets are small, being able to see them all at once can be useful, and even
comforting. But in this era of big data, a quest to see all the data at once in a spreadsheet
layout is a fool’s errand. Learning to manage data via programming frees us from the click-
and-drag paradigm popularized by spreadsheet applications, allows us to work with data
of arbitrary size, and reduces errors. Recording our data management operations in code
also makes them reproducible (see Appendix D)—an increasingly necessary trait in this era
of collaboration. It enables us to fully separate the raw data from our analysis, which is
difficult to achieve using a spreadsheet.

Pro Tip: Always keep your raw data and your analysis in separate files. Store the
5.1. TIDY DATA 93

uncorrected data file (with errors and problems) and make corrections with a script (see
Appendix D) file that transforms the raw data into the data that will actually be analyzed.
This process will maintain the provenance of your data and allow analyses to be updated
with new data without having to start data wrangling from scratch.

The long, narrow format for the Gapminder data that we have outlined above is called
tidy [218]. In what follows we will further expand upon this notion, and develop more
sophisticated techniques for wrangling data.

5.1.2 What are tidy data?


Data can be as simple as a column of numbers in a spreadsheet file or as complex as the
electronic medical records collected by a hospital. A newcomer to working with data may
expect each source of data to be organized in a unique way and to require unique techniques.
The expert, however, has learned to operate with a small set of standard tools. As you’ll
see, each of the standard tools performs a comparatively simple task. Combining those
simple tasks in appropriate ways is the key to dealing with complex data.
One reason the individual tools can be simple is that each tool gets applied to data
arranged in a simple but precisely defined pattern called tidy data. Tidy data exists in
systematically defined data tables (e.g., the rectangular arrays of data seen previously), but
not all data tables are tidy.
To illustrate, Table 5.1 shows a handful of entries from a large United States Social
Security Administration tabulation of names given to babies. In particular, the table shows
how many babies of each sex were given each name in each year.

year sex name n


1955 F Judine 5
2002 M Kadir 6
1935 F Jerre 11
1935 F Elynor 12
1910 M Bertram 33
1985 F Kati 212
1942 M Grafton 22

Table 5.1: A data table showing how many babies were given each name in each year in the
U.S., for a few names.

Table 5.1 shows that there were 6 boys named Kadir born in the U.S. in 2002 and 12
girls named Elynor born in 1935. As a whole, the babynames data table covers the years
1880 through 2014 and includes a total of 337,135,426 individuals, somewhat larger than
the current population of the U.S.
The data in Table 5.1 are tidy because they are organized according to two simple rules.

1. The rows, called cases or observations, each refer to a specific, unique, and similar
sort of thing, e.g., girls named Elynor in 1935.
2. The columns, called variables, each have the same sort of value recorded for each row.
For instance, n gives the number of babies for each case; sex tells which gender was
assigned at birth.
When data are in tidy form, it is relatively straightforward to transform the data into
arrangements that are more useful for answering interesting questions. For instance, you
94 CHAPTER 5. TIDY DATA AND ITERATION

might wish to know which were the most popular baby names over all the years. Even
though Table 5.1 contains the popularity information implicitly, we need to re-arrange
these data by adding up the counts for a name across all the years before the popularity
becomes obvious, as in Table 5.2.

popular_names <- babynames %>%


group_by(sex, name) %>%
summarize(total_births = sum(n)) %>%
arrange(desc(total_births))

sex name total births


1 M James 5105919
2 M John 5084943
3 M Robert 4796695
4 M Michael 4309198
5 F Mary 4115282
6 M William 4055473
7 M David 3577704
8 M Joseph 2570095
9 M Richard 2555330
10 M Charles 2364332

Table 5.2: The most popular baby names across all years.

The process of transforming information that is implicit in a data table into another
data table that gives the information explicitly is called data wrangling. The wrangling
itself is accomplished by using data verbs that take a tidy data table and transform it into
another tidy data table in a different form. In Chapter 4, you were introduced to several
data verbs.
Table 5.3 displays results from the Minneapolis mayoral election. Unlike babynames, it
is not in tidy form, though the display is attractive and neatly laid out. There are helpful
labels and summaries that make it easy for a person to read and draw conclusions. (For
instance, Ward 1 had a higher voter turnout than Ward 2, and both wards were lower than
the city total.)
However, being neat is not what makes data tidy. Table 5.3 violates the first rule for
tidy data.
1. Rule 1: The rows, called cases, each must represent the same underlying attribute,
that is, the same kind of thing.
That’s not true in Table 5.3. For most of the table, the rows represent a single precinct.
But other rows give ward or city-wide totals. The first two rows are captions describing
the data, not cases.
2. Rule 2: Each column is a variable containing the same type of value for each case.
That’s mostly true in Table 5.3, but the tidy pattern is interrupted by labels that
are not variables. For instance, the first two cells in row 15 are the label “Ward 1
Subtotal,” which is different from the ward/precinct identifiers that are the values in
most of the first column.
Conforming to the rules for tidy data simplifies summarizing and analyzing data. For
instance, in the tidy babynames table, it is easy (for a computer) to find the total number
5.1. TIDY DATA 95

Table 5.3: Ward and precinct votes cast in the 2013 Minneapolis mayoral election.

of babies: just add up all the numbers in the n variable. It is similarly easy to find the
number of cases: just count the rows. And if you want to know the total number of Ahmeds
or Sherinas across the years, there is an easy way to do that.
In contrast, it would be more difficult in the Minneapolis election data to find, say, the
total number of ballots cast. If you take the seemingly obvious approach and add up the
numbers in column I of Table 5.3 (labelled “Total Ballots Cast”), the result will be three
times the true number of ballots, because some of the rows contain summaries, not cases.
Indeed, if you wanted to do calculations based on the Minneapolis election data, you
would be far better off to put it in a tidy form.
The tidy form in Table 5.4 is, admittedly, not as attractive as the form published by
the Minneapolis government. But it is much easier to use for the purpose of generating
summaries and analyses.
Once data are in a tidy form, you can present them in ways that can be more effective
than a formatted spreadsheet. For example, the data graphic in Figure 5.1 presents the
turnout in each ward in a way that makes it easy to see how much variation there is within
and among precincts.
The tidy format also makes it easier to bring together data from different sources. For
instance, to explain the variation in voter turnout, you might want to consider variables
such as party affiliation, age, income, etc. Such data might be available on a ward-by-ward
basis from other records, such as public voter registration logs and census records. Tidy
data can be wrangled into forms that can be connected to one another (i.e., using the
inner join() function from Chapter 4). This task would be difficult if you had to deal
with an idiosyncratic format for each different source of data.
96 CHAPTER 5. TIDY DATA AND ITERATION

ward precinct registered voters absentee total turnout


1 1 28 492 27 0.27
1 4 29 768 26 0.37
1 7 47 291 8 0.16
2 1 63 1011 39 0.36
2 4 53 117 3 0.07
2 7 39 138 7 0.14
2 10 87 196 5 0.07
3 3 71 893 101 0.37
3 6 102 927 71 0.35

Table 5.4: A selection from the Minneapolis election data in tidy form.

l l
l
l
l l
l
l
l l l
l l
l l
l l
l l l
l l l
l l
l l
l
40 l l l l
l l l l
l l l l l
l l
l l l
l l l
l l
l l l
Voter Turnout (%)

l
l l
l l l l l
l l l l
l l l l
l
l l
l l
l l
l l
l
l l l l l
l
l l
l l
l l l
l
l l
l l l l
l l l
20 l
l l

l
l l
l

l l l

1 2 3 4 5 6 7 8 9 10
Precinct

Figure 5.1: A graphical depiction of voter turnout in the different wards.

Variables

In data science, the word variable has a different meaning than in mathematics. In algebra,
a variable is an unknown quantity. In data, a variable is known—it has been measured.
Rather, the word variable refers to a specific quantity or quality that can vary from case to
case. There are two major types of variables:

• Categorical variables record type or category and often take the form of a word.

• Quantitative variables record a numerical attribute. A quantitative variable is just


what it sounds like: a number.

A categorical variable tells into which category or group a case falls. For instance, in
the baby names data table, sex is a categorical variable with two levels F and M, standing
for female and male. Similarly, the name variable is categorical. It happens that there are
93,889 different levels for name, ranging from Aaron, Ab, and Abbie to Zyhaire, Zylis, and
Zymya.
5.1. TIDY DATA 97

Precinct First Second Third Ward


6 P-04 undervote undervote undervote W-6
2 P-06 BOB FINE MARK ANDREW undervote W-10
10 P-02D NEAL BAXTER BETSY HODGES DON SAMUELS W-7
5 P-01 DON SAMUELS undervote undervote W-5
27 P-03 CAM WINTON DON SAMUELS OLE SAVIOR W-1

Table 5.5: Individual ballots in the Minneapolis election. Each voter votes in one ward in
one precinct. The ballot marks the voter’s first three choices for mayor.

Cases and what they represent


As noted previously, a row of a tidy data table refers to a case. To this point, you may have
little reason to prefer the word case to row. When working with a data table, it is important
to keep in mind what a case stands for in the real world. Sometimes the meaning is obvious.
For instance, Table 5.5 is a tidy data table showing the ballots in the Minneapolis mayoral
election in 2013. Each case is an individual voter’s ballot. (The voters were directed to mark
their ballot with their first choice, second choice, and third choice among the candidates.
This is part of a procedure called rank choice voting.)
The case in Table 5.5 is a different sort of thing than the case in Table 5.4. In Table 5.4,
a case is a ward in a precinct. But in Table 5.5, the case is an individual ballot. Similarly,
in the baby names data (Table 5.1), a case is a name and sex and year while in Table 5.2
the case is a name and sex.
When thinking about cases, ask this question: What description would make every
case unique? In the vote summary data, a precinct does not uniquely identify a case.
Each individual precinct appears in several rows. But each precinct and ward combination
appears once and only once. Similarly, in Table 5.1, name and sex do not specify a unique
case. Rather, you need the combination of name-sex-year to identify a unique row.

Runners and races


Table 5.6 displays some of the results from a 10-mile running race held each year in Wash-
ington, D.C.
What is the meaning of a case here? It is tempting to think that a case is a person.
After all, it is people who run road races. But notice that individuals appear more than
once: Jane Poole ran each year from 2003 to 2007. (Her times improved consistently as she
got older!) Jane Smith ran in the races from 1999 to 2006, missing only the year 2000 race.
This suggests that the case is a runner in one year’s race.

Codebooks
Data tables do not necessarily display all the variables needed to figure out what makes
each row unique. For such information, you sometimes need to look at the documentation
of how the data were collected and what the variables mean.
The codebook is a document—separate from the data table—that describes various
aspects of how the data were collected, what the variables mean and what the different
levels of categorical variables refer to. The word codebook comes from the days when data
was encoded for the computer in ways that make it hard for a human to read. A codebook
should include information about how the data were collected and what constitutes a case.
Figure 5.2 shows the codebook for the babynames data in Table 5.1. In R, codebooks for
data tables are available from the help() function.
98 CHAPTER 5. TIDY DATA AND ITERATION

name.yob sex age year gun


1 jane polanek 1974 F 32 2006 114.50
2 jane poole 1948 F 55 2003 92.72
3 jane poole 1948 F 56 2004 87.28
4 jane poole 1948 F 57 2005 85.05
5 jane poole 1948 F 58 2006 80.75
6 jane poole 1948 F 59 2007 78.53
7 jane schultz 1964 F 35 1999 91.37
8 jane schultz 1964 F 37 2001 79.13
9 jane schultz 1964 F 38 2002 76.83
10 jane schultz 1964 F 39 2003 82.70
11 jane schultz 1964 F 40 2004 87.92
12 jane schultz 1964 F 41 2005 91.47
13 jane schultz 1964 F 42 2006 88.43
14 jane smith 1952 F 47 1999 90.60
15 jane smith 1952 F 49 2001 97.87

Table 5.6: An excerpt of runners’ performance over time in a 10-mile race.

help(HELPrct)

For the runners data in Table 5.6, a codebook should tell you that the meaning of the
gun variable is the time from when the start gun went off to when the runner crosses the
finish line and that the unit of measurement is minutes. It should also state what might
be obvious: that age is the person’s age in years and sex has two levels, male and female,
represented by M and F.

Multiple tables
It is often the case that creating a meaningful display of data involves combining data
from different sources and about different kinds of things. For instance, you might want
your analysis of the runners’ performance data in Table 5.6 to include temperature and
precipitation data for each year’s race. Such weather data is likely contained in a table of
daily weather measurements.
In many circumstances, there will be multiple tidy tables, each of which contains in-
formation relative to your analysis but has a different kind of thing as a case. We saw
in Chapter 4 how the inner join() and left join() functions can be used to combine
multiple tables, and in Chapter 12 we will further develop skills for working with relational
databases. For now, keep in mind that being tidy is not about shoving everything into one
table.

5.2 Reshaping data


Each row of a tidy data table is an individual case. It is often useful to re-organize the
same data in a such a way that a case has a different meaning. This can make it easier to
perform wrangling tasks such as comparisons, joins, and the inclusion of new data.
Consider the format of BP wide shown in Table 5.7, in which each case is a research study
subject and there are separate variables for the measurement of systolic blood pressure
(SBP) before and after exposure to a stressful environment. Exactly the same data can
5.2. RESHAPING DATA 99

Description: The HELP study was a clinical trial for adult inpatients recruited from a
detoxification unit. Patients with no primary care physician were randomized to
receive a multidisciplinary assessment and a brief motivational intervention or usual
care, with the goal of linking them to primary medical care.
Usage: data(HELPrct)
Format: Data frame with 453 observations on the following variables.

age: subject age at baseline (in years)


anysub: use of any substance post-detox: a factor with levels no yes
cesd: Center for Epidemiologic Studies Depression measure at baseline (possible
range 0-60: high scores indicate more depressive symptoms)
d1: lifetime number of hospitalizations for medical problems (measured at baseline)
daysanysub: time (in days) to first use of any substance post-detox
...

Details: Eligible subjects were adults, who spoke Spanish or English, reported alcohol,
heroin or cocaine as their first or second drug of choice, resided in proximity to the
primary care clinic to which they would be referred or were homeless. Patients with
established primary care relationships they planned to continue, significant dementia,
specific plans to leave the Boston area that would prevent research participation, fail-
ure to provide contact information for tracking purposes, or pregnancy were excluded.
Source: http://nhorton.people.amherst.edu/help

Figure 5.2: Part of the codebook for the HELPrct data table from the mosaicData package.

be presented in the format of the BP narrow data table (Table 5.8), where the case is an
individual occasion for blood-pressure measurement.

subject before after


BHO 160 115
GWB 120 135
WJC 105 145

Table 5.7: BP wide: a data table in a wide format

Each of the formats BP wide and BP narrow has its advantages and its disadvantages.
For example, it is easy to find the before-and-after change in blood pressure using BP wide.

BP_wide %>% mutate(change = after - before)

On the other hand, a narrow format is more flexible for including additional variables,
for example the date of the measurement or the diastolic blood pressure as in Table 5.9.
The narrow format also makes it feasible to add in additional measurement occasions. For
instance, Table 5.9 shows several “after” measurements for subject WJC. (Such repeated
measures are a common feature of scientific studies.) A simple strategy allows you to get
the benefits of either format: convert from wide to narrow or from narrow to wide as suits
your purpose.
100 CHAPTER 5. TIDY DATA AND ITERATION

subject when sbp


BHO before 160
GWB before 120
WJC before 105
BHO after 115
GWB after 135
WJC after 145

Table 5.8: BP narrow: a tidy data table in a narrow format.

subject when sbp dbp date


BHO before 160 69 13683.00
GWB before 120 54 10337.00
BHO before 155 65 13095.00
WJC after 145 75 12006.00
WJC after NA 65 14694.00
WJC after 130 60 15963.00
GWB after 135 NA 14372.00
WJC before 105 60 7533.00
BHO after 115 78 17321.00

Table 5.9: A data table extending the information in Tables 5.8 and 5.7 to include additional
variables and repeated measurements. The narrow format facilitates including new cases or
variables.

5.2.1 Data verbs for converting wide to narrow and vice versa
Transforming a data table from wide to narrow is the action of the gather() data verb:
A wide data table is the input and a narrow data table is the output. The reverse task,
transforming from narrow to wide, involves the data verb spread(). Both functions are
implemented in the tidyr package.

5.2.2 Spreading
The spread() function converts a data table from narrow to wide. Carrying out this
operation involves specifying some information in the arguments to the function. The
value is the variable in the narrow format that is to be divided up into multiple variables
in the resulting wide format. The key is the name of the variable in the narrow format that
identifies for each case individually which column in the wide format will receive the value.
For instance, in the narrow form of BP narrow (Table 5.8) the value variable is sbp. In
the corresponding wide form, BP wide (Table 5.7), the information in sbp will be spread
between two variables: before and after. The key variable in BP narrow is when. Note
that the different categorical levels in when specify which variable in BP wide will be the
destination for the sbp value of each case. Only the key and value variables are involved
in the transformation from narrow to wide. Other variables in the narrow table, such as
subject in BP narrow, are used to define the cases. Thus, to translate from BP narrow to
BP wide we would write this code:

BP_narrow %>% spread(key = when, value = sbp)


5.2. RESHAPING DATA 101

5.2.3 Gathering
Now consider how to transform BP wide into BP narrow. The names of the variables to
be gathered together, before and after, will become the categorical levels in the narrow
form. That is, they will make up the key variable in the narrow form. The data analyst has
to invent a name for this variable. There are all sorts of sensible possibilities, for instance
before or after. In gathering BP wide into BP narrow, the concise variable name when
was chosen.
Similarly, a name must be specified for the variable that is to hold the values in the
variables being gathered. Again, there are many reasonable possibilities. It is sensible to
choose a name that reflects the kind of thing those values are, in this case systolic blood
pressure. So, sbp is a good choice.
Finally, the analyst needs to specify which variables are to be gathered. For instance, it
hardly makes sense to gather subject with the other variables; it will remain as a separate
variable in the narrow result. Values in subject will be repeated as necessary to give each
case in the narrow format its own correct value of subject. In summary, to convert BP wide
into BP narrow, we run the following command.

BP_wide %>% gather(key = when, value = sbp, before, after)

The names of the key and value arguments are given as arguments. These are the names
invented by the data analyst; those names are not part of the wide input to gather(). The
arguments after the key and value are the names of the variables to be gathered.

5.2.4 Example: Gender-neutral names


In “A Boy Named Sue” country singer Johnny Cash famously told the story of a boy
toughened in life—eventually reaching gratitude—by being given a girl’s name. The conceit
is of course the rarity of being a boy with the name Sue, and indeed, Sue is given to about
300 times as many girls as boys (at least being recorded in this manner: Data entry errors
may account for some of these names).

babynames %>%
filter(name == "Sue") %>%
group_by(name, sex) %>%
summarise(total = sum(n))

Source: local data frame [2 x 3]


Groups: name [?]

name sex total


<chr> <chr> <int>
1 Sue F 144424
2 Sue M 519

On the other hand, some names that are predominantly given to girls are also commonly
given to boys. Although only 15% of people named Robin are male, it is easy to think of
a few famous men with that name: the actor Robin Williams, the singer Robin Gibb, and
the basketball player Robin Lopez (not to mention Batman’s sidekick) come to mind.
102 CHAPTER 5. TIDY DATA AND ITERATION

babynames %>%
filter(name == "Robin") %>%
group_by(name, sex) %>%
summarise(total = sum(n))

Source: local data frame [2 x 3]


Groups: name [?]

name sex total


<chr> <chr> <int>
1 Robin F 288636
2 Robin M 44026

This computational paradigm (e.g., filtering) works well if you want to look at gender
balance in one name at a time, but suppose you want to find the most gender-neutral names
from all 93,889 names in babynames? For this, it would be useful to have the results in a
wide format, like the one shown below.

babynames %>%
filter(name %in% c("Sue", "Robin", "Leslie")) %>%
group_by(name, sex) %>%
summarise(total = sum(n)) %>%
spread(key = sex, value = total, fill=0)

Source: local data frame [3 x 3]


Groups: name [3]

name F M
* <chr> <dbl> <dbl>
1 Leslie 264054 112533
2 Robin 288636 44026
3 Sue 144424 519

The spread() function can help us generate the wide format. Note that the sex variable
is the key used in the conversion. A fill of zero is appropriate here: For a name like Aaban
or Aadam, where there are no females, the entry for F should be zero.

BabyWide <- babynames %>%


group_by(sex, name) %>%
summarize(total = sum(n)) %>%
spread(key = sex, value = total, fill = 0)
head(BabyWide, 3)

# A tibble: 3 3
name F M
<chr> <dbl> <dbl>
1 Aaban 0 72
2 Aabha 21 0
3 Aabid 0 5

One way to define “approximately the same” is to take the smaller of the ratios M/F
and F/M. If females greatly outnumber males, then F/M will be large, but M/F will be
5.3. NAMING CONVENTIONS 103

small. If the sexes are about equal, then both ratios will be near one. The smaller will
never be greater than one, so the most balanced names are those with the smaller of the
ratios near one.
The code to identify the most balanced gender-neutral names out of the names with
more than 50,000 babies of each sex are shown below. Remember, a ratio of one means
exactly balanced; a ratio of 0.5 means two to one in favor of one sex; 0.33 means three
to one. (The pmin() transformation function returns the smaller of the two arguments for
each individual case.)

BabyWide %>%
filter(M > 50000, F > 50000) %>%
mutate(ratio = pmin(M / F, F / M) ) %>%
arrange(desc(ratio)) %>%
head(3)

# A tibble: 3 4
name F M ratio
<chr> <dbl> <dbl> <dbl>
1 Riley 81605 87494 0.933
2 Jackie 90337 78148 0.865
3 Casey 75060 108595 0.691

Riley has been the most gender-balanced name, followed by Jackie. Where does your
name fall on this list?

5.3 Naming conventions


Like any language, R has some rules that you cannot break, but also many conventions that
you can—but should not—break. There are a few simple rules that apply when creating a
name for an object:

• The name cannot start with a digit. So you cannot assign the name 100NCHS to a
data frame, but NCHS100 is fine. This rule is to make it easy for R to distinguish
between object names and numbers. It also helps you avoid mistakes such as writing
2pi when you mean 2*pi.

• The name cannot contain any punctuation symbols other than . and . So ?NCHS
or N*Hanes are not legitimate names. However, you can use . and in a name. For
reasons that will be explained later, the use of . in function names has a specific
meaning, but should otherwise be avoided. The use of is preferred.

• The case of the letters in the name matters. So NCHS, nchs, Nchs, and nChs, etc., are
all different names that only look similar to a human reader, not to R.

Pro Tip: Do not use . in function names, to avoid conflicting with internal functions.

One of R’s strengths is its modularity—many people have contributed many packages
that do many different things. However, this decentralized paradigm has resulted in many
different people writing code using many different conventions. The resulting lack of uni-
formity can make code harder to read. We suggest adopting a style guide and sticking
104 CHAPTER 5. TIDY DATA AND ITERATION

to it—we have attempted to do that in this book. However, the inescapable use of other
people’s code results in inevitable deviations from that style.
Two public style guides for R are widely adopted and influential: Google’s R Style Guide
and the Style Guide in Hadley Wickham’s Advanced R book [220]. Needless to say, they
don’t always agree. In this book, we follow the latter as closely as possible. This means:

• We use underscores ( ) in variable and function names. The use of periods (.) in
function names is restricted to S3 methods.

• We use spaces liberally and prefer multiline, narrow blocks of code to single lines of
wide code (although we have relaxed this in many of our examples to save space).

• We use CamelCase for the names of data tables. This means that each “word” in a
name starts with a capital letter, but there are no spaces (e.g., Teams, MedicareCharges,
WorldCities, etc.).

5.4 Automation and iteration


Calculators free human beings from having to perform arithmetic computations by hand.
Similarly, programming languages free humans from having to perform iterative computa-
tions by re-running chunks of code, or worse, copying-and-pasting a chunk of code many
times, while changing just one or two things in each chunk.
For example, in Major League Baseball there are 30 teams, and the game has been
played for over 100 years. There are a number of natural questions that we might want to
ask about each team (e.g., which player has accrued the most hits for that team?) or about
each season (e.g., which seasons had the highest levels of scoring?). If we can write a chunk
of code that will answer these questions for a single team or a single season, then we should
be able to generalize that chunk of code to work for all teams or seasons. Furthermore, we
should be able to do this without having to re-type that chunk of code. In this section, we
present a variety of techniques for automating these types of iterative operations.

5.4.1 Vectorized operations


In every programming language that we can think of, there is a way to write a loop. For
example, you can write a for() loop in R the same way you can with most programming
languages. Recall that the Teams data frame contains one row for each team in each MLB
season.

library(Lahman)
names(Teams)

[1] "yearID" "lgID" "teamID" "franchID"


[5] "divID" "Rank" "G" "Ghome"
[9] "W" "L" "DivWin" "WCWin"
[13] "LgWin" "WSWin" "R" "AB"
[17] "H" "X2B" "X3B" "HR"
[21] "BB" "SO" "SB" "CS"
[25] "HBP" "SF" "RA" "ER"
[29] "ERA" "CG" "SHO" "SV"
[33] "IPouts" "HA" "HRA" "BBA"
[37] "SOA" "E" "DP" "FP"
5.4. AUTOMATION AND ITERATION 105

[41] "name" "park" "attendance" "BPF"


[45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"

What might not be immediately obvious is that columns 15 through 40 of this data
frame contain numerical data about how each team performed in that season. To see this,
you can execute the str() command to see the structure of the data frame, but we suppress
that output here. For data frames, a similar alternative that is a little cleaner is glimpse().

str(Teams)
glimpse(Teams)

Regardless of your prior knowledge of baseball, you might be interested in computing the
averages of these 26 numeric columns. However, you don’t want to have to type the names
of each of them, or re-type the mean() command 26 times. Thus, most programmers will
immediately identify this as a situation in which a loop is a natural and efficient solution.

averages <- NULL


for (i in 15:40) {
averages[i - 14] <- mean(Teams[, i], na.rm = TRUE)
}
names(averages) <- names(Teams)[15:40]
averages

R AB H X2B X3B HR BB SO
681.946 5142.492 1346.273 227.625 47.104 101.137 473.649 737.949
SB CS HBP SF RA ER ERA CG
112.272 48.766 56.096 44.677 681.946 570.895 3.815 50.481
SHO SV IPouts HA HRA BBA SOA E
9.664 23.668 4022.383 1346.084 101.137 474.011 731.229 186.337
DP FP
140.186 0.962

This certainly works. However, it is almost always possible (and usually preferable) to
perform such operations in R without explicitly defining a loop. R programmers prefer to
use the concept of applying an operation to each element in a vector. This often requires
only one line of code, with no appeal to indices.
It is important to understand that the fundamental architecture of R is based on vectors.
That is, in contrast to general-purpose programming languages like C++ or Python that
distinguish between single items—like strings and integers—and arrays of those items, in R
a “string” is just a character vector of length 1. There is no special kind of atomic object.
Thus, if you assign a single “string” to an object, R still stores it as a vector.

a <- "a string"


class(a)

[1] "character"

length(a)

[1] 1
106 CHAPTER 5. TIDY DATA AND ITERATION

As a consequence of this construction, R is highly optimized for vectorized operations


(see Appendix B for more detailed information about R internals). Loops, by their nature,
do not take advantage of this optimization. Thus, R provides several tools for performing
loop-like operations without actually writing a loop. This can be a challenging conceptual
hurdle for those who are used to more general-purpose programming languages.

Pro Tip: Try to avoid writing for() loops in R, even when it seems like the easiest
solution.

5.4.2 The apply() family of functions


To apply a function to the rows or columns of a matrix or data frame, use apply(). In this
example, we calculate the mean of each of the statistics defined above, all at once. Compare
this to the for() loop written above.

Teams %>%
select(15:40) %>%
apply(MARGIN = 2, FUN = mean, na.rm = TRUE)

R AB H X2B X3B HR BB SO
681.946 5142.492 1346.273 227.625 47.104 101.137 473.649 737.949
SB CS HBP SF RA ER ERA CG
112.272 48.766 56.096 44.677 681.946 570.895 3.815 50.481
SHO SV IPouts HA HRA BBA SOA E
9.664 23.668 4022.383 1346.084 101.137 474.011 731.229 186.337
DP FP
140.186 0.962

The first argument to apply() is the matrix or data frame that you want to do something
to. The second argument specifies whether you want to apply the function FUN to the rows
or the columns of the matrix. Any further arguments are passed as options to FUN. Thus,
this command applies the mean() function to the 15th through the 40th columns of the
Teams data frame, while removing any NAs that might be present in any of those columns.
Note that the row-wise averages have no meaning in this case, but you could calculate
them by setting the MARGIN argument to 1 instead of 2:

Teams %>%
select(15:40) %>%
apply(MARGIN = 1, FUN = mean, na.rm = TRUE)

Of course, we began by taking the subset of the columns that were all numeric values.
If you tried to take the mean() of a non-numeric vector, you would get a warning (and a
value of NA).

Teams %>%
select(teamID) %>%
apply(MARGIN = 2, FUN = mean, na.rm = TRUE)

Warning in mean.default(x, ..., na.rm = na.rm):


argument is not numeric or logical: returning NA
5.4. AUTOMATION AND ITERATION 107

teamID
NA

sapply() and lapply()


Often you will want to apply a function to each element of a vector or list. For example,
the franchise now known as the Los Angeles Angels of Anaheim has gone by several names
during its time in MLB.

angels <- Teams %>%


filter(franchID == "ANA") %>%
group_by(teamID, name) %>%
summarise(began = first(yearID), ended = last(yearID)) %>%
arrange(began)
angels

Source: local data frame [4 x 4]


Groups: teamID [3]

teamID name began ended


<fctr> <chr> <int> <int>
1 LAA Los Angeles Angels 1961 1964
2 CAL California Angels 1965 1996
3 ANA Anaheim Angels 1997 2004
4 LAA Los Angeles Angels of Anaheim 2005 2015

The franchise began as the Los Angeles Angels (LAA) in 1961, then became the California
Angels (CAL) in 1965, the Anaheim Angels (ANA) in 1997, before taking their current name
(LAA again) in 2005. This situation is complicated by the fact that the teamID LAA was
re-used. This sort of schizophrenic behavior is unfortunately common in many data sets.
Now, suppose we want to find the length, in number of characters, of each of those team
names. We could check each one manually using the function nchar():

angels_names <- angels$name


nchar(angels_names[1])

[1] 18

nchar(angels_names[2])

[1] 17

nchar(angels_names[3])

[1] 14

nchar(angels_names[4])

[1] 29

But this would grow tiresome if we had many names. It would be simpler, more efficient,
more elegant, and scalable to apply the function nchar() to each element of the vector
angel names. We can accomplish this using either sapply() or lapply().
108 CHAPTER 5. TIDY DATA AND ITERATION

sapply(angels_names, FUN = nchar)

Los Angeles Angels California Angels


18 17
Anaheim Angels Los Angeles Angels of Anaheim
14 29

lapply(angels_names, FUN = nchar)

[[1]]
[1] 18

[[2]]
[1] 17

[[3]]
[1] 14

[[4]]
[1] 29

The key difference between sapply() and lapply() is that the former will try to return a
vector or matrix, whereas the latter will always return a list. Recall that the main difference
between lists and data.frames is that the elements (columns) of a data.frame have to
have the same length, whereas the elements of a list are arbitrary. So while lapply() is
more versatile, we usually find sapply() to be more convenient when it is appropriate.

Pro Tip: Use sapply() whenever you want to do something to each element of a vector,
and get a vector in return.

One of the most powerful uses of these iterative functions is that you can apply any
function, including a function that you have defined (see Appendix C for a discussion of
how to write user-defined functions). For example, suppose we want to display the top 5
seasons in terms of wins for each of the Angels teams.

top5 <- function(x, teamname) {


x %>%
filter(name == teamname) %>%
select(teamID, yearID, W, L, name) %>%
arrange(desc(W)) %>%
head(n = 5)
}

We can now do this for each element of our vector with a single call to lapply().

angels_list <- lapply(angels_names, FUN = top5, x = Teams)


angels_list

[[1]]
teamID yearID W L name
1 LAA 1962 86 76 Los Angeles Angels
5.4. AUTOMATION AND ITERATION 109

2 LAA 1964 82 80 Los Angeles Angels


3 LAA 1961 70 91 Los Angeles Angels
4 LAA 1963 70 91 Los Angeles Angels

[[2]]
teamID yearID W L name
1 CAL 1982 93 69 California Angels
2 CAL 1986 92 70 California Angels
3 CAL 1989 91 71 California Angels
4 CAL 1985 90 72 California Angels
5 CAL 1979 88 74 California Angels

[[3]]
teamID yearID W L name
1 ANA 2002 99 63 Anaheim Angels
2 ANA 2004 92 70 Anaheim Angels
3 ANA 1998 85 77 Anaheim Angels
4 ANA 1997 84 78 Anaheim Angels
5 ANA 2000 82 80 Anaheim Angels

[[4]]
teamID yearID W L name
1 LAA 2008 100 62 Los Angeles Angels of Anaheim
2 LAA 2014 98 64 Los Angeles Angels of Anaheim
3 LAA 2009 97 65 Los Angeles Angels of Anaheim
4 LAA 2005 95 67 Los Angeles Angels of Anaheim
5 LAA 2007 94 68 Los Angeles Angels of Anaheim

Finally, we can collect the results into a data frame by passing the resulting list to the
bind rows() function. Below, we do this and then compute the average number of wins in
a top 5 seasons for each Angels team name. Based on these data, the Los Angeles Angels of
Anaheim has been the most successful incarnation of the franchise, when judged by average
performance in the best five seasons.

angels_list %>% bind_rows() %>%


group_by(teamID, name) %>%
summarize(N = n(), mean_wins = mean(W)) %>%
arrange(desc(mean_wins))

Source: local data frame [4 x 4]


Groups: teamID [3]

teamID name N mean_wins


<fctr> <chr> <int> <dbl>
1 LAA Los Angeles Angels of Anaheim 5 96.8
2 CAL California Angels 5 90.8
3 ANA Anaheim Angels 5 88.4
4 LAA Los Angeles Angels 4 77.0

Once you’ve read Chapter 12, think about how you might do this operation in SQL. It
is not that easy!
110 CHAPTER 5. TIDY DATA AND ITERATION

5.4.3 Iteration over subgroups with dplyr::do()


In Chapter 4 we introduced data verbs that could be chained to perform very powerful data
wrangling operations. These functions—which come from the dplyr package—operate on
data frames and return data frames. The do() function in dplyr allows you to apply an
arbitrary function to the groups of a data frame. That is, you will first define a grouping
using the group by() function, and then apply a function to all of those groups. Note that
this is similar to sapply(), in that you are mapping a function over a collection of values,
but whereas the values used in sapply() are individual elements of a vector, in dplyr::do()
they are groups defined on a data frame.
One of the more enduring models in sabermetrics is Bill James’s formula for estimating
a team’s expected winning percentage, given knowledge only of the team’s runs scored and
runs allowed to date (recall that the team that scores the most runs wins a given game).
This statistic is known—unfortunately—as Pythagorean Winning Percentage, even though
it has nothing to do with Pythagoras. The formula is simple, but non-linear:

 RS 2 1
W P ct = = ,
RS 2 + RA2 1 + (RA/RS)2
where RS and RA are the number of runs the team has scored and allowed, respectively.
If we define x = RS/RA to be the team’s run ratio, then this is a function of one variable
1
having the form f (x) = 1+(1/x) 2.

This model seems to fit quite well upon visual inspection—in Figure 5.3 we show the
data since 1954, along with a line representing the model. Indeed, this model has also been
successful in other sports, albeit with wholly different exponents.

exp_wpct <- function (x) {


return(1/(1 + (1/x)^2))
}
TeamRuns <- Teams %>%
filter(yearID >= 1954) %>%
rename(RS = R) %>%
mutate(WPct = W / (W + L), run_ratio = RS/RA) %>%
select(yearID, teamID, lgID, WPct, run_ratio)
ggplot(data = TeamRuns, aes(x = run_ratio, y = WPct)) +
geom_vline(xintercept = 1, color= "darkgray", linetype = 2) +
geom_hline(yintercept = 0.5, color= "darkgray", linetype = 2) +
geom_point(alpha = 0.3) +
stat_function(fun = exp_wpct, size = 2, color = "blue") +
xlab("Ratio of Runs Scored to Runs Allowed") + ylab("Winning Percentage")

However, the exponent of 2 was posited by James. One can imagine having the exponent
become a parameter k, and trying to find the optimal fit. Indeed, researchers have found
that in baseball, the optimal value of k is not 2, but something closer to 1.85 [208]. It is
easy enough for us to find the optimal value using the fitModel() function from the mosaic
package.

exWpct <- fitModel(WPct ~ 1/(1 + (1/run_ratio)^k), data = TeamRuns)


coef(exWpct)

k
1.84
5.4. AUTOMATION AND ITERATION 111

0.7

0.6
Winning Percentage

0.5

0.4

0.3

0.75 1.00 1.25 1.50


Ratio of Runs Scored to Runs Allowed

Figure 5.3: Fit for the Pythagorean Winning Percentage model for all teams since 1954.

Furthermore, researchers investigating this model have found that the optimal value
of the exponent differs based on the era during which the model is fit. We can use the
dplyr::do() function to do this for all decades in baseball history. First, we must write a
short function that will return a data frame containing the optimal exponent.

fit_k <- function(x) {


mod <- fitModel(formula = WPct ~ 1/(1 + (1/run_ratio)^k), data = x)
return(data.frame(k = coef(mod)))
}

Note that this function will return the optimal value of the exponent over any time
period.

fit_k(TeamRuns)

k
k 1.84

Finally, we compute the decade for each year, and apply fit k() to those decades. In
the code below, the . refers to the result of the previous command, which in this case is
the data frame containing the information for a single decade.

TeamRuns %>%
mutate(decade = yearID %/% 10 * 10) %>%
group_by(decade) %>%
do(fit_k(x = .))

Source: local data frame [7 x 2]


Groups: decade [7]

decade k
112 CHAPTER 5. TIDY DATA AND ITERATION

<dbl> <dbl>
1 1950 1.69
2 1960 1.90
3 1970 1.74
4 1980 1.93
5 1990 1.88
6 2000 1.94
7 2010 1.78

Note the variation in the optimal value of k. Even though the exponent is not the same
in each decade, it varies within a fairly narrow range between 1.70 and 1.95.
As a second example, consider the problem of identifying the team in each season that
led their league in home runs. We can easily write a function that will, for a specific year
and league, return a data frame with one row that contains the team with the most home
runs.

hr_leader <- function (x) {


# x is a subset of Teams for a single year and league
x %>%
select(yearID, lgID, teamID, HR) %>%
arrange(desc(HR)) %>%
head(n = 1)
}

We can verify that in 1961, the New York Yankees led the American League in home
runs.

Teams %>%
filter(yearID == 1961 & lgID == "AL") %>%
hr_leader()

yearID lgID teamID HR


1 1961 AL NYA 240

We can use dplyr::do() to quickly find all the teams that led their league in home runs.

hr_leaders <- Teams %>%


group_by(yearID, lgID) %>%
do(hr_leader(.))
head(hr_leaders, 4)

Source: local data frame [4 x 4]


Groups: yearID, lgID [4]

yearID lgID teamID HR


<int> <fctr> <fctr> <int>
1 1871 NA CH1 10
2 1872 NA BL1 14
3 1873 NA BS1 13
4 1874 NA BS1 18
5.4. AUTOMATION AND ITERATION 113

l
l
l l
l
lll l l l
l l l
l l ll l l
l l l
l l ll l
l l l l ll l
l l l
l ll
l l l l l l
l l
l l l l l
200 l l l l
l l l
l l l
l l l
l l l l ll
l l l l
l l ll l
l l l
ll l l
l l l l l
l ll
l l l lll ll l
l
l
lll l l l
l l
l
l
l l
l lgID
l l l l l
l l l lll l
HR

ll l l AL
l l lll l l
l l l l l
l l l
l
l NL
l l
l l l
l l
l l l l l l
ll l l
l l l
l
l l l
100 l l ll
l l
l l
l l
l l l

l l
l
l
l
ll
l AL adopts DH

1920 1940 1960 1980 2000


yearID

Figure 5.4: Number of home runs hit by the team with the most home runs, 1916–2014.
Note how the AL has consistently bested the NL since the introduction of the designated
hitter (DH) in 1973.

In this manner, we can compute the average number of home runs hit in a season by
the team that hit the most.

mean(HR ~ lgID, data = hr_leaders)

AA AL FL NA NL PL UA
40.6 153.3 51.0 13.8 126.1 66.0 32.0

mean(HR ~ lgID, data = filter(hr_leaders, yearID >= 1916))

AA AL FL NA NL PL UA
NaN 171 NaN NaN 158 NaN NaN

In Figure 5.4 we show how this number has changed over time. We restrict our attention
to the years since 1916, during which only the AL and NL leagues have existed. We note
that while the top HR hitting teams were comparable across the two leagues until the mid
1970s, the AL teams have dominated since their league adopted the designated hitter rule
in 1973.

hr_leaders %>%
filter(yearID >= 1916) %>%
ggplot(aes(x = yearID, y = HR, color = lgID)) + geom_line() +
geom_point() + geom_smooth(se = 0) + geom_vline(xintercept = 1973) +
annotate("text", x=1974, y=25, label="AL adopts DH", hjust="left")

5.4.4 Iteration with mosaic::do


In the previous section we learned how to repeat operations while iterating over the ele-
ments of a vector. It can also be useful to simply repeat an operation many times and
114 CHAPTER 5. TIDY DATA AND ITERATION

2.0

1.5
density

1.0

0.5

0.0

1.5 1.8 2.1


Best fit exponent for a single season

Figure 5.5: Distribution of best-fitting exponent across single seasons from 1961–2014.

collect the results. Obviously, if the result of the operation is deterministic (i.e., you get
the same answer every time) then this is pointless. On the other hand, if this operation
involves randomness, then you won’t get the same answer every time, and understanding
the distribution of values that your random operation produces can be useful. We will flesh
out these ideas further in Chapter 10.
For example, in our investigation into the expected winning percentage in baseball, we
determined that the optimal exponent fit to the 61 seasons worth of data from 1954 to
2014 was 1.85. However, we also found that if we fit this same model separately for each
decade, that optimal exponent varies from 1.69 to 1.94. This gives us a rough sense of the
variability in this exponent—we observed values between 1.6 and 2, which may give some
insights as to plausible values for the exponent.
Nevertheless, our choice to stratify by decade was somewhat arbitrary. A more natural
question might be: What is the distribution of optimal exponents fit to a single-season’s
worth of data? How confident should we be in that estimate of 1.85?
We can use dplyr::do() and the function we wrote previously to compute the 61 actual
values. The resulting distribution is summarized in Figure 5.5.

k_actual <- TeamRuns %>%


group_by(yearID) %>%
do(fit_k(.))
favstats(~ k, data = k_actual)

min Q1 median Q3 max mean sd n missing


1.31 1.69 1.89 1.97 2.33 1.85 0.19 62 0

ggplot(data = k_actual, aes(x = k)) + geom_density() +


xlab("Best fit exponent for a single season")

Since we only have 61 samples, we might obtain a better understanding of the sampling
distribution of the mean k by resampling—sampling with replacement—from these 61 val-
5.4. AUTOMATION AND ITERATION 115

15

10
density

1.80 1.85 1.90


Distribution of resampled means

Figure 5.6: Bootstrap distribution of mean optimal exponent.

ues. (This is a statistical technique known as the bootstrap, which we describe further in
Chapter 7.) A simple way to do this is with the do() function in the mosaic package.

bstrap <- do(1000) * mean(~ k, data = resample(k_actual))


head(bstrap, 3)

mean
1 1.85
2 1.84
3 1.85

civals <- qdata(~ mean, c(0.025, .975), data = bstrap)


civals

quantile p
2.5% 1.81 0.025
97.5% 1.89 0.975

After repeating the resampling 1,000 times, we found that 95% of the resampled expo-
nents were between 1.805 and 1.893, with our original estimates of 1.85 lying somewhere
near the center of that distribution. This distribution, along the boundaries of the middle
95%, is depicted in Figure 5.6.

ggplot(data = bstrap, aes(x = mean)) + geom_density() +


xlab("Distribution of resampled means") +
geom_vline(data = civals, aes(xintercept = quantile), color = "red",
linetype = 3)
116 CHAPTER 5. TIDY DATA AND ITERATION

5.5 Data intake


Every easy data format is alike. Every difficult data format is difficult in its
own way. —inspired by Leo Tolstoy and Hadley Wickham

The tools that we develop in this book allow one to work with data in R. However, most
data sets are not available in R to begin with—they are often stored in a different file format.
While R has sophisticated abilities for reading data in a variety of formats, it is not without
limits. For data that are not in a file, one common form of data intake is Web scraping, in
which data from the Internet are processed as (structured) text and converted into data.
Such data often have errors that stem from blunders in data entry or from deficiencies in
the way data are stored or coded. Correcting such errors is called data cleaning.
The native file format for R is usually given the suffix .Rda (or sometimes, .RData). Any
object in your R environment can be written to this file format using the save() command.
Using the compress argument will make these files smaller.

save(hr_leaders, file = "hr_leaders.rda", compress = "xz")

This file format is usually an efficient means for storing data, but it is not the most
portable. To load a stored object into your R environment, use the load() command.

load(file = "hr_leaders.rda")

Pro Tip: Maintaining the provenance of data from beginning to the end of an analysis
is an important part of a reproducible workflow. This can be facilitated by creating one R
Markdown file or notebook that undertakes the data wrangling and generates an analytic
data set (using save()) that can be read (using load()) into a second R Markdown file.

5.5.1 Data-table friendly formats


Many formats for data are essentially equivalent to data tables. When you come across
data in a format that you don’t recognize, it is worth checking whether it is one of the
data-table friendly formats. Sometimes the filename extension provides an indication. Here
are several, each with a brief description:

CSV: a non-proprietary comma separated text format that is widely used for data ex-
change between different software packages. CSVs are easy to understand, but are
not compressed, and therefore can take up more space on disk than other formats.

Pro Tip: Be careful with date and time variables in CSV format: these can sometimes
be formatted in inconsistent ways that make it more challenging to ingest.

Software-package specific format some common examples include:


Octave (and through that, MATLAB): widely used in engineering and physics
Stata: commonly used for economic research
SPSS: commonly used for social science research
Minitab: often used in business applications
5.5. DATA INTAKE 117

SAS: often used for large data sets


Epi: used by the Centers for Disease Control (CDC) for health and epidemiology
data

Relational databases: the form that much of institutional, actively-updated data are
stored in. This includes business transaction records, government records, Web logs,
and so on. (See Chapter 12 for a discussion of relational database management sys-
tems.)
Excel: a set of proprietary spreadsheet formats heavily used in business. Watch out,
though. Just because something is stored in an Excel format doesn’t mean it is a
data table. Excel is sometimes used as a kind of tablecloth for writing down data
with no particular scheme in mind.
Web-related: For example:
• HTML (hypertext markup language): <table> format
• XML (extensible markup language) format, a tree-based document structure
• JSON (JavaScript Object Notation) is an increasingly common data format that
breaks the “rows-and-columns” paradigm (see Section 17.2.4)
• Google spreadsheets: published as HTML
• Application programming interfaces (API)

The procedure for reading data in one of these formats varies depending on the format.
For Excel or Google spreadsheet data, it is sometimes easiest to use the application software
to export the data as a CSV file. There are also R packages for reading directly from
either (readxl and googlesheets, respectively), which are useful if the spreadsheet is being
updated frequently. For the technical software package formats, the foreign R package
provides useful reading and writing functions. For relational databases, even if they are
on a remote server, there are several useful R packages that allow you to connect to these
databases directly, most notably dplyr and DBI. CSV and HTML <table> formats are
frequently encountered sources for data scraping. The next subsections give a bit more
detail about how to read them into R.

CSV (comma separated value) files


This text format can be read with a huge variety of software. It has a data table format,
with the values of variables in each case separated by commas. Here is an example of the
first several lines of a CSV file:

"year","sex","name","n","prop"
1880,"F","Mary",7065,0.0723835869064085
1880,"F","Anna",2604,0.0266789611187951
1880,"F","Emma",2003,0.0205214896777829
1880,"F","Elizabeth",1939,0.0198657855642641
1880,"F","Minnie",1746,0.0178884278469341
1880,"F","Margaret",1578,0.0161672045489473

The top row usually (but not always) contains the variable names. Quotation marks are
often used at the start and end of character strings—these quotation marks are not part of
the content of the string, but are useful if, say, you want to include a comma in the text of
118 CHAPTER 5. TIDY DATA AND ITERATION

a field. CSV files are often named with the .csv suffix; it is also common for them to be
named with .txt, .dat, or other things. You will also see characters other than commas
being used to delimit the fields: Tabs and vertical bars are particularly common.
Since reading from a CSV file is so common, several implementations are available. The
read.csv() function in the base package is perhaps the most widely used, but the more
recent read csv() function in the readr package is noticeably faster for large CSVs. CSV
files need not exist on your local hard drive. For example, here is a way to access a .csv
file over the Internet using a URL (universal resource locator).

myURL <- "http://tiny.cc/dcf/houses-for-sale.csv"


Houses <- readr::read_csv(myURL)
head(Houses, 3)

# A tibble: 3 16
price lot_size waterfront age land_value construction air_cond fuel
<int> <dbl> <int> <int> <int> <int> <int> <int>
1 132500 0.09 0 42 50000 0 0 3
2 181115 0.92 0 0 22300 0 0 2
3 109000 0.19 0 133 7300 0 0 2
# ... with 8 more variables: heat <int>, sewer <int>, living_area <int>,
# pct_college <int>, bedrooms <int>, fireplaces <int>, bathrooms <dbl>,
# rooms <int>
Just as reading a data file from the Internet uses a URL, reading a file on your computer
uses a complete name, called a path to the file. Although many people are used to using a
mouse-based selector to access their files, being specific about the full path to your files is
important to ensure the reproducibility of your code (see Appendix D).

HTML tables
Web pages are HTML documents, which are then translated by a browser to the formatted
content that users see. HTML includes facilities for presenting tabular content. The HTML
<table> markup is often the way human-readable data is arranged.
When you have the URL of a page containing one or more tables, it is sometimes easy
to read them into R as data tables. Since they are not CSVs, we can’t use read csv().
Instead, we use functionality in the rvest package to ingest the HTML as a data structure
in R. Once you have the content of the Web page, you can translate any tables in the page
from HTML to data table format.
In this brief example, we will investigate the progression of the world record time in the
mile run, as detailed on the Wikipedia. This page (see Figure 5.7) contains several tables,
each of which contains a list of new world records for a different class of athlete (e.g., men,
women, amateur, professional, etc.).

library(rvest)
library(methods)
url <- "http://en.wikipedia.org/wiki/Mile_run_world_record_progression"
tables <- url %>%
read_html() %>%
html_nodes("table")

The result, tables, is not a data table. Instead, it is a list (see Appendix B) of the
tables found in the Web page. Use length() to find how many items there are in the list
of tables.
5.5. DATA INTAKE 119

Figure 5.7: Part of a page on mile-run world records from Wikipedia. Two separate data
tables are visible. You can’t tell from this small part of the page, but there are seven tables
altogether on the page. These two tables are the third and fourth in the page.

length(tables)

[1] 7

You can access any of those tables using the [[() operator. The first table is tables[[1]],
the second table is tables[[2]], and so on. The third table—which corresponds to amateur
men up until 1862—is shown in Table 5.10.

Table3 <- html_table(tables[[3]])

Time Athlete Nationality Date Venue


4:52 Cadet Marshall United Kingdom 2 September 1852 Addiscome
4:45 Thomas Finch United Kingdom 3 November 1858 Oxford
4:45 St. Vincent Hammick United Kingdom 15 November 1858 Oxford
4:40 Gerald Surman United Kingdom 24 November 1859 Oxford
4:33 George Farran United Kingdom 23 May 1862 Dublin

Table 5.10: The third table embedded in the Wikipedia page on running records.

Likely of greater interest is the information in the fourth table, which corresponds to
the current era of International Amateur Athletics Federation world records. The first few
rows of that table are shown in Table 5.11. The last row of that table (now shown) contains
the current world record of 3:43.13, which was set by Hicham El Guerrouj of Morocco in
Rome on July 7th, 1999.
120 CHAPTER 5. TIDY DATA AND ITERATION

Table4 <- html_table(tables[[4]])


Table4 <- select(Table4, -Auto) # remove unwanted column

Time Athlete Nationality Date Venue


4:14.4 John Paul Jones United States 31 May 1913[5] Allston, Mass.
4:12.6 Norman Taber United States 16 July 1915[5] Allston, Mass.
4:10.4 Paavo Nurmi Finland 23 August 1923[5] Stockholm
4:09.2 Jules Ladoumgue France 4 October 1931[5] Paris
4:07.6 Jack Lovelock New Zealand 15 July 1933[5] Princeton, N.J.
4:06.8 Glenn Cunningham United States 16 June 1934[5] Princeton, N.J.

Table 5.11: The fourth table embedded in the Wikipedia page on running records.

5.5.2 APIs
An application programming interface (API) is a protocol for interacting with a computer
program that you can’t control. It is a set of agreed-upon instructions for using a “black-
box”—not unlike the manual for a television’s remote control. APIs provide access to
massive troves of public data on the Web, from a vast array of different sources. Not all
APIs are the same, but by learning how to use them, you can dramatically increase your
ability to pull data into R without having to “scrape” it.
If you want to obtain data from a public source, it is a good idea to check to see whether:
a) the company has a public API; b) someone has already written an R package to said
interface. These packages don’t provide the actual data—they simply provide a series of R
functions that allow you to access the actual data. The documentation for each package
will explain how to use it to collect data from the original source.

5.5.3 Cleaning data


A person somewhat knowledgeable about running would have little trouble interpreting
Tables 5.10 and 5.11 correctly. The Time is in minutes and seconds. The Date gives the
day on which the record was set. When the data table is read into R, both Time and Date
are stored as character strings. Before they can be used, they have to be converted into
a format that the computer can process like a date and time. Among other things, this
requires dealing with the footnote (listed as [5]) at the end of the date information.
Data cleaning refers to taking the information contained in a variable and transforming
it to a form in which that information can be used.

Recoding
Table 5.12 displays a few variables from the Houses data table we downloaded earlier. It
describes 1,728 houses for sale in Saratoga, NY.1 The full table includes additional variables
such as living area, price, bedrooms, and bathrooms. The data on house systems such
as sewer type and heat type have been stored as numbers, even though they are really
categorical.
There is nothing fundamentally wrong with using integers to encode, say, fuel type,
though it may be confusing to interpret results. What is worse is that the numbers imply
a meaningful order to the categories when there is none.
1 The example comes from Richard De Veaux at Williams College.
5.5. DATA INTAKE 121

fuel heat sewer construction


3 4 2 0
2 3 2 0
2 3 3 0
2 2 2 0
2 2 3 1

Table 5.12: Four of the variables from the houses-for-sale.csv file giving features of the
Saratoga houses stored as integer codes. Each case is a different house.

To translate the integers to a more informative coding, you first have to find out what
the various codes mean. Often, this information comes from the codebook, but sometimes
you will need to contact the person who collected the data. Once you know the translation,
you can use spreadsheet software to enter them into a data table, like this one for the houses:

Translations <- readr::read_csv("http://tiny.cc/dcf/house_codes.csv")


Translations %>% head(5)

# A tibble: 5 3
code system_type meaning
<int> <chr> <chr>
1 0 new_const no
2 1 new_const yes
3 1 sewer_type none
4 2 sewer_type private
5 3 sewer_type public

Translations describes the codes in a format that makes it easy to add new code values
as the need arises. The same information can also be presented a wide format as in Table
5.13.

CodeVals <- Translations %>%


spread(key = system_type, value = meaning, fill = "invalid")

code central air fuel type heat type new const sewer type
0 no invalid invalid no invalid
1 yes invalid invalid yes none
2 invalid gas hot air invalid private
3 invalid electric hot water invalid public
4 invalid oil electric invalid invalid

Table 5.13: The Translations data table rendered in a wide format.

In CodeVals, there is a column for each system type that translates the integer code to a
meaningful term. In cases where the integer has no corresponding term, invalid has been
entered. This provides a quick way to distinguish between incorrect entries and missing
entries. To carry out the translation, we join each variable, one at a time, to the data table
of interest. Note how the by value changes for each variable:
122 CHAPTER 5. TIDY DATA AND ITERATION

Houses <- Houses %>%


left_join(CodeVals %>%
select(code, fuel_type), by = c(fuel="code")) %>%
left_join(CodeVals %>% select(code, heat_type), by = c(heat="code")) %>%
left_join(CodeVals %>% select(code, sewer_type), by = c(sewer="code"))

Table 5.14 shows the re-coded data. We can compare this to the previous display in
Table 5.12.

fuel type heat type sewer type


1 electric electric private
2 gas hot water private
3 gas hot water public
4 gas hot air private
5 gas hot air public
6 gas hot air private

Table 5.14: The Houses data with re-coded categorical variables.

From strings to numbers


You have seen two major types of variables: quantitative and categorical. You are used
to using quoted character strings as the levels of categorical variables, and numbers for
quantitative variables.
Often, you will encounter data tables that have variables whose meaning is numeric but
whose representation is a character string. This can occur when one or more cases is given
a non-numeric value, e.g., not available.
The as.numeric() function will translate character strings with numerical content into
numbers. But as.character() goes the other way. For example, in the OrdwayBirds data,
the Month, Day, and Year variables are all being stored as character vectors, even though
their evident meaning is numeric.

OrdwayBirds %>%
select(Timestamp, Year, Month, Day) %>%
glimpse()

Observations: 15,829
Variables: 4
$ Timestamp <chr> "4/14/2010 13:20:56", "", "5/13/2010 16:00:30", "5/1...
$ Year <chr> "1972", "", "1972", "1972", "1972", "1972", "1972", ...
$ Month <chr> "7", "", "7", "7", "7", "7", "7", "7", "7", "7", "7"...
$ Day <chr> "16", "", "16", "16", "16", "16", "16", "16", "16", ...

We can convert the strings to numbers using mutate() and parse number(). Note how
the empty strings (i.e., "") in those fields are automatically converted into NA’s, since they
cannot be converted into valid numbers.

library(readr)
OrdwayBirds <- OrdwayBirds %>%
mutate(Month = parse_number(Month), Year = parse_number(Year),
5.5. DATA INTAKE 123

Day = parse_number(Day))
OrdwayBirds %>%
select(Timestamp, Year, Month, Day) %>%
glimpse()

Observations: 15,829
Variables: 4
$ Timestamp <chr> "4/14/2010 13:20:56", "", "5/13/2010 16:00:30", "5/1...
$ Year <dbl> 1972, NA, 1972, 1972, 1972, 1972, 1972, 1972, 1972, ...
$ Month <dbl> 7, NA, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
$ Day <dbl> 16, NA, 16, 16, 16, 16, 16, 16, 16, 16, 17, 18, 18, ...

Dates

Unfortunately, dates are often recorded as character strings (e.g., 29 October 2014). Among
other important properties, dates have a natural order. When you plot values such as 16
December 2015 and 29 October 2016, you expect the December date to come after the
October date, even though this is not true alphabetically of the string itself.
When plotting a value that is numeric, you expect the axis to be marked with a few
round numbers. A plot from 0 to 100 might have ticks at 0, 20, 40, 60, 100. It is similar for
dates. When you are plotting dates within one month, you expect the day of the month to
be shown on the axis. If you are plotting a range of several years, it would be appropriate
to show only the years on the axis.
When you are given dates stored as a character vector, it is usually necessary to convert
them to a data type designed specifically for dates. For instance, in the OrdwayBirds data,
the Timestamp variable refers to the time the data were transcribed from the original lab
notebook to the computer file. This variable is currently stored as a character string, but
we can translate it into a genuine date using functions from the lubridate package.
These dates are written in a format showing month/day/year hour:minute:second.
The mdy hms() function from the lubridate package converts strings in this format to a
date. Note that the data type of the When variable is now time.

library(lubridate)
WhenAndWho <- OrdwayBirds %>%
mutate(When = mdy_hms(Timestamp)) %>%
select(Timestamp, Year, Month, Day, When, DataEntryPerson) %>%
glimpse()

Observations: 15,829
Variables: 6
$ Timestamp <chr> "4/14/2010 13:20:56", "", "5/13/2010 16:00:30"...
$ Year <dbl> 1972, NA, 1972, 1972, 1972, 1972, 1972, 1972, ...
$ Month <dbl> 7, NA, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ...
$ Day <dbl> 16, NA, 16, 16, 16, 16, 16, 16, 16, 16, 17, 18...
$ When <dttm> 2010-04-14 13:20:56, NA, 2010-05-13 16:00:30,...
$ DataEntryPerson <chr> "Jerald Dosch", "Caitlin Baker", "Caitlin Bake...

With the When variable now recorded as a timestamp, we can create a sensible plot
showing when each of the transcribers completed their work, as in Figure 5.8.
124 CHAPTER 5. TIDY DATA AND ITERATION

Figure 5.8: The transcribers of OrdwayBirds from lab notebooks worked during different
time intervals.

WhenAndWho %>% ggplot(aes(x = When, y = DataEntryPerson)) +


geom_point(alpha = 0.1, position = "jitter")

Many of the same operations that apply to numbers can be used on dates. For example,
the range of dates that each transcriber worked can be calculated as a difference in times
(i.e., an interval()), and shown in Table 5.15. This makes it clear that Jolani worked on
the project for nearly a year (329 days), while Abby’s first transcription was also her last.

WhenAndWho %>%
group_by(DataEntryPerson) %>%
summarize(start = first(When), finish = last(When)) %>%
mutate(duration = interval(start, finish) / ddays(1))

DataEntryPerson start finish duration


Abby Colehour 2011-04-23 15:50:24 2011-04-23 15:50:24 0.00
Brennan Panzarella 2010-09-13 10:48:12 2011-04-10 21:58:56 209.47
Emily Merrill 2010-06-08 09:10:01 2010-06-08 14:47:21 0.23
Jerald Dosch 2010-04-14 13:20:56 2010-04-14 13:20:56 0.00
Jolani Daney 2010-06-08 09:03:00 2011-05-03 10:12:59 329.05
Keith Bradley-Hewitt 2010-09-21 11:31:02 2011-05-06 17:36:38 227.25
Mary Catherine Muiz 2012-02-02 08:57:37 2012-04-30 14:06:27 88.21

Table 5.15: Starting and ending dates for each transcriber involved in the OrdwayBirds
project.

There are many similar lubridate functions for converting strings in different formats
into dates, e.g., ymd(), dmy(), and so on. There are also functions like hour(), yday(), etc.
for extracting certain pieces of variables encoded as dates.
5.5. DATA INTAKE 125

Internally, R uses several different classes to represent dates and times. For timestamps
(also referred to as datetimes), these classes are POSIXct and POSIXlt. For most purposes,
you can treat these as being the same, but internally, they are stored differently. A POSIXct
object is stored as the number of seconds since the UNIX epoch (1970-01-01), whereas
POSIXlt objects are stored as a list of year, month, day, etc. character strings.

now()

[1] "2016-11-23 11:19:59 EST"

class(now())

[1] "POSIXct" "POSIXt"

class(as.POSIXlt(now()))

[1] "POSIXlt" "POSIXt"

For dates that do not include times, the Date class is most commonly used.

as.Date(now())

[1] "2016-11-23"

Factors or strings?
R was designed with a special data type for holding categorical data: factor. Factors store
categorical data efficiently and provide a means to put the categorical levels in whatever or-
der is desired. Unfortunately, factors also make cleaning data more confusing. The problem
is that it is easy to mistake a factor for a character string, but they have different properties
when it comes to converting a numeric or date form. This is especially problematic when
using the character processing techniques in Chapter 15.
By default, readr::read csv() will interpret character strings as strings and not as
factors. Other functions such as read.csv() convert character strings into factors by de-
fault. Cleaning such data often requires converting them back to a character format using
as.character(). Failing to do this when needed can result in completely erroneous results
without any warning.
For this reason, the data tables used in this book have been stored with categorical or text
data in character format. Be aware that data provided by other packages do not necessarily
follow this convention. If you get mysterious results when working with such data, consider
the possibility that you are working with factors rather than character vectors. Recall that
summary(), glimpse(), and str() will all reveal the data types of each variable in a data
frame.

Pro Tip: It’s always a good idea to carefully check all variables and data wrangling
operations to ensure that reasonable values are generated.

CSV files in this book are typically read with read csv() provided by the readr package.
If, for some reason, you prefer to use the read.csv() function, we recommend setting the
argument stringsAsFactors argument to FALSE to ensure that text data be stored as
character strings.
126 CHAPTER 5. TIDY DATA AND ITERATION

Figure 5.9: Screenshot of Wikipedia’s list of Japanese nuclear reactors.

5.5.4 Example: Japanese nuclear reactors


Dates and times are an important aspect of many analyses. In the example below, the vector
example contains human-readable datetimes stored as character by R. The ymd hms()
function from lubridate will convert this into POSIXct—a datetime format. This makes
it possible for R to do date arithmetic.

library(lubridate)
example <- c("2017-04-29 06:00:00", "2017-12-31 12:00:00")
str(example)

chr [1:2] "2017-04-29 06:00:00" "2017-12-31 12:00:00"

converted <- ymd_hms(example)


str(converted)

POSIXct[1:2], format: "2017-04-29 06:00:00" "2017-12-31 12:00:00"

converted

[1] "2017-04-29 06:00:00 UTC" "2017-12-31 12:00:00 UTC"

converted[2] - converted[1]

Time difference of 246 days

We will use this functionality to analyze data on nuclear reactors in Japan. Figure 5.9
displays the first part of this table as of the summer of 2016.
5.6. FURTHER RESOURCES 127

my_html <-
read_html("http://en.wikipedia.org/wiki/List_of_nuclear_reactors")
tables <- my_html %>% html_nodes(css = "table")
relevant_tables <- tables[grep("Fukushima Daiichi", tables)]
reactors <- html_table(relevant_tables[[1]], fill = TRUE)
names(reactors)[c(3,4,6,7)] <- c("Reactor Type",
"Reactor Model", "Capacity Net", "Capacity Gross")
reactors <- reactors[-1,]

We see that the first entries are the ill-fated Fukushima Daiichi reactors. The mutate()
function can be used in conjunction with the dmy() function from the lubridate package to
wrangle these data into a better form. (Note the back ticks used to specify variable names
that include space or special characters.)

library(readr)
reactors <- reactors %>%
rename(capacity_net=`Capacity Net`, capacity_gross=`Capacity Gross`) %>%
mutate(plantstatus = ifelse(grepl("Shut down", reactors$Status),
"Shut down", "Not formally shut down"),
capacity_net = parse_number(capacity_net),
construct_date = dmy(`Construction Start Date`),
operation_date = dmy(`Commercial Operation Date`),
closure_date = dmy(Closure))

How have these plants evolved over time? It seems likely that as nuclear technology has
progressed, plants should see an increase in capacity. A number of these reactors have been
shut down in recent years. Are there changes in capacity related to the age of the plant?
Figure 5.10 displays the data.

ggplot(data = reactors,
aes(x = construct_date, y = capacity_net, color = plantstatus)) +
geom_point() + geom_smooth() +
xlab("Date of Plant Construction") + ylab("Net Plant Capacity (MW)")

Indeed, reactor capacity has tended to increase over time, while the older reactors were
more likely to have been formally shut down. While it would have been straightforward to
code these data by hand, automating data ingestation for larger and more complex tables
is more efficient and less error-prone.

5.6 Further resources


The tidyr package, and in particular, the Tidy Data [230] paper provide principles for
tidy data. We provide further statistical justification for resampling-based techniques in
Chapter 7. The feather package provides an efficient mechanism for storing data frames
that can be read and written by both R and Python.
There are many R packages that do nothing other than provide access to a public API
from within R. There are far too many API packages to list here, but a fair number of them
are maintained by the rOpenSci group. In fact, several of the packages referenced in this
book, including the twitteR and aRxiv packages in Chapter 15, and the plotly package
in Chapter 11, are APIs. The CRAN task view on Web Technologies lists hundreds more
128 CHAPTER 5. TIDY DATA AND ITERATION

1500

l l l l l
Net Plant Capacity (MW)

l
l l l l
l l l
l
l l l l ll l ll l l l

1000
plantstatus
l
l ll l l
l l Not formally shut down
l l l l l l
l l l l
l Shut down

l l l
l
ll l
l
500 l
l
l l

l
l

1960 1980 2000


Date of Plant Construction

Figure 5.10: Distribution of capacity of Japanese nuclear power plants over time.

packages, including Rfacebook, instaR, Rflickr, tumblR, and Rlinkedin. The RSocrata
package facilitates the use of Socrata, which is itself an API for querying—among other
things—the NYC Open Data platform.

5.7 Exercises

Exercise 5.1
Consider the number of home runs hit (HR) and home runs allowed (HRA) for the Chicago
Cubs (CHN ) baseball team. Reshape the Teams data from the Lahman package into long
format and plot a time series conditioned on whether the HRs that involved the Cubs were
hit by them or allowed by them.

Exercise 5.2
Write a function called count seasons() that, when given a teamID, will count the
number of seasons the team played in the Teams data frame from the Lahman package.

Exercise 5.3
The team IDs corresponding to Brooklyn baseball teams from the Teams data frame
from the Lahman package are listed below. Use sapply() to find the number of seasons in
which each of those teams played.

bk_teams <- c("BR1", "BR2", "BR3", "BR4", "BRO", "BRP", "BRF")

Exercise 5.4
In the Marriage data set included in mosaicData, the appdate, ceremonydate, and dob
variables are encoded as factors, even though they are dates. Use lubridate to convert
those three columns into a date format.
5.7. EXERCISES 129

library(mosaic)
Marriage %>%
select(appdate, ceremonydate, dob) %>%
glimpse()

Observations: 98
Variables: 3
$ appdate <fctr> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, ...
$ ceremonydate <fctr> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, ...
$ dob <fctr> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/2...

Exercise 5.5
Consider the values returned by the as.numeric() and readr::parse number() func-
tions when applied to the following vectors. Describe the results and their implication.

x1 <- c("1900.45", "$1900.45", "1,900.45", "nearly $2000")


x2 <- as.factor(x1)

Exercise 5.6
An analyst wants to calculate the pairwise differences between the Treatment and Con-
trol values for a small data set from a crossover trial (all subjects received both treatments)
that consists of the following observations.

tab <- xtable(ds1)


print(tab, floating=FALSE)

id group vals
1 1 T 4.00
2 2 T 6.00
3 3 T 8.00
4 1 C 5.00
5 2 C 6.00
6 3 C 10.00
They use the following code to create the new diff variable.

Treat <- filter(ds1, group=="T")


Control <- filter(ds1, group=="C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all

Verify that this code works for this example and generates the correct values of -1, 0,
and -2. Describe two problems that might arise if the data set is not sorted in a particular
order or if one of the observations is missing for one of the subjects. Provide an alternative
approach to generate this variable that is more robust (hint: use tidyr::spread()).

Exercise 5.7
Generate the code to convert the following data frame to wide format.
130 CHAPTER 5. TIDY DATA AND ITERATION

grp sex meanL sdL meanR sdR


1 A F 0.22 0.11 0.34 0.08
2 A M 0.47 0.33 0.57 0.33
3 B F 0.33 0.11 0.40 0.07
4 B M 0.55 0.31 0.65 0.27

The result should look like the following display.

grp F.meanL F.meanR F.sdL F.sdR M.meanL M.meanR M.sdL M.sdR


1 A 0.22 0.34 0.11 0.08 0.47 0.57 0.33 0.33
2 B 0.33 0.40 0.11 0.07 0.55 0.65 0.31 0.27

Hint: use gather() in conjunction with spread().

Exercise 5.8
Use the dplyr::do() function and the HELPrct data frame from the mosaicData package
to fit a regression model predicting cesd as a function of age separately for each of the levels
of the substance variable. Generate a table of results (estimates and confidence intervals)
for each level of the grouping variable.

Exercise 5.9
Use the dplyr::do() function and the Lahman data to replicate one of these baseball
records plots (http://tinyurl.com/nytimes-records) from the The New York Times.

Exercise 5.10
Use the fec package to download the Federal Election Commission data for 2012. Re-
create Figure 2.1 and Figure 2.2 using ggplot2.

Exercise 5.11
Using the same FEC data as the previous exercise, re-create Figure 2.8.

Exercise 5.12
Using the approach described in Section 5.5.4, find another table in Wikipedia that can
be scraped and visualized. Be sure to interpret your graphical display.

Exercise 5.13
Replicate the wrangling to create the house elections table in the fec package from
the original Excel source file.

Exercise 5.14
Replicate the functionality of make babynames dist() from the mdsr package to wrangle
the original tables from the babynames package.
Chapter 6

Professional Ethics

6.1 Introduction
Work in data analytics involves expert knowledge, understanding, and skill. In much of your
work, you will be relying on the trust and confidence that your clients place in you. The
term professional ethics describes the special responsibilities not to take unfair advantage
of that trust. This involves more than being thoughtful and using common sense; there are
specific professional standards that should guide your actions.
The best known professional standards are those in the Hippocratic Oath for physicians,
which were originally written in the 5th century B.C. Three of the eight principles in the
modern version of the oath [237] are presented here because of similarity to standards for
data analytics.

• “I will not be ashamed to say ‘I know not,’ nor will I fail to call in my colleagues when
the skills of another are needed for a patient’s recovery.”
• “I will respect the privacy of my patients, for their problems are not disclosed to me
that the world may know.”
• “I will remember that I remain a member of society, with special obligations to all
my fellow human beings, those sound of mind and body as well as the infirm.”

Depending on the jurisdiction, these principles are extended and qualified by law. For
instance, notwithstanding the need to “respect the privacy of my patients,” health-care
providers in the United States are required by law to report to appropriate government
authorities evidence of child abuse or infectious diseases such as botulism, chicken pox, and
cholera.
This chapter introduces principles of professional ethics for data analytics and gives
examples of legal obligations as well as guidelines issued by professional societies. There is
no data analyst’s oath—only guidelines. Reasonable people can disagree about what actions
are best, but the existing guidelines provide a description of the ethical expectations on
which your clients can reasonably rely. As a consensus statement of professional ethics, the
guidelines also establish standards of accountability.

6.2 Truthful falsehoods


The single best-selling book with “statistics” in the title is How to Lie with Statistics by
Darrell Huff [114]. Written in the 1950s, the book shows graphical ploys to fool people

131
132 CHAPTER 6. PROFESSIONAL ETHICS

0
Number of murders committed using firearms

2005
200
Florida enacted
its 'Stand Your
Ground' law
400

l
ll
l
l l
600 l l
l
ll l
l
l ll
l l l
l l
800 lll l
l

1000
Source: Florida Department of Law Enforcement

1990 1995 2000 2005 2010 2015


Year

Figure 6.1: Reproduction of a data graphic reporting the number of gun deaths in Florida
over time. The original image was published by Reuters.

even with accurate data. A general method is to violate conventions and tacit expectations
that readers rely on when interpreting graphs. One way to think of How to Lie is a text
to show the general public what these tacit expectations are and give tips for detecting
when the trick is being played on them. The book’s title, while compelling, has wrongly
tarred the field of statistics. The “statistics” of the title are really just “numbers.” The
misleading graphical techniques are employed by politicians, journalists, and businessmen:
not statisticians. More accurate titles would be “How to Lie with Numbers,” or “Don’t be
misled by graphics.”
Some of the graphical tricks in “How to Lie ...” are still in use. Consider these two
recent examples.
In 2005, the Florida legislature passed the controversial “Stand Your Ground” law that
broadened the situations in which citizens can use lethal force to protect themselves against
perceived threats. Advocates believed that the new law would ultimately reduce crime;
opponents feared an increase in the use of lethal force. What was the actual outcome?
The graphic in Figure 6.1 is a reproduction of one published by the news service Reuters
showing the number of firearm murders in Florida over the years (see Exercise 4.18). Upon
first glance, the graphic gives the visual impression that right after the passage of the 2005
law, the number of murders decreased substantially. However, the numbers tell a different
story.
6.2. TRUTHFUL FALSEHOODS 133

The convention in data graphics is that up corresponds to increasing values. This is


not an obscure convention—rather, it’s a standard part of the secondary school curriculum.
Close inspection reveals that the y-axis in Figure 6.1 has been flipped upside down—the
number of gun deaths increased sharply after 2005.
Figure 6.2 shows another example of misleading graphics: a tweet by the news magazine
National Review on the subject of climate change. The dominant visual impression of the
graphic is that global temperature has hardly changed at all.

Figure 6.2: A tweet by National Review on December 14, 2015 showing the change in global
temperature over time.

There is a tacit graphical convention that the coordinate scales on which the data are
plotted are relevant to an informed interpretation of the data. The x-axis follows the
convention—1880 to 2015 is a reasonable choice when considering the relationship between
human industrial activity and climate. The y-axis, however, is utterly misleading. The scale
goes from -10 to 110 degrees Fahrenheit. While this is a relevant scale for showing season-
to-season variation in temperature, that is not the salient issue with respect to climate
change. The concern with climate change is about rising ocean levels, intensification of
storms, ecological and agricultural disruption, etc. These are the anticipated results of a
change in global average temperature on the order of 5 degrees Fahrenheit. The National
Review graphic has obscured the data by showing them on an irrelevant scale where the
134 CHAPTER 6. PROFESSIONAL ETHICS

actual changes in temperature are practically invisible. By graying out the numbers on the
y-axis, the National Review makes it even harder to see the trick that’s being played.
The examples in Figures 6.1 and 6.2 are not about lying with statistics. Statistical
methodology doesn’t enter into them. It’s the professional ethics of journalism that the
graphics violate, aided and abetted by an irresponsible ignorance of statistical methodology.
Insofar as both graphics concern matters of political controversy, they can be seen as part of
the blustering and bloviating of politics. While politics may be a profession, it’s a profession
without any comprehensive standard of professional ethics.

6.3 Some settings for professional ethics


Common sense is a good starting point for evaluating the ethics of a situation. Tell the
truth. Don’t steal. Don’t harm innocent people. But professional ethics also require a
neutral, unemotional, and informed assessment. A dramatic illustration of this comes from
legal ethics: a situation where the lawyers for an accused murderer found the bodies of
two victims whose deaths were unknown to authorities and to the victims’ families. The
responsibility to confidentiality for their client precluded the lawyers from following their
hearts and reporting the discovery. The lawyers’ careers were destroyed by the public and
political recriminations that followed, yet courts and legal scholars have confirmed that the
lawyers were right to do what they did, and have even held them up as heroes for their
ethical behavior.
Such extreme drama is rare. This section describes in brief six situations that raise
questions of the ethical course of action. Some are drawn from the authors’ personal expe-
rience, others from court cases and other reports. The purpose of these short case reports
is to raise questions. Principles for addressing those questions are the subject of the next
section.

6.3.1 The chief executive officer


One of us once worked as a statistical consultant for a client who wanted a proprietary
model to predict commercial outcomes. After reviewing the literature, an existing multiple
linear regression model was found that matched the scenario well and available public data
were used to fit the parameters of the model. The client’s staff were pleased with the result,
but the CEO wanted a model that would give a competitive advantage. After all, their
competitors could easily follow the same process to the same model, so what advantage
would the client’s company have? The CEO asked the statistical consultant whether the
coefficients in the model could be “tweaked” to reflect the specific values of his company.
The consultant suggested that this would not be appropriate, that the fitted coefficients
best match the data and to change them arbitrarily would be “playing God.” In response,
the CEO rose from his chair and asserted, “I want to play God.”
How should the consultant respond?

6.3.2 Employment discrimination


One of us works with legal cases arising from audits of employers, conducted by the United
States Office of Federal Contract Compliance Programs (OFCCP). In a typical case, the
OFCCP asks for hiring and salary data from a company that has a contract with the
United States government. The company usually complies, sometimes unaware that the
OFCCP applies a method to identify “discrimination” through a two-standard-deviation
test outlined in the Uniform Guidelines on Employee Selection Procedures (UGESP). A
6.3. SOME SETTINGS FOR PROFESSIONAL ETHICS 135

company that does not discriminate has some risk of being labeled as discriminating by
the OFCCP method [41]. By using a questionable statistical method, is the OFCCP acting
unethically?

6.3.3 Data scraping


In May 2016, the online OpenPsych Forum published a paper titled “The OkCupid data
set: A very large public data set of dating site users”. The resulting data set contained
2,620 variables—including usernames, gender, and dating preferences—from 68,371 people
scraped from the OkCupid dating website. The ostensible purpose of the data dump was
to provide an interesting open public data set to fellow researchers. These data might be
used to answer questions such as this one suggested in the abstract of the paper: whether
the Zodiac sign of each user was associated with any of the other variables (spoiler alert: it
wasn’t).
The data scraping did not involve any illicit technology such as breaking passwords.
Nonetheless, the author received many comments on the OpenPsych Forum challenging the
work as an ethical breach in doxing people by releasing personal data. Does the work raise
actual ethical issues?

6.3.4 Reproducible spreadsheet analysis


In 2010, Harvard economists Carmen Reinhart and Kenneth Rogoff published a report en-
titled Growth in a Time of Debt [177], which argued that countries which pursued austerity
measures did not necessarily suffer from slow economic growth. These ideas influenced the
thinking of policymakers—notably United States Congressman Paul Ryan—during the time
of the European debt crisis.
Graduate student Thomas Herndon requested access to the data and analysis contained
in the paper. After receiving the original spreadsheet from Reinhart, Herndon found several
errors.

“I clicked on cell L51, and saw that they had only averaged rows 30 through
44, instead of rows 30 through 49.” —Thomas Herndon [179]

In a critique [100] of the paper, Herndon, Ash, and Pollin point out coding errors, selec-
tive inclusion of data, and odd weighting of summary statistics that shaped the conclusions
of the Reinhart/Rogoff paper.
Does publishing a flawed analysis raise ethical questions?

6.3.5 Drug dangers


In September 2004, drug company Merck withdrew from the market a popular product
Vioxx because of evidence that the drug increases the risk of myocardial infarction (MI),
a major type of heart attack. Approximately 20 million Americans had taken Vioxx up to
that point. The leading medical journal Lancet later reported an estimate that Vioxx use
resulted in 88,000 Americans having heart attacks, of whom 38,000 died.
Vioxx had been approved in May 1999 by the United States Food and Drug Adminis-
tration based on tests involving 5,400 subjects. Slightly more than a year after the FDA
approval, a study [36] of 8,076 patients published in another leading medical journal, The
New England Journal of Medicine, established that Vioxx reduced the incidence of severe
gastro-intestinal events substantially compared to the standard treatment, naproxen. That’s
good for Vioxx. In addition, the abstract reports these findings regarding heart attacks:
136 CHAPTER 6. PROFESSIONAL ETHICS

“The incidence of myocardial infarction was lower among patients in the


naproxen group than among those in the [Vioxx] group (0.1 percent vs. 0.4 per-
cent; relative risk, 0.2; 95% confidence interval, 0.1 to 0.7); the overall mortality
rate and the rate of death from cardiovascular causes were similar in the two
groups.”

Read the abstract again carefully. The Vioxx group had a much higher rate of MI than
the group taking the standard treatment. This influential report identified the high risk
soon after the drug was approved for use. Yet Vioxx was not withdrawn for another three
years. Something clearly went wrong here. Did it involve an ethical lapse?

6.3.6 Legal negotiations


Lawyers sometimes retain statistical experts to help plan negotiations. In a common sce-
nario, the defense lawyer will be negotiating the amount of damages in a case with the
plaintiff’s attorney. Plaintiffs will ask the statistician to estimate the amount of damages,
with a clear but implicit directive that the estimate should reflect the plaintiff’s interests.
Similarly, the defense will ask their own expert to construct a framework that produces an
estimate at a lower level.
Is this a game statisticians should play?

6.4 Some principles to guide ethical action


As noted previously, lying, cheating, and stealing are common and longstanding unethical
behaviors. To guide professional action, however, more nuance and understanding is needed.
For instance, an essential aspect of the economy is that firms compete. As a natural part
of such competition, firms hurt one another; they take away business that the competitor
would otherwise have. We don’t consider competition to be unethical, although there are
certainly limits to ethical competition.
As a professional, you possess skills that are not widely available. A fundamental notion
of professional ethics is to avoid using those skills in a way that is effectively lying—leading
others to believe one thing when in fact something different is true. In every professional
action you take, there is an implicit promise that you can be relied on—that you will use
appropriate methods and draw appropriate conclusions. Non-professionals are not always
in a position to make an informed judgment about whether your methods and conclusions
are appropriate. Part of acting in a professionally ethical way is making sure that your
methods and conclusions are indeed appropriate.
It is necessary to believe that your methods and conclusions are appropriate, but not
sufficient. First, it’s easy to mislead yourself, particularly in the heat and excitement of
satisfying your client or your research team. Second, it’s usually not a matter of absolutes:
It’s not always certain that a method is appropriate. Instead, there is almost always a risk
that something is wrong.
An important way to deal with these issues is to draw on generally recognized pro-
fessional standards. Some examples: Use software systems that have been vetted by the
community. Check that your data are what you believe them to be. Don’t use analytical
methods that would not pass scrutiny by professional colleagues.
Note that the previous paragraph says “draw on” rather than “scrupulously follow.”
Inevitably there will be parts of your work that are not and cannot be vetted by the
community. You write your own data wrangling statements: They aren’t always vetted. In
special circumstances you might reasonably choose to use software that is new or created
just for the purpose at hand. You can look for internal consistency in your data, but it
6.4. SOME PRINCIPLES TO GUIDE ETHICAL ACTION 137

would be unreasonable in most circumstances to insist on tracking everything back to the


original point at which it was measured.
Another important approach is to be open and honest. Don’t overstate your confidence
in results. Point out to clients substantial risks of error or unexpected outcome. If you
would squirm if some aspect or another of your work came under expert scrutiny, it’s likely
that you should draw attention to that aspect yourself.
Still, there are limits. You generally can’t usefully inform your clients of every possible
risk and methodological limitation. The information would overwhelm them. And you
usually will not have the resources—time, money, data—that you would need to make
every aspect of your work perfect. You have to use good professional judgment to identify
the most salient risks and to ensure that your work is good enough even if it’s not perfect.
You have a professional responsibility to particular stakeholders. It’s important that you
consider and recognize all the various stakeholders to whom you have this responsibility.
These vary depending on the circumstances. Sometimes, your main responsibility is simply
to your employer or your client. In other circumstances, you will have a responsibility to
the general public or to subjects in your study or individuals represented in your data. You
may have a special responsibility to the research community or to your profession itself.
The legal system can also impose responsibilities; there are laws that are relevant to your
work. Expert witnesses in court cases have a particular responsibility to the court itself.
Another concern is the potential for a conflict of interest. A conflict of interest is not
itself unethical. We all have such conflicts: We want to do work that will advance us
professionally, which instills a temptation to satisfy the expectations of our employers or
colleagues or the marketplace. The conflict refers to the potential that our personal goals
may cloud or bias or otherwise shape our professional judgment.
Many professional fields have rules that govern actions in the face of a conflict of interest.
Judges recuse themselves when they have a prior involvement in a case. Lawyers and law
firms should not represent different clients whose interests are at odds with each other.
Clear protocols and standards for analysis regulated by the FDA help ensure that potential
conflicts of interest for researchers working for drug companies do not distort results. There’s
always a basic professional obligation to disclose potential conflicts of interest to your clients,
to journals, etc.
For concreteness, here is a list of professional ethical precepts. It’s simplistic; it’s not
feasible to capture every nuance in a brief exposition.

1. Do your work well by your own standards and by the standards of your profession.

2. Recognize the parties to whom you have a special professional obligation.

3. Report results and methods honestly and respect your responsibility to identify and
report flaws and shortcomings in your work.

6.4.1 Applying the precepts


Let’s explore how these precepts play out in the several scenarios outlined in the previous
section.

The CEO
You’ve been asked by a company CEO to modify model coefficients from the correct values,
that is, from the values found by a generally accepted method. The stakeholder in this
setting is the company. If your work will involve a method that’s not generally accepted by
the professional community, you’re obliged to point this out to the company.
138 CHAPTER 6. PROFESSIONAL ETHICS

Remember that your client also has substantial knowledge of how their business works.
Statistical purity is not the issue. Your work is a tool for your client to use; they can use
it as they want. Going a little further, it’s important to realize that your client’s needs
may not map well onto a particular statistical methodology. The consultant should work
genuinely to understand the client’s whole set of interests. Often the problem that clients
identify is not really the problem that needs to be solved when seen from an expert statistical
perspective.

Employment discrimination
The procedures adopted by the OFCCP are stated using statistical terms like “standard
deviation” that themselves suggest that they are part of a legitimate statistical method.
Yet the methods raise significant questions, since by construction they will sometimes label
a company that is not discriminating as a discriminator. OFCCP and others might argue
that they are not a statistical organization. They are enforcing a law, not participating in
research. The OFCCP has a responsibility to the courts. The courts themselves, including
the United States Supreme Court, have not developed or even called for a coherent approach
to the use of statistics (although in 1977 the Supreme Court labeled differences greater than
two or three standard deviations as too large to attribute solely to chance).

Data scraping
OkCupid provides public access to data. A researcher uses legitimate means to acquire
those data. What could be wrong?
There is the matter of the stakeholders. The collection of data was intended to support
psychological research. The ethics of research involving humans requires that the human
not be exposed to any risk for which consent has not been explicitly given. The OkCupid
members did not provide such consent. Since the data contain information that makes it
possible to identify individual humans, there is a realistic risk of the release of potentially
embarrassing information, or worse, information that jeopardizes the physical safety of
certain users.
Another stakeholder is OkCupid itself. Many information providers, like OkCupid, have
terms of use that restrict how the data may be legitimately used. Such terms of use (see
Section 6.5.3) form an explicit agreement between the service and the users of that service.
They cannot ethically be disregarded.

Reproducible spreadsheet analysis


The scientific community as a whole is a stakeholder in public research. Insofar as the
research is used to inform public policy, the public as a whole is a stakeholder. Researchers
have an obligation to be truthful in their reporting of research. This is not just a matter of
being honest, but also of participating in the process by which scientific work is challenged or
confirmed. Reinhart and Rogoff honored this professional obligation by providing reasonable
access to their software and data.
Note that it is not an ethical obligation to reach correct research results. The obligation
is to do everything feasible to ensure that the conclusions faithfully reflect the data and the
theoretical framework in which the data are analyzed. Scientific findings are often subject
to dispute, reinterpretation, and refinement.
Since this book is specifically about data science, it can be helpful to examine the
Reinhart and Rogoff findings with respect to the professional standards of data science.
Note that these can be different from the professional standards of economics, which might
reasonably be the ones that economists like Reinhart and Rogoff adopt. So the following is
6.4. SOME PRINCIPLES TO GUIDE ETHICAL ACTION 139

not a criticism of them, per se, but an opportunity to delineate standards relevant to data
scientists.
Seen from the perspective of data science, Microsoft Excel, the tool used by Reinhart
and Rogoff, is an unfortunate choice. It mixes the data with the analysis. It works at a low
level of abstraction, so it’s difficult to program in a concise and readable way. Commands
are customized to a particular size and organization of data, so it’s hard to apply to a new
or modified data set. One of the major strategies in debugging is to work on a data set
where the answer is known; this is impractical in Excel. Programming and revision in Excel
generally involves lots of click-and-drag copying, which is itself an error-prone operation.
Data science professionals have an ethical obligation to use tools that are reliable, veri-
fiable, and conducive to reproducible data analysis (see Appendix D). This is a good reason
for professionals to eschew Excel.

Drug dangers

When something goes wrong on a large scale, it’s tempting to look for a breach of ethics.
This may indeed identify an offender, but we must also beware of creating scapegoats.
With Vioxx, there were many claims, counterclaims, and lawsuits. The researchers failed to
incorporate some data that were available and provided a misleading summary of results.
The journal editors also failed to highlight the very substantial problem of the increased
rate of myocardial infarction with Vioxx.
To be sure, it’s unethical not to include data that undermines the conclusion presented in
a paper. The Vioxx researchers were acting according to their original research protocol—a
solid professional practice.
What seems to have happened with Vioxx is that the researchers had a theory that
the higher rate of infarction was not due to Vioxx, per se, but to an aspect of the study
protocol that excluded subjects who were being treated with aspirin to reduce the risk of
heart attacks. The researchers believed with some justification that the drug to which Vioxx
was being compared, naproxen, was acting as a substitute for aspirin. They were wrong, as
subsequent research showed.
Professional ethics dictate that professional standards be applied in work. Incidents
like Vioxx should remind us to work with appropriate humility and to be vigilant to the
possibility that our own explanations are misleading us.

Legal negotiations

In legal cases such as the one described earlier in the chapter, the data scientist has ethical
obligations to their client. Depending on the circumstances, they may also have obligations
to the court.
As always, you should be forthright with your client. Usually you will be using methods
that you deem appropriate, but on occasion you will be directed to use a method that you
think is inappropriate. For instance, we’ve seen occasions when the client requested that the
time period of data included in the analysis be limited in some way to produce a “better”
result. We’ve had clients ask us to subdivide the data (in employment discrimination cases,
say, by job title) in order to change p-values. Although such subdivision may be entirely
legitimate, the decision about subdividing—seen from a purely statistical point of view—
ought to be based on the situation, not the desired outcome (see the discussion of the
“garden of forking paths” in Section 7.7).
Your client is entitled to make such requests. Whether or not you think the method
being asked for is the right one doesn’t enter into it. Your professional obligation is to
140 CHAPTER 6. PROFESSIONAL ETHICS

inform the client what the flaws in the proposed method are and how and why you think
another method would be better. (See the major exception that follows.)
The legal system in countries such as the U.S. is an adversarial system. Lawyers are
allowed to frame legal arguments that may be dismissed: They are entitled to enter some
facts and not others into evidence. Of course, the opposing legal team is entitled to create
their own legal arguments and to cross-examine the evidence to show how it is incomplete
and misleading. When you are working with a legal team as a data scientist, you are part
of the team. The lawyers on the team are the experts about what negotiation strategies
and legal theories to use, how to define the limits of the case (such as damages), and how
to present their case or negotiate with the other party.
It is a different matter when you are presenting to the court. This might take the form
of filing an expert report to the court, testifying as an expert witness, or being deposed.
A deposition is when you are questioned, under oath, outside of the court room. You are
obliged to answer all questions honestly. (Your lawyer may, however, direct you not to
answer a question about privileged communications.)
If you are an expert witness or filing an expert report, the word “expert” is significant. A
court will certify you as an expert in a case giving you permission to express your opinions.
Now you have professional ethical obligations to apply your expertise honestly and openly
in forming those opinions.
When working on a legal case, you should get advice from a legal authority, which might
be your client. Remember that if you do shoddy work, or fail to reply honestly to the other
side’s criticisms of your work, your credibility as an expert will be imperiled.

6.5 Data and disclosure


6.5.1 Reidentification and disclosure avoidance
The ability to link multiple data sets and to use public information to identify individuals
is a growing problem. A glaring example of this occurred in 1996 when then-Governor of
Massachusetts William Weld collapsed while attending a graduation ceremony at Bentley
College. An MIT graduate student used information from a public data release by the
Massachusetts Group Insurance Commission to identify Weld’s subsequent hospitalization
records. The disclosure of this information was highly publicized and led to many changes
in data releases. This was a situation where the right balance was not struck between
disclosure (to help improve health care and control costs) and nondisclosure (to help ensure
private information is not made public). There are many challenges to ensure disclosure
avoidance [244, 151]: This remains an active and important area of research.
The Health Insurance Portability and Accountability Act (HIPAA) was passed by the
United States Congress in 1996—the same year as Weld’s illness. The law augmented and
clarified the role that researchers and medical care providers had in maintaining protected
health information (PHI). The HIPAA regulations developed since then specify procedures
to ensure that individually identifiable PHI is protected when it is transferred, received,
handled, analyzed, or shared. As an example, detailed geographic information (e.g., home
or office location) is not allowed to be shared unless there is an overriding need. For research
purposes, geographic information might be limited to state or territory, though for certain
rare diseases or characterist