0% found this document useful (0 votes)
38 views456 pages

Deep R Programming

Uploaded by

jose.i.lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views456 pages

Deep R Programming

Uploaded by

jose.i.lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 456

Deep R Programming

Marek Gagolewski

v1.0.0.9004
Prof. Marek Gagolewski
Warsaw University of Technology, Poland
Systems Research Institute, Polish Academy of Sciences
https://www.gagolewski.com/

Copyright (C) 2022–2024 by Marek Gagolewski. Some rights reserved.


This open-access textbook is an independent, non-profit project. It is published under
the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
License (CC BY-NC-ND 4.0). Please spread the word about it.
This project received no funding, administrative, technical, or editorial support from
Warsaw University of Technology, Polish Academy of Sciences, Deakin University, or
any other source.

A little peculiar is the world some people decided to immerse themselves in, so here is
a message stating the obvious. Every effort has been made in the preparation of this
book to ensure the accuracy of the information presented. However, the information
contained in this book is provided without warranty, either express or implied. The
author will, of course, not be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Any bug reports/corrections/feature requests are welcome. To make this textbook even
better, please file them at https://github.com/gagolews/deepr.
Typeset with XeLATEX. Please be understanding: it was an algorithmic process. Hence,
the results are ∈ [good enough, perfect).

Homepage: https://deepr.gagolewski.com/
Datasets: https://github.com/gagolews/teaching-data

Release: v1.0.0.9004 (2024-06-10T11:27:30+0200)


ISBN: 978-0-6455719-2-9 (v1.0; 2023; Melbourne: Marek Gagolewski)
DOI: 10.5281/zenodo.7490464
Contents

Preface xiii
0.1 To R, or not to R . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
0.2 R (GNU S) as a language and an environment . . . . . . . . . . . . xiii
0.3 Aims, scope, and design philosophy . . . . . . . . . . . . . . . . . xv
0.4 Classification of R data types and book structure . . . . . . . . . . xvi
0.5 About the author . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
0.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . xix
0.7 You can make this book better . . . . . . . . . . . . . . . . . . . xx

I Deep 1
1 Introduction 3
1.1 Hello, world! . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Setting up the development environment . . . . . . . . . . . . . . 4
1.2.1 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Interactive mode . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Batch mode: Working with R scripts (**) . . . . . . . . . . . 5
1.2.4 Weaving: Automatic report generation (**) . . . . . . . . . 5
1.2.5 Semi-interactive modes (Jupyter Notebooks, sending code to
the associated R console, etc.) . . . . . . . . . . . . . . . . 6
1.3 Atomic vectors at a glance . . . . . . . . . . . . . . . . . . . . . 8
1.4 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Numeric vectors 13
2.1 Creating numeric vectors . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Numeric constants . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Concatenating vectors with c . . . . . . . . . . . . . . . . 14
2.1.3 Repeating entries with rep . . . . . . . . . . . . . . . . . 14
2.1.4 Generating arithmetic progressions with seq and `:` . . . . 16
2.1.5 Generating pseudorandom numbers . . . . . . . . . . . . 17
2.1.6 Reading data with scan . . . . . . . . . . . . . . . . . . . 19
2.2 Creating named objects . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Vectorised mathematical functions . . . . . . . . . . . . . . . . . 23
2.3.1 abs and sqrt . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Natural exponential function and logarithm . . . . . . . . . 25
IV CONTENTS

2.3.4 Probability distributions (*) . . . . . . . . . . . . . . . . . 25


2.3.5 Special functions (*) . . . . . . . . . . . . . . . . . . . . . 29
2.4 Arithmetic operations . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Vectorised arithmetic operators . . . . . . . . . . . . . . . 30
2.4.2 Recycling rule . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.3 Operator precedence . . . . . . . . . . . . . . . . . . . . 32
2.4.4 Accumulating . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.5 Aggregating . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Logical vectors 39
3.1 Creating logical vectors . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Comparing elements . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Vectorised relational operators . . . . . . . . . . . . . . . 39
3.2.2 Testing for NA, NaN, and Inf . . . . . . . . . . . . . . . . . 40
3.2.3 Dealing with round-off errors (*) . . . . . . . . . . . . . . 41
3.3 Logical operations . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Vectorised logical operators . . . . . . . . . . . . . . . . . 43
3.3.2 Operator precedence revisited . . . . . . . . . . . . . . . . 44
3.3.3 Dealing with missingness . . . . . . . . . . . . . . . . . . 45
3.3.4 Aggregating with all, any, and sum . . . . . . . . . . . . . 45
3.3.5 Simplifying predicates . . . . . . . . . . . . . . . . . . . 46
3.4 Choosing elements with ifelse . . . . . . . . . . . . . . . . . . 47
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Lists and attributes 51


4.1 Type hierarchy and conversion . . . . . . . . . . . . . . . . . . . 51
4.1.1 Explicit type casting . . . . . . . . . . . . . . . . . . . . . 52
4.1.2 Implicit conversion (coercion) . . . . . . . . . . . . . . . . 52
4.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Creating lists . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Converting to and from lists . . . . . . . . . . . . . . . . . 56
4.3 NULL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Object attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Developing perceptual indifference to most attributes . . . . 58
4.4.2 But there are a few use cases . . . . . . . . . . . . . . . . 58
4.4.3 Special attributes . . . . . . . . . . . . . . . . . . . . . . 60
4.4.4 Labelling vector elements with the names attribute . . . . . 61
4.4.5 Altering and removing attributes . . . . . . . . . . . . . . 63
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 Vector indexing 67
5.1 head and tail . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Subsetting and extracting from vectors . . . . . . . . . . . . . . . 68
5.2.1 Nonnegative indexes . . . . . . . . . . . . . . . . . . . . 68
5.2.2 Negative indexes . . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 Logical indexer . . . . . . . . . . . . . . . . . . . . . . . 71
CONTENTS V

5.2.4 Character indexer . . . . . . . . . . . . . . . . . . . . . . 72


5.3 Replacing elements . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.1 Modifying atomic vectors . . . . . . . . . . . . . . . . . . 74
5.3.2 Modifying lists . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.3 Inserting new elements . . . . . . . . . . . . . . . . . . . 77
5.4 Functions related to indexing . . . . . . . . . . . . . . . . . . . . 78
5.4.1 Matching elements in another vector . . . . . . . . . . . . 78
5.4.2 Assigning numbers into intervals . . . . . . . . . . . . . . 79
5.4.3 Splitting vectors into subgroups . . . . . . . . . . . . . . . 79
5.4.4 Ordering elements . . . . . . . . . . . . . . . . . . . . . 82
5.4.5 Identifying duplicates . . . . . . . . . . . . . . . . . . . . 84
5.4.6 Counting index occurrences . . . . . . . . . . . . . . . . . 85
5.5 Preserving and losing attributes . . . . . . . . . . . . . . . . . . 85
5.5.1 c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.2 as.something . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.3 Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.4 Vectorised functions . . . . . . . . . . . . . . . . . . . . 87
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 Character vectors 93
6.1 Creating character vectors . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 Inputting individual strings . . . . . . . . . . . . . . . . . 93
6.1.2 Many strings, one object . . . . . . . . . . . . . . . . . . 95
6.1.3 Concatenating character vectors . . . . . . . . . . . . . . 96
6.1.4 Formatting objects . . . . . . . . . . . . . . . . . . . . . 96
6.1.5 Reading text data from files . . . . . . . . . . . . . . . . . 97
6.2 Pattern searching . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.1 Comparing whole strings . . . . . . . . . . . . . . . . . . 98
6.2.2 Partial matching . . . . . . . . . . . . . . . . . . . . . . 98
6.2.3 Matching anywhere within a string . . . . . . . . . . . . . 99
6.2.4 Using regular expressions (*) . . . . . . . . . . . . . . . . 99
6.2.5 Locating pattern occurrences . . . . . . . . . . . . . . . . 100
6.2.6 Replacing pattern occurrences . . . . . . . . . . . . . . . 103
6.2.7 Splitting strings into tokens . . . . . . . . . . . . . . . . . 103
6.3 Other string operations . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.1 Extracting substrings . . . . . . . . . . . . . . . . . . . . 104
6.3.2 Translating characters . . . . . . . . . . . . . . . . . . . . 105
6.3.3 Ordering strings . . . . . . . . . . . . . . . . . . . . . . 105
6.4 Other atomic vector types (*) . . . . . . . . . . . . . . . . . . . . 106
6.4.1 Integer vectors (*) . . . . . . . . . . . . . . . . . . . . . . 106
6.4.2 Raw vectors (*) . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.3 Complex vectors (*) . . . . . . . . . . . . . . . . . . . . . 108
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7 Functions 111
7.1 Creating and invoking functions . . . . . . . . . . . . . . . . . . 112
7.1.1 Anonymous functions . . . . . . . . . . . . . . . . . . . . 112
VI CONTENTS

7.1.2 Named functions . . . . . . . . . . . . . . . . . . . . . . 113


7.1.3 Passing arguments to functions . . . . . . . . . . . . . . . 114
7.1.4 Grouping expressions with curly braces, `{` . . . . . . . . . 115
7.2 Functional programming . . . . . . . . . . . . . . . . . . . . . . 117
7.2.1 Functions are objects . . . . . . . . . . . . . . . . . . . . 117
7.2.2 Calling on precomputed arguments with do.call . . . . . . 119
7.2.3 Common higher-order functions . . . . . . . . . . . . . . 120
7.2.4 Vectorising functions with Map . . . . . . . . . . . . . . . 120
7.3 Accessing third-party functions . . . . . . . . . . . . . . . . . . . 123
7.3.1 Using R packages . . . . . . . . . . . . . . . . . . . . . . 123
Default packages . . . . . . . . . . . . . . . . . . . . . . 124
Source vs binary packages (*) . . . . . . . . . . . . . . . . 125
Managing dependencies (*) . . . . . . . . . . . . . . . . . 126
7.3.2 Calling external programs . . . . . . . . . . . . . . . . . . 127
7.3.3 Interfacing C, C++, Fortran, Python, Java, etc. (**) . . . . . . 128
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8 Flow of execution 133


8.1 Conditional evaluation . . . . . . . . . . . . . . . . . . . . . . . 133
8.1.1 Return value . . . . . . . . . . . . . . . . . . . . . . . . 134
8.1.2 Nested ifs . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.1.3 Condition: Either TRUE or FALSE . . . . . . . . . . . . . . . 136
8.1.4 Short-circuit evaluation . . . . . . . . . . . . . . . . . . . 137
8.2 Exception handling . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.3 Repeated evaluation . . . . . . . . . . . . . . . . . . . . . . . . 139
8.3.1 while . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.3.2 for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.3.3 break and next . . . . . . . . . . . . . . . . . . . . . . . 143
8.3.4 return . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3.5 Time and space complexity of algorithms (*) . . . . . . . . . 145
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

II Deeper 149
9 Designing functions 151
9.1 Managing data flow . . . . . . . . . . . . . . . . . . . . . . . . 151
9.1.1 Checking input data integrity and argument handling . . . . 151
9.1.2 Putting outputs into context . . . . . . . . . . . . . . . . 155
9.2 Organising and maintaining functions . . . . . . . . . . . . . . . 157
9.2.1 Function libraries . . . . . . . . . . . . . . . . . . . . . . 157
9.2.2 Writing R packages (*) . . . . . . . . . . . . . . . . . . . . 157
Package structure (*) . . . . . . . . . . . . . . . . . . . . 158
Building and installing (*) . . . . . . . . . . . . . . . . . . 158
Documenting (*) . . . . . . . . . . . . . . . . . . . . . . 159
9.2.3 Writing standalone programs (**) . . . . . . . . . . . . . . 159
9.2.4 Assuring quality code . . . . . . . . . . . . . . . . . . . . 161
Managing changes and working collaboratively . . . . . . . 161
CONTENTS VII

Test-driven development and continuous integration . . . . 161


Debugging . . . . . . . . . . . . . . . . . . . . . . . . . 162
Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.3 Special functions: Syntactic sugar . . . . . . . . . . . . . . . . . . 162
9.3.1 Backticks . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.3.2 Dollar, `$` (*) . . . . . . . . . . . . . . . . . . . . . . . . 163
9.3.3 Curly braces, `{` . . . . . . . . . . . . . . . . . . . . . . 164
9.3.4 `if` . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.3.5 Operators are functions . . . . . . . . . . . . . . . . . . . 165
Calling built-in operators as functions . . . . . . . . . . . . 165
Defining binary operators . . . . . . . . . . . . . . . . . . 166
9.3.6 Replacement functions . . . . . . . . . . . . . . . . . . . 167
Creating replacement functions . . . . . . . . . . . . . . . 167
Substituting parts of vectors . . . . . . . . . . . . . . . . 168
Replacing attributes . . . . . . . . . . . . . . . . . . . . . 168
Compositions of replacement functions (*) . . . . . . . . . 169
9.4 Arguments and local variables . . . . . . . . . . . . . . . . . . . 172
9.4.1 Call by “value” . . . . . . . . . . . . . . . . . . . . . . . . 172
9.4.2 Variable scope . . . . . . . . . . . . . . . . . . . . . . . . 172
9.4.3 Closures (*) . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.4.4 Default arguments . . . . . . . . . . . . . . . . . . . . . 174
9.4.5 Lazy vs eager evaluation . . . . . . . . . . . . . . . . . . . 174
9.4.6 Ellipsis, `...` . . . . . . . . . . . . . . . . . . . . . . . . 175
9.4.7 Metaprogramming (*) . . . . . . . . . . . . . . . . . . . . 177
9.5 Principles of sustainable design (*) . . . . . . . . . . . . . . . . . 179
9.5.1 To write or abstain . . . . . . . . . . . . . . . . . . . . . 179
9.5.2 To pamper or challenge . . . . . . . . . . . . . . . . . . . 180
9.5.3 To build or reuse . . . . . . . . . . . . . . . . . . . . . . 181
9.5.4 To revolt or evolve . . . . . . . . . . . . . . . . . . . . . . 182
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

10 S3 classes 185
10.1 Object type vs class . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.2 Generics and method dispatching . . . . . . . . . . . . . . . . . 189
10.2.1 Generics, default, and custom methods . . . . . . . . . . . 189
10.2.2 Creating generics . . . . . . . . . . . . . . . . . . . . . . 191
10.2.3 Built-in generics . . . . . . . . . . . . . . . . . . . . . . 193
10.2.4 First-argument dispatch and calling S3 methods directly . . 194
10.2.5 Multi-class-ness . . . . . . . . . . . . . . . . . . . . . . 197
10.2.6 Operator overloading . . . . . . . . . . . . . . . . . . . . 199
10.3 Common built-in S3 classes . . . . . . . . . . . . . . . . . . . . . 202
10.3.1 Date, time, etc. . . . . . . . . . . . . . . . . . . . . . . . 202
10.3.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.3.3 Ordered factors . . . . . . . . . . . . . . . . . . . . . . . 208
10.3.4 Formulae (*) . . . . . . . . . . . . . . . . . . . . . . . . 208
10.4 (Over)using the forward pipe operator, `|>` (*) . . . . . . . . . . . 209
10.5 S4 classes (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
VIII CONTENTS

10.5.1 Defining S4 classes . . . . . . . . . . . . . . . . . . . . . 213


10.5.2 Accessing slots . . . . . . . . . . . . . . . . . . . . . . . 214
10.5.3 Defining methods . . . . . . . . . . . . . . . . . . . . . . 214
10.5.4 Defining constructors . . . . . . . . . . . . . . . . . . . . 216
10.5.5 Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

11 Matrices and other arrays 219


11.1 Creating arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.1.1 matrix and array . . . . . . . . . . . . . . . . . . . . . . 219
11.1.2 Promoting and stacking vectors . . . . . . . . . . . . . . . 221
11.1.3 Simplifying lists . . . . . . . . . . . . . . . . . . . . . . . 222
11.1.4 Beyond numeric arrays . . . . . . . . . . . . . . . . . . . 224
11.1.5 Internal representation . . . . . . . . . . . . . . . . . . . 225
11.2 Array indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.2.1 Arrays are built on basic vectors . . . . . . . . . . . . . . . 227
11.2.2 Selecting individual elements . . . . . . . . . . . . . . . . 228
11.2.3 Selecting rows and columns . . . . . . . . . . . . . . . . . 228
11.2.4 Dropping dimensions . . . . . . . . . . . . . . . . . . . . 229
11.2.5 Selecting submatrices . . . . . . . . . . . . . . . . . . . . 229
11.2.6 Selecting elements based on logical vectors . . . . . . . . . 230
11.2.7 Selecting based on two-column numeric matrices . . . . . . 231
11.2.8 Higher-dimensional arrays . . . . . . . . . . . . . . . . . 232
11.2.9 Replacing elements . . . . . . . . . . . . . . . . . . . . . 233
11.3 Common operations . . . . . . . . . . . . . . . . . . . . . . . . 233
11.3.1 Matrix transpose . . . . . . . . . . . . . . . . . . . . . . 233
11.3.2 Vectorised mathematical functions . . . . . . . . . . . . . 234
11.3.3 Aggregating rows and columns . . . . . . . . . . . . . . . 234
11.3.4 Binary operators . . . . . . . . . . . . . . . . . . . . . . 235
11.4 Numerical matrix algebra (*) . . . . . . . . . . . . . . . . . . . . 238
11.4.1 Matrix multiplication . . . . . . . . . . . . . . . . . . . . 238
11.4.2 Solving systems of linear equations . . . . . . . . . . . . . 239
11.4.3 Norms and metrics . . . . . . . . . . . . . . . . . . . . . 240
11.4.4 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . 241
11.4.5 QR decomposition . . . . . . . . . . . . . . . . . . . . . 243
11.4.6 SVD decomposition . . . . . . . . . . . . . . . . . . . . . 244
11.4.7 A note on the Matrix package . . . . . . . . . . . . . . . . 245
11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12 Data frames 251


12.1 Creating data frames . . . . . . . . . . . . . . . . . . . . . . . . 251
12.1.1 data.frame and as.data.frame . . . . . . . . . . . . . . . 251
12.1.2 cbind.data.frame and rbind.data.frame . . . . . . . . . 255
12.1.3 Reading data frames . . . . . . . . . . . . . . . . . . . . 257
12.1.4 Interfacing relational databases and querying with SQL (*) . 258
12.1.5 Strings as factors? . . . . . . . . . . . . . . . . . . . . . . 259
12.1.6 Internal representation . . . . . . . . . . . . . . . . . . . 260
CONTENTS IX

12.2 Data frame subsetting . . . . . . . . . . . . . . . . . . . . . . . 262


12.2.1 Data frames are lists . . . . . . . . . . . . . . . . . . . . 262
12.2.2 Data frames are matrix-like . . . . . . . . . . . . . . . . . 265
12.3 Common operations . . . . . . . . . . . . . . . . . . . . . . . . 268
12.3.1 Ordering rows . . . . . . . . . . . . . . . . . . . . . . . 269
12.3.2 Handling duplicated rows . . . . . . . . . . . . . . . . . . 271
12.3.3 Joining (merging) data frames . . . . . . . . . . . . . . . . 271
12.3.4 Aggregating and transforming columns . . . . . . . . . . . 272
12.3.5 Handling missing values . . . . . . . . . . . . . . . . . . 274
12.3.6 Reshaping data frames . . . . . . . . . . . . . . . . . . . 274
12.3.7 Aggregating data in groups . . . . . . . . . . . . . . . . . 277
12.3.8 Transforming data in groups . . . . . . . . . . . . . . . . 285
12.3.9 Metaprogramming-based techniques (*) . . . . . . . . . . 287
12.3.10 A note on the dplyr (tidyverse) and data.table packages (*) 290
12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

13 Graphics 297
13.1 Graphics primitives . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.1.1 Symbols (points) . . . . . . . . . . . . . . . . . . . . . . 299
13.1.2 Line segments . . . . . . . . . . . . . . . . . . . . . . . 301
13.1.3 Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . 302
13.1.4 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.1.5 Raster images (bitmaps) (*) . . . . . . . . . . . . . . . . . 304
13.2 Graphics settings . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.2.1 Colours . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.2.2 Plot margins and clipping regions . . . . . . . . . . . . . . 308
13.2.3 User coordinates and axes . . . . . . . . . . . . . . . . . . 308
13.2.4 Plot dimensions (*) . . . . . . . . . . . . . . . . . . . . . 311
13.2.5 Many figures on one page (subplots) . . . . . . . . . . . . . 312
13.2.6 Graphics devices . . . . . . . . . . . . . . . . . . . . . . 313
13.3 Higher-level functions . . . . . . . . . . . . . . . . . . . . . . . 315
13.3.1 Scatter and function plots with plot.default and matplot . 315
13.3.2 Bar plots and histograms . . . . . . . . . . . . . . . . . . 319
13.3.3 Box-and-whisker plots . . . . . . . . . . . . . . . . . . . 324
13.3.4 Contour plots and heat maps . . . . . . . . . . . . . . . . 325
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

III Deepest 331


14 Interfacing compiled code (**) 333
14.1 C and C++ code in R . . . . . . . . . . . . . . . . . . . . . . . . 334
14.1.1 Source files for compiled code in R packages . . . . . . . . . 334
14.1.2 R CMD SHLIB . . . . . . . . . . . . . . . . . . . . . . . . 337
14.2 Handling basic types . . . . . . . . . . . . . . . . . . . . . . . . 340
14.2.1 SEXPTYPEs . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.2.2 Accessing elements in simple atomic vectors . . . . . . . . . 342
14.2.3 Representation of missing values . . . . . . . . . . . . . . 343
X CONTENTS

14.2.4 Memory allocation . . . . . . . . . . . . . . . . . . . . . 345


14.2.5 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
14.2.6 Character vectors and individual strings (*) . . . . . . . . . 351
14.2.7 Calling R functions from C (**) . . . . . . . . . . . . . . . 352
14.2.8 External pointers (**) . . . . . . . . . . . . . . . . . . . . 352
14.3 Dealing with compound types . . . . . . . . . . . . . . . . . . . 355
14.3.1 Reading and setting attributes . . . . . . . . . . . . . . . 355
14.3.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
14.3.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 357
14.3.4 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . 358
14.4 Using existing function libraries . . . . . . . . . . . . . . . . . . 359
14.4.1 Checking for user interrupts . . . . . . . . . . . . . . . . 359
14.4.2 Generating pseudorandom numbers . . . . . . . . . . . . 359
14.4.3 Mathematical functions from the R API . . . . . . . . . . . 360
14.4.4 Header files from other R packages (*) . . . . . . . . . . . . 360
14.4.5 Specifying compiler and linker flags (**) . . . . . . . . . . . 360
14.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361

15 Unevaluated expressions (*) 363


15.1 Expressions at a glance . . . . . . . . . . . . . . . . . . . . . . . 364
15.2 Language objects . . . . . . . . . . . . . . . . . . . . . . . . . . 365
15.3 Calls as combinations of expressions . . . . . . . . . . . . . . . . 367
15.3.1 Browsing parse trees . . . . . . . . . . . . . . . . . . . . 367
15.3.2 Manipulating calls . . . . . . . . . . . . . . . . . . . . . 369
15.4 Inspecting function definition and usage . . . . . . . . . . . . . . 369
15.4.1 Getting the body and formal arguments . . . . . . . . . . . 369
15.4.2 Getting the expression passed as an actual argument . . . . 370
15.4.3 Checking if an argument is missing . . . . . . . . . . . . . 371
15.4.4 Determining how a function was called . . . . . . . . . . . 372
15.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374

16 Environments and evaluation (*) 375


16.1 Frames: Environments as object containers . . . . . . . . . . . . . 375
16.1.1 Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
16.1.2 Environments vs named lists . . . . . . . . . . . . . . . . 376
16.1.3 Hash maps: Fast element lookup by name . . . . . . . . . . 377
16.1.4 Call by value, copy on demand: Not for environments . . . . 378
16.1.5 A note on reference classes (**) . . . . . . . . . . . . . . . 381
16.2 The environment model of evaluation . . . . . . . . . . . . . . . . 381
16.2.1 Getting the current environment . . . . . . . . . . . . . . 382
16.2.2 Enclosures, enclosures thereof, etc. . . . . . . . . . . . . . 383
16.2.3 Missing names are sought in enclosing environments . . . . 384
16.2.4 Looking for functions . . . . . . . . . . . . . . . . . . . . 386
16.2.5 Inspecting the search path . . . . . . . . . . . . . . . . . 387
16.2.6 Attaching to and detaching from the search path . . . . . . 389
16.2.7 Masking (shadowing) objects from down under . . . . . . . 391
16.3 Closures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
CONTENTS XI

16.3.1 Local environment . . . . . . . . . . . . . . . . . . . . . 394


16.3.2 Lexical scope and function closures . . . . . . . . . . . . . 394
16.3.3 Application: Function factories . . . . . . . . . . . . . . . 397
16.3.4 Accessing the calling environment . . . . . . . . . . . . . . 399
16.3.5 Package namespaces (*) . . . . . . . . . . . . . . . . . . . 399
16.3.6 S3 method lookup by UseMethod (*) . . . . . . . . . . . . . 402
16.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

17 Lazy evaluation (**) 409


17.1 Evaluation of function arguments . . . . . . . . . . . . . . . . . 409
17.2 Evaluation of default arguments . . . . . . . . . . . . . . . . . . 413
17.3 Ellipsis revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 415
17.4 on.exit (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
17.5 Metaprogramming and laziness in action: Examples (*) . . . . . . . 418
17.5.1 match.arg . . . . . . . . . . . . . . . . . . . . . . . . . 419
17.5.2 curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
17.5.3 with and within . . . . . . . . . . . . . . . . . . . . . . 421
17.5.4 transform . . . . . . . . . . . . . . . . . . . . . . . . . 422
17.5.5 subset . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
17.5.6 Forward pipe operator . . . . . . . . . . . . . . . . . . . 424
17.5.7 Other ideas (**) . . . . . . . . . . . . . . . . . . . . . . . 425
17.6 Processing formulae, `~` (*) . . . . . . . . . . . . . . . . . . . . . 426
17.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
17.8 Outro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

Changelog 431

References 433
XII CONTENTS

Deep R Programming is a comprehensive and in-depth introductory course on one of


the most popular languages for data science. It equips ambitious students, profession-
als, and researchers with the knowledge and skills to become independent users of
this potent environment so that they can tackle any problem related to data wrangling
and analytics, numerical computing, statistics, and machine learning.
For many students around the world, educational resources are hardly affordable.
Therefore, I have decided that this book should remain an independent, non-profit,
open-access project (available both in PDF1 and HTML2 forms). Whilst, for some
people, the presence of a “designer tag” from a major publisher might still be a proxy
for quality, it is my hope that this publication will prove useful to those seeking know-
ledge for knowledge’s sake.
Any bug/typo reports/fixes are appreciated. Please submit them via this project’s Git-
Hub repository3 . Thank you.
Citation: Gagolewski M. (2024), Deep R Programming, Melbourne, DOI: 10.5281/zen-
odo.74904644 , ISBN: 978-0-6455719-2-9, URL: https://deepr.gagolewski.com/.
Make sure to check out Minimalist Data Wrangling with Python5 [27] too.

1 https://deepr.gagolewski.com/deepr.pdf
2 https://deepr.gagolewski.com/
3 https://github.com/gagolews/deepr/issues
4 https://dx.doi.org/10.5281/zenodo.7490464
5 https://datawranglingpy.gagolewski.com/
0
Preface

0.1 To R, or not to R
R has been named the eleventh most dreaded programming language in the 2022
StackOverflow Developer Survey6 .
Also, it is a free app, so there must be something wrong with it, right?
But whatever, R is deprecated anyway; the modern way is to use tidyverse.
Or we should all just switch to Python7 .

Yeah, nah.

0.2 R (GNU S) as a language and an environment


Let’s get one8 thing straight: R is not just a statistical package. It is a general-purpose,
high-level programming language that happens to be very powerful for numerical,
data-intense computing activities of any kind. It offers extensive support for statist-
ical, machine learning, data analysis, data wrangling, and data visualisation applica-
tions, but there is much more.
As we detail below, R has a long history. It is an open-source version of the S environ-
ment, which was written for statisticians, by statisticians. Therefore, it is a free, yet
often more capable alternative to other software (but without any strings attached).
Unlike in some of them, in R, a spreadsheet-like GUI is not the main gateway for per-
forming computations on data. Here, we must write code to get things done. Despite
the beginning of the learning curve’s being a little steeper for non-programmers, in
the long run, R empowers us more because we are not limited to tackling the most
common scenarios. If some functionality is missing or does not suit our needs, we
can easily (re)implement it ourselves.
6https://survey.stackoverflow.co/2022
7https://datawranglingpy.gagolewski.com/
8 Also, we must not confuse RStudio with R. The former is merely one of many development environ-

ments for our language. We program in R, not in RStudio.


XIV PREFACE

R is thus very convenient for rapid prototyping. It helps turn our ideas into fully op-
erational code that can be battle-tested, extended, polished, run in production, and
otherwise enjoyed. As an interpreted language, it can be executed not only in an inter-
active read-eval-print loop (command–result, question–answer, …), but also in batch
mode (running standalone scripts).
Therefore, we would rather position R amongst such environments for numerical or
scientific computing as Python with numpy and pandas, Julia, GNU Octave, Scilab, and
MATLAB. However, it is more specialised in data science applications than any of them.
Hence, it provides a much smoother experience. This is why, over the years, R has be-
come the de facto standard in statistics and related fields.

Important R is a whole ecosystem. Apart from the R language interpreter, it features


advanced:
• graphics capabilities (see Chapter 13),
• a consistent, well-integrated help system (Section 1.4),
• ways for convenient interfacing with compiled code (Chapter 14),
• a package system and centralised package repositories (such as CRAN and Biocon-
ductor; Section 7.3.1),
• a lively community of users and developers – curious and passionate people, like
you and yours cordially.

Note R [70] is a dialect of the very popular S system designed in the mid-1970s by
Rick A. Becker, John M. Chambers, and Allan R. Wilks at Bell Labs see [3, 4, 5, 6] for
historical notes and [7, 9, 13, 57] for works on newer versions of S. Quoting from [4]:

The design goal for S is, most broadly stated, to enable and encourage good data
analysis, that is, to provide users with specific facilities and a general environ-
ment that helps them quickly and conveniently look at many displays, summar-
ies, and models for their data, and to follow the kind of iterative, exploratory path
that most often leads to a thorough analysis. The system is designed for interact-
ive use with simple but general expressions for the user to type, and immediate,
informative feedback from the system, including graphic output on any of a vari-
ety of graphical devices.

S became popular because it offered greater flexibility than the standalone statistical
packages. It was praised for its high interactivity and array-centrism that was taken
PREFACE XV

from APL, the familiar syntax of the C language involving {curly braces}, the ability
to treat code as data known from Lisp (Chapter 15), the notion of lazy arguments
(Chapter 17), and the ease of calling external C and Fortran routines (Chapter 14). Its
newer versions were also somewhat object-orientated (Chapter 10).
However, S was a proprietary and closed-source system. To address this, Robert Gen-
tleman and Ross Ihaka of the Statistics Department, University of Auckland developed
R in the 1990s9 . They were later joined by many contributors10 . It has been decided
that it will be distributed under the terms of the free GNU General Public License, ver-
sion 2.
In essence, R was supposed to be backwards-compatible with S, but some design
choices led to their evaluation models’ being slightly different. In Chapter 16, we dis-
cuss that R’s design was inspired by the Scheme language [1].

0.3 Aims, scope, and design philosophy


Many users were introduced to R by means of some very advanced operations in-
volving data frames, formulae, and functions that rely on nonstandard evaluation
(metaprogramming), like:
lm(
Ozone~Solar.R+Temp,
data=subset(airquality, Temp>60, select=-(Month:Day))
) |> summary()

This is horrible.
Another cohort was isolated from base R through a thick layer of popular third-party
packages that introduce an overwhelming number of functions (every operation, re-
gardless of its complexity, has a unique name). They often duplicate the core function-
ality, and might not be fully compatible with our traditional system.
Both user families ought to be fine, as long as they limit themselves to solving only the
most common data processing problems.
But we yearn for more. We do not want hundreds of prefabricated recipes for popular
dishes that we can mindlessly apply without much understanding.
Our aim is to learn the fundamentals of base R, which constitutes the lingua franca of
9 See [12, 37] for historical notes. R version 0.49 released in April 1997 (the first whose source code is avail-

able on CRAN; see https://cloud.r-project.org/src/base/R-0), was already fairly feature-rich. In particular,


it implemented S3 methods, formulae, and data frames that were introduced in the 1991 version of S [13].
10 The beauty of the employed open-source model is that all the contributors are real human beings,

not anonymous contractors working for soulless corporations; see https://www.r-project.org/contributors.


html.
XVI PREFACE

all R users. We want to be able to indite code that everybody should understand; code
that will work without modifications in the next decades too.
We want to be able to tackle any data-intense problem. Furthermore, we want to de-
velop transferable skills so that learning new tools such as Python with numpy and pan-
das (e.g., [27, 47]) or Julia will be much easier later. After all, R is not the only notable
environment out there.
Anyway, enough preaching. This graduate11 -level textbook is for readers who:
• would like to experience the joy of solving problems by programming,
• want to become independent users of the R environment,
• can appreciate a more cohesively and comprehensively12 organised material,
• do not mind a slightly steeper learning curve at the beginning,
• do not want to be made obsolete by artificial “intelligence” in the future.
Some readers will benefit from its being their first introduction to R (yet, without all
the pampering). For others13 , this will be a fine course from intermediate to advanced
(do not skip the first chapters, though).
Either way, we should not forget to solve all the prescribed exercises.
Good luck!

0.4 Classification of R data types and book structure


The most commonly used R data types can be classified as follows; see also Figure 1.
1. Basic types are discussed in the first part of the book:
• atomic vectors represent whole sequences of values, where every element is of
the same type:
– logical (Chapter 3) includes items that are TRUE (“yes”, “present”),
FALSE (“no”, “absent”), or NA (“not available”, “missing”);

11 The author taught similar courses for his wonderfully ambitious undergraduate data/computer sci-
ence and mathematics students at the Warsaw University of Technology, where our approach has proven
not difficult whatsoever.
12 Yours truly has not chosen to play a role of a historian, a stenographer, nor a grammarian. Thus, he has

made a few noninvasive idealisations for didactic purposes. Languages evolve over time, R is now different
from what it used to be, and we can shape it (slowly; we value its stable API) to become something even
better in the future.
13 It might also happen that for certain readers, this will not be an appropriate course at all, either at this

stage of their career (come back later) or in general (no dramas). This is a non-profit, open-access project,
but it does not mean it is ideal for everyone. We recommend giving other sources a try, e.g., [8, 10, 15, 45,
58, 61, 62, 69], etc. Some of them are freely available.
PREFACE XVII

NULL
logical
numeric
character
list
function
...
factor
matrix
array
data.frame
formula
Date
kmeans
...

Figure 1. An overview of the most prevalent R data types; see Figure 17.2 for a more
comprehensive list.

– numeric (Chapter 2) represents real numbers, such as 1, 3.14, -0.


0000001, etc.;

– character (Chapter 6) contains strings of characters, e.g., "groß",


"123", or “Добрий день”;

• function (Chapter 7) is used to group a series of expressions (code lines) so


that they can be applied on miscellaneous input data to generate the (hope-
fully) desired outcomes, for instance, cat, print, plot, sample, and sum;
• list (generic vector; Chapter 4) can store elements of mixed types.
The above will be complemented with a discussion on vector indexing (Chapter 5)
and ways to control the program flow (Chapter 8).
2. Compound types are mostly discussed in the second part. They are wrappers around
objects of basic types that might behave unlike the underlying primitives thanks
to the dedicated operations overloaded for them. For instance:
• factor (Section 10.3.2) is a vector-like object that represents qualitative data
(on a nominal or an ordered scale);
• matrix (Chapter 11) stores tabular data, i.e., arranged into rows and columns,
where each cell is usually of the same type;
• data.frame (Chapter 12) is also used for depositing tabular data, but this
time such that each column can be of a different type;
• formula (Section 17.6) is utilised by some functions to specify supervised
learning models or define operations to be performed within data sub-
groups, amongst others;
XVIII PREFACE

• and many more, which we can arbitrarily define using the principles of S3-
style object-orientated programming (Chapter 10).
In this part of the book, we also discuss the principles of sustainable coding
(Chapter 9) as well as introduce ways to prepare publication-quality graphics
(Chapter 13).
3. More advanced material is discussed in the third part. For most readers, it should
be of theoretical interest only. However, it can help gain a complete understanding
of and control over our environment. This includes the following data types:
• symbol (name), call, expression (Chapter 15) are objects representing un-
evaluated R expressions that can be freely manipulated and executed if
needed;
• environment (Chapter 16) store named objects in hash maps and provides
the basis for the environment model of evaluation;
• externalptr (Section 14.2.8) provides the ability to maintain any dynamic-
ally allocated C/C++ objects between function calls.
We should not be surprised that we did not list any data types defined by a few trendy14
third-party packages. We will later see that we can most often do without them. If that
is not the case, we will become skilled enough to learn them quickly ourselves.

0.5 About the author


I, Marek Gagolewski15 (pronounced like Maa’rek (Mark) Gong-o-leaf-ski), am cur-
rently an Associate Professor in Data Science at the Faculty of Mathematics and In-
formation Science, Warsaw University of Technology.
My research interests are related to data science, in particular: modelling complex phe-
nomena, developing usable, general-purpose algorithms, studying their analytical
properties, and finding out how people use, misuse, understand, and misunderstand
methods of data analysis in research, commercial, and decision-making settings. I am
an author of over 95 publications, including journal papers in outlets such as Proceed-
ings of the National Academy of Sciences (PNAS), Journal of Statistical Software, The R Journal,
Information Fusion, International Journal of Forecasting, Statistical Modelling, Physica A: Stat-
istical Mechanics and its Applications, Information Sciences, Knowledge-Based Systems, IEEE
Transactions on Fuzzy Systems, and Journal of Informetrics.
In my “spare” time, I write books for my students: check out my Minimalist Data
Wrangling with Python16 [27]. I also develop17 open-source software for data analysis,
14 Which does not automatically mean good. For instance, sugar, salt, and some drugs are very popular,

but it does not make them healthy.


15 https://www.gagolewski.com/
16 https://datawranglingpy.gagolewski.com/
17 https://github.com/gagolews
PREFACE XIX

such as stringi18 (one of the most often downloaded R packages) and genieclust19
(a fast and robust clustering algorithm in both Python and R).

0.6 Acknowledgements
R, and its predecessor S, is the result of a collaborative effort of many program-
mers20 . Without their generous intellectual contributions, the landscape of data ana-
lysis would not be as beautiful as it is now. R is distributed under the terms of the GNU
General Public License version 2. We occasionally display fragments of its source code
for didactic purposes.
We describe and use R version 4.3.1 (2023-06-16). However, we expect 99.9% of the
material covered here to be valid in future releases (consider filing a bug report if you
discover this is not the case).
Deep R Programming is based on the author’s experience as an R user (since ~2003),
developer of open-source packages, tutor/lecturer (since ~2008), and an author of a
quite successful Polish textbook Programowanie w języku R [25] which was published by
PWN (1st ed. 2014, 2nd ed. 2016). Even though the current book is an entirely different
work, its predecessor served as an excellent testbed for many ideas conveyed here.
In particular, the teaching style exercised in this book has proven successful in many
similar courses that yours truly was responsible for, including at Warsaw University of
Technology, Data Science Retreat (Berlin), and Deakin University (Melbourne). I thank
all my students and colleagues for the feedback given over the last 15-odd years.
This work received no funding, administrative, technical, or editorial support from
Deakin University, Warsaw University of Technology, Polish Academy of Sciences, or
any other source.
This book was prepared in a Markdown superset called MyST21 , Sphinx22 , and TeX
(XeLaTeX). Code chunks were processed with the R package knitr [64]. All fig-
ures were plotted with the low-level graphics package using the author’s own style
template. A little help from Makefiles, custom shell scripts, and Sphinx plugins
(sphinxcontrib-bibtex23 , sphinxcontrib-proof24 ) dotted the j’s and crossed the f ’s.
The Ubuntu Mono25 font is used for the display of code. The typesetting of the main text
relies on the Alegreya26 typeface.
18 https://stringi.gagolewski.com/
19 https://genieclust.gagolewski.com/
20 https://www.r-project.org/contributors.html
21 https://myst-parser.readthedocs.io/en/latest/index.html
22 https://www.sphinx-doc.org/
23 https://pypi.org/project/sphinxcontrib-bibtex
24 https://pypi.org/project/sphinxcontrib-proof
25 https://design.ubuntu.com/font
26 https://www.huertatipografica.com/en
XX PREFACE

0.7 You can make this book better


When it comes to quality assurance, open, non-profit projects have to resort to the
generosity of the readers’ community.
If you find a typo, a bug, or a passage that could be rewritten or extended for better
readability/clarity, do not hesitate to report it via the Issues tracker available at https:
//github.com/gagolews/deepr. New feature requests are welcome as well.
Part I

Deep
1
Introduction

1.1 Hello, world!


Traditionally, every programming journey starts by printing a “Hello, world”-like
greeting. Let’s then get it over with asap:
cat("My hovercraft is full of eels.\n") # `\n` == newline
## My hovercraft is full of eels.

By calling (invoking) the cat function, we printed out a given character string that we
enclosed in double-quote characters.
Documenting code is a good development practice. It is thus worth knowing that any
text following a hash sign (that is not part of a string) is a comment. It is ignored by the
interpreter.
# This is a comment.
# This is another comment.
cat("I cannot wait", "till lunchtime.\n") # two arguments (another comment)
## I cannot wait till lunchtime.
cat("# I will not buy this record.\n# It is scratched.\n")
## # I will not buy this record.
## # It is scratched.

By convention, in this book, the textual outputs generated by R itself are always pre-
ceded by two hashes. This makes copy-pasting all code chunks easier in case we would
like to experiment with them (which is always highly encouraged).
Whenever a call to a function is to be made, the round brackets are obligatory. All objects
within the parentheses (they are separated by commas) constitute the input data to be
consumed by the operation. Thus, the syntax is: a_function_to_be_called(argument1,
argument2, etc.).
4 I DEEP

1.2 Setting up the development environment


1.2.1 Installing R
It is quite natural to pine for the ability to execute the foregoing code ourselves; to
learn programming without getting our hands dirty is impossible.
The official precompiled binary distributions of R can be downloaded from https://
cran.r-project.org/.
For serious programming work1 , we recommend, sooner rather than later, switch-
ing to2 one of the UNIX-like operating systems. This includes the free, open-source
(== good) variants of GNU/Linux, amongst others, or the proprietary (== not so good)
m**OS. In such a case, we can employ our favourite package manager (e.g., apt, dnf,
pacman, or Homebrew) to install R.

Other users (e.g., of Wi***ws) might consider installing Anaconda or Miniconda, es-
pecially if they would like to work with Jupyter (Section 1.2.5) or Python as well.

Below we review several ways in which we can write and execute R code. It is up to
the benign readers to research, set up, and learn the development environment that
suits their needs. As usual in real life, there is no single universal approach that always
works best in all scenarios.

1.2.2 Interactive mode


Whenever we would like to compute something quickly, e.g., determine basic aggreg-
ates of a few numbers entered by hand or evaluate a mathematical expression like
“2+2”, R’s read-eval-print loop (REPL) can give us instant gratification.
How to start the R console varies from system to system. For instance, the users
of UNIX-like boxes can simply execute R from the terminal (shell, command line).
Wi***ws folks can fire up RGui from the Start menu.

Important When working interactively, the default3 command prompt, “>”, means: I
am awaiting orders. Moreover, “+” denotes: Please continue. In the latter case, we should
either complete the unfinished expression or cancel the operation by pressing ESC or
CTRL+C (depending on the operating system).

> cat("And now


+ for something
(continues on next page)

1 For instance, when interoperability with other programming languages/environments is required or

when we think about scheduling jobs on Linux-based computing/container clusters.


2 Or at least trying out – by installing a copy of GNU/Linux on a virtual machine (VM).
3 It can be changed; see help("options").
1 INTRODUCTION 5

(continued from previous page)


+ completely different
+
+
+ it is an unfinished expression...
+ awaiting another double quote character and then the closing bracket...
+
+ press ESC or CTRL+C to abort input
>

For readability, we never print out the command prompt characters in this book.

1.2.3 Batch mode: Working with R scripts (**)


The interactive mode of operation is unsuitable for more complicated tasks, though.
The users of UNIX-like operating systems will be interested in another extreme, which
involves writing standalone R scripts that can be executed line by line without any user
intervention.
To do so, in the terminal, we can invoke:
Rscript file.R

where file.R is the path to a source file.


Exercise 1.1 (**) In your favourite text editor (e.g., Notepad++, Kate, vi, Emacs, RStudio,
or VSCodium), create a file named test.R. Write a few calls to the cat function. Then, execute
this script from the terminal through Rscript.

1.2.4 Weaving: Automatic report generation (**)


Reproducible data analysis4 requires us to keep the results (text, tables, plots, auxiliary
files) synchronised with the code and data that generate them.
utils::Sweave (the Sweave function from the utils package) and knitr [64] are two
example template processors that evaluate R code chunks within documents written
in LaTeX, HTML, or other markup languages. The chunks are replaced by the outputs
they yield.
This book is a showcase of such an approach: all the results, including Figure 2.3 and
the message about busy hovercrafts, were generated programmatically. Thanks to its
being written in the highly universal Markdown5 language, it could be converted to a
4 The idea dates back to Knuth’s literate programming concept; see [40].
5 https://daringfireball.net/projects/markdown
6 I DEEP

single PDF document6 as well as the whole website7 . This was facilitated by tools like
pandoc and docutils.

Exercise 1.2 (**) Call install.packages("knitr") in R. Then, create a text file named
test.Rmd with the following content:

# Hello, Markdown!

This is my first automatically generated report,


where I print messages and stuff.

```{r}
print("G'day!")
print(2+2)
plot((1:10)^2)
```

Thank you for your attention.

Assuming that the file is located in the current working directory (compare Section 7.3.2), call
knitr::knit("test.Rmd") from the R console, or run in the terminal:

Rscript -e 'knitr::knit("test.Rmd")'

Inspect the generated Markdown file, test.md.


Furthermore, if you have the pandoc tool installed, to generate a standalone HTML file, execute
in the terminal:
pandoc test.md --standalone -o test.html

Alternatively, see Section 7.3.2 for ways to call external programs from R.

1.2.5 Semi-interactive modes (Jupyter Notebooks, sending code to the asso-


ciated R console, etc.)
The nature of the most frequent use cases of R encourages a semi-interactive work-
flow, where we quickly progress with prototyping by trial and error. In this mode,
we compose a series of short code fragments inside a standalone R script. Each frag-
ment implements a simple, well-defined task, such as loading data files, data cleans-
ing, feature visualisation, computations of information aggregates, etc. Importantly,
any code chunk can be sent to the associated R console and executed there. This way,
we can inspect the result it generates. If we are not happy with the outcome, we can
apply the necessary corrections.
There are quite a few integrated development environments that enable such a work-
6 https://deepr.gagolewski.com/deepr.pdf
7 https://deepr.gagolewski.com/
1 INTRODUCTION 7

flow, including JupyterLab, Emacs, RStudio, and VSCodium. Some of them require
additional plugins for R.
Executing an individual code line or a whole text selection is usually done by pressing
(configurable) keyboard shortcuts such as Ctrl+Enter or Shift+Enter.
Exercise 1.3 (*) JupyterLab8 is a development environment that runs in a web browser. It was
programmed in Python, but supports many programming languages. Thanks to IRkernel9 , we
can use it with R.
1. Install JupyterLab and IRkernel (for instance, if you use Anaconda, run conda install
-c r r-essentials).

2. From the File menu, select Create a new R source file and save it as, e.g., test.R.
3. Click File and select Create a new console for the editor running the R kernel.
4. Input a few print “Hello, world”-like calls.
5. Press Shift+Enter (whilst working in the editor) to send different code fragments to the
console and execute them. Inspect the results.
See Figure 1.1 for an illustration. Note that issuing options(jupyter.rich_display=FALSE)
may be necessary to disable rich HTML outputs and make them look more like ones in this book.

Figure 1.1. JupyterLab: A source file editor and the associated R console, where we can
run arbitrary code fragments.

Example 1.4 (*) JupyterLab also handles dedicated Notebooks, where editable and execut-
able code chunks and results they generate can be kept together in a single .ipynb (JSON) file;
8 https://jupyterlab.readthedocs.io/en/stable
9 https://irkernel.github.io/
8 I DEEP

see Figure 1.2 for an illustration and Chapter 1 of [27] for a quick introduction (from the Python
language kernel perspective).
This environment is convenient for live coding (e.g., for teachers) or performing exploratory data
analyses. However, for more serious programming work, the code can get messy. Luckily, there is
always an option to export a notebook to an executable, plain text R script.

Figure 1.2. An example Jupyter Notebook, where we can keep code and results to-
gether.

1.3 Atomic vectors at a glance


After printing “Hello, world”, a typical programming course would normally proceed
with the discussion on basic data types for storing individual numeric or logical values.
Next, we would be introduced to arithmetic and relational operations on such scalars,
followed by the definition of whole arrays or other collections of values, complemented
by the methods to iterate over them, one element after another.
In R, no separate types representing individual values have been defined. Instead,
what seems to be a single datum, is already a vector (sequence, array) of length one.
2.71828 # input a number; here: the same as print(2.71828)
## [1] 2.7183
length(2.71828) # it is a vector with one element
## [1] 1
1 INTRODUCTION 9

To create a vector of any length, we can call the c function, which combines given ar-
guments into a single sequence:
c(1, 2, 3) # three values combined
## [1] 1 2 3
length(c(1, 2, 3)) # indeed, it is a vector of length three
## [1] 3

In Chapter 2, Chapter 3, and Chapter 6, we will discuss the most prevalent types of
atomic vectors: numeric, logical, and character ones, respectively.
c(0, 1, -3.14159, 12345.6) # four numbers
## [1] 0.0000 1.0000 -3.1416 12345.6000
c(TRUE, FALSE) # two logical values
## [1] TRUE FALSE
c("spam", "bacon", "spam") # three character strings
## [1] "spam" "bacon" "spam"

We call them atomic for they can only group together values of the same type. Lists,
which we will discuss in Chapter 4, are, on the other hand, referred to as generic vectors.
They can be used for storing items of mixed types: other lists as well.

Note Not having separate scalar types greatly simplifies the programming of numer-
ical computing tasks. Vectors are prevalent in our main areas of interest: statistics,
simulations, data science, machine learning, and all other data-orientated comput-
ing. For example, columns and rows in tables (characteristics of clients, ratings of
items given by users) or time series (stock market prices, readings from temperature
sensors) are all best represented by means of such sequences.
The fact that vectors are the core part of the R language makes their use very natural,
as opposed to the languages that require special add-ons for vector processing, e.g.,
numpy for Python [34]. By learning different ways to process them as a whole (instead of
one element at a time), we will ensure that our ideas can quickly be turned into opera-
tional code. For instance, computing summary statistics such as, say, the mean abso-
lute deviation of a sequence x, will be as effortless as writing mean(abs(x-mean(x))).
Such code is not only easy to read and maintain, but it is also fast to run.

1.4 Getting help


Our aim is to become independent, advanced R programmers.
Independent, however, does not mean omniscient. The R help system is the authorit-
ative source of knowledge about specific functions or more general topics. To open a
help page, we call:
10 I DEEP

help("topic") # equivalently: ?"topic"

Exercise 1.5 Sight (without going into detail) the manual on the length function by calling
help("length"). Note that most help pages are structured as follows:

1. Header: package:base means that the function is a base one (see Section 7.3.1 for more
details on the R package system);
2. Title;
3. Description: a short description of what the function does;
4. Usage: the list of formal arguments (parameters) to the function;
5. Arguments: the meaning of each formal argument explained;
6. Details: technical information;
7. Value: return value explained;
8. References: further reading;
9. See Also: links to other help pages;
10. Examples: R code that is worth inspecting.
We can also search within all the installed help pages by calling:
help.search("vague topic") # equivalently: ??"vague topic"

This way, we will be able to find answers to our questions more reliably than when
asking DuckDuckGo or G**gle, which commonly return many low-quality, irrelevant,
or distracting results from splogs. We do not want to lose the sacred code writer’s flow!
It is a matter of personal hygiene and good self discipline.

Important All code chunks, including code comments and textual outputs, form an
integral part of this book’s text. They should not be skipped by the reader. On the con-
trary, they must become objects of our intense reflection and thorough investigation.
For instance, whenever we introduce a function, it may be a clever idea to look it up
in the help system. Moreover, playing with the presented code (running, modifying,
experimenting, etc.) is also very beneficial. We should develop the habit of asking
ourselves questions like “What would happen if…”, and then finding the answers on
our own.

We are now ready to discuss the most significant operations on numeric vectors,
which constitute the main theme of the next chapter. See you there.
1 INTRODUCTION 11

1.5 Exercises
Exercise 1.6 What are the three most important types of atomic vectors?
Exercise 1.7 According to the classification of the R data types we introduced in the previous
chapter, are atomic vectors basic or compound types?
2
Numeric vectors

In this chapter, we discuss the uttermost common operations on numeric vectors.


They are so fundamental that we will also find them in other scientific computing
environments, including Python with numpy or tensorflow, Julia, MATLAB, GNU
Octave, or Scilab.
At first blush, the number of functions we are going to explore may seem quite large.
Still, the reader is kindly asked to place some trust (a rare thing these days) in yours
truly. It is because our selection is comprised only of the most representative and edu-
cational amongst the plethora of possible choices. More complex building blocks can
often be reduced to a creative combination of the former or be easily found in a num-
ber of additional packages or libraries (e.g., GNU GSL [28]).
A solid understanding of base R programming is crucial for dealing with popular pack-
ages (such as data.table, dplyr, or caret). Most importantly, base R’s API is stable.
Hence, the code we compose today will most likely work the same way in ten years. It
is often not the case when we rely on external add-ons.
In the sequel, we will be advocating a minimalist, keep-it-simple approach to the art of
programming data processing pipelines, one that is a healthy balance between “doing
it all by ourselves”, “minimising the information overload”, “being lazy”, and “standing
on the shoulders of giants”.

Note The exercises that we suggest in the sequel are all self-contained, unless expli-
citly stated otherwise. The use of language constructs that are yet to be formally intro-
duced (in particular, if, for, and while explained in Chapter 8) is not just unneces-
sary: it is discouraged. Moreover, we recommend against taking shortcuts by looking
up partial solutions on the internet. Rather, to get the most out of this course, we
should be seeking relevant information within the current and preceding chapters as
well as the R help system.

2.1 Creating numeric vectors


2.1.1 Numeric constants
The simplest numeric vectors are those of length one:
14 I DEEP

-3.14
## [1] -3.14
1.23e-4
## [1] 0.000123

The latter is in what we call scientific notation, which is a convenient means of entering
numbers of very large or small orders of magnitude. Here, “e” stands for “… times 10 to
the power of…”. Therefore, 1.23e-4 is equal to 1.23×10−4 = 0.000123. In other words,
given 1.23, we move the decimal separator by four digits towards the left, adding zer-
oes if necessary.
In real life, some information items may be inherently or temporarily missing, un-
known, or Not Available. As R is orientated towards data processing, it was equipped
with a special indicator:
NA_real_ # numeric NA (missing value)
## [1] NA

It is similar to the Null marker in database query languages such as SQL. Note that
NA_real_ is displayed simply as “NA”, chiefly for readability.

Moreover, Inf denotes infinity, ∞, i.e., an element that is larger than the largest rep-
resentable double-precision (64 bit) floating point value. Also, NaN stands for not-a-
number, which is returned as the result of some illegal operations, e.g., 0/0 or ∞ − ∞.

Let’s provide a few ways to create numeric vectors with possibly more than one ele-
ment.

2.1.2 Concatenating vectors with c


First, the c function can be used to combine (concatenate) many numeric vectors, each
of any length. It results in a single object:
c(1, 2, 3) # three vectors of length one –> one vector of length three
## [1] 1 2 3
c(1, c(2, NA_real_, 4), 5, c(6, c(7, Inf)))
## [1] 1 2 NA 4 5 6 7 Inf

Note Running help("c"), we will see that its usage is like c(...). In the current
context, this means that the c function takes an arbitrary number of arguments. In
Section 9.4.6, we will study the dot-dot-dot (ellipsis) parameter in more detail.

2.1.3 Repeating entries with rep


Second, rep replicates the elements in a vector a given number of times.
2 NUMERIC VECTORS 15

rep(1, 5)
## [1] 1 1 1 1 1
rep(c(1, 2, 3), 4)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3

In the second case, the whole vector (1, 2, 3) has been recycled (tiled) four times. Inter-
estingly, if the second argument is a vector of the same length as the first one, the
behaviour will be different:
rep(c(1, 2, 3), c(2, 1, 4))
## [1] 1 1 2 3 3 3 3
rep(c(1, 2, 3), c(4, 4, 4))
## [1] 1 1 1 1 2 2 2 2 3 3 3 3

Here, each element is repeated the corresponding number of times.


Calling help("rep"), we find that the function’s usage is like rep(x, ...). It is rather
peculiar. However, reading further, we discover that the ellipsis (dot-dot-dot) may be
fed with one of the following parameters:
• times,
• length.out1 ,
• each.
So far, we have been playing with times, which is listed second in the parameter list
(after x, the vector whose elements are to be repeated).

Important It turns out that the undermentioned function calls are all equivalent:
rep(c(1, 2, 3), 4) # positional matching of arguments: `x`, then `times`
rep(c(1, 2, 3), times=4) # `times` is the second argument
rep(x=c(1, 2, 3), times=4) # keyword arguments of the form name=value
rep(times=4, x=c(1, 2, 3)) # keyword arguments can be given in any order
rep(times=4, c(1, 2, 3)) # mixed positional and keyword arguments

We can also pass each or length.out, but their names must be mentioned explicitly:
rep(c(1, 2, 3), length.out=7)
## [1] 1 2 3 1 2 3 1
rep(c(1, 2, 3), each=3)
## [1] 1 1 1 2 2 2 3 3 3
rep(c(1, 2, 3), length.out=7, each=3)
## [1] 1 1 1 2 2 2 3

Note Whether it was a good programming practice to actually implement a range


1 A dot has no special meaning in R; see Section 2.2.
16 I DEEP

of varied behaviours inside a single function is a matter of taste. On the one hand, in
all of the preceding examples, we do repeat the input elements somehow, so remem-
bering just one function name is really convenient. Nevertheless, a drastic change in
the repetition pattern depending, e.g., on the length of the times argument can be
bug-prone. Anyway, we have been warned2 .

Zero-length vectors are possible too:


rep(c(1, 2, 3), 0)
## numeric(0)

Even though their handling might be a little tricky, we will later see that they are in-
dispensable in contexts like “create an empty data frame with a specific column struc-
ture”.
Also, note that R often allows for partial matching of named arguments, but its use is
a bad programming practice; see Section 15.4.4 for more details.
rep(c(1, 2, 3), len=7) # not recommended (see later)
## Warning in rep(c(1, 2, 3), len = 7): partial argument match of 'len' to
## 'length.out'
## [1] 1 2 3 1 2 3 1

We see the warning only because we have set options(warnPartialMatchArgs=TRUE)


in our environment. It is not used by default.

2.1.4 Generating arithmetic progressions with seq and `:`


Third, we can call the seq function to create a sequence of equally-spaced numbers on
a linear scale, i.e., an arithmetic progression.
seq(1, 15, 2)
## [1] 1 3 5 7 9 11 13 15

From the function’s help page, we discover that seq accepts the from, to, by, and
length.out arguments, amongst others. Thus, the preceding call is equivalent to:

seq(from=1, to=15, by=2)


## [1] 1 3 5 7 9 11 13 15

Note that to actually means “up to”:


2 Some “caring” R users might be tempted to introduce two new functions now, one for generating (1,

2, 3, 1, 2, 3, …) only and the other outputting patterns like (1, 1, 1, 2, 2, 2, …). They would most likely wrap
them in a new package and announce that on social media. But this is nothing else than a multiplication of
entities without actual necessity. This way, we would end up with three functions. First is the original one,
rep, which everyone ought to know anyway because it is part of the standard library. Second and third are
the two redundant procedures whose user-friendliness is only illusory. See also Chapter 9 for a discussion
on the design of functions.
2 NUMERIC VECTORS 17

seq(from=1, to=16, by=2)


## [1] 1 3 5 7 9 11 13 15

We can also pass length.out instead of by. In such a case, the increments or decre-
ments will be computed via the formula ((to - from)/(length.out - 1)). This
default value is reported in the Usage section of help("seq").
seq(1, 0, length.out=5)
## [1] 1.00 0.75 0.50 0.25 0.00
seq(length.out=5) # default `from` is 1
## [1] 1 2 3 4 5

Arithmetic progressions with steps equal to 1 or -1 can also be generated via the `:`
operator.
1:10 # seq(1, 10) or seq(1, 10, 1)
## [1] 1 2 3 4 5 6 7 8 9 10
-1:10 # seq(-1, 10) or seq(-1, 10, 1)
## [1] -1 0 1 2 3 4 5 6 7 8 9 10
-1:-10 # seq(-1, -10) or seq(-1, -10, -1)
## [1] -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

Let’s highlight the order of precedence of this operator: -1:10 means (-1):10, and
not -(1:10); compare Section 2.4.3.
Exercise 2.1 Take a look at the manual page of seq_along and seq_len and determine
whether we can do without them, having seq3 at hand.

2.1.5 Generating pseudorandom numbers


We can also generate sequences drawn independently from a range of univariate prob-
ability distributions.
runif(7) # uniform U(0, 1)
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556 0.528105
rnorm(7) # normal N(0, 1)
## [1] 1.23950 -0.10897 -0.11724 0.18308 1.28055 -1.72727 1.69018

These correspond to seven pseudorandom deviates from the uniform distribution on


the unit interval (i.e., (0, 1)) and the standard normal distribution (i.e., with expecta-
tion 0 and standard deviation 1), respectively; compare Figure 2.3.
For more named distribution classes frequently occur in various real-world statistical
modelling exercises, see Section 2.3.4.
Another worthwhile function picks items from a given vector, either with or without
replacement:
3 Certain configurations of seq and its variants might return vectors of the type integer instead of

double, some of them in a compact (ALTREP) form; see Section 6.4.1.


18 I DEEP

sample(1:10, 20, replace=TRUE) # 20 with replacement (allow repetitions)


## [1] 3 3 10 2 6 5 4 6 9 10 5 3 9 9 9 3 8 10 7 10
sample(1:10, 5, replace=FALSE) # 5 without replacement (do not repeat)
## [1] 9 3 4 6 1

The distribution of the sampled values does not need to be uniform; the prob argument
may be fed with a vector of the corresponding probabilities. For example, here are 20
independent realisations of the random variable 𝑋 such that Pr(𝑋 = 0) = 0.9 (the
probability that we obtain 0 is equal to 90%) and Pr(𝑋 = 1) = 0.1:
sample(0:1, 20, replace=TRUE, prob=c(0.9, 0.1))
## [1] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

Note If n is a single number (a numeric vector of length 1), then sample(n, ...)
is equivalent to sample(1:n, ...). Similarly, seq(n) is a synonym for seq(1, n)
or seq(1, length(n)), depending on the length of n. This is a dangerous behaviour
that can occasionally backfire and lead to bugs (check what happens when n is, e.g.,
0). Nonetheless, we have been warned. From now on, we are going to be extra careful
(but are we really?). Read more at help("sample") and help("seq").

Let’s stress that the numbers we obtain are merely pseudorandom because they are gen-
erated algorithmically. R uses the Mersenne-Twister MT19937 method [46] by default;
see help("RNG") and [21, 29, 42]. By setting the seed of the random number generator,
i.e., resetting its state to a given one, we can obtain results that are reproducible.
set.seed(12345) # seeds are specified with integers
sample(1:10, 5, replace=TRUE) # a,b,c,d,e
## [1] 3 10 8 10 8
sample(1:10, 5, replace=TRUE) # f,g,h,i,j
## [1] 2 6 6 7 10

Setting the seed to the one used previously gives:


set.seed(12345)
sample(1:10, 5, replace=TRUE) # a,b,c,d,e
## [1] 3 10 8 10 8

We did not(?) expect that! And now for something completely different:
set.seed(12345)
sample(1:10, 10, replace=TRUE) # a,b,c,d,e,f,g,h,i,j
## [1] 3 10 8 10 8 2 6 6 7 10

Reproducibility is a crucial feature of each truly scientific experiment. The same initial
condition (here: the same seed) leads to exactly the same outcomes.

Note Some claim that the only unsuspicious seed is 42 but in matters of taste, there
2 NUMERIC VECTORS 19

can be no disputes. Everyone can use their favourite picks: yours truly savours 123, 1234,
and 12345 as well.
When performing many runs of Monte Carlo experiments, it may also be a clever idea
to call set.seed(i) in the 𝑖-th iteration of a simulation we are trying to program.
We should ensure that our seed settings are applied consistently across all our scripts.
Otherwise, we might be accused of tampering with evidence. For instance, here is the
ultimate proof that we are very lucky today:
set.seed(1679619) # totally unsuspicious, right?
sample(0:1, 20, replace=TRUE) # so random
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

This is exactly why reproducible scripts and auxiliary data should be published along-
side all research reports or papers. Only open, transparent science can be fully trust-
worthy.

If set.seed is not called explicitly, and the random state is not restored from the pre-
viously saved R session (see Chapter 16), then the random generator is initialised based
on the current wall time and the identifier of the running R instance (PID). This may
justify the impression that the numbers we generate appear surprising.
To understand the “pseudo” part of the said randomness better, in Section 8.3, we will
build a very simple random generator ourselves.

2.1.6 Reading data with scan


An example text file named euraud-20200101-20200630.csv4 gives the EUR to AUD
exchange rates (how many Australian Dollars can we buy for 1 Euro) from 1 January to
30 June 2020 (remember COVID-19?). Let’s preview the first couple of lines:
# EUR/AUD Exchange Rates
# Source: Statistical Data Warehouse of the European Central Bank System
# https://www.ecb.europa.eu/stats/policy_and_exchange_rates/
# (provided free of charge)
NA
1.6006
1.6031
NA

The four header lines that begin with “#” merely serve as comments for us humans.
They should be ignored by the interpreter. The first “real” value, NA, corresponds to
1 January (Wednesday, New Year’s Day; Forex markets were closed, hence a missing
observation).
We can invoke the scan function to read all the inputs and convert them to a single
numeric vector:
4 https://github.com/gagolews/teaching-data/raw/master/marek/euraud-20200101-20200630.csv
20 I DEEP

scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
## [1] NA 1.6006 1.6031 NA NA 1.6119 1.6251 1.6195 1.6193 1.6132
## [11] NA NA 1.6117 1.6110 1.6188 1.6115 1.6122 NA NA 1.6154
## [21] 1.6177 1.6184 1.6149 1.6127 NA NA 1.6291 1.6290 1.6299 1.6412
## [31] 1.6494 NA NA 1.6521 1.6439 1.6299 1.6282 1.6417 NA NA
## [41] 1.6373 1.6260 1.6175 1.6138 1.6151 NA NA 1.6129 1.6195 1.6142
## [51] 1.6294 1.6363 NA NA 1.6384 1.6442 1.6565 1.6672 1.6875 NA
## [61] NA 1.6998 1.6911 1.6794 1.6917 1.7103 NA NA 1.7330 1.7377
## [71] 1.7389 1.7674 1.7684 NA NA 1.8198 1.8287 1.8568 1.8635 1.8226
## [81] NA NA 1.8586 1.8315 1.7993 1.8162 1.8209 NA NA 1.8021
## [91] 1.7967 1.8053 1.7970 1.8004 NA NA 1.7790 1.7578 1.7596
## [ reached getOption("max.print") -- omitted 83 entries ]

We used the paste0 function (Section 6.1.3) to concatenate two long strings (too long
to fit a single line of code) and form a single URL.
We can also read the files located on our computer. For example:
scan("~/Projects/teaching-data/marek/euraud-20200101-20200630.csv",
comment.char="#")

It used an absolute file path that starts at the user’s home directory, denoted “~”. Yours
truly’s case is /home/gagolews.

Note For portability reasons, we suggest slashes, “/”, as path separators; see also
help("file.path") and help(".Platform"). They are recognised by all UNIX-like
boxes as well as by other popular operating systems, including W*****ws. Note that
URLs, such as https://deepr.gagolewski.com/, consist of slashes too.

Paths can also be relative to the current working directory, denoted “.”, which can be
read via a call to getwd. Usually, it is the directory from where the R session has been
started.
For instance, if the working directory was /home/gagolews/Projects/teaching-data/
marek, we could write the file path equivalently as ./euraud-20200101-20200630.
csv or even euraud-20200101-20200630.csv.

On as side note, “..” marks the parent directory of the current working directory.
In the above example, ../r/iris.csv is equivalent to /home/gagolews/Projects/
teaching-data/r/iris.csv.

Exercise 2.2 Read the help page about scan. Take note of the following formal arguments and
their meaning: dec, sep, what, comment.char, and na.strings.
Later we will discuss the read.table and read.csv functions. They are wrappers
around scan that reads structured data. Also, write exports an atomic vector’s contents
to a text file.
2 NUMERIC VECTORS 21

Example 2.3 Figure 2.1 shows the graph of the aforementioned exchange rates, which was gen-
erated by calling:
plot(scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#"),
xlab="Day", ylab="EUR/AUD")
1.85
1.80
EUR/AUD
1.75
1.70
1.65
1.60

0 50 100 150
Day

Figure 2.1. EUR/AUD exchange rates from 2020-01-01 (day 1) to 2020-06-30 (day 182).

Somewhat misleadingly (and for reasons that will become apparent later), the documentation of
plot can be accessed by calling help("plot.default"). Read about, and experiment with,
different values of the main, xlab, ylab, type, col, pch, cex, lty, and lwd arguments. More
plotting routines will be discussed in Chapter 13.

2.2 Creating named objects


The objects we bring forth will often need to be memorised so that they can be referred
to in further computations. The assignment operator, `<-`, can be used for this pur-
pose:
x <- 1:3 # creates a numeric vector and binds the name `x` to it

The now-named object can be recalled5 and dealt with as we please:

5 Name bindings are part of environment frames; see Chapter 16.


22 I DEEP

print(x) # or just `x` in the R console


## [1] 1 2 3
sum(x) # example operation: compute the sum of all elements in `x`
## [1] 6

Important In R, all names are case-sensitive. Hence, x and X can coexist peacefully:
when set, they refer to two different objects. If we tried calling Print(x), print(X),
or PRINT(x), we would get an error.

Typically, we will be using syntactic names. In help("make.names"), we read: A syntactic-


ally valid name consists of letters, numbers and the dot or underline characters and starts with
a letter or the dot not followed by a number. Names such as .2way are not valid, and neither are
the reserved words such as if, for, function, next, and TRUE, but see Section 9.3.1 for
an exception.
A fine name is self-explanatory and thus reader-friendly: patients, mean, and aver-
age_scores are way better (if they are what they claim they are) than xyz123, crap, or
spam. Also, it might not be such a bad idea to get used to denoting:

• vectors by x, y, z,
• matrices (and matrix-like objects) by A, B, …, X, Y, Z,
• integer indexes by letters i, j, k, l,
• object sizes by n, m, d, p or nx, ny, etc.,
especially when they are only of temporary nature (for storing auxiliary results, iter-
ating over collections of objects, etc.).
There are numerous naming conventions that we can adopt, but most often they are
a matter of taste; snake_case, lowerCamelCase, UpperCamelCase, flatcase, or dot.
case are equally sound as long as they are used coherently (for instance, some use
snake_case for vectors and UpperCamelCase for functions). Occasionally, we have
little choice but to adhere to the naming conventions of the project we are about to
contribute to.

Note Generally, a dot, “.”, has no special meaning6 ; na.omit is as appropriate a name
as na_omit, naOmit, NAOMIT, naomit, and NaOmit. Readers who know other program-
ming languages will need to habituate themselves to this convention.

R, as a dynamic language, allows for introducing new variables at any time. Moreover,
existing names can be bound to new values. For instance:

6 See Section 10.2 and Section 16.2.1 for a few asterisks.


2 NUMERIC VECTORS 23

(y <- "spam") # bracketed expression – printing not suppressed


## [1] "spam"
x <- y # overwrites the previous `x`
print(x)
## [1] "spam"

Now x refers to a verbatim copy of y.

Note Objects are automatically destroyed when we cannot access them anymore. By
now, the garbage collector is likely to have got rid of the foregoing 1:3 vector (to which
the name x was bound previously).

2.3 Vectorised mathematical functions


Mathematically, we will be denoting a given vector 𝒙 of length 𝑛 by (𝑥1 , 𝑥2 , … , 𝑥𝑛 ). In
other words, 𝑥𝑖 is its 𝑖-th element. Let’s review a few operations that are ubiquitous in
numerical computing.

2.3.1 abs and sqrt


R implements vectorised versions of the most popular mathematical functions, e.g.,
abs (absolute value, |𝑥|) and sqrt (square root, √𝑥).

abs(c(2, -1, 0, -3, NA_real_))


## [1] 2 1 0 3 NA

Here, vectorised means that instead of being defined to act on a single numeric value,
they are applied on each element in a vector. The 𝑖-th resulting item is a transformed
version of the 𝑖-th input:

|𝒙| = (|𝑥1 |, |𝑥2 |, … , |𝑥𝑛 |).

Moreover, if an input is a missing value, the corresponding output will be marked as


unknown as well.
Another example:
x <- c(4, 2, -1)
(y <- sqrt(x))
## Warning in sqrt(x): NaNs produced
## [1] 2.0000 1.4142 NaN

To attract our attention to the fact that computing the square root of a negative value is
a reckless act, R generated an informative warning. However, a warning is not an error:
24 I DEEP

the result is being produced as usual. In this case, the ill value is marked as not-a-
number.
Also, the fact that the irrational √2 is displayed7 as 1.4142 does not mean that it is such
a crude approximation to 1.414213562373095048801688724209698.... It was roun-
ded when printing purely for aesthetic reasons. In fact, in Section 3.2.3, we will point
out that the computer’s floating-point arithmetic has roughly 16 decimal digits preci-
sion (but we shall see that the devil is in the detail).
print(y, digits=16) # display more significant figures
## [1] 2.000000000000000 1.414213562373095 NaN

2.3.2 Rounding
The following functions drop all or portions of fractional parts of numbers:
• floor(x) (rounds down to the nearest integer, denoted ⌊𝑥⌋),
• ceiling(x) (rounds up, denoted ⌈𝑥⌉ = −⌊−𝑥⌋),
• trunc(x) (rounds towards zero),
• round(x, digits=0) (rounds to the nearest number with digits decimal digits).
For instance:
x <- c(7.0001, 6.9999, -4.3149, -5.19999, 123.4567, -765.4321, 0.5, 1.5, 2.5)
floor(x)
## [1] 7 6 -5 -6 123 -766 0 1 2
ceiling(x)
## [1] 8 7 -4 -5 124 -765 1 2 3
trunc(x)
## [1] 7 6 -4 -5 123 -765 0 1 2

Note When we write that a function’s usage is like round(x, digits=0), compare
help("round"), we mean that the digits parameter is equipped with the default value
of 0. In other words, if rounding to 0 decimal digits is what we need, the second argu-
ment can be omitted.

round(x) # the same as round(x, 0); round half to even


## [1] 7 7 -4 -5 123 -765 0 2 2
round(x, 1) # round to tenths (nearest 0.1s)
## [1] 7.0 7.0 -4.3 -5.2 123.5 -765.4 0.5 1.5 2.5
round(x, -2) # round to hundreds (nearest 100s)
## [1] 0 0 0 0 100 -800 0 0 0

7 There are a couple of settings in place that control the default behaviour of the print function; see
width, digits, max.print, OutDec, scipen, etc. in help("options").
2 NUMERIC VECTORS 25

2.3.3 Natural exponential function and logarithm


Moreover:
• exp(x) outputs the natural exponential function, 𝑒𝑥 , where Euler’s number 𝑒 ≃
2.718,
• log(x, base=exp(1)) computes, by default, the natural logarithm of 𝑥, log𝑒 𝑥
(which is most often denoted simply by log 𝑥).
Recall that if 𝑥 = 𝑒𝑦 , then log𝑒 𝑥 = 𝑦, i.e., one is the inverse of the other.
log(c(0, 1, 2.7183, 7.3891, 20.0855)) # grows slowly
## [1] -Inf 0 1 2 3
exp(c(0, 1, 2, 3)) # grows fast
## [1] 1.0000 2.7183 7.3891 20.0855

These functions enjoy a number of very valuable identities and inequalities. In partic-
ular, we should know from school that log(𝑥 ⋅ 𝑦) = log 𝑥 + log 𝑦, log(𝑥𝑦 ) = 𝑦 log 𝑥,
and 𝑒𝑥+𝑦 = 𝑒𝑥 ⋅ 𝑒𝑦 .
For the logarithm to a different base, say, log10 𝑥, we can call:
log(c(0, 1, 10, 100, 1000, 1e10), 10) # or log(..., base=10)
## [1] -Inf 0 1 2 3 10

Recall that if log𝑏 𝑥 = 𝑦, then 𝑥 = 𝑏𝑦 , for any 1 ≠ 𝑏 > 0.


Example 2.4 Commonly, a logarithmic scale is used for variables that grow rapidly when ex-
pressed as functions of each other; see Figure 2.2.
x <- seq(0, 10, length.out=1001)
par(mfrow=c(1, 2)) # two plots in one figure (one row, two columns)
plot(x, exp(x), type="l") # left subplot
plot(x, exp(x), type="l", log="y") # log-scale on the y-axis; right subplot

Let’s highlight that 𝑒𝑥 on the log-scale is nothing more than a straight line. Such a transforma-
tion of the axes can only be applied in the case of values strictly greater than 0.

2.3.4 Probability distributions (*)


It should come as no surprise that R offers extensive support for many univariate prob-
ability distribution families, including:
• continuous distributions, i.e., those whose support is comprised of uncountably
many real numbers (e.g., some interval or the whole real line):
– *unif (uniform),
– *norm (normal),
– *exp (exponential),
26 I DEEP

10000
20000
15000

1000
exp(x)

exp(x)
10000

100
5000

10
0

1
0 2 4 6 8 10 0 2 4 6 8 10
x x

Figure 2.2. Linear- vs log-scale on the y-axis.

– *gamma (gamma, Γ),


– *beta (beta, B),
– *lnorm (log-normal),
– *t (Student),
– *cauchy (Cauchy–Lorentz),
– *chisq (chi-squared, 𝜒 2 ),
– *f (Snedecor–Fisher),
– *weibull (Weibull);
with the prefix “*” being one of:
– d (probability density function, PDF),
– p (cumulative distribution function, CDF; or survival function, SF),
– q (quantile function, being the inverse of the CDF),
– r (generation of random deviates; already mentioned above);
• discrete distributions, i.e., those whose possible outcomes can easily be enumer-
ated (e.g., some integers):
– *binom (binomial),
– *geom (geometric),
– *pois (Poisson),
2 NUMERIC VECTORS 27

– *hyper (hypergeometric),
– *nbinom (negative binomial);
prefixes “p” and “r” retain their meaning, however:
– d now gives the probability mass function (PMF),
– q brings about the quantile function, defined as a generalised inverse of the
CDF.
Each distribution is characterised by a set of underlying parameters. For instance, a
normal distribution N(𝜇, 𝜎) can be pinpointed by setting its expected value 𝜇 ∈ ℝ
and standard deviation 𝜎 > 0. In R, these two have been named mean and sd, re-
spectively; see help("dnorm"). Therefore, e.g., dnorm(x, 1, 2) computes the PDF of
N(1, 2) at x.

Note The parametrisations assumed in R can be subtly different from what we know
from statistical textbooks or probability courses. For example, the normal distribu-
tion can be identified based on either standard deviation or variance, and the expo-
nential distribution can be defined via expected value or its reciprocal. We thus advise
the reader to study carefully the documentation of help("dnorm"), help("dunif"),
help("dexp"), help("dbinom"), and the like.

It is also worth knowing the typical use cases of each of the distributions listed, e.g.,
a Poisson distribution can describe the probability of observing the number of in-
dependent events in a fixed time interval (e.g., the number of users downloading a
copy of R from CRAN per hour), and an exponential distribution can model the time
between such events; compare [23].

Exercise 2.5 A call to hist(x) draws a histogram, which can serve as an estimator of the un-
derlying continuous probability density function of a given sample; see Figure 2.3 for an illustra-
tion.
par(mfrow=c(1, 2)) # two plots in one figure
# left subplot: uniform U(0, 1)
hist(runif(10000, 0, 1), col="white", probability=TRUE, main="")
x <- seq(0, 1, length.out=101)
lines(x, dunif(x, 0, 1), lwd=2) # draw the true density function (PDF)
# right subplot: normal N(0, 1)
hist(rnorm(10000, 0, 1), col="white", probability=TRUE, main="")
x <- seq(-4, 4, length.out=101)
lines(x, dnorm(x, 0, 1), lwd=2) # draw the PDF

Draw a histogram of some random samples of different sizes n from the following distributions:
• rnorm(n, µ, σ) – normal N(𝜇, 𝜎) with expected values 𝜇 ∈ {−1, 0, 5} (i.e., 𝜇 being
equal to either −1, 0, or 5; read “∈” as “belongs to the given set” or “in”) and standard devi-
ations 𝜎 ∈ {0.5, 1, 5};
28 I DEEP

0.4
1.0

0.3
0.8
Density

Density
0.6

0.2
0.4

0.1
0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 -4 -2 0 2 4
runif(10000, 0, 1) rnorm(10000, 0, 1)

Figure 2.3. Example histograms of some pseudorandom samples and the true under-
lying probability density functions: the uniform distribution on the unit interval (left)
and the standard normal distribution (right).

• runif(n, a, b) – uniform U(𝑎, 𝑏) on the interval (𝑎, 𝑏) with 𝑎 = 0 and 𝑏 = 1 as well


as 𝑎 = −1 and 𝑏 = 1;
• rbeta(n, α, β) – beta B(𝛼, 𝛽) with 𝛼, 𝛽 ∈ {0.5, 1, 2};
• rexp(n, λ) – exponential E(𝜆) with rates 𝜆 ∈ {0.5, 1, 10};
Moreover, read about and play with the breaks, main, xlab, ylab, xlim, ylim, and col para-
meters; see help("hist").
Example 2.6 We roll a six-sided dice twelve times. Let 𝐶 be a random variable describing the
number of cases where the “1” face is thrown. 𝐶 follows a binomial distribution Bin(𝑛, 𝑝) with
parameters 𝑛 = 12 (the number of Bernoulli trials) and 𝑝 = 1/6 (the probability of success
in a single roll).
The probability mass function, dbinom, represents the probabilities that the number of “1”s rolled
is equal to 0, 1, …, or 12, i.e., 𝑃(𝐶 = 0), 𝑃(𝐶 = 1), …, or 𝑃(𝐶 = 12), respectively:
round(dbinom(0:12, 12, 1/6), 2) # PMF at 13 different points
## [1] 0.11 0.27 0.30 0.20 0.09 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.00

On the other hand, the probability that we throw no more than three “1”s, 𝑃(𝐶 ≤ 3), can be
determined by means of the cumulative distribution function, pbinom:
pbinom(3, 12, 1/6) # pbinom(3, 12, 1/6, lower.tail=FALSE)
## [1] 0.87482

The smallest 𝑐 such that 𝑃(𝐶 ≤ 𝑐) ≥ 0.95 can be computed based on the quantile function:
2 NUMERIC VECTORS 29

qbinom(0.95, 12, 1/6)


## [1] 4
pbinom(3:4, 12, 1/6) # for comparison: 0.95 is in-between
## [1] 0.87482 0.96365

In other words, at least 95% of the time, we will be observing no more than four successes.
Also, here are 30 pseudorandom realisations (simulations) of the random variable 𝐶:
rbinom(30, 12, 1/6) # how many successes in 12 trials, repeated 30 times
## [1] 1 3 2 4 4 0 2 4 2 2 4 2 3 2 0 4 1 0 1 4 4 3 2 6 2 3 2 2 1 1

2.3.5 Special functions (*)


Within mathematical formulae and across assorted application areas, certain func-
tions appear more frequently than others. Hence, for the sake of notational brevity
and computational precision, many of them have been assigned special names. For
instance, the following functions are mentioned in the definitions related to a few
probability distributions:

• gamma(x) for 𝑥 > 0 computes Γ(𝑥) = ∫0 𝑡𝑥−1 𝑒−𝑡 𝑑𝑡,
Γ(𝑎)Γ(𝑏) 1
• beta(a, b) for 𝑎, 𝑏 > 0 yields 𝐵(𝑎, 𝑏) = Γ(𝑎+𝑏)
= ∫0 𝑡𝑎−1 (1 − 𝑡)𝑏−1 𝑑𝑡.
Why do we have beta if it is merely a mix of gammas? A specific, tailored function is
expected to be faster and more precise than its DIY version; its underlying implement-
ation does not have to involve any calls to gamma.
beta(0.25, 250) # okay
## [1] 0.91213
gamma(0.25)*gamma(250)/gamma(250.25) # not okay
## [1] NaN

The Γ function grows so rapidly that already gamma(172) gives rise to Inf. It is due to
the fact that a computer’s arithmetic is not infinitely precise; compare Section 3.2.3.
Special functions are plentiful; see the open-access NIST Digital Library of Mathematical
Functions [51] for one of the most definitive references (and also [2] for its predecessor).
R package gsl [33] provides a vectorised interface to the GNU GSL [28] library, which
implements many of such routines.
Exercise 2.7 The Pochhammer symbol, (𝑎)𝑥 = Γ(𝑎 + 𝑥)/Γ(𝑎), can be computed via a call to
gsl::poch(a, x), i.e., the poch function from the gsl package:

# call install.packages("gsl") first


library("gsl") # load the package
poch(10, 3:6) # calls gsl_sf_poch() from GNU GSL
## [1] 1320 17160 240240 3603600
30 I DEEP

Read the documentation of the corresponding gsl_sf_poch function in the GNU GSL manual8 .
And when you are there, do not hesitate to go through the list of all functions, including those
related to statistics, permutations, combinations, and so forth.
Many functions also have their logarithm-of versions; see, e.g., lgamma and lbeta.
Also, for instance, dnorm and dbeta have the log parameter. Their classical use case
is the (numerical) maximum likelihood estimation, which involves the sums of the
logarithms of densities.

2.4 Arithmetic operations


2.4.1 Vectorised arithmetic operators
R features the following binary arithmetic operators:
• `+` (addition) and `-` (subtraction),
• `*` (multiplication) and `/` (division),
• `%/%` (integer division) and `%%` (modulo, division remainder),
• `^` (exponentiation; synonym: `**`).
They are all vectorised: they take two vectors on input and produce another vector on
output.
c(1, 2, 3) * c(10, 100, 1000)
## [1] 10 200 3000

The operation was performed in an elementwise fashion on the corresponding pairs of ele-
ments from both vectors. The first element in the left sequence was multiplied by the
corresponding element in the right vector, and the result was stored in the first element
of the output. Then, the second element in the left… all right, we get it.
Other operators behave similarly:
0:10 + seq(0, 1, 0.1)
## [1] 0.0 1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11.0
0:7 / rep(3, length.out=8) # division by 3
## [1] 0.00000 0.33333 0.66667 1.00000 1.33333 1.66667 2.00000 2.33333
0:7 %/% rep(3, length.out=8) # integer division
## [1] 0 0 0 1 1 1 2 2
0:7 %% rep(3, length.out=8) # division remainder
## [1] 0 1 2 0 1 2 0 1

Operations involving missing values also yield NAs:


8 https://www.gnu.org/software/gsl/doc/html
2 NUMERIC VECTORS 31

c(1, NA_real_, 3, NA_real_) + c(NA_real_, 2, 3, NA_real_)


## [1] NA NA 6 NA

2.4.2 Recycling rule


Some of the preceding statements can be written more concisely. When the operands
are of different lengths, the shorter one is recycled as many times as necessary, as in
rep(y, length.out=length(x)). For example:

0:7 / 3
## [1] 0.00000 0.33333 0.66667 1.00000 1.33333 1.66667 2.00000 2.33333
1:10 * c(-1, 1)
## [1] -1 2 -3 4 -5 6 -7 8 -9 10
2 ^ (0:10)
## [1] 1 2 4 8 16 32 64 128 256 512 1024

We call this the recycling rule.


If an operand cannot be recycled in its entirety, a warning9 is generated, but the output
is still available.
c(1, 10, 100) * 1:8
## Warning in c(1, 10, 100) * 1:8: longer object length is not a multiple of
## shorter object length
## [1] 1 20 300 4 50 600 7 80

Vectorisation and the recycling rule are perhaps most fruitful when applying binary
operators on sequences of identical lengths or when performing vector-scalar (i.e., a
sequence vs a single value) operations. However, there is much more: schemes like
“every 𝑘-th element” appear in Taylor series expansions (multiply by c(-1, 1)), 𝑘-fold
cross-validation, etc.; see also Section 11.3.4 for use cases in matrix/tensor processing.
Also, pmin and pmax return the parallel minimum and maximum of the corresponding
elements of the input vectors. Their behaviour is the same as the arithmetic operators,
but we call them as ordinary functions:
pmin(c(1, 2, 3, 4), c(4, 2, 3, 1))
## [1] 1 2 3 1
pmin(3, 1:5)
## [1] 1 2 3 3 3
pmax(0, pmin(1, c(0.25, -2, 5, -0.5, 0, 1.3, 0.99))) # clipping to [0, 1]
## [1] 0.25 0.00 1.00 0.00 0.00 1.00 0.99

Note Some functions can be very deeply vectorised, i.e., with respect to multiple ar-
guments. For example:
9 A few functions do not warn us whatsoever when they perform incomplete recycling (e.g., paste) or

can even give an error (e.g., as.data.frame.list). Consider this inconsistency an annoying bug and hope
it will be fixed, in the next decade or so.
32 I DEEP

runif(3, c(10, 20, 30), c(11, 22, 33))


## [1] 10.288 21.577 31.227

generates three random numbers uniformly distributed over the intervals (10, 11),
(20, 22), and (30, 33), respectively.

2.4.3 Operator precedence


Expressions involving multiple operators need a set of rules governing the order
of computations (unless we enforce it using round brackets). We have said that
-1:10 means (-1):10 rather than -(1:10). But what about, say, 1+1+1+1+1*0 or
3*2^0:5+10?

Let’s list the operators mentioned so far in their order of precedence, from the least to the
most binding (see also help("Syntax")):
1. `<-` (right to left),
2. `+` and `-` (binary),
3. `*` and `/`,
4. `%%` and `%/%`,
5. `:`,
6. `+` and `-` (unary),
7. `^` (right to left).
Hence, -2^2/3+3*4 means ((-(2^2))/3)+(3*4) and not, e.g., -((2^(2/(3+3)))*4).
Notice that `+` and `-`, `*` and `/`, as well as `%%` and `%/%` have the same priority.
Expressions involving a series of operations in the same group are evaluated left to
right, with the exception of `^` and `<-`, which are performed the other way around.
Therefore:
• 2*3/4*5 is equivalent to ((2*3)/4)*5,
• 2^3^4 is 2^(3^4) because, mathematically, we would write it as 23 = 281 ,
4

• “x <- y <- 4*3%%8/2” binds both y and x to 6, not x to the previous value of y
and then y to 6.
When in doubt, we can always bracket a subexpression to ensure it is executed in the
intended order. It can also increase the readability of our code.

2.4.4 Accumulating
The `+` and `*` operators, as well as the pmin and pmax functions, implement element-
wise operations that are applied on the corresponding elements taken from two given
2 NUMERIC VECTORS 33

vectors. For instance:


𝑥1 𝑦1 𝑥1 + 𝑦 1

⎜ ⎞
⎟ ⎛
⎜ ⎞
⎟ ⎛
⎜ ⎞


⎜ 𝑥2 ⎟
⎟ ⎜
⎜ 𝑦2 ⎟
⎟ ⎜
⎜ 𝑥2 + 𝑦 2 ⎟


⎜ 𝑥 ⎟
⎟ ⎜
⎜ ⎟
⎟ ⎜
⎜ ⎟


⎜ 3 ⎟ +
⎟ ⎜ ⎜ 𝑦3 ⎟

= ⎜ 𝑥 + 𝑦3
⎜ 3 ⎟

.

⎜ ⋮ ⎟⎟ ⎜ ⎜ ⋮ ⎟ ⎜
⎟ ⎜ ⋮ ⎟

⎝ 𝑥𝑛 ⎠ ⎝ 𝑦𝑛 ⎠ ⎝ 𝑥𝑛 + 𝑦𝑛 ⎠

However, we can also scan through all the values in a single vector and combine the
successive elements that we inspect using the corresponding operation:
• cumsum(x) gives the cumulative sum of the elements in a vector,
• cumprod(x) computes the cumulative product,
• cummin(x) yields the cumulative minimum,
• cummax(x) breeds the cumulative maximum.
The 𝑖-th element in the output vector will consist of the sum/product/min/max of the
first 𝑖 inputs. For example:

𝑥1 𝑥

⎜ ⎞
⎟ ⎛ 1
⎜ ⎞

⎜ 𝑥
⎜ 2 ⎟⎟ ⎜
⎜ 𝑥1 + 𝑥2 ⎟

cumsum ⎜

⎜ 𝑥 ⎟

3 ⎟ = ⎜
⎜ 𝑥 + 𝑥2 + 𝑥3 ⎟
⎟ .

⎜ ⎟ ⎜ ⎜ 1 ⎟

⎜ ⋮ ⎟⎟ ⎜ ⎜ ⋮ ⋱ ⎟

⎝ 𝑥𝑛 ⎠ ⎝ 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 ⎠

cumsum(1:8)
## [1] 1 3 6 10 15 21 28 36
cumprod(1:8)
## [1] 1 2 6 24 120 720 5040 40320
cummin(c(3, 2, 4, 5, 1, 6, 0))
## [1] 3 2 2 2 1 1 0
cummax(c(3, 2, 4, 5, 1, 6, 0))
## [1] 3 3 4 5 5 6 6

Example 2.8 On a side note, diff can be considered an inverse to cumsum. It computes the it-
erated difference: subtracts the first two elements, then the second from the third one, the third
from the fourth, and so on. In other words, diff(x) gives 𝒚 such that 𝑦𝑖 = 𝑥𝑖+1 − 𝑥𝑖 .
x <- c(-2, 3, 6, 2, 15)
diff(x)
## [1] 5 3 -4 13
cumsum(diff(x))
## [1] 5 8 4 17
cumsum(c(-2, diff(x))) # recreates x
## [1] -2 3 6 2 15

Thanks to diff, we can compute the daily changes to the EUR/AUD forex rates studied earlier;
see Figure 2.4.
34 I DEEP

aud <- scan(paste0("https://github.com/gagolews/teaching-data/raw/",


"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
aud_all <- na.omit(aud) # remove all missing values
plot(diff(aud_all), type="s", ylab="Daily change [EUR/AUD]") # "steps"
abline(h=0, lty="dotted") # draw a horizontal line at y=0
0.04
Daily change [EUR/AUD]
0.02
0.00
-0.02
-0.04

0 20 40 60 80 100 120
Index

Figure 2.4. Iterated differences of the exchange rates (non-missing values only).

2.4.5 Aggregating
If we are only concerned with the last cumulant, which summarises all the inputs, we
have the following10 functions at our disposal:
𝑛
• sum(x) computes the sum of elements in a vector, ∑𝑖=1 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 ,
𝑛
• prod(x) outputs the product of all elements, ∏𝑖=1 𝑥𝑖 = 𝑥1 𝑥2 ⋯ 𝑥𝑛 ,
• min(x) determines the minimum,
• max(x) reckons the greatest value.
sum(1:8)
## [1] 36
prod(1:8)
## [1] 40320
min(c(3, 2, 4, 5, 1, 6, 0))
## [1] 0
(continues on next page)

10 Chapter 7 will discuss the Reduce function, which generalises the above by allowing any binary opera-

tion to be propagated over a given vector.


2 NUMERIC VECTORS 35

(continued from previous page)


max(c(3, 2, 4, 5, 1, 6, 0))
## [1] 6

The foregoing functions form the basis for the popular summary statistics11 (sample
aggregates) such as:
• mean(x) gives the arithmetic mean, sum(x)/length(x),
• var(x) yields the (unbiased) sample variance, sum((x-mean(x))^2)/(length(x)-1),
• sd(x) is the standard deviation, sqrt(var(x)).
Furthermore, median(x) computes the sample median, i.e., the middle value in the
sorted12 version of x.
For instance:
x <- runif(1000)
c(min(x), mean(x), median(x), max(x), sd(x))
## [1] 0.00046535 0.49727780 0.48995025 0.99940453 0.28748391

Exercise 2.9 Let 𝒙 be any vector of length 𝑛 with positive elements. Compute its geometric and
harmonic mean, which are given by, respectively,
√ 𝑛
√ 𝑛 𝑛
√∏ 𝑥𝑖 = 𝑒 𝑛 ∑𝑖=1 log 𝑥𝑖
1
𝑛
and 𝑛 .
⎷ 𝑖=1 ∑𝑖=1 𝑥1𝑖

When solving exercises like this one, it does not really matter what data you apply these functions
on. We are being abstract in the sense that the 𝒙 vector can be anything: from the one that features
very accurate socioeconomic predictions that will help make this world less miserable, through
the data you have been collecting for the last ten years in relation to your super important PhD
research, whatever your company asked you to crunch today, to something related to the hobby
project that you enjoy doing after hours. But you can also just test the above on something like “x
<- runif(10)”, and move on.

All aggregation functions return a missing value if any of the input elements is unavail-
able. Luckily, they are equipped with the na.rm parameter, on behalf of which we can
request the removal of NAs.
aud <- scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
c(min(aud), mean(aud), max(aud))
## [1] NA NA NA
c(min(aud, na.rm=TRUE), mean(aud, na.rm=TRUE), max(aud, na.rm=TRUE))
## [1] 1.6006 1.6775 1.8635

11 Actually, var and median, amongst others, are defined by the stats package. But this one is automat-

ically loaded by default, so let’s not make a fuss about it now.


12 min, median, and max are generalised by the quantile function. We will discuss it much later (Sec-

tion 4.4.3), because it returns a named vector.


36 I DEEP

Otherwise, we could have called, e.g., mean(na.omit(x)).

Note In the documentation, we read that the usage of sum, prod, min, and max is like
sum(..., na.rm=FALSE), etc. In this context, it means that they accept any number
of input vectors, and each of them can be of arbitrary length. Therefore, min(1, 2,
3), min(c(1, 2, 3)) as well as min(c(1, 2), 3) all return the same result.

However, we also read that we have mean(x, trim=0, na.rm=FALSE, ...). This time,
only one vector can be aggregated, and any further arguments (except trim and na.rm)
are ignored.
The extra flexibility (which we do not have to rely on, ever) of the former group is due to
their being associative operations. We have, e.g., (2+3)+4 = 2+(3+4). Hence, these
operations can be performed in any order, in any group. They are primitive operations:
it is mean that is based on sum, not vice versa.

2.5 Exercises
Exercise 2.10 Answer the following questions.
• What is the meaning of the dot-dot-dot parameter in the definition of the c function?
• We say that the round function is vectorised. What does that mean?
• What is wrong with a call to c(sqrt(1), sqrt(2), sqrt(3))?
• What do we mean by saying that multiplication operates element by element?
• How does the recycling rule work when applying `+`?
• How to (and why) set the seed of the pseudorandom number generator?
• What is the difference between NA_real_ and NaN?
• How are default arguments specified in the manual of, e.g., the round function?
• Is a call to rep(times=4, x=1:5) equivalent to rep(4, 1:5)?
• List a few ways to generate a sequence like (-1, -0.75, -0.5, …, 0.75, 1).
• Is -3:5 the same as -(3:5)? What about the precedence of operators in expressions such as
2^3/4*5^6, 5*6+4/17%%8, and 1+-2^3:4-1?

• If x is a numeric vector of length 𝑛 (for some 𝑛 ≥ 0), how many values will sample(x)
output?
• Does scan support reading directly from compressed archives, e.g., .csv.gz files?
When in doubt, refer back to the material discussed in this chapter or the R manual.
2 NUMERIC VECTORS 37

Exercise 2.11 Thanks to vectorisation, implementing an example graph of arcsine and ar-
ccosine is straightforward.
x <- seq(-1, 1, length.out=11) # increase length.out for a smoother curve
plot(x, asin(x), # asin() computed for 11 points
type="l", # lines
ylim=c(-pi/2, pi), # y axis limits like c(y_min, y_max)
ylab="asin(x), acos(x)") # y axis label
lines(x, acos(x), col="red", lty="dashed") # adds to the current plot
legend("topright", c("asin(x)", "acos(x)"),
lty=c("solid", "dashed"), col=c("black", "red"), bg="white")

Thusly inspired, plot the following functions: | sin 𝑥2 |, |sin |𝑥||, √⌊𝑥⌋, and 1/(1 + 𝑒−𝑥 ). Recall
that the documentation of plot can be accessed by calling help("plot.default").
Exercise 2.12 The expression:
𝑛
(−1)𝑖+1 1 1 1 1
4∑ = 4 ( − + − + ⋯)
𝑖=1
2𝑖 − 1 1 3 5 7

slowly converges to 𝜋 as 𝑛 approaches ∞. Calculate it for 𝑛 = 1 000 000 and 𝑛 =


1 000 000 000 using the vectorised functions and operators discussed in this chapter, making
use of the recycling rule as much as possible.
Exercise 2.13 Let x and y be two vectors of identical lengths 𝑛, say:
x <- rnorm(100)
y <- 2*x+10+rnorm(100, 0, 0.5)

Compute the Pearson linear correlation coefficient given by:


𝑛 1 𝑛 1 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑛 ∑𝑗=1 𝑥𝑗 ) (𝑦𝑖 − 𝑛 ∑𝑗=1 𝑦𝑗 )
𝑟= .
2 2
√∑𝑛 (𝑥𝑖 − 1 𝑛 𝑛
∑𝑗=1 𝑥𝑗 ) √∑𝑖=1 (𝑦𝑖 − 1 𝑛
∑𝑗=1 𝑦𝑗 )
𝑖=1 𝑛 𝑛

To make sure you have come up with a correct implementation, compare your result to a call to
cor(x, y).

Exercise 2.14 (*) Find an R package providing a function to compute moving (rolling) averages
and medians of a given vector. Apply them on the EUR/AUD currency exchange data. Draw thus
obtained smoothened versions of the time series.
Exercise 2.15 (**) Use a call to convolve(..., type="filter") to compute the 𝑘-moving
average of a numeric vector.

In the next chapter, we will study operations that involve logical values.
3
Logical vectors

3.1 Creating logical vectors


R defines three(!) logical constants: TRUE, FALSE, and NA, which represent “yes”, “no”,
and “???”, respectively. Each of them, when instantiated, is an atomic vector of length
one.
Some of the functions we introduced in the previous chapter can be used to generate
logical vectors as well:
c(TRUE, FALSE, FALSE, NA, TRUE, FALSE)
## [1] TRUE FALSE FALSE NA TRUE FALSE
rep(c(TRUE, FALSE, NA), each=2)
## [1] TRUE TRUE FALSE FALSE NA NA
sample(c(TRUE, FALSE), 10, replace=TRUE, prob=c(0.8, 0.2))
## [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE

Note By default, T is a synonym for TRUE and F stands for FALSE. However, these are
not reserved keywords and can be reassigned to any other values. Therefore, we advise
against relying on them: they are not used throughout the course of this course.
Also, notice that the logical missing value is spelled simply as NA, and not NA_logical_.
Both the logical NA and the numeric NA_real_ are, for the sake of our widely-conceived
wellbeing, both printed as “NA” on the R console. This, however, does not mean they are
identical; see Section 4.1 for discussion.

3.2 Comparing elements


3.2.1 Vectorised relational operators
Logical vectors frequently come into being as a result of various testing activities. In
particular, the binary operators:
• `<` (less than),
40 I DEEP

• `<=` (less than or equal),


• `>` (greater than),
• `>=` (greater than or equal)
• `==` (equal),
• `!=` (not equal),
compare the corresponding elements of two numeric vectors and output a logical vector.
1 < 3
## [1] TRUE
c(1, 2, 3, 4) == c(2, 2, 3, 8)
## [1] FALSE TRUE TRUE FALSE
1:10 <= 10:1
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

Thus, they operate in an elementwise manner. Moreover, the recycling rule is applied
if necessary:
3 < 1:5 # c(3, 3, 3, 3, 3) < c(1, 2, 3, 4, 5)
## [1] FALSE FALSE FALSE TRUE TRUE
c(1, 4) == 1:4 # c(1, 4, 1, 4) == c(1, 2, 3, 4)
## [1] TRUE FALSE FALSE TRUE

Therefore, we can say that they are vectorised in the same manner as the arithmetic
operators `+`, `*`, etc.; compare Section 2.4.1.

3.2.2 Testing for NA, NaN, and Inf


Comparisons against missing values and not-numbers yield NAs. Instead of the incor-
rect “x == NA_real_”, testing for missingness should rather be performed via a call to
the vectorised is.na function.
is.na(c(NA_real_, Inf, -Inf, NaN, -1, 0, 1))
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE
is.na(c(TRUE, FALSE, NA, TRUE)) # works for logical vectors too
## [1] FALSE FALSE TRUE FALSE

Moreover, is.finite is noteworthy since it returns FALSE on Infs, NA_real_s and


NaNs.

is.finite(c(NA_real_, Inf, -Inf, NaN, -1, 0, 1))


## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE

See also the more specific is.nan and is.infinite.


3 LOGICAL VECTORS 41

3.2.3 Dealing with round-off errors (*)


In mathematics, real numbers are merely an idealisation. In practice, however, it is
impossible to store them with infinite precision (think 𝜋 = 3.141592653589793...):
computer memory is limited, and our time is precious.
Therefore, a consensus had to be reached. In R, we rely on the double-precision floating
point format. The floating point part means that the numbers can be both small (close to
zero like ±2.23 × 10−308 ) and large (e.g., ±1.79 × 10308 ).

Note
2.23e-308 == 0.00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
0000000223

1.79e308 == 179000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000

These two are pretty distant.

Every numeric value takes 8 bytes (or, equivalently, 64 bits) of memory. We are, how-
ever, able to store only about 15-17 decimal digits:
print(0.12345678901234567890123456789012345678901234, digits=22) # 22 is max
## [1] 0.1234567890123456773699

which limits the precision of our computations. The about part is, unfortunately, due
to the numbers’ being written in the computer-friendly binary, not the human-aligned
decimal base. This can lead to unexpected outcomes.
In particular:
• 0.1 cannot be represented exactly for it cannot be written as a finite series of re-
ciprocals of powers of 2 (we have 0.1 = 2−4 + 2−5 + 2−8 + 2−9 + …). This leads
to surprising results such as:
0.1 + 0.1 + 0.1 == 0.3
## [1] FALSE

Quite strikingly, what follows does not show anything suspicious:


42 I DEEP

c(0.1, 0.1 + 0.1 + 0.1, 0.3)


## [1] 0.1 0.3 0.3

Printing involves rounding. In the above context, it is misleading. Actually, we ex-


perience something more like:
print(c(0.1, 0.1 + 0.1 + 0.1, 0.3), digits=22)
## [1] 0.1000000000000000055511 0.3000000000000000444089
## [3] 0.2999999999999999888978

• All integers between −253 and 253 all stored exactly. This is good news. However,
the next integer is beyond the representable range:
2^53 + 1 == 2^53
## [1] TRUE

• The order of operations may matter. In particular, the associativity property can
be violated when dealing with numbers of contrasting orders of magnitude:
2^53 + 2^-53 - 2^53 - 2^-53 # should be == 0.0
## [1] -1.1102e-16

• Some numbers may just be too large, too small, or too close to zero to be repres-
ented exactly:
c(sum(2^((1023-52):1023)), sum(2^((1023-53):1023)))
## [1] 1.7977e+308 Inf
c(2^(-1022-52), 2^(-1022-53))
## [1] 4.9407e-324 0.0000e+00

Important The double-precision floating point format (IEEE 754) is not specific to R.
It is used by most other computing environments, including Python and C++.
For discussion, see [32, 35, 42], and the more statistically-orientated [31].

Can we do anything about these issues?


Firstly, dealing with integers of a reasonable order of magnitude (e.g., various resource
or case IDs in our datasets) is safe. Their comparison, addition, subtraction, and mul-
tiplication are always precise.
In all other cases (including applying other operations on integers, e.g., division or
sqrt), we need to be very careful with comparisons, especially involving testing for
equality via `==`. The sole fact that sin 𝜋 = 0, mathematically speaking, does not
mean that we should expect that:
sin(pi) == 0
## [1] FALSE
3 LOGICAL VECTORS 43

Instead, they are so close that we can treat the difference between them as negligible. Thus,
in practice, instead of testing if 𝑥 = 𝑦, we will be considering:
• |𝑥 − 𝑦| (absolute error), or
|𝑥−𝑦|
• |𝑦|(relative error; which takes the order of magnitude of the numbers into ac-
count but obviously cannot be applied if 𝑦 is very close to 0),
and determining if these are less than an assumed error margin, 𝜀 > 0, say, 10−8 or
2−26 . For example:
abs(sin(pi) - 0) < 2^-26
## [1] TRUE

Note Rounding can sometimes have a similar effect as testing for almost equality in
terms of the absolute error.
round(sin(pi), 8) == 0
## [1] TRUE

Important The foregoing recommendations are valid for the most popular applic-
ations of R, i.e., statistical and, more generally, scientific computing1 . Our datasets
usually do not represent accurate measurements. Bah, the world itself is far from
ideal! Therefore, we do not have to lose sleep over our not being able to precisely pin-
point the exact solutions.

3.3 Logical operations


3.3.1 Vectorised logical operators
The relational operators such as `==` and `>` accept only two arguments. Their chain-
ing is forbidden. A test that we would mathematically write as 0 ≤ 𝑥 ≤ 1 (or
𝑥 ∈ [0, 1]) cannot be expressed as “0 <= x <= 1” in R. Therefore, we need a way
to combine two logical conditions so as to be able to state that “𝑥 ≥ 0 and, at the same
time, 𝑥 ≤ 1”.
In such situations, the following logical operators and functions come in handy:
• `!` (not, negation; unary),
1 However, in financial applications, we had rather rely on base-10 numbers (compare the aforemen-

tioned issue with 0.1). There are some libraries implementing higher precision floating-point numbers or
even interval arithmetic that keeps track of error propagation in operation chains.
44 I DEEP

• `&` (and, conjunction; are both predicates true?),


• `|` (or, alternation; is at least one true?),
• xor (exclusive-or, exclusive disjunction, either-or; is one and only one of the pre-
dicates true?).
They again act elementwisely and implement the recycling rule if necessary (and ap-
plicable).
x <- c(-10, -1, -0.25, 0, 0.5, 1, 5, 100)
(x >= 0) & (x <= 1)
## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
(x < 0) | (x > 1)
## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
!((x < 0) | (x > 1))
## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
xor(x >= -1, x <= 1)
## [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE

Important The vectorised `&` and `|` operators should not be confused with their
scalar, short-circuit counterparts, `&&` and `||`; see Section 8.1.4.

3.3.2 Operator precedence revisited


The operators introduced in this chapter have lower precedence than the arithmetic
ones, including the binary `+` and `-`. Calling help("Syntax") reveals that we can
extend our listing from Section 2.4.3 as follows:
1. `<-` (right to left; least binding),
2. `|`,
3. `&`,
4. `!` (unary),
5. `<`, `>`, `<=`, `>=`, `==`, and `!=`,
6. `+` and `-` (binary),
7. `*` and `/`,
8. …
The order of precedence is intuitive, e.g., “x+1 <= y & y <= z-1 | x <= z” means
“(((x+1) <= y) & (y <= (z-1))) | (x <= z)”.
3 LOGICAL VECTORS 45

3.3.3 Dealing with missingness


Operations involving missing values follow the principles of Łukasiewicz’s three-
valued logic, which is based on common sense. For instance, “NA | TRUE” is TRUE
because the alternative (or) needs at least one argument to be TRUE to generate a posit-
ive result. On the other hand, “NA | FALSE” is NA since the outcome would be different
depending on what we substituted NA for.
Let’s contemplate the logical operations’ truth tables for all the possible combinations
of inputs:
u <- c(TRUE, FALSE, NA, TRUE, FALSE, NA, TRUE, FALSE, NA)
v <- c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, NA, NA, NA)
!u
## [1] FALSE TRUE NA FALSE TRUE NA FALSE TRUE NA
u & v
## [1] TRUE FALSE NA FALSE FALSE FALSE NA FALSE NA
u | v
## [1] TRUE TRUE TRUE TRUE FALSE NA TRUE NA NA
xor(u, v)
## [1] FALSE TRUE NA TRUE FALSE NA NA NA NA

3.3.4 Aggregating with all, any, and sum


Just like in the case of numeric vectors, we can summarise the contents of logical se-
quences. all tests whether every element in a logical vector is equal to TRUE. any de-
termines if there exists an element that is TRUE.
x <- runif(10000)
all(x <= 0.2) # are all values in x <= 0.2?
## [1] FALSE
any(x <= 0.2) # is there at least one element in x that is <= 0.2?
## [1] TRUE
any(c(NA, FALSE, TRUE))
## [1] TRUE
all(c(TRUE, TRUE, NA))
## [1] NA

Note all will frequently be used in conjunction with `==` as the latter is itself vector-
ised: it does not test whether a vector as a whole is equal to another one.
z <- c(1, 2, 3)
z == 1:3 # elementwise equal
## [1] TRUE TRUE TRUE
all(z == 1:3) # elementwise equal summarised
## [1] TRUE

However, let’s keep in mind the warning about the testing for exact equality of floating-
46 I DEEP

point numbers stated in Section 3.2.3. Sometimes, considering absolute or relative


errors might be more appropriate.
z <- sin((0:10)*pi) # sin(0), sin(pi), sin(2*pi), ..., sin(10*pi)
all(z == 0.0) # danger zone! please don't...
## [1] FALSE
all(abs(z - 0.0) < 1e-8) # are the absolute errors negligible?
## [1] TRUE

We can also call sum on a logical vector. Taking into account that it interprets TRUE as
numeric 1 and FALSE as 0 (more on this in Section 4.1), it will give us the number of
elements equal to TRUE.
sum(x <= 0.2) # how many elements in x are <= 0.2?
## [1] 1998

Also, by computing sum(x)/length(x), we can obtain the proportion (fraction) of val-


ues equal to TRUE in x. Equivalently:
mean(x <= 0.2) # proportion of elements <= 0.2
## [1] 0.1998

Naturally, we expect mean(runif(n) <= 0.2) to be equal to 0.2 (20%), but with ran-
domness, we can never be sure.

3.3.5 Simplifying predicates


Each aspiring programmer needs to become fluent with the rules governing the trans-
formations of logical conditions, e.g., that the negation of “(x >= 0) & (x < 1)” is
equivalent to “(x < 0) | (x >= 1)”. Such rules are called tautologies. Here are a few
of them:
• !(!p) is equivalent to p (double negation),
• !(p & q) holds if and only if !p | !q (De Morgan’s law),
• !(p | q) is !p & !q (another De Morgan’s law),
• all(p) is equivalent to !any(!p).
Various combinations thereof are, of course, possible. Further simplifications are en-
abled by other properties of the binary operations:
• commutativity (symmetry), e.g., 𝑎 + 𝑏 = 𝑏 + 𝑎, 𝑎 ∗ 𝑏 = 𝑏 ∗ 𝑎,
• associativity, e.g., (𝑎 + 𝑏) + 𝑐 = 𝑎 + (𝑏 + 𝑐), max(max(𝑎, 𝑏), 𝑐) =
max(𝑎, max(𝑏, 𝑐)),
• distributivity, e.g., 𝑎 ∗ 𝑏 + 𝑎 ∗ 𝑐 = 𝑎 ∗ (𝑏 + 𝑐), min(max(𝑎, 𝑏), max(𝑎, 𝑐)) =
max(𝑎, min(𝑏, 𝑐)),
3 LOGICAL VECTORS 47

and relations, including:


• transitivity, e.g., if 𝑎 ≤ 𝑏 and 𝑏 ≤ 𝑐, then surely 𝑎 ≤ 𝑐.
Exercise 3.1 Assuming that a, b, and c are numeric vectors, simplify the following expressions:
• !(b>a & b<c),
• !(a>=b & b>=c & a>=c),
• a>b & a<c | a<c & a>d,
• a>b | a<=b,
• a<=b & a>c | a>b & a<=c,
• a<=b & (a>c | a>b) & a<=c,
• !all(a > b & b < c).

3.4 Choosing elements with ifelse


The ifelse function is a vectorised version of the scalar if…else conditional state-
ment, which we will forgo for as long as until Chapter 8. It permits us to select an
element from one of two vectors based on some logical condition.
A call to ifelse(l, t, f), where l is a logical vector, returns a vector y such that:

𝑡𝑖 if 𝑙𝑖 is TRUE ,
𝑦𝑖 = {
𝑓𝑖 if 𝑙𝑖 is FALSE .

In other words, the 𝑖-th element of the result vector is equal to 𝑡𝑖 if 𝑙𝑖 is TRUE and to 𝑓𝑖
otherwise. For example:
(z <- rnorm(6)) # example vector
## [1] -0.560476 -0.230177 1.558708 0.070508 0.129288 1.715065
ifelse(z >= 0, z, -z) # like abs(z)
## [1] 0.560476 0.230177 1.558708 0.070508 0.129288 1.715065

or:
(x <- rnorm(6)) # example vector
## [1] 0.46092 -1.26506 -0.68685 -0.44566 1.22408 0.35981
(y <- rnorm(6)) # example vector
## [1] 0.40077 0.11068 -0.55584 1.78691 0.49785 -1.96662
ifelse(x >= y, x, y) # like pmax(x, y)
## [1] 0.46092 0.11068 -0.55584 1.78691 1.22408 0.35981

We should not be surprised anymore that the recycling rule is fired up when necessary:
48 I DEEP

ifelse(x > 0, x^2, 0) # squares of positive xs and 0 otherwise


## [1] 0.21244 0.00000 0.00000 0.00000 1.49838 0.12947

Note All arguments are evaluated in their entirety before deciding on which elements
are selected. Therefore, the following call generates a warning:
ifelse(z >= 0, log(z), NA_real_)
## Warning in log(z): NaNs produced
## [1] NA NA 0.44386 -2.65202 -2.04571 0.53945

This is because, with log(z), we compute the logarithms of negative values anyway.
To fix this, we can write:
log(ifelse(z >= 0, z, NA_real_))
## [1] NA NA 0.44386 -2.65202 -2.04571 0.53945

In case we yearn for an if…else if…else-type expression, the calls to ifelse can
naturally be nested.
Example 3.2 A version of pmax(pmax(x, y), z) can be written as:
ifelse(x >= y,
ifelse(z >= x, z, x),
ifelse(z >= y, z, y)
)
## [1] 0.46092 0.11068 1.55871 1.78691 1.22408 1.71506

However, determining three intermediate logical vectors is not necessary. We can save one call to
`>=` by introducing an auxiliary variable:
xy <- ifelse(x >= y, x, y)
ifelse(z >= xy, z, xy)
## [1] 0.46092 0.11068 1.55871 1.78691 1.22408 1.71506

Exercise 3.3 Figure 3.1 depicts a realisation of the mixture 𝑍 = 0.2𝑋 + 0.8𝑌 of two normal
distributions 𝑋 ∼ N(−2, 0.5) and 𝑌 ∼ N(3, 1).
n <- 100000
z <- ifelse(runif(n) <= 0.2, rnorm(n, -2, 0.5), rnorm(n, 3, 1))
hist(z, breaks=101, probability=TRUE, main="", col="white")

In other words, we generated a variate from the normal distribution that has the expected value
of −2 with probability 20%, and from the one with the expectation of 3 otherwise. Thus inspired,
generate the Gaussian mixtures:
• 2
3
𝑋 + 13 𝑌, where 𝑋 ∼ N(100, 16) and 𝑌 ∼ N(116, 8),
• 0.3𝑋 + 0.4𝑌 + 0.3𝑍, where 𝑋 ∼ N(−10, 2), 𝑌 ∼ N(0, 2), and 𝑍 ∼ N(10, 2).
3 LOGICAL VECTORS 49

0.00 0.05 0.10 0.15 0.20 0.25 0.30


Density

-4 -2 0 2 4 6 8
z

Figure 3.1. A mixture of two Gaussians generated with ifelse.

(*) On a side note, knowing that if 𝑋 follows N(0, 1), then the scaled-shifted 𝜎𝑋 + 𝜇 is distrib-
uted N(𝜇, 𝜎), the above can be equivalently written as:
w <- (runif(n) <= 0.2)
z <- rnorm(n, 0, 1)*ifelse(w, 0.5, 1) + ifelse(w, -2, 3)

3.5 Exercises
Exercise 3.4 Answer the following questions.
• Why the statement “The Earth is flat or the smallpox vaccine is proven effective” is obviously
true?
• What is the difference between NA and NA_real_?
• Why is “FALSE & NA” equal to FALSE, but “TRUE & NA” is NA?
• Why has ifelse(x>=0, sqrt(x), NA_real_) a tendency to generate warnings and
how to rewrite it so as to prevent that from happening?
• What is the interpretation of mean(x >= 0 & x <= 1)?
• For some integer 𝑥 and 𝑦, how to verify whether 0 < 𝑥 < 100, 0 < 𝑦 < 100, and 𝑥 < 𝑦,
all at the same time?
• Mathematically, for all real 𝑥, 𝑦 > 0, we have log 𝑥𝑦 = log 𝑥 + log 𝑦. Why then
all(log(x*y) == log(x)+log(y)) can sometimes return FALSE? How to fix this?
50 I DEEP

• Is x/y/z always equal to x/(y/z)? How to fix this?


• What is the purpose of very specific functions such as log1p and expm1 (see their help page)
and many others listed in, e.g., the GNU GSL library [28]? Is our referring to them a viola-
tion of the beloved “do not multiply entities without necessity” rule?
• If we know that 𝑥 may be subject to error, how to test whether 𝑥 > 0 in a robust manner?
• Is “y<-5” the same as “y <- 5” or rather “y < -5”?
Exercise 3.5 What is the difference between all and isTRUE? What about `==`, identical,
and all.equal? Is the last one properly vectorised?
Exercise 3.6 Compute the cross-entropy loss between a numeric vector 𝒑 with values in the inter-
val (0, 1) and a logical vector 𝒚, both of length 𝑛 (you can generate them randomly or manually,
it does not matter, it is just an exercise):

1 𝑛
ℒ(𝒑, 𝒚) = ∑ℓ ,
𝑛 𝑖=1 𝑖

where
− log 𝑝𝑖 if 𝑦𝑖 is TRUE ,
ℓ𝑖 = {
− log(1 − 𝑝𝑖 ) if 𝑦𝑖 is FALSE .

Interpretation: in classification problems, 𝑦𝑖 ∈ {FALSE, TRUE} denotes the true class of the 𝑖-
th object (say, whether the 𝑖-th hospital patient is symptomatic) and 𝑝𝑖 ∈ (0, 1) is a machine
learning algorithm’s confidence that 𝑖 belongs to class TRUE (e.g., how sure a decision tree model
is that the corresponding person is unwell). Ideally, if 𝑦𝑖 is TRUE, 𝑝𝑖 should be close to 1 and to 0
otherwise. The cross-entropy loss quantifies by how much a classifier differs from the omniscient
one. The use of the logarithm penalises strong beliefs in the wrong answer.

By the way, if we have solved any of the exercises encountered so far by referring to
if statements, for loops, vector indexing like x[...], or any external R package, we
recommend going back and rewrite our code. Let’s keep things simple (effective, read-
able) by only using base R’s vectorised operations that we have introduced.
4
Lists and attributes

After two brain-teasing chapters, it is time to cool it down a little. In this more tech-
nical part, we will introduce lists, which serve as universal containers for R objects of
any size and type. Moreover, we will also show that each R object can be equipped
with a number of optional attributes. Thanks to them, we will be able to label elements
in any vector, and, in Chapter 10, introduce new complex data types such as matrices
and data frames.

4.1 Type hierarchy and conversion


So far, we have been playing with three types of atomic vectors:
1. logical (Chapter 3),
2. numeric (Chapter 2),
3. character (which we have barely touched upon yet, but rest assured that they will
be covered in detail very soon; see Chapter 6).
To determine the type of an object programmatically, we can call the typeof function.
typeof(c(1, 2, 3))
## [1] "double"
typeof(c(TRUE, FALSE, TRUE, NA))
## [1] "logical"
typeof(c("spam", "spam", "bacon", "eggs", "spam"))
## [1] "character"

It turns out that we can easily convert between these types, either on our explicit de-
mand (type casting) or on-the-fly (coercion, when we perform an operation that expects
something different from the kind of input it was fed with).

Note (*) Numeric vectors are reported as being either of the type double (double-
precision floating-point numbers) or integer (32-bit; it is a subset of double); see
Section 6.4.1. In most practical cases, this is a technical detail that we can risklessly
ignore; compare also the mode function.
52 I DEEP

4.1.1 Explicit type casting


We can use functions such as as.logical, as.numeric1 , and as.character to convert
given objects to the corresponding types.
as.numeric(c(TRUE, FALSE, NA, TRUE, NA, FALSE)) # synonym: as.double
## [1] 1 0 NA 1 NA 0
as.logical(c(-2, -1, 0, 1, 2, 3, NA_real_, -Inf, NaN))
## [1] TRUE TRUE FALSE TRUE TRUE TRUE NA TRUE NA

Important The rules are:


• TRUE → 1,
• FALSE → 0,
• NA → NA_real_,
and:
• 0 → FALSE,
• NA_real_ and NaN → NA,
• anything else → TRUE.
The distinction between zero and non-zero is commonly applied in other program-
ming languages as well.

Moreover, in the case of the conversion involving character strings, we have:


as.character(c(TRUE, FALSE, NA, TRUE, NA, FALSE))
## [1] "TRUE" "FALSE" NA "TRUE" NA "FALSE"
as.character(c(-2, -1, 0, 1, 2, 3, NA_real_, -Inf, NaN))
## [1] "-2" "-1" "0" "1" "2" "3" NA "-Inf" "NaN"
as.logical(c("TRUE", "True", "true", "T",
"FALSE", "False", "false", "F",
"anything other than these", NA_character_))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA NA
as.numeric(c("0", "-1.23e4", "pi", "2+2", "NaN", "-Inf", NA_character_))
## Warning: NAs introduced by coercion
## [1] 0 -12300 NA NA NaN -Inf NA

4.1.2 Implicit conversion (coercion)


Recall that we referred to the three vector types as atomic ones. They can only be used
to store elements of the same type.
1 (*) as.numeric is a built-in generic function identical to (synonymous with) as.double; see Sec-

tion 10.2.3. is.numeric is generic too, and is more universal than is.double, which only verifies whether
typeof returns "double". For instance, vectors of the type integer which we mention later are considered
numeric as well.
4 LISTS AND ATTRIBUTES 53

If we make an attempt at composing an object of mixed types with c, the common type
will be determined in such a way that data are stored without information loss:
c(-1, FALSE, TRUE, 2, "three", NA)
## [1] "-1" "FALSE" "TRUE" "2" "three" NA
c("zero", TRUE, NA)
## [1] "zero" "TRUE" NA
c(-1, FALSE, TRUE, 2, NA)
## [1] -1 0 1 2 NA

Hence, we see that logical is the most specialised of the tree, whereas character is
the most general.

Note The logical NA is converted to NA_real_ and NA_character_ in the preceding


examples. R users tend to rely on implicit type conversion when they write c(1, 2,
NA, 4) rather than c(1, 2, NA_real_, 4). In most cases, this is fine, but it might
make us less vigilant.
However, occasionally, it will be wiser to be more unequivocal. For instance,
rep(NA_real_, 1e9) preallocates a long numeric vector instead of a logical one.

Some functions that expect vectors of specific types can apply coercion by themselves
(or act as if they do so):
c(NA, FALSE, TRUE) + 10 # implicit conversion logical –> numeric
## [1] NA 10 11
c(-1, 0, 1) & TRUE # implicit conversion numeric –> logical
## [1] TRUE FALSE TRUE
sum(c(TRUE, TRUE, FALSE, TRUE, FALSE)) # same as sum(as.numeric(...))
## [1] 3
cumsum(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 2 2 3 3
cummin(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 1 0 0 0

Exercise 4.1 In one of the previous exercises, we computed the cross-entropy loss between a lo-
gical vector 𝒚 ∈ {0, 1}𝑛 and a numeric vector 𝒑 ∈ (0, 1)𝑛 . This measure can be equivalently
defined as:

1 𝑛
ℒ(𝒑, 𝒚) = − ⎛⎜∑ 𝑦 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )⎞
⎟.
𝑛 ⎝𝑖=1 𝑖 ⎠
Using vectorised operations, but not relying on ifelse this time, implement this formula. Then,
compute the cross-entropy loss between, for instance, “y <- sample(c(FALSE, TRUE), n)”
and “p <- runif(n)” for some n. Note how seamlessly we translate between FALSE/TRUEs and
0/1s in the above equation (in particular, where 1 − 𝑦𝑖 means the logical negation of 𝑦𝑖 ).
54 I DEEP

4.2 Lists
Lists are generalised vectors. They can be comprised of R objects of any kind, also other
lists. It is why we classify them as recursive (and not atomic) objects. They are especially
useful wherever there is a need to handle some multitude as a single entity.

4.2.1 Creating lists


The most straightforward way to create a list is by means of the list function:
list(1, 2, 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

Notice that it is not the same as c(1, 2, 3). We got a sequence that wraps three
numeric vectors, each of length one. More examples:
list(1:3, 4, c(TRUE, FALSE, NA, TRUE), "and so forth") # different types
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] TRUE FALSE NA TRUE
##
## [[4]]
## [1] "and so forth"
list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3)) # a list of lists
## [[1]]
## [[1]][[1]]
## [1] TRUE FALSE NA TRUE
##
## [[1]][[2]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
##
## [[2]]
(continues on next page)
4 LISTS AND ATTRIBUTES 55

(continued from previous page)


## [[2]][[1]]
## [1] 1 2 3

The display of lists is (un)pretty bloated. However, the str function prints any R object
in a more concise fashion:
str(list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3)))
## List of 2
## $ :List of 2
## ..$ : logi [1:4] TRUE FALSE NA TRUE
## ..$ : chr [1:26] "a" "b" "c" "d" ...
## $ :List of 1
## ..$ : int [1:3] 1 2 3

Note In Section 4.1, we said that the c function, when fed with arguments of mixed
types, tries to determine the common type that retains the sense of data. If coercion
to an atomic vector is not possible, the result will be a list.
c(1, "two", identity) # `identity` is an object of the type "function"
## [[1]]
## [1] 1
##
## [[2]]
## [1] "two"
##
## [[3]]
## function (x)
## x
## <environment: namespace:base>

Thus, the c function can also be used to concatenate lists:


c(list(1), list(2), list(3)) # three lists –> one list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

Lists can be repeated using rep:


rep(list(1:11, LETTERS), 2)
## [[1]]
(continues on next page)
56 I DEEP

(continued from previous page)


## [1] 1 2 3 4 5 6 7 8 9 10 11
##
## [[2]]
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
##
## [[3]]
## [1] 1 2 3 4 5 6 7 8 9 10 11
##
## [[4]]
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

4.2.2 Converting to and from lists


The conversion of an atomic vector to a list of vectors of length one can be done via a
call to as.list:
as.list(c(1, 2, 3)) # vector of length 3 –> list of 3 vectors of length 1
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

Unfortunately, calling, say, as.numeric on a list arouses an error, even if it is com-


prised of numeric vectors only. We can try flattening it to an atomic sequence by call-
ing unlist:
unlist(list(list(1, 2), list(3, list(4:8)), 9))
## [1] 1 2 3 4 5 6 7 8 9
unlist(list(list(1, 2), list(3, list(4:8)), "spam"))
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "spam"

Note (*) Chapter 11 will mention the simplify2array function, which generalises
unlist in a way that can sometimes give rise to a matrix.
4 LISTS AND ATTRIBUTES 57

4.3 NULL
NULL, being the one and only instance of the eponymous type, can be used as a place-
holder for an R object or designate the absence of any entities whatsoever.
list(NULL, NULL, month.name)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## [1] "January" "February" "March" "April" "May"
## [6] "June" "July" "August" "September" "October"
## [11] "November" "December"

NULL is different from a vector of length zero because the latter has a type. However,
NULL sometimes behaves like a zero-length vector. In particular, length(NULL) returns
0. Also, c called with no arguments returns NULL.
Testing for NULL-ness can be done with a call to is.null.

Important NULL is not the same as NA. The former cannot be emplaced in an atomic
vector.
c(1, NA, 3, NULL, 5) # here, NULL behaves like a zero-length vector
## [1] 1 NA 3 5

They have very distinct semantics (no value vs a missing value).

Later we will see that some functions return NULL invisibly when they have nothing
interesting to report. This is the case of print or plot, which are called because of
their side effects (printing and plotting).
Furthermore, in certain contexts, replacing content with NULL will actually result in
its removal, e.g., when subsetting a list.

4.4 Object attributes


Lists can embrace many entities in the form of a single item sequence. Attributes, on
the other hand, give means to inject extra data into an object.
Attributes are unordered key=value pairs, where key is a single string, and value
58 I DEEP

is any R object except NULL. They can be introduced by calling, amongst others2 , the
structure function:

x_simple <- 1:10


x <- structure(
x_simple, # the object to be equipped with attributes
attribute1="value1",
attribute2=c(6, 100, 324)
)

4.4.1 Developing perceptual indifference to most attributes


Let’s see how the foregoing x is reported on the console:
print(x)
## [1] 1 2 3 4 5 6 7 8 9 10
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] 6 100 324

The object of concern, 1:10, was displayed first. We need to get used to that. Most of
the time, we suggest to treat the “attr…” parts of the display as if they were printed in
tiny font.
Equipping an object with attributes does not usually change its nature; see, however,
Chapter 10 for a few exceptions. The above x is still treated as an ordinary sequence
of numbers by most functions:
sum(x) # the same as sum(1:10); `sum` does not care about any attributes
## [1] 55
typeof(x) # just a numeric vector, but with some perks
## [1] "integer"

Important Attributes are generally ignored by most functions unless they have spe-
cifically been programmed to pay attention to them.

4.4.2 But there are a few use cases


Some R functions add attributes to the return value to sneak extra information that
might be useful, just in case.
For instance, na.omit, whose main aim is to remove missing values from an atomic
vector, yields:

2 Other ways include the replacement versions of the attr and attributes functions; see Section 9.3.6.
4 LISTS AND ATTRIBUTES 59

y <- c(10, 20, NA, 40, 50, NA, 70)


(y_na_free <- na.omit(y))
## [1] 10 20 40 50 70
## attr(,"na.action")
## [1] 3 6
## attr(,"class")
## [1] "omit"

We can enjoy the NA-free version of y in any further computations:


mean(y_na_free)
## [1] 38

Additionally, the na.action attribute tells us where the missing observations were:
attr(y_na_free, "na.action") # read the attribute value
## [1] 3 6
## attr(,"class")
## [1] "omit"

We ignore the class part until Chapter 10.


As another example, gregexpr discussed in Chapter 6 searches for a given pattern in
a character vector:
needle <- "spam|durian" # pattern to search for: spam OR durian
haystack <- c("spam, bacon, and durian-flavoured spam", "spammer") # text
(pos <- gregexpr(needle, haystack, perl=TRUE))
## [[1]]
## [1] 1 18 35
## attr(,"match.length")
## [1] 4 6 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

We sought all occurrences of the pattern within two character strings. As their number
may vary from string to string, wrapping the results in a list was a good design choice.
Each list element gives the starting positions where matches can be found: there are
60 I DEEP

three and one match(es), respectively. Moreover, every vector of positions has a desig-
nated match.length attribute (amongst others), in case we need it.
Exercise 4.2 Create a list with EUR/AUD, EUR/GBP, and EUR/USD exchange rates read
from the euraud-*.csv, eurgbp-*.csv, and eurusd-*.csv files in our data repository3 .
Each of its three elements should be a numeric vector storing the currency exchange rates. Further-
more, equip them with currency_from, currency_to, date_from, and date_to attributes.
For example:
## [1] NA 1.6006 1.6031 NA NA 1.6119 1.6251 1.6195 1.6193 1.6132
## [11] NA NA 1.6117 1.6110 1.6188 1.6115 1.6122 NA
## attr(,"currency_from")
## [1] "EUR"
## attr(,"currency_to")
## [1] "AUD"
## attr(,"date_from")
## [1] "2020-01-01"
## attr(,"date_to")
## [1] "2020-06-30"

Such an additional piece of information could be stored in a few separate variables (other vectors),
but then it would not be as convenient to use as the above representation.

4.4.3 Special attributes


Attributes have great potential which is somewhat wasted, for R users rarely know:
• that attributes exist (pessimistic scenario), or
• how to handle them (realistic scenario).
But we know now.
What is more, certain attributes have been predestined to play a unique role in R.
Namely, the most prevalent amongst the special attributes are:
• names, row.names, and dimnames are used to label the elements of atomic and gen-
eric vectors (see below) as well as rows and columns in matrices (Chapter 11) and
data frames (Chapter 12),
• dim turns flat vectors into matrices and other tensors (Chapter 11),
• levels labels the underlying integer codes in factor objects (Section 10.3.2),
• class can be used to bring forth new complex data structures based on basic types
(Chapter 10).
We call them special because:
• they cannot be assigned arbitrary values; for instance, we will soon see that names
can accept a character vector of a specific length,
3 https://github.com/gagolews/teaching-data/tree/master/marek
4 LISTS AND ATTRIBUTES 61

• they can be accessed via designated functions, e.g., names, class, dim, dimnames,
levels, etc.,

• they are widely recognised by many other functions.


However, in spite of the above, special attributes can still be managed as ordinary ones.
Exercise 4.3 comment is perhaps the most rarely used special attribute. Create an object
(whatever) equipped with the comment attribute. Verify that assigning to it anything other than
a character vector leads to an error. Read its value by calling the comment function. Display the
object equipped with this attribute. Note that the print function ignores its existence whatsoever:
this is how special it is.

Important (*) The accessor functions such as names or class might return meaningful
values, even if the corresponding attribute is not set explicitly; see, e.g., Section 11.1.5
for an example.

4.4.4 Labelling vector elements with the names attribute


The special attribute names labels atomic vectors’ and lists’ elements.
(x <- structure(c(13, 2, 6), names=c("spam", "sausage", "celery")))
## spam sausage celery
## 13 2 6

The labels may improve the expressivity and readability of our code and data.
Exercise 4.4 Verify that the above x is still an ordinary numeric vector by calling typeof and
sum on it.

Let’s stress that we can ignore the names attribute whatsoever. If we apply any oper-
ation discussed in Chapter 2, we will garner the same result regardless whether such
extra information is present or not.
It is just the print function that changed its behaviour slightly. After all, it is a special
attribute. Instead of reporting:
## [1] 13 2 6
## attr(,"names")
## [1] "spam" "sausage" "celery"

we got a nicely formatted table-like display. Non-special attributes are still printed in
the standard way:
structure(x, additional_attribute=1:10)
## spam sausage celery
## 13 2 6
## attr(,"additional_attribute")
## [1] 1 2 3 4 5 6 7 8 9 10
62 I DEEP

Note Chapter 5 will also mention that some operations (such as indexing) gain super-
powers in the presence of the names attribute.

This attribute can be read by calling:


attr(x, "names") # just like any other attribute
## [1] "spam" "sausage" "celery"
names(x) # because it is so special
## [1] "spam" "sausage" "celery"

Named vectors can be easily created with the c and list functions as well:
c(a=1, b=2)
## a b
## 1 2
list(a=1, b=2)
## $a
## [1] 1
##
## $b
## [1] 2
c(a=c(x=1, y=2), b=3, c=c(z=4)) # this is smart
## a.x a.y b c.z
## 1 2 3 4

Let’s contemplate how a named list is printed on the console. Again, it is still a list, but
with some extras.
Exercise 4.5 A whole lot of functions return named vectors. Evaluate the following expressions
and read the corresponding pages in their documentation:
• quantile(runif(100)),
• hist(runif(100), plot=FALSE),
• options() (take note of digits, scipen, max.print, and width),
• capabilities().

Note (*) Most of the time, lists are used merely as containers for other R objects. This
is a dull yet essential role. However, let’s just mention here that every data frame is,
in fact, a generic vector (see Chapter 12). Each column corresponds to a named list
element:
(df <- head(iris)) # some data frame
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
(continues on next page)
4 LISTS AND ATTRIBUTES 63

(continued from previous page)


## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
typeof(df) # it is just a list (with extras that will be discussed later)
## [1] "list"
unclass(df) # how it is represented exactly (without the extras)
## $Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
##
## $Sepal.Width
## [1] 3.5 3.0 3.2 3.1 3.6 3.9
##
## $Petal.Length
## [1] 1.4 1.4 1.3 1.5 1.4 1.7
##
## $Petal.Width
## [1] 0.2 0.2 0.2 0.2 0.2 0.4
##
## $Species
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
##
## attr(,"row.names")
## [1] 1 2 3 4 5 6

Therefore, the functions we discuss in this chapter are of use in processing such struc-
tured data too.

4.4.5 Altering and removing attributes


We know that a single attribute can be read by calling attr. Their whole list is gener-
ated with a call to attributes.
(x <- structure(c("some", "object"), names=c("X", "Y"),
attribute1="value1", attribute2="value2", attribute3="value3"))
## X Y
## "some" "object"
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "value3"
attr(x, "attribute1") # reads a single attribute, returns NULL if unset
## [1] "value1"
attributes(x) # returns a named list with all attributes of an object
## $names
(continues on next page)
64 I DEEP

(continued from previous page)


## [1] "X" "Y"
##
## $attribute1
## [1] "value1"
##
## $attribute2
## [1] "value2"
##
## $attribute3
## [1] "value3"

We can alter an attribute’s value or add further attributes by referring to the struc-
ture function once again. Moreover, setting an attribute’s value to NULL gets rid of it
completely.
structure(x, attribute1=NULL, attribute4="added", attribute3="modified")
## X Y
## "some" "object"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "modified"
## attr(,"attribute4")
## [1] "added"

As far as the names attribute is concerned, we may generate an unnamed copy of an


object by calling:
unname(x)
## [1] "some" "object"
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "value3"

In Section 9.3.6, we will introduce replacement functions. They will enable us to modify or
remove an object’s attribute by calling “attr(x, "some_attribute") <- new_value”.
Moreover, Section 5.5 highlights that certain operations (such as vector indexing, ele-
mentwise arithmetic operations, and coercion) might not preserve all attributes of the
objects that were given as their inputs.
4 LISTS AND ATTRIBUTES 65

4.5 Exercises
Exercise 4.6 Provide an answer to the following questions.
• What is the meaning of c(TRUE, FALSE)*1:10?
• What does sum(as.logical(x)) compute when x is a numeric vector?
• We said that atomic vectors of the type character are the most general ones. Therefore, is
as.numeric(as.character(x)) the same as as.numeric(x), regardless of the type of
x?

• What is the meaning of as.logical(x+y) if x and y are logical vectors? What about as.
logical(x*y), as.logical(1-x), and as.logical(x!=y)?

• What is the meaning of the following when x is a logical vector?


– cummin(x) and cummin(!x),
– cummax(x) and cummax(!x),
– cumsum(x) and cumsum(!x),
– cumprod(x) and cumprod(!x).
• Let x be a named numeric vector, e.g., “x <- quantile(runif(100))”. What is the
result of 2*x, mean(x), and round(x, 2)?
• What is the meaning of x == NULL?
• Give two ways to create a named character vector.
• Give two ways (discussed above; there are more) to remove the names attribute from an ob-
ject.
Exercise 4.7 There are a few peculiarities when joining or coercing lists. Compare the results
generated by the following pairs of expressions:
# 1)
as.character(list(list(1, 2), list(3, list(4)), 5))
as.character(unlist(list(list(1, 2), list(3, list(4)), 5)))
# 2)
as.numeric(list(list(1, 2), list(3, list(4)), 5))
as.numeric(unlist(list(list(1, 2), list(3, list(4)), 5)))
# 3)
unlist(list(list(1, 2), sd))
list(1, 2, sd)
# 4)
c(list(c(1, 2), 3), 4, 5)
c(list(c(1, 2), 3), c(4, 5))

Exercise 4.8 Given numeric vectors x, y, z, and w, how to combine x, y, and list(z, w) so as
to obtain list(x, y, z, w)? More generally, given a set of atomic vectors and lists of atomic
66 I DEEP

vectors, how to combine them to obtain a single list of atomic vectors (not a list of atomic vectors
and lists, not atomic vectors unwound, etc.)?
Exercise 4.9 readRDS serialises R objects and writes their snapshots to disk so that they can
be restored via a call to saveRDS at a later time. Verify that this function preserves object attrib-
utes. Also, check out dput and dget which work with objects’ textual representation in the form
executable R code.
Exercise 4.10 (*) Use jsonlite::fromJSON to read a JSON file in the form of a named list.

In the extremely unlikely event of our finding the current chapter boring, let’s rejoice:
some of the exercises and remarks that we will encounter in the next part, which is
devoted to vector indexing, will definitely be deliciously stimulating!
5
Vector indexing

We now know plenty of ways to process vectors in their entirety, but how to extract and
replace their specific parts? We will be collectively referring to such activities as index-
ing. This is because they are often performed through the index operator, `[`.

5.1 head and tail


Let’s begin with something more lightweight, though. The head function fetches a few
elements from the beginning of a vector.
x <- 1:10
head(x) # head(x, 6)
## [1] 1 2 3 4 5 6
head(x, 3) # get the first three
## [1] 1 2 3
head(x, -3) # skip the last three
## [1] 1 2 3 4 5 6 7

Similarly, tail extracts a couple of items from the end of a sequence.


tail(x) # tail(x, 6)
## [1] 5 6 7 8 9 10
tail(x, 3) # get the last three
## [1] 8 9 10
tail(x, -3) # skip the first three
## [1] 4 5 6 7 8 9 10

Both functions work on lists too1 . They are useful for previewing the contents of really
big objects. Also, they never complain about our trying to fetch too many elements:
head(x, 100) # no more than the first 100 elements
## [1] 1 2 3 4 5 6 7 8 9 10

1 head and tail are actually S3 generics defined in the utils package. We are able to call them on

matrices and data frames; see Chapter 10.


68 I DEEP

5.2 Subsetting and extracting from vectors


Given a vector x, x[i] returns its subset comprised of elements indicated by the in-
dexer i, which can be a single vector of:
• nonnegative integers (gives the positions of elements to retrieve),
• negative integers (gives the positions to omit),
• logical values (states which items should be fetched or skipped),
• character strings (locates the elements with specific names).

5.2.1 Nonnegative indexes


Consider example vectors:
(x <- seq(10, 100, 10))
## [1] 10 20 30 40 50 60 70 80 90 100
(y <- list(1, 11:12, 21:23))
## [[1]]
## [1] 1
##
## [[2]]
## [1] 11 12
##
## [[3]]
## [1] 21 22 23

The first element in a vector is at index 1. Hence:


x[1] # the first element
## [1] 10
x[length(x)] # the last element
## [1] 100

Important We might have wondered why “[1]” is displayed each time we print out
an atomic vector on the console:
print((1:51)*10)
## [1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170
## [18] 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340
## [35] 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510

It is merely a visual hint indicating which vector element we output at the beginning
in each line.
5 VECTOR INDEXING 69

Vectorisation is a universal feature of R. It comes as no surprise that the indexer can


also be of length greater than one.
x[c(1, length(x))] # the first and the last
## [1] 10 100
x[1:3] # the first three
## [1] 10 20 30

Take note of the boundary cases:


x[c(1, 2, 1, 0, 3, NA_real_, 1, 11)] # repeated, 0, missing, out of bound
## [1] 10 20 10 30 NA 10 NA
x[c()] # indexing by an empty vector
## numeric(0)

When applied on lists, the index operator always returns a list as well, even if we ask
for a single element:
y[2] # a list that includes the second element
## [[1]]
## [1] 11 12
y[c(1, 3)] # not the same as x[1, 3] (a different story)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 21 22 23

If we want to extract a component, i.e., to dig into what is inside a list at a specific
location, we can refer to `[[`:
y[[2]] # extract the second element
## [1] 11 12

This is exactly why R displays “[[1]]”, “[[2]]”, etc. when lists are printed.
On a side note, calling x[[i]] on an atomic vector, where i is a single value, has almost2
the same effect as x[i]. However, `[[` generates an error if the subscript is out of
bounds.

Important Let’s reflect on the operators’ behaviour in the case of nonexistent items:
c(1, 2, 3)[4]
## [1] NA
list(1, 2, 3)[4]
## [[1]]
## NULL
(continues on next page)

2 See also Section 5.5 for the discussion on the preservation of object attributes.
70 I DEEP

(continued from previous page)


c(1, 2, 3)[[4]]
## Error in c(1, 2, 3)[[4]]: subscript out of bounds
list(1, 2, 3)[[4]]
## Error in list(1, 2, 3)[[4]]: subscript out of bounds

Note (*) `[[` also supports multiple indexers.


y[[c(1, 3)]]
## Error in y[[c(1, 3)]]: subscript out of bounds

Its meaning is different from y[c(1, 3)], though; we are about to extract a single
value, remember? Here, indexing is applied recursively. Namely, the above is equivalent
to y[[1]][[3]]. We got an error because y[[1]] is of a length smaller than three.
More examples:
y[[c(3, 1)]] # y[[3]][[1]]
## [1] 21
list(list(7))[[c(1, 1)]] # 7, not list(7)
## [1] 7

5.2.2 Negative indexes


The indexer can also be a vector of negative integers. This way, we can exclude the ele-
ments at given positions:
y[-1] # all but the first
## [[1]]
## [1] 11 12
##
## [[2]]
## [1] 21 22 23
x[-(1:3)] # all but the first three
## [1] 40 50 60 70 80 90 100
x[-c(1, 0, 2, 1, 1, 8:100)] # 0, repeated, out of bound indexes
## [1] 30 40 50 60 70

Note Negative and positive indexes cannot be mixed.


x[-1:3] # recall that -1:3 == (-1):3
## Error in x[-1:3]: only 0's may be mixed with negative subscripts

Also, NA indexes cannot be mixed with negative ones.


5 VECTOR INDEXING 71

5.2.3 Logical indexer


A vector can also be subsetted by means of a logical vector. If they both are of identical
lengths, the consecutive elements in the latter indicate whether the corresponding
elements of the indexed vector are supposed to be selected (TRUE) or omitted (FALSE).
# 1*** 2 3 4 5*** 6*** 7 8*** 9? 10***
x[c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, NA, TRUE)]
## [1] 10 50 60 80 NA 100

In other words, x[l], where l is a logical vector, returns all x[i] with i such that l[i]
is TRUE. We thus extracted the elements at indexes 1, 5, 6, 8, and 10.

Important Be careful: if the element selector is NA, we will get a missing value (for
atomic vectors) or NULL (for lists).
c("one", "two", "three")[c(NA, TRUE, FALSE)]
## [1] NA "two"
list("one", "two", "three")[c(NA, TRUE, FALSE)]
## [[1]]
## NULL
##
## [[2]]
## [1] "two"

This, lamentably, comes with no warning, which might be problematic when indexers
are generated programmatically. As a remedy, we sometimes pass the logical indexer
to the which function first. It returns the indexes of the elements equal to TRUE, ignor-
ing the missing ones.
which(c(NA, TRUE, FALSE, TRUE, FALSE, NA, TRUE))
## [1] 2 4 7
c("one", "two", "three")[which(c(NA, TRUE, FALSE))]
## [1] "two"

Recall that in Chapter 3, we discussed ample vectorised operations that generate lo-
gical vectors. Anything that yields a logical vector of the same length as x can be passed
as an indexer.
x > 60 # yes, it is a perfect indexer candidate
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
x[x > 60] # select elements in `x` that are greater than 60
## [1] 70 80 90 100
x[x < 30 | 70 < x] # elements not between 30 and 70
## [1] 10 20 80 90 100
x[x < mean(x)] # elements smaller than the mean
## [1] 10 20 30 40 50
x[x^2 > 7777 | log10(x) <= 1.6] # indexing via a transformed version of `x`
(continues on next page)
72 I DEEP

(continued from previous page)


## [1] 10 20 30 90 100
(z <- round(runif(length(x)), 2)) # ten pseudorandom numbers
## [1] 0.29 0.79 0.41 0.88 0.94 0.05 0.53 0.89 0.55 0.46
x[z <= 0.5] # indexing based on `z`, not `x`: no problem
## [1] 10 30 60 100

The indexer is always evaluated first and then passed to the subsetting operation. The
index operator does not care how an indexer is generated.
Furthermore, the recycling rule is applied when necessary:
x[c(FALSE, TRUE)] # every second element
## [1] 20 40 60 80 100
y[c(TRUE, FALSE)] # interestingly, there is no warning here
## [[1]]
## [1] 1
##
## [[2]]
## [1] 21 22 23

Exercise 5.1 Consider a simple database about six people, their favourite dishes, and birth
years.
name <- c("Graham", "John", "Terry", "Eric", "Michael", "Terry")
food <- c("bacon", "spam", "spam", "eggs", "spam", "beans")
year <- c( 1941, 1939, 1942, 1943, 1943, 1940 )

The consecutive elements in different vectors correspond to each other, e.g., Graham was born in
1941, and his go-to food was bacon.
• List the names of people born in 1941 or 1942.
• List the names of those who like spam.
• List the names of those who like spam and were born after 1940.
• Compute the average birth year of the lovers of spam.
• Give the average age, in 1969, of those who didn’t find spam utmostly delicious.
The answers must be provided programmatically, i.e., do not just write "Eric" and "Graham".
Make the code generic enough so that it works in the case of any other database of this kind, no
matter its size.
Exercise 5.2 Remove missing values from a given vector without referring to na.omit.

5.2.4 Character indexer


Let’s consider a vector equipped with the names attribute:
5 VECTOR INDEXING 73

x <- structure(x, names=letters[1:10]) # add names


print(x)
## a b c d e f g h i j
## 10 20 30 40 50 60 70 80 90 100

These labels can be referred to when extracting the elements. To do this, we use an
indexer that is a character vector:
x[c("a", "f", "a", "g", "z")]
## a f a g <NA>
## 10 60 10 70 NA

Important We have said that special object attributes add extra functionality on top
of the existing ones. Therefore, indexing by means of positive, negative, and logical
vectors is still available:
x[1:3]
## a b c
## 10 20 30
x[-(1:5)]
## f g h i j
## 60 70 80 90 100
x[x > 70]
## h i j
## 80 90 100

Lists can also be subsetted this way.


(y <- structure(y, names=c("first", "second", "third")))
## $first
## [1] 1
##
## $second
## [1] 11 12
##
## $third
## [1] 21 22 23
y[c("first", "second")]
## $first
## [1] 1
##
## $second
## [1] 11 12
y["third"] # result is a list
## $third
## [1] 21 22 23
(continues on next page)
74 I DEEP

(continued from previous page)


y[["third"]] # result is the specific element unwrapped
## [1] 21 22 23

Important Labels do not have to be unique. When we have repeated names, the first
matching element is extracted:
structure(c(1, 2, 3), names=c("a", "b", "a"))["a"]
## a
## 1

There is no direct way to select all but given names, just like with negative integer in-
dexers. For a workaround, see Section 5.4.1.
Exercise 5.3 Rewrite the solution to Exercise 5.1 assuming that we now have three features
wrapped inside a list.
(people <- list(
Name=c("Graham", "John", "Terry", "Eric", "Michael", "Terry", "Steve"),
Food=c("bacon", "spam", "spam", "eggs", "spam", "beans", "spam"),
Year=c( 1941, 1939, 1942, 1943, 1943, 1940, NA_real_)
))
## $Name
## [1] "Graham" "John" "Terry" "Eric" "Michael" "Terry" "Steve"
##
## $Food
## [1] "bacon" "spam" "spam" "eggs" "spam" "beans" "spam"
##
## $Year
## [1] 1941 1939 1942 1943 1943 1940 NA

Do not refer to name, food, and year directly. Instead, use the full people[["Name"]] etc. ac-
cessors. There is no need to pout: it is just a tiny bit of extra work. Also, notice that Steve has joined
the group; hello, Steve.

5.3 Replacing elements


5.3.1 Modifying atomic vectors
There are also replacement versions of the aforementioned indexing schemes. They al-
low us to substitute some new content for the old one.
(x <- 1:12)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
(continues on next page)
5 VECTOR INDEXING 75

(continued from previous page)


x[length(x)] <- 42 # modify the last element
print(x)
## [1] 1 2 3 4 5 6 7 8 9 10 11 42

The principles of vectorisation, recycling rule, and implicit coercion are all in place:
x[c(TRUE, FALSE)] <- c("a", "b", "c")
print(x)
## [1] "a" "2" "b" "4" "c" "6" "a" "8" "b" "10" "c" "42"

Long story long: first, to ensure that the new content can be poured into the old wine-
skin, R coerced the numeric vector to a character one. Then, every second element
therein, a total of six items, was replaced by a recycled version of the replacement se-
quence of length three. Finally, the name x was rebound to such a brought-forth object
and the previous one became forgotten.

Note For more details on replacement functions in general, see Section 9.3.6. Such
operations alter the state of the object they are called on (quite rare a behaviour in
functional languages).

Exercise 5.4 Replace missing values in a given numeric vector with the arithmetic mean of its
well-defined observations.

5.3.2 Modifying lists


List contents can be altered as well. For modifying individual elements, the safest op-
tion is to use the replacement version of the `[[` operator:
y <- list(a=1, b=1:2, c=1:3)
y[[1]] <- 100:110
y[["c"]] <- -y[["c"]]
print(y)
## $a
## [1] 100 101 102 103 104 105 106 107 108 109 110
##
## $b
## [1] 1 2
##
## $c
## [1] -1 -2 -3

The replacement version of `[` modifies a whole sub-list:


y[1:3] <- list(1, c("a", "b", "c"), c(TRUE, FALSE))
print(y)
## $a
(continues on next page)
76 I DEEP

(continued from previous page)


## [1] 1
##
## $b
## [1] "a" "b" "c"
##
## $c
## [1] TRUE FALSE

Moreover:
y[1] <- list(1:10) # replace one element with one object
y[-1] <- 10:11 # replace two elements with two singletons
print(y)
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] 10
##
## $c
## [1] 11

Note Let i be a vector of positive indexes of elements to be modified. Overall, calling


“y[i] <- z” behaves as if we wrote:
1. y[[ i[1] ]] <- z[[1]],
2. y[[ i[2] ]] <- z[[2]],
3. y[[ i[3] ]] <- z[[3]],
and so forth.
Furthermore, z (but not i) will be recycled when necessary. In other words, we retrieve
z[[j %% length(z)]] for consecutive js from 1 to the length of i.

Exercise 5.5 Reflect on the results of the following expressions:


• y[1] <- c("a", "b", "c"),
• y[[1]] <- c("a", "b", "c"),
• y[[1]] <- list(c("a", "b", "c")),
• y[1:3] <- c("a", "b", "c"),
• y[1:3] <- list(c("a", "b", "c")),
• y[1:3] <- "a",
• y[1:3] <- list("a"),
5 VECTOR INDEXING 77

• y[c(1, 2, 1)] <- c("a", "b", "c").

Important Setting a list item to NULL removes it from the list completely.
y <- list(1, 2, 3, 4)
y[1] <- NULL # removes the first element (i.e., 1)
y[[1]] <- NULL # removes the first element (i.e., now 2)
y[1] <- list(NULL) # sets the first element (i.e., now 3) to NULL
print(y)
## [[1]]
## NULL
##
## [[2]]
## [1] 4

The same notation convention is used for dropping object attributes; see Section 9.3.6.

5.3.3 Inserting new elements


New elements can be pushed at the end of the vector easily3 .
(x <- 1:5)
## [1] 1 2 3 4 5
x[length(x)+1] <- 6 # insert at the end
print(x)
## [1] 1 2 3 4 5 6
x[10] <- 10 # insert at the end but add more items
print(x)
## [1] 1 2 3 4 5 6 NA NA NA 10

The elements to be inserted can be named as well:


x["a"] <- 11 # still inserts at the end
x["z"] <- 12
x["c"] <- 13
x["z"] <- 14 # z is already there; replace
print(x)
## a z c
## 1 2 3 4 5 6 NA NA NA 10 11 14 13

Note that x was not equipped with the names attribute before. The unlabelled elements
were assigned blank labels (empty strings).

Note It is not possible to insert new elements at the beginning or in the middle of a
sequence, at least not with the index operator. By writing “x[3:4] <- 1:5” we do not
3 And often cheaply; see Section 8.3.5 for performance notes. Still, a warning can be generated on each

size extension if the "check.bounds" flag is set; see help("options").


78 I DEEP

replace two elements in the middle with five other ones. However, we can always use
the c function to slice parts of the vector and intertwine them with some new content:
x <- seq(10, 100, 10)
x <- c(x[1:2], 1:5, x[5:7])
print(x)
## [1] 10 20 1 2 3 4 5 50 60 70

5.4 Functions related to indexing


Let’s review a few operations which pinpoint interesting elements in a vector (or func-
tions based on these).

5.4.1 Matching elements in another vector


We know that the `==` operator acts in an elementwise manner. It compares each ele-
ment in a vector on its left side to the corresponding element in a vector on the right
side. Thus, missing values and the recycling rule aside, if “z <- (x == y)”, then z[i]
is TRUE if and only if x[i] is equal to y[i].
The `%in%` operator4 is vectorised differently. It checks whether each element on the
left-hand side matches one of the elements on the right. Given “z <- (x %in% y)”,
z[i] is TRUE whenever x[i] is equal to y[j] for some j.

c("spam", "bacon", "spam", "eggs", "spam") %in% c("eggs", "spam", "ham")


## [1] TRUE FALSE TRUE TRUE TRUE

Example 5.6 Here is how we can remove the elements of a vector that have been assigned spe-
cified labels.
(x <- structure(1:12, names=month.abb)) # example vector
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 2 3 4 5 6 7 8 9 10 11 12
x[!(names(x) %in% c("Jan", "May", "Sep", "Oct"))] # get rid of some elements
## Feb Mar Apr Jun Jul Aug Nov Dec
## 2 3 4 6 7 8 11 12

More generally, match(x, y) gives us the index of the element in y that matches each
x[i].

match(c("spam", "bacon", "spam", "eggs", "spam"), c("eggs", "spam", "ham"))


## [1] 2 NA 2 1 2
(continues on next page)

4 A fantastic name; see Section 9.3.5.


5 VECTOR INDEXING 79

(continued from previous page)


match(month.abb, c("Jan", "May", "Sep", "Oct")) # is the month on the list?
## [1] 1 NA NA NA 2 NA NA NA 3 4 NA NA
match(c("Jan", "May", "Sep", "Oct"), month.abb) # which month is it?
## [1] 1 5 9 10

By default, a missing value denotes a no-match.


Exercise 5.7 Check out the documentation of `%in%` to see how this operator is reduced to a call
to match. Also, verify that it treats missing values as well-defined ones.
If the elements in y are not unique, the smallest index j such that x[i] == y[j] is
returned. Therefore, for example, match(TRUE, l) fetches the index of the first occur-
rence of a positive value in a logical vector l.
(x <- round(runif(10), 2)) # example vector
## [1] 0.29 0.79 0.41 0.88 0.94 0.05 0.53 0.89 0.55 0.46
match(TRUE, x>0.8) # index of the first value > 0.8 (from the left)
## [1] 4

5.4.2 Assigning numbers into intervals


findInterval can come in handy where the assigning of numeric values into real in-
tervals is needed. Namely, z <- findInterval(x, y) for increasing y gives z[i]
being the index j such that x[i] is between y[j] (by default, inclusive) and y[j+1]
(by default, exclusive).
For example, a sequence of five knots 𝒚 = (−∞, 0.25, 0.5, 0.75, ∞) splits the real line
into four intervals:
[−∞, 0.25) [0.25, 0.5) [0.5, 0.75) [0.75, ∞)
(1) (2) (3) (4)

Hence, for instance:


findInterval(c(0, 0.2, 0.25, 0.4, 0.66, 1), c(-Inf, 0.25, 0.5, 0.75, Inf))
## [1] 1 1 2 2 3 4

Exercise 5.8 Refer to the manual of findInterval to verify the function’s behaviour when we
do not include ±∞ as endpoints and how to make ∞ classified as a member of the fourth interval.
Exercise 5.9 Using a call to findInterval, compose a statement that generates a logical vec-
tor whose 𝑖-th element indicates whether x[i] is in the interval [0.25, 0.5]. Was this easier to
write than an expression involving `<=` and `>=`?

5.4.3 Splitting vectors into subgroups


split(x, z) can take the output of match or findInterval (and many other opera-
tions) and divide the elements in a vector x into subgroups corresponding to identical
zs.
80 I DEEP

For instance, we can assign people into groups determined by their favourite dish:
name <- c("Graham", "John", "Terry", "Eric", "Michael", "Terry")
food <- c("bacon", "spam", "spam", "eggs", "spam", "beans")
split(name, food) # group names with respect to food
## $bacon
## [1] "Graham"
##
## $beans
## [1] "Terry"
##
## $eggs
## [1] "Eric"
##
## $spam
## [1] "John" "Terry" "Michael"

The result is a named list with labels determined by the unique elements in the second
vector.
Here is another example, where we pigeonhole some numbers into the four previously
mentioned intervals:
x <- c(0, 0.2, 0.25, 0.4, 0.66, 1)
split(x, findInterval(x, c(-Inf, 0.25, 0.5, 0.75, Inf)))
## $`1`
## [1] 0.0 0.2
##
## $`2`
## [1] 0.25 0.40
##
## $`3`
## [1] 0.66
##
## $`4`
## [1] 1

Items in the first argument that correspond to missing values in the grouping vector
will be ignored. Also, unsurprisingly, the recycling rule is applied when necessary.
We can also split x into groups defined by a combination of levels of two or more vari-
ables z1, z2, etc., by calling split(x, list(z1, z2, ...)).
Example 5.10 The ToothGrowth dataset is a named list (more precisely, a data frame; see
Chapter 12) that represents the results of an experimental study involving 60 guinea pigs. The
experiment’s aim was to measure the effect of different vitamin C supplement types and doses
on the growth of the rodents’ teeth lengths:
ToothGrowth <- as.list(ToothGrowth) # it is a list, but with extra attribs
ToothGrowth[["supp"]] <- as.character(ToothGrowth[["supp"]]) # was: factor
(continues on next page)
5 VECTOR INDEXING 81

(continued from previous page)


print(ToothGrowth)
## $len
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3
## [15] 22.5 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5
## [29] 23.3 29.5 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7 19.7 23.3
## [43] 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9
## [57] 26.4 27.3 29.4 23.0
##
## $supp
## [1] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
## [15] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
## [29] "VC" "VC" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
## [43] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
## [57] "OJ" "OJ" "OJ" "OJ"
##
## $dose
## [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [18] 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5
## [35] 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0
## [52] 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0

We can split len with respect to the combinations of supp and dose (also called interactions)
by calling:
split(ToothGrowth[["len"]], ToothGrowth[c("supp", "dose")], sep="_")
## $OJ_0.5
## [1] 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7
##
## $VC_0.5
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0
##
## $OJ_1
## [1] 19.7 23.3 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3
##
## $VC_1
## [1] 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5
##
## $OJ_2
## [1] 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0
##
## $VC_2
## [1] 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5

Other synonyms are, of course, possible, e.g., split(ToothGrowth[[1]], ToothGrowth[-1])


and split(ToothGrowth[[1]], list(ToothGrowth[[2]], ToothGrowth[[3]])).
We recommend meditating upon our conscious use of double vs single square brackets here.
Functions such as Map (Section 7.2) will soon enable us to compute any summary statistics within
groups, e.g., the group averages like those obtained by executing “SELECT AVG(len) FROM
82 I DEEP

ToothGrowth GROUP BY supp, dose” in SQL. As an appetiser, let’s pass a list of vectors to
the boxplot function; see Figure 5.1.
boxplot(split(ToothGrowth[["len"]], ToothGrowth[c("supp", "dose")], sep="_"))
35
30
25
20
15
10
5

OJ_0.5 VC_0.5 OJ_1 VC_1 OJ_2 VC_2

Figure 5.1. Box-and-whisker plots of len split by supp and dose in ToothGrowth.

Note unsplit revokes the effects of split. Later, we will get used to calling un-
split(Map(some_transformation, split(x, z)), z) to modify the values in x in-
dependently in each group defined by z (e.g., standardise the variables within each
class separately).

5.4.4 Ordering elements


The order function finds the ordering permutation of a given vector, i.e., a sequence
of indexes that leads to a sorted version thereof.
x <- c(1024, 7, 42, 666, 0, 32787)
(o <- order(x)) # the ordering permutation of x
## [1] 5 2 3 4 1 6
x[o] # ordered version of x
## [1] 0 7 42 666 1024 32787

Note that o[1] is the index of the smallest element in x, o[2] is the position of the
second smallest, …, and o[length(o)] is the index of the greatest value. Hence, e.g.,
x[o[1]] is equivalent to min(x).

Another example:
5 VECTOR INDEXING 83

x <- c("b", "a", "abs", "bass", "aaargh", "aargh", "aaaargh")


(o <- order(x))
## [1] 2 7 5 6 3 1 4
x[o]
## [1] "a" "aaaargh" "aaargh" "aargh" "abs" "b" "bass"

Here, as x is a character vector, the ordering is lexicographical (like in a dictionary).


This is exactly how `<=` on strings works.

Note The ordering permutation that order returns is unique (that is why we call it the
permutation), even for inputs containing duplicated elements. Owing to the use of a
stable sorting algorithm, ties (repeated elements) will be listed in the order of occur-
rence.
order(c(10, 20, 40, 10, 10, 30, 20, 10, 10))
## [1] 1 4 5 8 9 2 7 6 3

We have, e.g., five 10s at positions 1, 4, 5, 8, and 9. These five indexes are guaranteed
to be listed in this very order.

Ordering can also be performed in a nonincreasing manner:


x[order(x, decreasing=TRUE)]
## [1] "bass" "b" "abs" "aargh" "aaargh" "aaaargh" "a"

Note A call to sort(x) is equivalent to x[order(x)], but the former function can be
faster in certain scenarios. For instance, one of its arguments can induce a partially
sorted vector which can be helpful if we only seek a few order statistics (e.g., the seven
smallest values). Speed is rarely a bottleneck in the case of sorting (when it is, we have
a problem!). This is why we will not bother ourselves with such topics until the last part
of this pleasant book. Currently, we aim at expanding our skill repertoire so that we
can implement anything we can think of.

Exercise 5.11 is.unsorted(x) determines if the elements in x are… not sorted with respect
to `<=`. Write an R expression that generates the same result by referring to the order function.
Also, assuming that x is numeric, do the same by means of a call to diff.
order also accepts one or more arguments via the dot-dot-dot parameter, `...`. This
way, we can sort a vector with respect to many criteria. If there are ties in the first
variable, they will be resolved by the order of elements in the second variable. This is
most useful for rearranging rows of a data frame, which we will exercise in Chapter 12.
x <- c( 10, 20, 30, 40, 50, 60)
y1 <- c("a", "b", "a", "a", "b", "b")
y2 <- c("w", "w", "v", "u", "u", "v")
(continues on next page)
84 I DEEP

(continued from previous page)


x[order(y1)]
## [1] 10 30 40 20 50 60
x[order(y2)]
## [1] 40 50 30 60 10 20
x[order(y1, y2)]
## [1] 40 30 10 50 60 20
x[order(y2, y1)]
## [1] 40 50 30 60 10 20

Note (*) Calling order on a permutation (a vector that is an arbitrary arrangement of


𝑛 consecutive natural numbers) determines its inverse.
x <- c(10, 30, 40, 20, 10, 10, 50, 30)
order(x)
## [1] 1 5 6 4 2 8 3 7
order(order(x)) # the inverse of the above permutation
## [1] 1 5 7 4 2 3 8 6
(x[order(x)])[order(order(x))] # we get x again
## [1] 10 30 40 20 10 10 50 30

Note that order(order(x)) can be considered as a way to rank all the elements in x.
For instance, the third value in x, 40, is assigned rank 7: it is the seventh smallest value
in this vector. This breaks the ties on a first-come, first-served basis. But we can also
write:
order(order(x, runif(length(x)))) # ranks with ties broken at random
## [1] 2 5 7 4 3 1 8 6

For different variations of these, see the rank function.

Exercise 5.12 Recall that sample(x) returns a pseudorandom permutation of elements of a


given vector unless x is a single positive number. Write an expression that always produces a
proper rearrangement, regardless of the size of x.

5.4.5 Identifying duplicates


Whether any element in a vector was already listed in the previous part of the sequence
can be verified by calling:
x <- c(10, 20, 30, 20, 40, 50, 50, 50, 20, 20, 60)
duplicated(x)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE

This function can be used to remove repeated observations; see also unique. This func-
tion returns a value that is not guaranteed to be sorted (unlike in some other lan-
guages/libraries).
5 VECTOR INDEXING 85

Exercise 5.13 What can be the use case of a call to match(x, unique(x))?
Exercise 5.14 Given two named lists x and y, which we treat as key-value pairs, determine their
set-theoretic union (with respect to the keys). For example:
x <- list(a=1, b=2)
y <- list(c=3, a=4)
z <- ...to.do... # combine x and y
str(z)
## List of 3
## $ a: num 4
## $ b: num 2
## $ c: num 3

5.4.6 Counting index occurrences


tabulate takes a vector of values from a set of small positive integers (e.g., indexes)
and determines their number of occurrences:
x <- c(2, 4, 6, 2, 2, 2, 3, 6, 6, 3)
tabulate(x)
## [1] 0 4 2 1 0 3

In other words, there are 0 ones, 4 twos, …, and 3 sixes.


Exercise 5.15 Using a call to tabulate (amongst others), return a named vector with the num-
ber of occurrences of each unique element in a character vector. For example:
y <- c("a", "b", "a", "c", "a", "d", "e", "e", "g", "g", "c", "c", "g")
result <- ...to.do...
print(result)
## a b c d e g
## 3 1 3 1 2 3

5.5 Preserving and losing attributes


Attributes are conceived of as extra data. It is thus up to a function’s authors what they
will decide to do with them. Generally, it is safe to assume that much thought has been
put into the design of base R functions. Oftentimes, they behave fairly reasonably. This
is why we are now going to spend some time now exploring their approaches to the
handling of attributes.
Namely, for functions and operators that aim at transforming vectors passed as their
inputs, the assumed strategy may be to:
• ignore the input attributes completely,
86 I DEEP

• equip the output object with the same set of attributes, or


• take care only of a few special attributes, such as names, if that makes sense.
Below we explore some common patterns; see also Section 1.3 of [68].

5.5.1 c
First, c drops5 all attributes except names:
(x <- structure(1:4, names=c("a", "b", "c", "d"), attrib1="<3"))
## a b c d
## 1 2 3 4
## attr(,"attrib1")
## [1] "<3"
c(x) # only `names` are preserved
## a b c d
## 1 2 3 4

We can therefore end up calling this function chiefly for this nice side effect. Also, recall
that unname drops the labels.
unname(x)
## [1] 1 2 3 4
## attr(,"attrib1")
## [1] "<3"

5.5.2 as.something
as.vector, as.numeric, and similar drop all attributes in the case where the output
is an atomic vector, but it might not necessarily do so in other cases (because they are
S3 generics; see Chapter 10).
as.vector(x) # drops all attributes if x is atomic
## [1] 1 2 3 4

5.5.3 Subsetting
Subsetting with `[` (except where the indexer is not given) drops all attributes but
names (as well as dim and dimnames; see Chapter 11), which is adjusted accordingly:

x[1] # subset of labels


## a
## 1
x[[1]] # this always drops the labels (makes sense, right?)
## [1] 1

5 To be precise, we mean the default S3 method of c here; compare Section 10.2.4.


5 VECTOR INDEXING 87

The replacement version of the index operator modifies the values in an existing vector
whilst preserving all the attributes. In particular, skipping the indexer replaces all the
elements:
y <- x
y[] <- c("u", "v") # note that c("u", "v") has no attributes
print(y)
## a b c d
## "u" "v" "u" "v"
## attr(,"attrib1")
## [1] "<3"

5.5.4 Vectorised functions


Vectorised unary functions tend to copy all the attributes.
round(x)
## a b c d
## 1 2 3 4
## attr(,"attrib1")
## [1] "<3"

Binary operations are expected to get the attributes from the longer input. If they are
of equal sizes, the first argument is preferred to the second.
y <- structure(c(1, 10), names=c("f", "g"), attrib1=":|", attrib2=":O")
y * x # x is longer
## a b c d
## 1 20 3 40
## attr(,"attrib1")
## [1] "<3"
y[c("h", "i")] <- c(100, 1000) # add two new elements at the end
y * x
## f g h i
## 1 20 300 4000
## attr(,"attrib1")
## [1] ":|"
## attr(,"attrib2")
## [1] ":O"
x * y
## a b c d
## 1 20 300 4000
## attr(,"attrib1")
## [1] "<3"
## attr(,"attrib2")
## [1] ":O"

Also, Section 9.3.6 mentions a way to copy all attributes from one object to another.
88 I DEEP

Important Even in base R, the foregoing rules are not enforced strictly. We con-
sider them inconsistencies that should be, for the time being, treated as features (with
which we need to learn to live as they have not been fixed for years, but hope springs
eternal).
As far as third-party extension packages are concerned, suffice it to say that a lot of R
programmers do not know what attributes are whatsoever. It is always best to refer to
the documentation, perform a few experiments, and/or manually ensure the preser-
vation of the data we care about.

Exercise 5.16 Check what attributes are preserved by ifelse.

5.6 Exercises
Exercise 5.17 Answer the following questions (contemplate first, then use R to find the answer).
• What is the result of x[c()]? Is it the same as x[]?
• Is x[c(1, 1, 1)] equivalent to x[1]?
• Is x[1] equivalent to x["1"]?
• Is x[c(-1, -1, -1)] equivalent to x[-1]?
• What does x[c(0, 1, 2, NA)] do?
• What does x[0] return?
• What does x[1, 2, 3] do?
• What about x[c(0, -1, -2)] and x[c(-1, -2, NA)]?
• Why x[NA] is so significantly different from x[c(1, NA)]?
• What is x[c(FALSE, TRUE, 2)]?
• What will we obtain by calling x[x<min(x)]?
• What about x[length(x)+1]?
• Why x[min(y)] is most probably a mistake? What could it mean? How can it be fixed?
• Why cannot we mix indexes of different types and write x[c(1, "b", "c", 4)]? Or can
we?
• Why would we call as.vector(na.omit(x)) instead of just na.omit(x)?
• What is the difference between sort and order?
• What is the type and the length of the object returned by a call to split(a, u)? What about
split(a, c(u, v))?
5 VECTOR INDEXING 89

• How to get rid of the seventh element from a list of ten elements?
• How to get rid of the seventh, eight, and ninth elements from a list with ten elements?
• How to get rid of the seventh element from an atomic vector of ten elements?
• If y is a list, by how many elements “y[c(length(y)+1, length(y)+1, length(y)+1)]
<- list(1, 2, 3)” will extend it?

• What is the difference between x[x>0] and x[which(x>0)]?


Exercise 5.18 If x is an atomic vector of length n, x[5:n] obviously extracts everything from
the fifth element to the end. Does it, though? Check what happens when x is of length less than
five, including 0. List different ways to correct this expression so that it makes (some) sense in the
case of shorter vectors.
Exercise 5.19 Similarly, x[length(x) + 1 - 5:1] is supposed to return the last five ele-
ments in x. Propose a few alternatives that are correct also for short xs.
Exercise 5.20 Given a numeric vector, fetch its five largest elements. Ensure the code works for
vectors of length less than five.
Exercise 5.21 We can compute a trimmed mean of some x by setting the trim argument to
the mean function. Compute a similar robust estimator of location – the 𝑝-winsorised mean, 𝑝 ∈
[0, 0.5] defined as the arithmetic mean of all elements in x clipped to the [𝑄𝑝 , 𝑄1−𝑝 ] interval,
where 𝑄𝑝 is the vector’s 𝑝-quantile; see quantile. For example, if x is (8, 5, 2, 9, 7, 4, 6, 1, 3),
we have 𝑄0.25 = 3 and 𝑄0.75 = 7 and hence the 0.25-winsorised mean will be equal to the
arithmetic mean of (7, 5, 3, 7, 7, 4, 6, 3, 3).
Exercise 5.22 Let x and y be two vectors of the same length, 𝑛, and no ties. Implement the for-
mula for the Spearman rank correlation coefficient:
𝑛
6 ∑𝑖=1 𝑑𝑖2
𝜚(𝐱, 𝐲) = 1 − ,
𝑛(𝑛2 − 1)
where 𝑑𝑖 = 𝑟𝑖 − 𝑠𝑖 , 𝑖 = 1, … , 𝑛, and 𝑟𝑖 and 𝑠𝑖 denote the rank of 𝑥𝑖 and 𝑦𝑖 , respectively; see also
cor.

Exercise 5.23 (*) Given two vectors x and y both of length 𝑛, a call to approx(x, y, ...
) can be used to interpolate linearly between the points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ). We
can use it to generate new 𝑦s for previously unobserved 𝑥s (somewhere “in-between” the data we
already have). Moreover, spline(x, y, ...) can perform a cubic spline interpolation, which
is smoother; see Figure 5.2.
x <- c(1, 3, 5, 7, 10)
y <- c(1, 15, 25, 6, 0)
x_new <- seq(1, 10, by=0.25)
y_new1 <- approx(x, y, xout=x_new)[["y"]]
y_new2 <- spline(x, y, xout=x_new)[["y"]]
plot(x, y, ylim=c(-10, 30)) # the points to interpolate between
lines(x_new, y_new1, col="black", lty="solid") # linear interpolation
lines(x_new, y_new2, col="darkred", lty="dashed") # cubic interpolation
(continues on next page)
90 I DEEP

(continued from previous page)


legend("topright", legend=c("linear", "cubic"),
lty=c("solid", "dashed"), col=c("black", "darkred"), bg="white")
30

linear
cubic
20
10
y
0
-10

2 4 6 8 10
x

Figure 5.2. Piecewise linear and cubic spline interpolation.

Using a call to one of the above, impute missing data in euraud-20200101-20200630.csv6 ,


e.g., the blanks in (0.60, 0.62, NA, 0.64, NA, NA, 0.58) should be filled so as to obtain
(0.60, 0.62, 0.63, 0.64, 0.62, 0.60, 0.58).

Exercise 5.24 Given some 1 ≤ from ≤ to ≤ n, use findInterval to generate a logical vector of
length n with TRUE elements only at indexes between from and to, inclusive.
Exercise 5.25 Implement expressions that give rise to the same results as calls to which,
which.min, which.max, and rev functions. What is the difference between x[x>y] and
x[which(x>y)]? What about which.min(x) vs which(x == min(x))?

Exercise 5.26 Given two equal-length vectors x and y, fetch the value from the former that cor-
responds to the smallest value in the latter. Write three versions of such an expression, each deal-
ing with potential ties in y differently. For example:
x <- c("a", "b", "c", "d", "e", "f")
y <- c( 3, 1, 2, 1, 1, 4)

It should choose the first ("b"), last ("e"), or random element from x fulfilling the above property
("b", "d", or "e" with equal probability). Make sure your code works for x being of the type
character or numeric as well as an empty vector.

6 https://github.com/gagolews/teaching-data/raw/master/marek/euraud-20200101-20200630.csv
5 VECTOR INDEXING 91

Exercise 5.27 Implement an expression that yields the same result as duplicated(x) for a
numeric vector x, but using diff and order.
Exercise 5.28 Based on match and unique, implement your versions of union(x, y), in-
tersect(x, y), setdiff(x, y), is.element(x, y), and setequal(x, y) for x and y
being nonempty numeric vectors.
6
Character vectors

Text is a universal, portable, economical, and efficient means of interacting between


humans and computers as well as exchanging data between programs or APIs. This
book is 99% made of text. And, wow, how much valuable knowledge is in it, innit?

6.1 Creating character vectors


6.1.1 Inputting individual strings
Specific character strings are delimited by a pair of either double or single quotes (apo-
strophes).
"a string"
## [1] "a string"
'another string' # and, of course, neither 'like this" nor "like this'
## [1] "another string"

The only difference between these two is that we cannot directly include, e.g., an apo-
strophe in a single quote-delimited string. On the other hand, "'tis good ol' spam"
and 'I "love" bacon' are both okay.
However, to embrace characters whose inclusion might otherwise be difficult or im-
possible, we may always employ the so-called escape sequences.
R uses the backslash, “\”, as the escape character. In particular:
• \" inputs a double quote,
• \' generates a single quote,
• \\ includes a backslash,
• \n endows a new line.
(x <- "I \"love\" bacon\n\\\"/")
## [1] "I \"love\" bacon\n\\\"/"

The print function (which was implicitly called to display the above object) does not
reveal the special meaning of the escape sequences. Instead, print outputs strings in
94 I DEEP

the same way that we ourselves would follow when inputting them. The number of
characters in x is 18, and not 23:
nchar(x)
## [1] 18

To display the string as-it-really-is, we call cat:


cat(x, sep="\n")
## I "love" bacon
## \"/

In raw character constants, the backslash character’s special meaning is disabled.


They can be entered using the notation like r"(...)", r"{...}", or r"[...]";
see help("Quotes"). These can be useful when inputting regular expressions (Sec-
tion 6.2.4).
x <- r"(spam\n\\\"maps)" # also: r"-(...)-", r"--(...)--", etc.
print(x)
## [1] "spam\\n\\\\\\\"maps"
cat(x, sep="\n")
## spam\n\\\"maps

Furthermore, the string version of the missing value marker is NA_character_.

Note (*) The Unicode standard 15.0 (version dated September 2022) defines 149 186
characters, i.a., letters from different scripts, mathematical symbols, and emojis.
Each is assigned a unique numeric identifier; see the Unicode Character Code Charts1 .
For example, the inverted exclamation mark (see the Latin-1 Supplement section therein)
has been mapped to the hexadecimal code 0xA1 (or 161 decimally). Knowing this magic
number permits us to specify a Unicode code point using one of the following escape
sequences:
• \uxxxx – codes using four hexadecimal digits,
• \Uxxxxxxxx – codes using eight hexadecimal digits.
For instance:
cat("!\u00a1!\U000000a1!", sep="\n")
## !¡!¡!

All R installations allow for working with Unicode strings. More precisely, they sup-
port dealing with UTF-8, being a super-encoding that is native to most UNIX-like
boxes, including GNU/Linux and m**OS. Other operating systems may use some 8-
bit encoding as the system one (e.g., latin1 or cp1252), but they can be mixed with
Unicode seamlessly; see help("Encoding"), help("iconv"), and [26] for discussion.
1 https://www.unicode.org/charts
6 CHARACTER VECTORS 95

Nevertheless, certain output devices (web browsers, LaTeX renderers, text terminals)
might be unable to display every possible Unicode character, e.g., due to some fonts’
being missing. However, as far as processing character data is concerned, this does
not matter because R does it with its eyes closed. For example:
cat("\U0001f642\u2665\u0bb8\U0001f923\U0001f60d\u2307", sep="\n")
## ������

In the PDF version2 of this adorable book, the Unicode glyphs are not rendered cor-
rectly for some reason. However, its HTML variant3 , generated from the same source
files, should be displayed by most web browsers properly.

Note (*) Some output devices may support the following codes that control the posi-
tion of the caret (text cursor):
• \b inserts a backspace (moves cursor one column to the left),
• \t implants a tabulator (advances to the next tab stop, e.g., a multiply of four or
eight text columns),
• \r injects a carriage return (move to the beginning of the current line).
cat("abc\bd\tef\rg\nhij", sep="\n")
## gbd ef
## hij

These can be used on unbuffered outputs like stderr to display the status of the cur-
rent operation, for instance, an animated textual progress bar, the print-out of the
ETA, or the percentage of work completed.
Further, certain terminals can also understand the ECMA-48/ANSI-X3.64 escape se-
quences4 of the form \u001b[... to control the cursor’s position, text colour, and
even style. For example, \u001b[1;31m outputs red text in bold font and \u001b[0m
resets the settings to default. We recommend giving, e.g., cat("\u001b[1;31mspam\
u001b[0m") or cat("\u001b[5;36m\u001b[Abacon\u001b[Espam\u001b[0m") a try.

6.1.2 Many strings, one object


Less trivial character vectors (meaning, of length greater than one) can be constructed
by means of, e.g., c or rep5 .

2https://deepr.gagolewski.com/deepr.pdf
3https://deepr.gagolewski.com/
4 https://en.wikipedia.org/wiki/ANSI_escape_code
5 Internally, there is a string cache (a hash table). Multiple clones of the same string do not occupy more

RAM than necessary.


96 I DEEP

(x <- c(rep("spam", 3), "bacon", NA_character_, "spam"))


## [1] "spam" "spam" "spam" "bacon" NA "spam"

Thus, a character vector is, in fact, a sequence of sequences of characters6 . As usual,


the total number of strings can be fetched via the length function. However, the
length of each string may be read with the vectorised nchar.
length(x) # how many strings?
## [1] 6
nchar(x) # the number of code points in each string
## [1] 4 4 4 5 NA 4

6.1.3 Concatenating character vectors


paste can be used to concatenate (join) the corresponding elements of two or more
character vectors:
paste(c("a", "b", "c"), c("1", "2", "3")) # sep=" " by default
## [1] "a 1" "b 2" "c 3"
paste(c("a", "b", "c"), c("1", "2", "3"), sep="") # see also paste0
## [1] "a1" "b2" "c3"

The function is deeply vectorised:


paste(c("a", "b", "c"), 1:6, c("!", "?")) # coercion of numeric to character
## [1] "a 1 !" "b 2 ?" "c 3 !" "a 4 ?" "b 5 !" "c 6 ?"

We can also collapse (flatten, aggregate) a sequence of strings into a single string:
paste(c("a", "b", "c", "d"), collapse=",")
## [1] "a,b,c,d"
paste(c("a", "b", "c", "d"), 1:2, sep="", collapse="")
## [1] "a1b2c1d2"

Perhaps for convenience, alas, paste treats missing values differently from most other
vectorised functions:
paste(c("A", NA_character_, "B"), "!", sep="")
## [1] "A!" "NA!" "B!"

6.1.4 Formatting objects


Strings can also arise by converting other-typed R objects into text. For example, the
quite customisable (see Chapter 10) format function prepares data for display in dy-
namically generated reports.
6 (*) Chapter 14 will mention that objects of the type character are internally represented as objects with

SEXPTYPE of STRSXP. They are arrays with elements whose SEXPTYPE is CHARSXP, each of which is a string
of characters (char*).
6 CHARACTER VECTORS 97

x <- c(123456.789, -pi, NaN)


format(x)
## [1] "123456.7890" " -3.1416" " NaN"
cat(format(x, digits=8, scientific=FALSE, drop0trailing=TRUE), sep="\n")
## 123456.789
## -3.1415927
## NaN

Moreover, sprintf is a workhorse for turning possibly many atomic vectors into
strings. Its first argument is a format string. Special escape sequences starting with
the per cent sign, “%”, serve as placeholders for the actual values. For instance, “%s” is
replaced with a string and “%f” with a floating point value taken from further argu-
ments.
sprintf("%s%s", "a", c("X", "Y", "Z")) # like paste(...)
## [1] "aX" "aY" "aZ"
sprintf("key=%s, value=%f", c("spam", "eggs"), c(100000, 0))
## [1] "key=spam, value=100000.000000" "key=eggs, value=0.000000"

The numbers’ precision, strings’ widths and justification, etc., can be customised, e.g.,
“%6.2f” is a number that, when converted to text, will occupy six text columns7 , with
two decimal digits of precision.
sprintf("%10s=%6.2f%%", "rate", 2/3*100) # "%%" renders the per cent sign
## [1] " rate= 66.67%"
sprintf("%.*f", 1:5, pi) # variable precision
## [1] "3.1" "3.14" "3.142" "3.1416" "3.14159"

Also, e.g., “%1$s”, “%2$s”, … inserts the first, second, … argument as text.
sprintf("%1$s, %2$s, %1$s, and %1$s", "spam", "bacon") # numbered argument
## [1] "spam, bacon, spam, and spam"

Exercise 6.1 Read help("sprintf") (highly recommended!).

6.1.5 Reading text data from files


Given a raw text file, readLines loads it into memory and represents it as a character
vector, with each line stored in a separate string.
head(readLines(
"https://github.com/gagolews/teaching-data/raw/master/README.md"
))
## [1] "# Dr [Marek](https://www.gagolewski.com)'s Data for Teaching"
## [2] ""
(continues on next page)

7 This is only true for 8-bit native encodings or ASCII; see also sprintf from the stringx package, which

takes the text width and not the number of bytes into account.
98 I DEEP

(continued from previous page)


## [3] "> *See the comment lines within the files themselves for"
## [4] "> a detailed description of each dataset.*"
## [5] ""
## [6] "*Good* datasets are actually hard to find!"

writeLines is its counterpart. There is also an option to read or write parts of files at
a time using file connections which we mention in Section 8.3.5. Moreover, cat(...,
append=TRUE) can be used to create a text file incrementally.

6.2 Pattern searching


6.2.1 Comparing whole strings
We have already reviewed a couple of ways to compare strings as a whole. For instance,
the `==` operator implements elementwise testing:
c("spam", "spam", "bacon", "eggs") == c("spam", "eggs") # recycling rule
## [1] TRUE FALSE FALSE TRUE

In Section 5.4.1, we introduced the match function and its derivative, the `%in%` oper-
ator. They are vectorised in a different way:
match(c("spam", "spam", "bacon", "eggs"), c("spam", "eggs"))
## [1] 1 1 NA 2
c("spam", "spam", "bacon", "eggs") %in% c("spam", "eggs")
## [1] TRUE TRUE FALSE TRUE

Note (*) match relies on a simple, bytewise comparison of the corresponding code
points. It might not be valid in natural language processing activities, e.g., where
the German word groß should be equivalent to gross [18]. Moreover, in the rare situ-
ations where we read Unicode-unnormalised data, canonically equivalent strings may
be considered different; see [17].

6.2.2 Partial matching


When only a consideration of the initial part of each string is required, we can call:
startsWith(c("s", "spam", "spamtastic", "spontaneous", "spoon"), "spam")
## [1] FALSE TRUE TRUE FALSE FALSE

If we provide many prefixes, the above function will be applied elementwisely, just like
the `==` operator.
6 CHARACTER VECTORS 99

On the other hand, charmatch performs a partial matching of strings. It is an each-vs-


all version of startsWith:
charmatch(c("s", "sp", "spam", "spams", "eggs", "bacon"), c("spam", "eggs"))
## [1] 1 1 1 NA 2 NA
charmatch(c("s", "sp", "spam", "spoo", "spoof"), c("spam", "spoon"))
## [1] 0 0 1 2 NA

Note that 0 designates that there was an ambiguous match.

Note (*) In Section 9.4.7, we discuss match.arg, which a few R functions rely on when
they need to select a value from a range of possible choices. Furthermore, Section 9.3.2
and Section 15.4.4 mention the (discouraged) partial matching of list labels and func-
tion argument names.

6.2.3 Matching anywhere within a string


Fixed patterns can also be searched for anywhere within character strings using grepl:
x <- c("spam", "y spammite spam", "yummy SPAM", "sram")
grepl("spam", x, fixed=TRUE) # fixed patterns, as opposed to regexes below
## [1] TRUE TRUE FALSE FALSE

Important The order of arguments is like grepl(needle, haystack), not vice versa.
Also, this function is not vectorised with respect to the first argument.

Exercise 6.2 How the calls to grep(y, x, value=FALSE) and grep(y, x, value=TRUE)
can be implemented based on grepl and other operations we are already familiar with?

Note (*) As a curiosity, agrepl performs approximate matching, which can account
for a smöll nmber of tpyos.
agrepl("spam", x)
## [1] TRUE TRUE FALSE TRUE
agrepl("ham", x, ignore.case=TRUE)
## [1] TRUE TRUE TRUE TRUE

It is based on Levenshtein’s edit distance that measures the number of character inser-
tions, deletions, or substitutions required to turn one string into another.

6.2.4 Using regular expressions (*)


Setting perl=TRUE allows for identifying occurrences of patterns specified by regular
expressions (regexes).
100 I DEEP

grepl("^spam", x, perl=TRUE) # strings that begin with `spam`


## [1] TRUE FALSE FALSE FALSE
grepl("(?i)^spam|spam$", x, perl=TRUE) # begin or end; case ignored
## [1] TRUE TRUE TRUE FALSE

Note For more details on regular expressions in general, see, e.g., [24]. The ultimate
reference on the PCRE2 pattern syntax is the Unix man page pcre2pattern(3)8 . From
now on, we assume that the reader is familiar with it.
Apart from the Perl-compatible regexes, R also gives access to the TRE library (ERE-
like), which is the default one; see help("regex"). However, we discourage its use
because it is feature-poorer.

Exercise 6.3 The list.files function generates the list of file names in a given directory that
match a given regular expression. For instance, the following gives all CSV files in a folder:
list.files("~/Projects/teaching-data/r/", "\\.csv$")
## [1] "air_quality_1973.csv" "anscombe.csv" "iris.csv"
## [4] "titanic.csv" "tooth_growth.csv" "trees.csv"
## [7] "world_phones.csv"

Write a single regular expression that matches file names ending with “.csv” or “.csv.gz”.
Also, scribble a regex that matches CSV files whose names do not begin with “eurusd”.

6.2.5 Locating pattern occurrences


regexpr finds the first occurrence of a pattern in each string:

regexpr("spam", x, fixed=TRUE)
## [1] 1 3 -1 -1
## attr(,"match.length")
## [1] 4 4 -1 -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

In particular, there is a pattern occurrence starting at the third code point of the
second string in x. Moreover, the last string has no pattern match, which is denoted
by -1.
The match.length attribute is generally more informative when searching with regu-
lar expressions.
To locate all the matches, i.e., globally, we use gregexpr:
8 http://www.pcre.org/current/doc/html/pcre2pattern.html
6 CHARACTER VECTORS 101

# `spam` followed by 0 or more letters, case insensitively


gregexpr("(?i)spam\\p{L}*", x, perl=TRUE)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 3 12
## attr(,"match.length")
## [1] 8 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] 7
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

As we noted in Section 4.4.2, wrapping the results in a list was a clever choice for the
number of matches can obviously vary between strings.
In Section 7.2, we will look at the Map function, which, along with substring intro-
duced below, can aid in getting the most out of such data. Meanwhile, let’s just men-
tion that regmatches extracts the matching substrings:
regmatches(x, gregexpr("(?i)spam\\p{L}*", x, perl=TRUE))
## [[1]]
## [1] "spam"
##
(continues on next page)
102 I DEEP

(continued from previous page)


## [[2]]
## [1] "spammite" "spam"
##
## [[3]]
## [1] "SPAM"
##
## [[4]]
## character(0)

Note (*) Consider what happens when a regular expression contains parenthesised
subexpressions (capture groups).
r <- "(?<basename>[^. ]+)\\.(?<extension>[^ ]*)"

This regex consists of two capture groups separated by a dot. The first one is labelled
“basename”. It comprises several arbitrary characters except for spaces and dots. The
second group, named “extension”, is a substring consisting of anything but spaces.
Such a pattern can be used for unpacking space-delimited lists of file names.
z <- "dataset.csv.gz something_else.txt spam"
regexpr(r, z, perl=TRUE)
## [1] 1
## attr(,"match.length")
## [1] 14
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
## basename extension
## [1,] 1 9
## attr(,"capture.length")
## basename extension
## [1,] 7 6
## attr(,"capture.names")
## [1] "basename" "extension"
gregexpr(r, z, perl=TRUE)
## [[1]]
## [1] 1 16
## attr(,"match.length")
## [1] 14 18
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
(continues on next page)
6 CHARACTER VECTORS 103

(continued from previous page)


## basename extension
## [1,] 1 9
## [2,] 16 31
## attr(,"capture.length")
## basename extension
## [1,] 7 6
## [2,] 14 3
## attr(,"capture.names")
## [1] "basename" "extension"

The capture.* attributes give us access to the matches to the individual capture
groups, i.e., the basename and the extension.

Exercise 6.4 (*) Check out the difference between the results generated by regexec and reg-
expr as well as between the outputs of gregexec and gregexpr.

6.2.6 Replacing pattern occurrences


sub and gsub can replace the first and all, respectively, matches to a pattern:

x <- c("spam", "y spammite spam", "yummy SPAM", "sram")


sub("spam", "ham", x, fixed=TRUE)
## [1] "ham" "y hammite spam" "yummy SPAM" "sram"
gsub("spam", "ham", x, fixed=TRUE)
## [1] "ham" "y hammite ham" "yummy SPAM" "sram"

Note (*) If a regex defines capture groups, matches thereto can be mentioned not only
in the pattern itself but also in the replacement string:
gsub("(\\p{L})\\p{L}\\1", "\\1", "aha egg gag NaN spam", perl=TRUE)
## [1] "a egg g N spam"

Matched are, in the following order: a letter (it is a capture group), another letter, and
the former letter again. Each such palindrome of length three is replaced with just the
repeated letter.

Exercise 6.5 (*) Display the source code of glob2rx by calling print(glob2rx) and study
how this function converts wildcards such as file???.* or *.csv to regular expressions that
can be passed to, e.g., list.files.

6.2.7 Splitting strings into tokens


strsplit divides each string in a character vector into chunks.
104 I DEEP

strsplit(c("spam;spam;eggs;;bacon", "spam"), ";", fixed=TRUE)


## [[1]]
## [1] "spam" "spam" "eggs" "" "bacon"
##
## [[2]]
## [1] "spam"

Note that this time the search pattern specifying the token delimiter is given as the
second argument (an inconsistency).

6.3 Other string operations


6.3.1 Extracting substrings
substring extracts parts of strings between given character position ranges.

substring("spammity spam", 1, 4) # from the first to the fourth character


## [1] "spam"
substring("spammity spam", 10) # from the tenth to end
## [1] "spam"
substring("spammity spam", c(1, 10), c(4, 14)) # vectorisation
## [1] "spam" "spam"
substring(c("spammity spam", "bacon and eggs"), 1, c(4, 5))
## [1] "spam" "bacon"

Note There is also a replacement (compare Section 9.3.6) version of the foregoing:
x <- "spam, spam, bacon, and spam"
substring(x, 7, 11) <- "eggs"
print(x)
## [1] "spam, eggs, bacon, and spam"

Unfortunately, the number of characters in the replacement string should not exceed
the length of the part being substituted (try "chickpeas" instead of "eggs"). However,
substring replacement can be written as a composition of substring extraction and
concatenation:
paste(substring(x, 1, 6), "chickpeas", substring(x, 11), sep="")
## [1] "spam, chickpeas, bacon, and spam"

Exercise 6.6 Take the output generated by regexpr and apply substring to extract the pat-
tern occurrences. If there is no match in a string, the corresponding output should be NA.
6 CHARACTER VECTORS 105

6.3.2 Translating characters


tolower and toupper converts between lower and upper case:

toupper("spam")
## [1] "SPAM"

Note Like many other string operations in base R, these functions per