Deep R Programming
Deep R Programming
Marek Gagolewski
v1.0.0.9004
Prof. Marek Gagolewski
Warsaw University of Technology, Poland
Systems Research Institute, Polish Academy of Sciences
https://www.gagolewski.com/
A little peculiar is the world some people decided to immerse themselves in, so here is
a message stating the obvious. Every effort has been made in the preparation of this
book to ensure the accuracy of the information presented. However, the information
contained in this book is provided without warranty, either express or implied. The
author will, of course, not be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Any bug reports/corrections/feature requests are welcome. To make this textbook even
better, please file them at https://github.com/gagolews/deepr.
Typeset with XeLATEX. Please be understanding: it was an algorithmic process. Hence,
the results are ∈ [good enough, perfect).
Homepage: https://deepr.gagolewski.com/
Datasets: https://github.com/gagolews/teaching-data
Preface xiii
0.1 To R, or not to R . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
0.2 R (GNU S) as a language and an environment . . . . . . . . . . . . xiii
0.3 Aims, scope, and design philosophy . . . . . . . . . . . . . . . . . xv
0.4 Classification of R data types and book structure . . . . . . . . . . xvi
0.5 About the author . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
0.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . xix
0.7 You can make this book better . . . . . . . . . . . . . . . . . . . xx
I Deep 1
1 Introduction 3
1.1 Hello, world! . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Setting up the development environment . . . . . . . . . . . . . . 4
1.2.1 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Interactive mode . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Batch mode: Working with R scripts (**) . . . . . . . . . . . 5
1.2.4 Weaving: Automatic report generation (**) . . . . . . . . . 5
1.2.5 Semi-interactive modes (Jupyter Notebooks, sending code to
the associated R console, etc.) . . . . . . . . . . . . . . . . 6
1.3 Atomic vectors at a glance . . . . . . . . . . . . . . . . . . . . . 8
1.4 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Numeric vectors 13
2.1 Creating numeric vectors . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Numeric constants . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Concatenating vectors with c . . . . . . . . . . . . . . . . 14
2.1.3 Repeating entries with rep . . . . . . . . . . . . . . . . . 14
2.1.4 Generating arithmetic progressions with seq and `:` . . . . 16
2.1.5 Generating pseudorandom numbers . . . . . . . . . . . . 17
2.1.6 Reading data with scan . . . . . . . . . . . . . . . . . . . 19
2.2 Creating named objects . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Vectorised mathematical functions . . . . . . . . . . . . . . . . . 23
2.3.1 abs and sqrt . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Natural exponential function and logarithm . . . . . . . . . 25
IV CONTENTS
3 Logical vectors 39
3.1 Creating logical vectors . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Comparing elements . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Vectorised relational operators . . . . . . . . . . . . . . . 39
3.2.2 Testing for NA, NaN, and Inf . . . . . . . . . . . . . . . . . 40
3.2.3 Dealing with round-off errors (*) . . . . . . . . . . . . . . 41
3.3 Logical operations . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Vectorised logical operators . . . . . . . . . . . . . . . . . 43
3.3.2 Operator precedence revisited . . . . . . . . . . . . . . . . 44
3.3.3 Dealing with missingness . . . . . . . . . . . . . . . . . . 45
3.3.4 Aggregating with all, any, and sum . . . . . . . . . . . . . 45
3.3.5 Simplifying predicates . . . . . . . . . . . . . . . . . . . 46
3.4 Choosing elements with ifelse . . . . . . . . . . . . . . . . . . 47
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Vector indexing 67
5.1 head and tail . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Subsetting and extracting from vectors . . . . . . . . . . . . . . . 68
5.2.1 Nonnegative indexes . . . . . . . . . . . . . . . . . . . . 68
5.2.2 Negative indexes . . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 Logical indexer . . . . . . . . . . . . . . . . . . . . . . . 71
CONTENTS V
6 Character vectors 93
6.1 Creating character vectors . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 Inputting individual strings . . . . . . . . . . . . . . . . . 93
6.1.2 Many strings, one object . . . . . . . . . . . . . . . . . . 95
6.1.3 Concatenating character vectors . . . . . . . . . . . . . . 96
6.1.4 Formatting objects . . . . . . . . . . . . . . . . . . . . . 96
6.1.5 Reading text data from files . . . . . . . . . . . . . . . . . 97
6.2 Pattern searching . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.1 Comparing whole strings . . . . . . . . . . . . . . . . . . 98
6.2.2 Partial matching . . . . . . . . . . . . . . . . . . . . . . 98
6.2.3 Matching anywhere within a string . . . . . . . . . . . . . 99
6.2.4 Using regular expressions (*) . . . . . . . . . . . . . . . . 99
6.2.5 Locating pattern occurrences . . . . . . . . . . . . . . . . 100
6.2.6 Replacing pattern occurrences . . . . . . . . . . . . . . . 103
6.2.7 Splitting strings into tokens . . . . . . . . . . . . . . . . . 103
6.3 Other string operations . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.1 Extracting substrings . . . . . . . . . . . . . . . . . . . . 104
6.3.2 Translating characters . . . . . . . . . . . . . . . . . . . . 105
6.3.3 Ordering strings . . . . . . . . . . . . . . . . . . . . . . 105
6.4 Other atomic vector types (*) . . . . . . . . . . . . . . . . . . . . 106
6.4.1 Integer vectors (*) . . . . . . . . . . . . . . . . . . . . . . 106
6.4.2 Raw vectors (*) . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.3 Complex vectors (*) . . . . . . . . . . . . . . . . . . . . . 108
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7 Functions 111
7.1 Creating and invoking functions . . . . . . . . . . . . . . . . . . 112
7.1.1 Anonymous functions . . . . . . . . . . . . . . . . . . . . 112
VI CONTENTS
II Deeper 149
9 Designing functions 151
9.1 Managing data flow . . . . . . . . . . . . . . . . . . . . . . . . 151
9.1.1 Checking input data integrity and argument handling . . . . 151
9.1.2 Putting outputs into context . . . . . . . . . . . . . . . . 155
9.2 Organising and maintaining functions . . . . . . . . . . . . . . . 157
9.2.1 Function libraries . . . . . . . . . . . . . . . . . . . . . . 157
9.2.2 Writing R packages (*) . . . . . . . . . . . . . . . . . . . . 157
Package structure (*) . . . . . . . . . . . . . . . . . . . . 158
Building and installing (*) . . . . . . . . . . . . . . . . . . 158
Documenting (*) . . . . . . . . . . . . . . . . . . . . . . 159
9.2.3 Writing standalone programs (**) . . . . . . . . . . . . . . 159
9.2.4 Assuring quality code . . . . . . . . . . . . . . . . . . . . 161
Managing changes and working collaboratively . . . . . . . 161
CONTENTS VII
10 S3 classes 185
10.1 Object type vs class . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.2 Generics and method dispatching . . . . . . . . . . . . . . . . . 189
10.2.1 Generics, default, and custom methods . . . . . . . . . . . 189
10.2.2 Creating generics . . . . . . . . . . . . . . . . . . . . . . 191
10.2.3 Built-in generics . . . . . . . . . . . . . . . . . . . . . . 193
10.2.4 First-argument dispatch and calling S3 methods directly . . 194
10.2.5 Multi-class-ness . . . . . . . . . . . . . . . . . . . . . . 197
10.2.6 Operator overloading . . . . . . . . . . . . . . . . . . . . 199
10.3 Common built-in S3 classes . . . . . . . . . . . . . . . . . . . . . 202
10.3.1 Date, time, etc. . . . . . . . . . . . . . . . . . . . . . . . 202
10.3.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.3.3 Ordered factors . . . . . . . . . . . . . . . . . . . . . . . 208
10.3.4 Formulae (*) . . . . . . . . . . . . . . . . . . . . . . . . 208
10.4 (Over)using the forward pipe operator, `|>` (*) . . . . . . . . . . . 209
10.5 S4 classes (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
VIII CONTENTS
13 Graphics 297
13.1 Graphics primitives . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.1.1 Symbols (points) . . . . . . . . . . . . . . . . . . . . . . 299
13.1.2 Line segments . . . . . . . . . . . . . . . . . . . . . . . 301
13.1.3 Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . 302
13.1.4 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.1.5 Raster images (bitmaps) (*) . . . . . . . . . . . . . . . . . 304
13.2 Graphics settings . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.2.1 Colours . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.2.2 Plot margins and clipping regions . . . . . . . . . . . . . . 308
13.2.3 User coordinates and axes . . . . . . . . . . . . . . . . . . 308
13.2.4 Plot dimensions (*) . . . . . . . . . . . . . . . . . . . . . 311
13.2.5 Many figures on one page (subplots) . . . . . . . . . . . . . 312
13.2.6 Graphics devices . . . . . . . . . . . . . . . . . . . . . . 313
13.3 Higher-level functions . . . . . . . . . . . . . . . . . . . . . . . 315
13.3.1 Scatter and function plots with plot.default and matplot . 315
13.3.2 Bar plots and histograms . . . . . . . . . . . . . . . . . . 319
13.3.3 Box-and-whisker plots . . . . . . . . . . . . . . . . . . . 324
13.3.4 Contour plots and heat maps . . . . . . . . . . . . . . . . 325
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Changelog 431
References 433
XII CONTENTS
1 https://deepr.gagolewski.com/deepr.pdf
2 https://deepr.gagolewski.com/
3 https://github.com/gagolews/deepr/issues
4 https://dx.doi.org/10.5281/zenodo.7490464
5 https://datawranglingpy.gagolewski.com/
0
Preface
0.1 To R, or not to R
R has been named the eleventh most dreaded programming language in the 2022
StackOverflow Developer Survey6 .
Also, it is a free app, so there must be something wrong with it, right?
But whatever, R is deprecated anyway; the modern way is to use tidyverse.
Or we should all just switch to Python7 .
Yeah, nah.
R is thus very convenient for rapid prototyping. It helps turn our ideas into fully op-
erational code that can be battle-tested, extended, polished, run in production, and
otherwise enjoyed. As an interpreted language, it can be executed not only in an inter-
active read-eval-print loop (command–result, question–answer, …), but also in batch
mode (running standalone scripts).
Therefore, we would rather position R amongst such environments for numerical or
scientific computing as Python with numpy and pandas, Julia, GNU Octave, Scilab, and
MATLAB. However, it is more specialised in data science applications than any of them.
Hence, it provides a much smoother experience. This is why, over the years, R has be-
come the de facto standard in statistics and related fields.
Note R [70] is a dialect of the very popular S system designed in the mid-1970s by
Rick A. Becker, John M. Chambers, and Allan R. Wilks at Bell Labs see [3, 4, 5, 6] for
historical notes and [7, 9, 13, 57] for works on newer versions of S. Quoting from [4]:
The design goal for S is, most broadly stated, to enable and encourage good data
analysis, that is, to provide users with specific facilities and a general environ-
ment that helps them quickly and conveniently look at many displays, summar-
ies, and models for their data, and to follow the kind of iterative, exploratory path
that most often leads to a thorough analysis. The system is designed for interact-
ive use with simple but general expressions for the user to type, and immediate,
informative feedback from the system, including graphic output on any of a vari-
ety of graphical devices.
S became popular because it offered greater flexibility than the standalone statistical
packages. It was praised for its high interactivity and array-centrism that was taken
PREFACE XV
from APL, the familiar syntax of the C language involving {curly braces}, the ability
to treat code as data known from Lisp (Chapter 15), the notion of lazy arguments
(Chapter 17), and the ease of calling external C and Fortran routines (Chapter 14). Its
newer versions were also somewhat object-orientated (Chapter 10).
However, S was a proprietary and closed-source system. To address this, Robert Gen-
tleman and Ross Ihaka of the Statistics Department, University of Auckland developed
R in the 1990s9 . They were later joined by many contributors10 . It has been decided
that it will be distributed under the terms of the free GNU General Public License, ver-
sion 2.
In essence, R was supposed to be backwards-compatible with S, but some design
choices led to their evaluation models’ being slightly different. In Chapter 16, we dis-
cuss that R’s design was inspired by the Scheme language [1].
This is horrible.
Another cohort was isolated from base R through a thick layer of popular third-party
packages that introduce an overwhelming number of functions (every operation, re-
gardless of its complexity, has a unique name). They often duplicate the core function-
ality, and might not be fully compatible with our traditional system.
Both user families ought to be fine, as long as they limit themselves to solving only the
most common data processing problems.
But we yearn for more. We do not want hundreds of prefabricated recipes for popular
dishes that we can mindlessly apply without much understanding.
Our aim is to learn the fundamentals of base R, which constitutes the lingua franca of
9 See [12, 37] for historical notes. R version 0.49 released in April 1997 (the first whose source code is avail-
all R users. We want to be able to indite code that everybody should understand; code
that will work without modifications in the next decades too.
We want to be able to tackle any data-intense problem. Furthermore, we want to de-
velop transferable skills so that learning new tools such as Python with numpy and pan-
das (e.g., [27, 47]) or Julia will be much easier later. After all, R is not the only notable
environment out there.
Anyway, enough preaching. This graduate11 -level textbook is for readers who:
• would like to experience the joy of solving problems by programming,
• want to become independent users of the R environment,
• can appreciate a more cohesively and comprehensively12 organised material,
• do not mind a slightly steeper learning curve at the beginning,
• do not want to be made obsolete by artificial “intelligence” in the future.
Some readers will benefit from its being their first introduction to R (yet, without all
the pampering). For others13 , this will be a fine course from intermediate to advanced
(do not skip the first chapters, though).
Either way, we should not forget to solve all the prescribed exercises.
Good luck!
11 The author taught similar courses for his wonderfully ambitious undergraduate data/computer sci-
ence and mathematics students at the Warsaw University of Technology, where our approach has proven
not difficult whatsoever.
12 Yours truly has not chosen to play a role of a historian, a stenographer, nor a grammarian. Thus, he has
made a few noninvasive idealisations for didactic purposes. Languages evolve over time, R is now different
from what it used to be, and we can shape it (slowly; we value its stable API) to become something even
better in the future.
13 It might also happen that for certain readers, this will not be an appropriate course at all, either at this
stage of their career (come back later) or in general (no dramas). This is a non-profit, open-access project,
but it does not mean it is ideal for everyone. We recommend giving other sources a try, e.g., [8, 10, 15, 45,
58, 61, 62, 69], etc. Some of them are freely available.
PREFACE XVII
NULL
logical
numeric
character
list
function
...
factor
matrix
array
data.frame
formula
Date
kmeans
...
Figure 1. An overview of the most prevalent R data types; see Figure 17.2 for a more
comprehensive list.
• and many more, which we can arbitrarily define using the principles of S3-
style object-orientated programming (Chapter 10).
In this part of the book, we also discuss the principles of sustainable coding
(Chapter 9) as well as introduce ways to prepare publication-quality graphics
(Chapter 13).
3. More advanced material is discussed in the third part. For most readers, it should
be of theoretical interest only. However, it can help gain a complete understanding
of and control over our environment. This includes the following data types:
• symbol (name), call, expression (Chapter 15) are objects representing un-
evaluated R expressions that can be freely manipulated and executed if
needed;
• environment (Chapter 16) store named objects in hash maps and provides
the basis for the environment model of evaluation;
• externalptr (Section 14.2.8) provides the ability to maintain any dynamic-
ally allocated C/C++ objects between function calls.
We should not be surprised that we did not list any data types defined by a few trendy14
third-party packages. We will later see that we can most often do without them. If that
is not the case, we will become skilled enough to learn them quickly ourselves.
such as stringi18 (one of the most often downloaded R packages) and genieclust19
(a fast and robust clustering algorithm in both Python and R).
0.6 Acknowledgements
R, and its predecessor S, is the result of a collaborative effort of many program-
mers20 . Without their generous intellectual contributions, the landscape of data ana-
lysis would not be as beautiful as it is now. R is distributed under the terms of the GNU
General Public License version 2. We occasionally display fragments of its source code
for didactic purposes.
We describe and use R version 4.3.1 (2023-06-16). However, we expect 99.9% of the
material covered here to be valid in future releases (consider filing a bug report if you
discover this is not the case).
Deep R Programming is based on the author’s experience as an R user (since ~2003),
developer of open-source packages, tutor/lecturer (since ~2008), and an author of a
quite successful Polish textbook Programowanie w języku R [25] which was published by
PWN (1st ed. 2014, 2nd ed. 2016). Even though the current book is an entirely different
work, its predecessor served as an excellent testbed for many ideas conveyed here.
In particular, the teaching style exercised in this book has proven successful in many
similar courses that yours truly was responsible for, including at Warsaw University of
Technology, Data Science Retreat (Berlin), and Deakin University (Melbourne). I thank
all my students and colleagues for the feedback given over the last 15-odd years.
This work received no funding, administrative, technical, or editorial support from
Deakin University, Warsaw University of Technology, Polish Academy of Sciences, or
any other source.
This book was prepared in a Markdown superset called MyST21 , Sphinx22 , and TeX
(XeLaTeX). Code chunks were processed with the R package knitr [64]. All fig-
ures were plotted with the low-level graphics package using the author’s own style
template. A little help from Makefiles, custom shell scripts, and Sphinx plugins
(sphinxcontrib-bibtex23 , sphinxcontrib-proof24 ) dotted the j’s and crossed the f ’s.
The Ubuntu Mono25 font is used for the display of code. The typesetting of the main text
relies on the Alegreya26 typeface.
18 https://stringi.gagolewski.com/
19 https://genieclust.gagolewski.com/
20 https://www.r-project.org/contributors.html
21 https://myst-parser.readthedocs.io/en/latest/index.html
22 https://www.sphinx-doc.org/
23 https://pypi.org/project/sphinxcontrib-bibtex
24 https://pypi.org/project/sphinxcontrib-proof
25 https://design.ubuntu.com/font
26 https://www.huertatipografica.com/en
XX PREFACE
Deep
1
Introduction
By calling (invoking) the cat function, we printed out a given character string that we
enclosed in double-quote characters.
Documenting code is a good development practice. It is thus worth knowing that any
text following a hash sign (that is not part of a string) is a comment. It is ignored by the
interpreter.
# This is a comment.
# This is another comment.
cat("I cannot wait", "till lunchtime.\n") # two arguments (another comment)
## I cannot wait till lunchtime.
cat("# I will not buy this record.\n# It is scratched.\n")
## # I will not buy this record.
## # It is scratched.
By convention, in this book, the textual outputs generated by R itself are always pre-
ceded by two hashes. This makes copy-pasting all code chunks easier in case we would
like to experiment with them (which is always highly encouraged).
Whenever a call to a function is to be made, the round brackets are obligatory. All objects
within the parentheses (they are separated by commas) constitute the input data to be
consumed by the operation. Thus, the syntax is: a_function_to_be_called(argument1,
argument2, etc.).
4 I DEEP
Other users (e.g., of Wi***ws) might consider installing Anaconda or Miniconda, es-
pecially if they would like to work with Jupyter (Section 1.2.5) or Python as well.
Below we review several ways in which we can write and execute R code. It is up to
the benign readers to research, set up, and learn the development environment that
suits their needs. As usual in real life, there is no single universal approach that always
works best in all scenarios.
Important When working interactively, the default3 command prompt, “>”, means: I
am awaiting orders. Moreover, “+” denotes: Please continue. In the latter case, we should
either complete the unfinished expression or cancel the operation by pressing ESC or
CTRL+C (depending on the operating system).
For readability, we never print out the command prompt characters in this book.
single PDF document6 as well as the whole website7 . This was facilitated by tools like
pandoc and docutils.
Exercise 1.2 (**) Call install.packages("knitr") in R. Then, create a text file named
test.Rmd with the following content:
# Hello, Markdown!
```{r}
print("G'day!")
print(2+2)
plot((1:10)^2)
```
Assuming that the file is located in the current working directory (compare Section 7.3.2), call
knitr::knit("test.Rmd") from the R console, or run in the terminal:
Rscript -e 'knitr::knit("test.Rmd")'
Alternatively, see Section 7.3.2 for ways to call external programs from R.
flow, including JupyterLab, Emacs, RStudio, and VSCodium. Some of them require
additional plugins for R.
Executing an individual code line or a whole text selection is usually done by pressing
(configurable) keyboard shortcuts such as Ctrl+Enter or Shift+Enter.
Exercise 1.3 (*) JupyterLab8 is a development environment that runs in a web browser. It was
programmed in Python, but supports many programming languages. Thanks to IRkernel9 , we
can use it with R.
1. Install JupyterLab and IRkernel (for instance, if you use Anaconda, run conda install
-c r r-essentials).
2. From the File menu, select Create a new R source file and save it as, e.g., test.R.
3. Click File and select Create a new console for the editor running the R kernel.
4. Input a few print “Hello, world”-like calls.
5. Press Shift+Enter (whilst working in the editor) to send different code fragments to the
console and execute them. Inspect the results.
See Figure 1.1 for an illustration. Note that issuing options(jupyter.rich_display=FALSE)
may be necessary to disable rich HTML outputs and make them look more like ones in this book.
Figure 1.1. JupyterLab: A source file editor and the associated R console, where we can
run arbitrary code fragments.
Example 1.4 (*) JupyterLab also handles dedicated Notebooks, where editable and execut-
able code chunks and results they generate can be kept together in a single .ipynb (JSON) file;
8 https://jupyterlab.readthedocs.io/en/stable
9 https://irkernel.github.io/
8 I DEEP
see Figure 1.2 for an illustration and Chapter 1 of [27] for a quick introduction (from the Python
language kernel perspective).
This environment is convenient for live coding (e.g., for teachers) or performing exploratory data
analyses. However, for more serious programming work, the code can get messy. Luckily, there is
always an option to export a notebook to an executable, plain text R script.
Figure 1.2. An example Jupyter Notebook, where we can keep code and results to-
gether.
To create a vector of any length, we can call the c function, which combines given ar-
guments into a single sequence:
c(1, 2, 3) # three values combined
## [1] 1 2 3
length(c(1, 2, 3)) # indeed, it is a vector of length three
## [1] 3
In Chapter 2, Chapter 3, and Chapter 6, we will discuss the most prevalent types of
atomic vectors: numeric, logical, and character ones, respectively.
c(0, 1, -3.14159, 12345.6) # four numbers
## [1] 0.0000 1.0000 -3.1416 12345.6000
c(TRUE, FALSE) # two logical values
## [1] TRUE FALSE
c("spam", "bacon", "spam") # three character strings
## [1] "spam" "bacon" "spam"
We call them atomic for they can only group together values of the same type. Lists,
which we will discuss in Chapter 4, are, on the other hand, referred to as generic vectors.
They can be used for storing items of mixed types: other lists as well.
Note Not having separate scalar types greatly simplifies the programming of numer-
ical computing tasks. Vectors are prevalent in our main areas of interest: statistics,
simulations, data science, machine learning, and all other data-orientated comput-
ing. For example, columns and rows in tables (characteristics of clients, ratings of
items given by users) or time series (stock market prices, readings from temperature
sensors) are all best represented by means of such sequences.
The fact that vectors are the core part of the R language makes their use very natural,
as opposed to the languages that require special add-ons for vector processing, e.g.,
numpy for Python [34]. By learning different ways to process them as a whole (instead of
one element at a time), we will ensure that our ideas can quickly be turned into opera-
tional code. For instance, computing summary statistics such as, say, the mean abso-
lute deviation of a sequence x, will be as effortless as writing mean(abs(x-mean(x))).
Such code is not only easy to read and maintain, but it is also fast to run.
Exercise 1.5 Sight (without going into detail) the manual on the length function by calling
help("length"). Note that most help pages are structured as follows:
1. Header: package:base means that the function is a base one (see Section 7.3.1 for more
details on the R package system);
2. Title;
3. Description: a short description of what the function does;
4. Usage: the list of formal arguments (parameters) to the function;
5. Arguments: the meaning of each formal argument explained;
6. Details: technical information;
7. Value: return value explained;
8. References: further reading;
9. See Also: links to other help pages;
10. Examples: R code that is worth inspecting.
We can also search within all the installed help pages by calling:
help.search("vague topic") # equivalently: ??"vague topic"
This way, we will be able to find answers to our questions more reliably than when
asking DuckDuckGo or G**gle, which commonly return many low-quality, irrelevant,
or distracting results from splogs. We do not want to lose the sacred code writer’s flow!
It is a matter of personal hygiene and good self discipline.
Important All code chunks, including code comments and textual outputs, form an
integral part of this book’s text. They should not be skipped by the reader. On the con-
trary, they must become objects of our intense reflection and thorough investigation.
For instance, whenever we introduce a function, it may be a clever idea to look it up
in the help system. Moreover, playing with the presented code (running, modifying,
experimenting, etc.) is also very beneficial. We should develop the habit of asking
ourselves questions like “What would happen if…”, and then finding the answers on
our own.
We are now ready to discuss the most significant operations on numeric vectors,
which constitute the main theme of the next chapter. See you there.
1 INTRODUCTION 11
1.5 Exercises
Exercise 1.6 What are the three most important types of atomic vectors?
Exercise 1.7 According to the classification of the R data types we introduced in the previous
chapter, are atomic vectors basic or compound types?
2
Numeric vectors
Note The exercises that we suggest in the sequel are all self-contained, unless expli-
citly stated otherwise. The use of language constructs that are yet to be formally intro-
duced (in particular, if, for, and while explained in Chapter 8) is not just unneces-
sary: it is discouraged. Moreover, we recommend against taking shortcuts by looking
up partial solutions on the internet. Rather, to get the most out of this course, we
should be seeking relevant information within the current and preceding chapters as
well as the R help system.
-3.14
## [1] -3.14
1.23e-4
## [1] 0.000123
The latter is in what we call scientific notation, which is a convenient means of entering
numbers of very large or small orders of magnitude. Here, “e” stands for “… times 10 to
the power of…”. Therefore, 1.23e-4 is equal to 1.23×10−4 = 0.000123. In other words,
given 1.23, we move the decimal separator by four digits towards the left, adding zer-
oes if necessary.
In real life, some information items may be inherently or temporarily missing, un-
known, or Not Available. As R is orientated towards data processing, it was equipped
with a special indicator:
NA_real_ # numeric NA (missing value)
## [1] NA
It is similar to the Null marker in database query languages such as SQL. Note that
NA_real_ is displayed simply as “NA”, chiefly for readability.
Moreover, Inf denotes infinity, ∞, i.e., an element that is larger than the largest rep-
resentable double-precision (64 bit) floating point value. Also, NaN stands for not-a-
number, which is returned as the result of some illegal operations, e.g., 0/0 or ∞ − ∞.
Let’s provide a few ways to create numeric vectors with possibly more than one ele-
ment.
Note Running help("c"), we will see that its usage is like c(...). In the current
context, this means that the c function takes an arbitrary number of arguments. In
Section 9.4.6, we will study the dot-dot-dot (ellipsis) parameter in more detail.
rep(1, 5)
## [1] 1 1 1 1 1
rep(c(1, 2, 3), 4)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
In the second case, the whole vector (1, 2, 3) has been recycled (tiled) four times. Inter-
estingly, if the second argument is a vector of the same length as the first one, the
behaviour will be different:
rep(c(1, 2, 3), c(2, 1, 4))
## [1] 1 1 2 3 3 3 3
rep(c(1, 2, 3), c(4, 4, 4))
## [1] 1 1 1 1 2 2 2 2 3 3 3 3
Important It turns out that the undermentioned function calls are all equivalent:
rep(c(1, 2, 3), 4) # positional matching of arguments: `x`, then `times`
rep(c(1, 2, 3), times=4) # `times` is the second argument
rep(x=c(1, 2, 3), times=4) # keyword arguments of the form name=value
rep(times=4, x=c(1, 2, 3)) # keyword arguments can be given in any order
rep(times=4, c(1, 2, 3)) # mixed positional and keyword arguments
We can also pass each or length.out, but their names must be mentioned explicitly:
rep(c(1, 2, 3), length.out=7)
## [1] 1 2 3 1 2 3 1
rep(c(1, 2, 3), each=3)
## [1] 1 1 1 2 2 2 3 3 3
rep(c(1, 2, 3), length.out=7, each=3)
## [1] 1 1 1 2 2 2 3
of varied behaviours inside a single function is a matter of taste. On the one hand, in
all of the preceding examples, we do repeat the input elements somehow, so remem-
bering just one function name is really convenient. Nevertheless, a drastic change in
the repetition pattern depending, e.g., on the length of the times argument can be
bug-prone. Anyway, we have been warned2 .
Even though their handling might be a little tricky, we will later see that they are in-
dispensable in contexts like “create an empty data frame with a specific column struc-
ture”.
Also, note that R often allows for partial matching of named arguments, but its use is
a bad programming practice; see Section 15.4.4 for more details.
rep(c(1, 2, 3), len=7) # not recommended (see later)
## Warning in rep(c(1, 2, 3), len = 7): partial argument match of 'len' to
## 'length.out'
## [1] 1 2 3 1 2 3 1
From the function’s help page, we discover that seq accepts the from, to, by, and
length.out arguments, amongst others. Thus, the preceding call is equivalent to:
2, 3, 1, 2, 3, …) only and the other outputting patterns like (1, 1, 1, 2, 2, 2, …). They would most likely wrap
them in a new package and announce that on social media. But this is nothing else than a multiplication of
entities without actual necessity. This way, we would end up with three functions. First is the original one,
rep, which everyone ought to know anyway because it is part of the standard library. Second and third are
the two redundant procedures whose user-friendliness is only illusory. See also Chapter 9 for a discussion
on the design of functions.
2 NUMERIC VECTORS 17
We can also pass length.out instead of by. In such a case, the increments or decre-
ments will be computed via the formula ((to - from)/(length.out - 1)). This
default value is reported in the Usage section of help("seq").
seq(1, 0, length.out=5)
## [1] 1.00 0.75 0.50 0.25 0.00
seq(length.out=5) # default `from` is 1
## [1] 1 2 3 4 5
Arithmetic progressions with steps equal to 1 or -1 can also be generated via the `:`
operator.
1:10 # seq(1, 10) or seq(1, 10, 1)
## [1] 1 2 3 4 5 6 7 8 9 10
-1:10 # seq(-1, 10) or seq(-1, 10, 1)
## [1] -1 0 1 2 3 4 5 6 7 8 9 10
-1:-10 # seq(-1, -10) or seq(-1, -10, -1)
## [1] -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
Let’s highlight the order of precedence of this operator: -1:10 means (-1):10, and
not -(1:10); compare Section 2.4.3.
Exercise 2.1 Take a look at the manual page of seq_along and seq_len and determine
whether we can do without them, having seq3 at hand.
The distribution of the sampled values does not need to be uniform; the prob argument
may be fed with a vector of the corresponding probabilities. For example, here are 20
independent realisations of the random variable 𝑋 such that Pr(𝑋 = 0) = 0.9 (the
probability that we obtain 0 is equal to 90%) and Pr(𝑋 = 1) = 0.1:
sample(0:1, 20, replace=TRUE, prob=c(0.9, 0.1))
## [1] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
Note If n is a single number (a numeric vector of length 1), then sample(n, ...)
is equivalent to sample(1:n, ...). Similarly, seq(n) is a synonym for seq(1, n)
or seq(1, length(n)), depending on the length of n. This is a dangerous behaviour
that can occasionally backfire and lead to bugs (check what happens when n is, e.g.,
0). Nonetheless, we have been warned. From now on, we are going to be extra careful
(but are we really?). Read more at help("sample") and help("seq").
Let’s stress that the numbers we obtain are merely pseudorandom because they are gen-
erated algorithmically. R uses the Mersenne-Twister MT19937 method [46] by default;
see help("RNG") and [21, 29, 42]. By setting the seed of the random number generator,
i.e., resetting its state to a given one, we can obtain results that are reproducible.
set.seed(12345) # seeds are specified with integers
sample(1:10, 5, replace=TRUE) # a,b,c,d,e
## [1] 3 10 8 10 8
sample(1:10, 5, replace=TRUE) # f,g,h,i,j
## [1] 2 6 6 7 10
We did not(?) expect that! And now for something completely different:
set.seed(12345)
sample(1:10, 10, replace=TRUE) # a,b,c,d,e,f,g,h,i,j
## [1] 3 10 8 10 8 2 6 6 7 10
Reproducibility is a crucial feature of each truly scientific experiment. The same initial
condition (here: the same seed) leads to exactly the same outcomes.
Note Some claim that the only unsuspicious seed is 42 but in matters of taste, there
2 NUMERIC VECTORS 19
can be no disputes. Everyone can use their favourite picks: yours truly savours 123, 1234,
and 12345 as well.
When performing many runs of Monte Carlo experiments, it may also be a clever idea
to call set.seed(i) in the 𝑖-th iteration of a simulation we are trying to program.
We should ensure that our seed settings are applied consistently across all our scripts.
Otherwise, we might be accused of tampering with evidence. For instance, here is the
ultimate proof that we are very lucky today:
set.seed(1679619) # totally unsuspicious, right?
sample(0:1, 20, replace=TRUE) # so random
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
This is exactly why reproducible scripts and auxiliary data should be published along-
side all research reports or papers. Only open, transparent science can be fully trust-
worthy.
If set.seed is not called explicitly, and the random state is not restored from the pre-
viously saved R session (see Chapter 16), then the random generator is initialised based
on the current wall time and the identifier of the running R instance (PID). This may
justify the impression that the numbers we generate appear surprising.
To understand the “pseudo” part of the said randomness better, in Section 8.3, we will
build a very simple random generator ourselves.
The four header lines that begin with “#” merely serve as comments for us humans.
They should be ignored by the interpreter. The first “real” value, NA, corresponds to
1 January (Wednesday, New Year’s Day; Forex markets were closed, hence a missing
observation).
We can invoke the scan function to read all the inputs and convert them to a single
numeric vector:
4 https://github.com/gagolews/teaching-data/raw/master/marek/euraud-20200101-20200630.csv
20 I DEEP
scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
## [1] NA 1.6006 1.6031 NA NA 1.6119 1.6251 1.6195 1.6193 1.6132
## [11] NA NA 1.6117 1.6110 1.6188 1.6115 1.6122 NA NA 1.6154
## [21] 1.6177 1.6184 1.6149 1.6127 NA NA 1.6291 1.6290 1.6299 1.6412
## [31] 1.6494 NA NA 1.6521 1.6439 1.6299 1.6282 1.6417 NA NA
## [41] 1.6373 1.6260 1.6175 1.6138 1.6151 NA NA 1.6129 1.6195 1.6142
## [51] 1.6294 1.6363 NA NA 1.6384 1.6442 1.6565 1.6672 1.6875 NA
## [61] NA 1.6998 1.6911 1.6794 1.6917 1.7103 NA NA 1.7330 1.7377
## [71] 1.7389 1.7674 1.7684 NA NA 1.8198 1.8287 1.8568 1.8635 1.8226
## [81] NA NA 1.8586 1.8315 1.7993 1.8162 1.8209 NA NA 1.8021
## [91] 1.7967 1.8053 1.7970 1.8004 NA NA 1.7790 1.7578 1.7596
## [ reached getOption("max.print") -- omitted 83 entries ]
We used the paste0 function (Section 6.1.3) to concatenate two long strings (too long
to fit a single line of code) and form a single URL.
We can also read the files located on our computer. For example:
scan("~/Projects/teaching-data/marek/euraud-20200101-20200630.csv",
comment.char="#")
It used an absolute file path that starts at the user’s home directory, denoted “~”. Yours
truly’s case is /home/gagolews.
Note For portability reasons, we suggest slashes, “/”, as path separators; see also
help("file.path") and help(".Platform"). They are recognised by all UNIX-like
boxes as well as by other popular operating systems, including W*****ws. Note that
URLs, such as https://deepr.gagolewski.com/, consist of slashes too.
Paths can also be relative to the current working directory, denoted “.”, which can be
read via a call to getwd. Usually, it is the directory from where the R session has been
started.
For instance, if the working directory was /home/gagolews/Projects/teaching-data/
marek, we could write the file path equivalently as ./euraud-20200101-20200630.
csv or even euraud-20200101-20200630.csv.
On as side note, “..” marks the parent directory of the current working directory.
In the above example, ../r/iris.csv is equivalent to /home/gagolews/Projects/
teaching-data/r/iris.csv.
Exercise 2.2 Read the help page about scan. Take note of the following formal arguments and
their meaning: dec, sep, what, comment.char, and na.strings.
Later we will discuss the read.table and read.csv functions. They are wrappers
around scan that reads structured data. Also, write exports an atomic vector’s contents
to a text file.
2 NUMERIC VECTORS 21
Example 2.3 Figure 2.1 shows the graph of the aforementioned exchange rates, which was gen-
erated by calling:
plot(scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#"),
xlab="Day", ylab="EUR/AUD")
1.85
1.80
EUR/AUD
1.75
1.70
1.65
1.60
0 50 100 150
Day
Figure 2.1. EUR/AUD exchange rates from 2020-01-01 (day 1) to 2020-06-30 (day 182).
Somewhat misleadingly (and for reasons that will become apparent later), the documentation of
plot can be accessed by calling help("plot.default"). Read about, and experiment with,
different values of the main, xlab, ylab, type, col, pch, cex, lty, and lwd arguments. More
plotting routines will be discussed in Chapter 13.
Important In R, all names are case-sensitive. Hence, x and X can coexist peacefully:
when set, they refer to two different objects. If we tried calling Print(x), print(X),
or PRINT(x), we would get an error.
• vectors by x, y, z,
• matrices (and matrix-like objects) by A, B, …, X, Y, Z,
• integer indexes by letters i, j, k, l,
• object sizes by n, m, d, p or nx, ny, etc.,
especially when they are only of temporary nature (for storing auxiliary results, iter-
ating over collections of objects, etc.).
There are numerous naming conventions that we can adopt, but most often they are
a matter of taste; snake_case, lowerCamelCase, UpperCamelCase, flatcase, or dot.
case are equally sound as long as they are used coherently (for instance, some use
snake_case for vectors and UpperCamelCase for functions). Occasionally, we have
little choice but to adhere to the naming conventions of the project we are about to
contribute to.
Note Generally, a dot, “.”, has no special meaning6 ; na.omit is as appropriate a name
as na_omit, naOmit, NAOMIT, naomit, and NaOmit. Readers who know other program-
ming languages will need to habituate themselves to this convention.
R, as a dynamic language, allows for introducing new variables at any time. Moreover,
existing names can be bound to new values. For instance:
Note Objects are automatically destroyed when we cannot access them anymore. By
now, the garbage collector is likely to have got rid of the foregoing 1:3 vector (to which
the name x was bound previously).
Here, vectorised means that instead of being defined to act on a single numeric value,
they are applied on each element in a vector. The 𝑖-th resulting item is a transformed
version of the 𝑖-th input:
To attract our attention to the fact that computing the square root of a negative value is
a reckless act, R generated an informative warning. However, a warning is not an error:
24 I DEEP
the result is being produced as usual. In this case, the ill value is marked as not-a-
number.
Also, the fact that the irrational √2 is displayed7 as 1.4142 does not mean that it is such
a crude approximation to 1.414213562373095048801688724209698.... It was roun-
ded when printing purely for aesthetic reasons. In fact, in Section 3.2.3, we will point
out that the computer’s floating-point arithmetic has roughly 16 decimal digits preci-
sion (but we shall see that the devil is in the detail).
print(y, digits=16) # display more significant figures
## [1] 2.000000000000000 1.414213562373095 NaN
2.3.2 Rounding
The following functions drop all or portions of fractional parts of numbers:
• floor(x) (rounds down to the nearest integer, denoted ⌊𝑥⌋),
• ceiling(x) (rounds up, denoted ⌈𝑥⌉ = −⌊−𝑥⌋),
• trunc(x) (rounds towards zero),
• round(x, digits=0) (rounds to the nearest number with digits decimal digits).
For instance:
x <- c(7.0001, 6.9999, -4.3149, -5.19999, 123.4567, -765.4321, 0.5, 1.5, 2.5)
floor(x)
## [1] 7 6 -5 -6 123 -766 0 1 2
ceiling(x)
## [1] 8 7 -4 -5 124 -765 1 2 3
trunc(x)
## [1] 7 6 -4 -5 123 -765 0 1 2
Note When we write that a function’s usage is like round(x, digits=0), compare
help("round"), we mean that the digits parameter is equipped with the default value
of 0. In other words, if rounding to 0 decimal digits is what we need, the second argu-
ment can be omitted.
7 There are a couple of settings in place that control the default behaviour of the print function; see
width, digits, max.print, OutDec, scipen, etc. in help("options").
2 NUMERIC VECTORS 25
These functions enjoy a number of very valuable identities and inequalities. In partic-
ular, we should know from school that log(𝑥 ⋅ 𝑦) = log 𝑥 + log 𝑦, log(𝑥𝑦 ) = 𝑦 log 𝑥,
and 𝑒𝑥+𝑦 = 𝑒𝑥 ⋅ 𝑒𝑦 .
For the logarithm to a different base, say, log10 𝑥, we can call:
log(c(0, 1, 10, 100, 1000, 1e10), 10) # or log(..., base=10)
## [1] -Inf 0 1 2 3 10
Let’s highlight that 𝑒𝑥 on the log-scale is nothing more than a straight line. Such a transforma-
tion of the axes can only be applied in the case of values strictly greater than 0.
10000
20000
15000
1000
exp(x)
exp(x)
10000
100
5000
10
0
1
0 2 4 6 8 10 0 2 4 6 8 10
x x
– *hyper (hypergeometric),
– *nbinom (negative binomial);
prefixes “p” and “r” retain their meaning, however:
– d now gives the probability mass function (PMF),
– q brings about the quantile function, defined as a generalised inverse of the
CDF.
Each distribution is characterised by a set of underlying parameters. For instance, a
normal distribution N(𝜇, 𝜎) can be pinpointed by setting its expected value 𝜇 ∈ ℝ
and standard deviation 𝜎 > 0. In R, these two have been named mean and sd, re-
spectively; see help("dnorm"). Therefore, e.g., dnorm(x, 1, 2) computes the PDF of
N(1, 2) at x.
Note The parametrisations assumed in R can be subtly different from what we know
from statistical textbooks or probability courses. For example, the normal distribu-
tion can be identified based on either standard deviation or variance, and the expo-
nential distribution can be defined via expected value or its reciprocal. We thus advise
the reader to study carefully the documentation of help("dnorm"), help("dunif"),
help("dexp"), help("dbinom"), and the like.
It is also worth knowing the typical use cases of each of the distributions listed, e.g.,
a Poisson distribution can describe the probability of observing the number of in-
dependent events in a fixed time interval (e.g., the number of users downloading a
copy of R from CRAN per hour), and an exponential distribution can model the time
between such events; compare [23].
Exercise 2.5 A call to hist(x) draws a histogram, which can serve as an estimator of the un-
derlying continuous probability density function of a given sample; see Figure 2.3 for an illustra-
tion.
par(mfrow=c(1, 2)) # two plots in one figure
# left subplot: uniform U(0, 1)
hist(runif(10000, 0, 1), col="white", probability=TRUE, main="")
x <- seq(0, 1, length.out=101)
lines(x, dunif(x, 0, 1), lwd=2) # draw the true density function (PDF)
# right subplot: normal N(0, 1)
hist(rnorm(10000, 0, 1), col="white", probability=TRUE, main="")
x <- seq(-4, 4, length.out=101)
lines(x, dnorm(x, 0, 1), lwd=2) # draw the PDF
Draw a histogram of some random samples of different sizes n from the following distributions:
• rnorm(n, µ, σ) – normal N(𝜇, 𝜎) with expected values 𝜇 ∈ {−1, 0, 5} (i.e., 𝜇 being
equal to either −1, 0, or 5; read “∈” as “belongs to the given set” or “in”) and standard devi-
ations 𝜎 ∈ {0.5, 1, 5};
28 I DEEP
0.4
1.0
0.3
0.8
Density
Density
0.6
0.2
0.4
0.1
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 -4 -2 0 2 4
runif(10000, 0, 1) rnorm(10000, 0, 1)
Figure 2.3. Example histograms of some pseudorandom samples and the true under-
lying probability density functions: the uniform distribution on the unit interval (left)
and the standard normal distribution (right).
On the other hand, the probability that we throw no more than three “1”s, 𝑃(𝐶 ≤ 3), can be
determined by means of the cumulative distribution function, pbinom:
pbinom(3, 12, 1/6) # pbinom(3, 12, 1/6, lower.tail=FALSE)
## [1] 0.87482
The smallest 𝑐 such that 𝑃(𝐶 ≤ 𝑐) ≥ 0.95 can be computed based on the quantile function:
2 NUMERIC VECTORS 29
In other words, at least 95% of the time, we will be observing no more than four successes.
Also, here are 30 pseudorandom realisations (simulations) of the random variable 𝐶:
rbinom(30, 12, 1/6) # how many successes in 12 trials, repeated 30 times
## [1] 1 3 2 4 4 0 2 4 2 2 4 2 3 2 0 4 1 0 1 4 4 3 2 6 2 3 2 2 1 1
The Γ function grows so rapidly that already gamma(172) gives rise to Inf. It is due to
the fact that a computer’s arithmetic is not infinitely precise; compare Section 3.2.3.
Special functions are plentiful; see the open-access NIST Digital Library of Mathematical
Functions [51] for one of the most definitive references (and also [2] for its predecessor).
R package gsl [33] provides a vectorised interface to the GNU GSL [28] library, which
implements many of such routines.
Exercise 2.7 The Pochhammer symbol, (𝑎)𝑥 = Γ(𝑎 + 𝑥)/Γ(𝑎), can be computed via a call to
gsl::poch(a, x), i.e., the poch function from the gsl package:
Read the documentation of the corresponding gsl_sf_poch function in the GNU GSL manual8 .
And when you are there, do not hesitate to go through the list of all functions, including those
related to statistics, permutations, combinations, and so forth.
Many functions also have their logarithm-of versions; see, e.g., lgamma and lbeta.
Also, for instance, dnorm and dbeta have the log parameter. Their classical use case
is the (numerical) maximum likelihood estimation, which involves the sums of the
logarithms of densities.
The operation was performed in an elementwise fashion on the corresponding pairs of ele-
ments from both vectors. The first element in the left sequence was multiplied by the
corresponding element in the right vector, and the result was stored in the first element
of the output. Then, the second element in the left… all right, we get it.
Other operators behave similarly:
0:10 + seq(0, 1, 0.1)
## [1] 0.0 1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11.0
0:7 / rep(3, length.out=8) # division by 3
## [1] 0.00000 0.33333 0.66667 1.00000 1.33333 1.66667 2.00000 2.33333
0:7 %/% rep(3, length.out=8) # integer division
## [1] 0 0 0 1 1 1 2 2
0:7 %% rep(3, length.out=8) # division remainder
## [1] 0 1 2 0 1 2 0 1
0:7 / 3
## [1] 0.00000 0.33333 0.66667 1.00000 1.33333 1.66667 2.00000 2.33333
1:10 * c(-1, 1)
## [1] -1 2 -3 4 -5 6 -7 8 -9 10
2 ^ (0:10)
## [1] 1 2 4 8 16 32 64 128 256 512 1024
Vectorisation and the recycling rule are perhaps most fruitful when applying binary
operators on sequences of identical lengths or when performing vector-scalar (i.e., a
sequence vs a single value) operations. However, there is much more: schemes like
“every 𝑘-th element” appear in Taylor series expansions (multiply by c(-1, 1)), 𝑘-fold
cross-validation, etc.; see also Section 11.3.4 for use cases in matrix/tensor processing.
Also, pmin and pmax return the parallel minimum and maximum of the corresponding
elements of the input vectors. Their behaviour is the same as the arithmetic operators,
but we call them as ordinary functions:
pmin(c(1, 2, 3, 4), c(4, 2, 3, 1))
## [1] 1 2 3 1
pmin(3, 1:5)
## [1] 1 2 3 3 3
pmax(0, pmin(1, c(0.25, -2, 5, -0.5, 0, 1.3, 0.99))) # clipping to [0, 1]
## [1] 0.25 0.00 1.00 0.00 0.00 1.00 0.99
Note Some functions can be very deeply vectorised, i.e., with respect to multiple ar-
guments. For example:
9 A few functions do not warn us whatsoever when they perform incomplete recycling (e.g., paste) or
can even give an error (e.g., as.data.frame.list). Consider this inconsistency an annoying bug and hope
it will be fixed, in the next decade or so.
32 I DEEP
generates three random numbers uniformly distributed over the intervals (10, 11),
(20, 22), and (30, 33), respectively.
Let’s list the operators mentioned so far in their order of precedence, from the least to the
most binding (see also help("Syntax")):
1. `<-` (right to left),
2. `+` and `-` (binary),
3. `*` and `/`,
4. `%%` and `%/%`,
5. `:`,
6. `+` and `-` (unary),
7. `^` (right to left).
Hence, -2^2/3+3*4 means ((-(2^2))/3)+(3*4) and not, e.g., -((2^(2/(3+3)))*4).
Notice that `+` and `-`, `*` and `/`, as well as `%%` and `%/%` have the same priority.
Expressions involving a series of operations in the same group are evaluated left to
right, with the exception of `^` and `<-`, which are performed the other way around.
Therefore:
• 2*3/4*5 is equivalent to ((2*3)/4)*5,
• 2^3^4 is 2^(3^4) because, mathematically, we would write it as 23 = 281 ,
4
• “x <- y <- 4*3%%8/2” binds both y and x to 6, not x to the previous value of y
and then y to 6.
When in doubt, we can always bracket a subexpression to ensure it is executed in the
intended order. It can also increase the readability of our code.
2.4.4 Accumulating
The `+` and `*` operators, as well as the pmin and pmax functions, implement element-
wise operations that are applied on the corresponding elements taken from two given
2 NUMERIC VECTORS 33
However, we can also scan through all the values in a single vector and combine the
successive elements that we inspect using the corresponding operation:
• cumsum(x) gives the cumulative sum of the elements in a vector,
• cumprod(x) computes the cumulative product,
• cummin(x) yields the cumulative minimum,
• cummax(x) breeds the cumulative maximum.
The 𝑖-th element in the output vector will consist of the sum/product/min/max of the
first 𝑖 inputs. For example:
𝑥1 𝑥
⎛
⎜ ⎞
⎟ ⎛ 1
⎜ ⎞
⎟
⎜ 𝑥
⎜ 2 ⎟⎟ ⎜
⎜ 𝑥1 + 𝑥2 ⎟
⎟
cumsum ⎜
⎜
⎜ 𝑥 ⎟
⎟
3 ⎟ = ⎜
⎜ 𝑥 + 𝑥2 + 𝑥3 ⎟
⎟ .
⎜
⎜ ⎟ ⎜ ⎜ 1 ⎟
⎟
⎜ ⋮ ⎟⎟ ⎜ ⎜ ⋮ ⋱ ⎟
⎟
⎝ 𝑥𝑛 ⎠ ⎝ 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 ⎠
cumsum(1:8)
## [1] 1 3 6 10 15 21 28 36
cumprod(1:8)
## [1] 1 2 6 24 120 720 5040 40320
cummin(c(3, 2, 4, 5, 1, 6, 0))
## [1] 3 2 2 2 1 1 0
cummax(c(3, 2, 4, 5, 1, 6, 0))
## [1] 3 3 4 5 5 6 6
Example 2.8 On a side note, diff can be considered an inverse to cumsum. It computes the it-
erated difference: subtracts the first two elements, then the second from the third one, the third
from the fourth, and so on. In other words, diff(x) gives 𝒚 such that 𝑦𝑖 = 𝑥𝑖+1 − 𝑥𝑖 .
x <- c(-2, 3, 6, 2, 15)
diff(x)
## [1] 5 3 -4 13
cumsum(diff(x))
## [1] 5 8 4 17
cumsum(c(-2, diff(x))) # recreates x
## [1] -2 3 6 2 15
Thanks to diff, we can compute the daily changes to the EUR/AUD forex rates studied earlier;
see Figure 2.4.
34 I DEEP
0 20 40 60 80 100 120
Index
Figure 2.4. Iterated differences of the exchange rates (non-missing values only).
2.4.5 Aggregating
If we are only concerned with the last cumulant, which summarises all the inputs, we
have the following10 functions at our disposal:
𝑛
• sum(x) computes the sum of elements in a vector, ∑𝑖=1 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 ,
𝑛
• prod(x) outputs the product of all elements, ∏𝑖=1 𝑥𝑖 = 𝑥1 𝑥2 ⋯ 𝑥𝑛 ,
• min(x) determines the minimum,
• max(x) reckons the greatest value.
sum(1:8)
## [1] 36
prod(1:8)
## [1] 40320
min(c(3, 2, 4, 5, 1, 6, 0))
## [1] 0
(continues on next page)
10 Chapter 7 will discuss the Reduce function, which generalises the above by allowing any binary opera-
The foregoing functions form the basis for the popular summary statistics11 (sample
aggregates) such as:
• mean(x) gives the arithmetic mean, sum(x)/length(x),
• var(x) yields the (unbiased) sample variance, sum((x-mean(x))^2)/(length(x)-1),
• sd(x) is the standard deviation, sqrt(var(x)).
Furthermore, median(x) computes the sample median, i.e., the middle value in the
sorted12 version of x.
For instance:
x <- runif(1000)
c(min(x), mean(x), median(x), max(x), sd(x))
## [1] 0.00046535 0.49727780 0.48995025 0.99940453 0.28748391
Exercise 2.9 Let 𝒙 be any vector of length 𝑛 with positive elements. Compute its geometric and
harmonic mean, which are given by, respectively,
√ 𝑛
√ 𝑛 𝑛
√∏ 𝑥𝑖 = 𝑒 𝑛 ∑𝑖=1 log 𝑥𝑖
1
𝑛
and 𝑛 .
⎷ 𝑖=1 ∑𝑖=1 𝑥1𝑖
When solving exercises like this one, it does not really matter what data you apply these functions
on. We are being abstract in the sense that the 𝒙 vector can be anything: from the one that features
very accurate socioeconomic predictions that will help make this world less miserable, through
the data you have been collecting for the last ten years in relation to your super important PhD
research, whatever your company asked you to crunch today, to something related to the hobby
project that you enjoy doing after hours. But you can also just test the above on something like “x
<- runif(10)”, and move on.
All aggregation functions return a missing value if any of the input elements is unavail-
able. Luckily, they are equipped with the na.rm parameter, on behalf of which we can
request the removal of NAs.
aud <- scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
c(min(aud), mean(aud), max(aud))
## [1] NA NA NA
c(min(aud, na.rm=TRUE), mean(aud, na.rm=TRUE), max(aud, na.rm=TRUE))
## [1] 1.6006 1.6775 1.8635
11 Actually, var and median, amongst others, are defined by the stats package. But this one is automat-
Note In the documentation, we read that the usage of sum, prod, min, and max is like
sum(..., na.rm=FALSE), etc. In this context, it means that they accept any number
of input vectors, and each of them can be of arbitrary length. Therefore, min(1, 2,
3), min(c(1, 2, 3)) as well as min(c(1, 2), 3) all return the same result.
However, we also read that we have mean(x, trim=0, na.rm=FALSE, ...). This time,
only one vector can be aggregated, and any further arguments (except trim and na.rm)
are ignored.
The extra flexibility (which we do not have to rely on, ever) of the former group is due to
their being associative operations. We have, e.g., (2+3)+4 = 2+(3+4). Hence, these
operations can be performed in any order, in any group. They are primitive operations:
it is mean that is based on sum, not vice versa.
2.5 Exercises
Exercise 2.10 Answer the following questions.
• What is the meaning of the dot-dot-dot parameter in the definition of the c function?
• We say that the round function is vectorised. What does that mean?
• What is wrong with a call to c(sqrt(1), sqrt(2), sqrt(3))?
• What do we mean by saying that multiplication operates element by element?
• How does the recycling rule work when applying `+`?
• How to (and why) set the seed of the pseudorandom number generator?
• What is the difference between NA_real_ and NaN?
• How are default arguments specified in the manual of, e.g., the round function?
• Is a call to rep(times=4, x=1:5) equivalent to rep(4, 1:5)?
• List a few ways to generate a sequence like (-1, -0.75, -0.5, …, 0.75, 1).
• Is -3:5 the same as -(3:5)? What about the precedence of operators in expressions such as
2^3/4*5^6, 5*6+4/17%%8, and 1+-2^3:4-1?
• If x is a numeric vector of length 𝑛 (for some 𝑛 ≥ 0), how many values will sample(x)
output?
• Does scan support reading directly from compressed archives, e.g., .csv.gz files?
When in doubt, refer back to the material discussed in this chapter or the R manual.
2 NUMERIC VECTORS 37
Exercise 2.11 Thanks to vectorisation, implementing an example graph of arcsine and ar-
ccosine is straightforward.
x <- seq(-1, 1, length.out=11) # increase length.out for a smoother curve
plot(x, asin(x), # asin() computed for 11 points
type="l", # lines
ylim=c(-pi/2, pi), # y axis limits like c(y_min, y_max)
ylab="asin(x), acos(x)") # y axis label
lines(x, acos(x), col="red", lty="dashed") # adds to the current plot
legend("topright", c("asin(x)", "acos(x)"),
lty=c("solid", "dashed"), col=c("black", "red"), bg="white")
Thusly inspired, plot the following functions: | sin 𝑥2 |, |sin |𝑥||, √⌊𝑥⌋, and 1/(1 + 𝑒−𝑥 ). Recall
that the documentation of plot can be accessed by calling help("plot.default").
Exercise 2.12 The expression:
𝑛
(−1)𝑖+1 1 1 1 1
4∑ = 4 ( − + − + ⋯)
𝑖=1
2𝑖 − 1 1 3 5 7
To make sure you have come up with a correct implementation, compare your result to a call to
cor(x, y).
Exercise 2.14 (*) Find an R package providing a function to compute moving (rolling) averages
and medians of a given vector. Apply them on the EUR/AUD currency exchange data. Draw thus
obtained smoothened versions of the time series.
Exercise 2.15 (**) Use a call to convolve(..., type="filter") to compute the 𝑘-moving
average of a numeric vector.
In the next chapter, we will study operations that involve logical values.
3
Logical vectors
Note By default, T is a synonym for TRUE and F stands for FALSE. However, these are
not reserved keywords and can be reassigned to any other values. Therefore, we advise
against relying on them: they are not used throughout the course of this course.
Also, notice that the logical missing value is spelled simply as NA, and not NA_logical_.
Both the logical NA and the numeric NA_real_ are, for the sake of our widely-conceived
wellbeing, both printed as “NA” on the R console. This, however, does not mean they are
identical; see Section 4.1 for discussion.
Thus, they operate in an elementwise manner. Moreover, the recycling rule is applied
if necessary:
3 < 1:5 # c(3, 3, 3, 3, 3) < c(1, 2, 3, 4, 5)
## [1] FALSE FALSE FALSE TRUE TRUE
c(1, 4) == 1:4 # c(1, 4, 1, 4) == c(1, 2, 3, 4)
## [1] TRUE FALSE FALSE TRUE
Therefore, we can say that they are vectorised in the same manner as the arithmetic
operators `+`, `*`, etc.; compare Section 2.4.1.
Note
2.23e-308 == 0.00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
0000000223
1.79e308 == 179000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000
Every numeric value takes 8 bytes (or, equivalently, 64 bits) of memory. We are, how-
ever, able to store only about 15-17 decimal digits:
print(0.12345678901234567890123456789012345678901234, digits=22) # 22 is max
## [1] 0.1234567890123456773699
which limits the precision of our computations. The about part is, unfortunately, due
to the numbers’ being written in the computer-friendly binary, not the human-aligned
decimal base. This can lead to unexpected outcomes.
In particular:
• 0.1 cannot be represented exactly for it cannot be written as a finite series of re-
ciprocals of powers of 2 (we have 0.1 = 2−4 + 2−5 + 2−8 + 2−9 + …). This leads
to surprising results such as:
0.1 + 0.1 + 0.1 == 0.3
## [1] FALSE
• All integers between −253 and 253 all stored exactly. This is good news. However,
the next integer is beyond the representable range:
2^53 + 1 == 2^53
## [1] TRUE
• The order of operations may matter. In particular, the associativity property can
be violated when dealing with numbers of contrasting orders of magnitude:
2^53 + 2^-53 - 2^53 - 2^-53 # should be == 0.0
## [1] -1.1102e-16
• Some numbers may just be too large, too small, or too close to zero to be repres-
ented exactly:
c(sum(2^((1023-52):1023)), sum(2^((1023-53):1023)))
## [1] 1.7977e+308 Inf
c(2^(-1022-52), 2^(-1022-53))
## [1] 4.9407e-324 0.0000e+00
Important The double-precision floating point format (IEEE 754) is not specific to R.
It is used by most other computing environments, including Python and C++.
For discussion, see [32, 35, 42], and the more statistically-orientated [31].
Instead, they are so close that we can treat the difference between them as negligible. Thus,
in practice, instead of testing if 𝑥 = 𝑦, we will be considering:
• |𝑥 − 𝑦| (absolute error), or
|𝑥−𝑦|
• |𝑦|(relative error; which takes the order of magnitude of the numbers into ac-
count but obviously cannot be applied if 𝑦 is very close to 0),
and determining if these are less than an assumed error margin, 𝜀 > 0, say, 10−8 or
2−26 . For example:
abs(sin(pi) - 0) < 2^-26
## [1] TRUE
Note Rounding can sometimes have a similar effect as testing for almost equality in
terms of the absolute error.
round(sin(pi), 8) == 0
## [1] TRUE
Important The foregoing recommendations are valid for the most popular applic-
ations of R, i.e., statistical and, more generally, scientific computing1 . Our datasets
usually do not represent accurate measurements. Bah, the world itself is far from
ideal! Therefore, we do not have to lose sleep over our not being able to precisely pin-
point the exact solutions.
tioned issue with 0.1). There are some libraries implementing higher precision floating-point numbers or
even interval arithmetic that keeps track of error propagation in operation chains.
44 I DEEP
Important The vectorised `&` and `|` operators should not be confused with their
scalar, short-circuit counterparts, `&&` and `||`; see Section 8.1.4.
Note all will frequently be used in conjunction with `==` as the latter is itself vector-
ised: it does not test whether a vector as a whole is equal to another one.
z <- c(1, 2, 3)
z == 1:3 # elementwise equal
## [1] TRUE TRUE TRUE
all(z == 1:3) # elementwise equal summarised
## [1] TRUE
However, let’s keep in mind the warning about the testing for exact equality of floating-
46 I DEEP
We can also call sum on a logical vector. Taking into account that it interprets TRUE as
numeric 1 and FALSE as 0 (more on this in Section 4.1), it will give us the number of
elements equal to TRUE.
sum(x <= 0.2) # how many elements in x are <= 0.2?
## [1] 1998
Naturally, we expect mean(runif(n) <= 0.2) to be equal to 0.2 (20%), but with ran-
domness, we can never be sure.
𝑡𝑖 if 𝑙𝑖 is TRUE ,
𝑦𝑖 = {
𝑓𝑖 if 𝑙𝑖 is FALSE .
In other words, the 𝑖-th element of the result vector is equal to 𝑡𝑖 if 𝑙𝑖 is TRUE and to 𝑓𝑖
otherwise. For example:
(z <- rnorm(6)) # example vector
## [1] -0.560476 -0.230177 1.558708 0.070508 0.129288 1.715065
ifelse(z >= 0, z, -z) # like abs(z)
## [1] 0.560476 0.230177 1.558708 0.070508 0.129288 1.715065
or:
(x <- rnorm(6)) # example vector
## [1] 0.46092 -1.26506 -0.68685 -0.44566 1.22408 0.35981
(y <- rnorm(6)) # example vector
## [1] 0.40077 0.11068 -0.55584 1.78691 0.49785 -1.96662
ifelse(x >= y, x, y) # like pmax(x, y)
## [1] 0.46092 0.11068 -0.55584 1.78691 1.22408 0.35981
We should not be surprised anymore that the recycling rule is fired up when necessary:
48 I DEEP
Note All arguments are evaluated in their entirety before deciding on which elements
are selected. Therefore, the following call generates a warning:
ifelse(z >= 0, log(z), NA_real_)
## Warning in log(z): NaNs produced
## [1] NA NA 0.44386 -2.65202 -2.04571 0.53945
This is because, with log(z), we compute the logarithms of negative values anyway.
To fix this, we can write:
log(ifelse(z >= 0, z, NA_real_))
## [1] NA NA 0.44386 -2.65202 -2.04571 0.53945
In case we yearn for an if…else if…else-type expression, the calls to ifelse can
naturally be nested.
Example 3.2 A version of pmax(pmax(x, y), z) can be written as:
ifelse(x >= y,
ifelse(z >= x, z, x),
ifelse(z >= y, z, y)
)
## [1] 0.46092 0.11068 1.55871 1.78691 1.22408 1.71506
However, determining three intermediate logical vectors is not necessary. We can save one call to
`>=` by introducing an auxiliary variable:
xy <- ifelse(x >= y, x, y)
ifelse(z >= xy, z, xy)
## [1] 0.46092 0.11068 1.55871 1.78691 1.22408 1.71506
Exercise 3.3 Figure 3.1 depicts a realisation of the mixture 𝑍 = 0.2𝑋 + 0.8𝑌 of two normal
distributions 𝑋 ∼ N(−2, 0.5) and 𝑌 ∼ N(3, 1).
n <- 100000
z <- ifelse(runif(n) <= 0.2, rnorm(n, -2, 0.5), rnorm(n, 3, 1))
hist(z, breaks=101, probability=TRUE, main="", col="white")
In other words, we generated a variate from the normal distribution that has the expected value
of −2 with probability 20%, and from the one with the expectation of 3 otherwise. Thus inspired,
generate the Gaussian mixtures:
• 2
3
𝑋 + 13 𝑌, where 𝑋 ∼ N(100, 16) and 𝑌 ∼ N(116, 8),
• 0.3𝑋 + 0.4𝑌 + 0.3𝑍, where 𝑋 ∼ N(−10, 2), 𝑌 ∼ N(0, 2), and 𝑍 ∼ N(10, 2).
3 LOGICAL VECTORS 49
-4 -2 0 2 4 6 8
z
(*) On a side note, knowing that if 𝑋 follows N(0, 1), then the scaled-shifted 𝜎𝑋 + 𝜇 is distrib-
uted N(𝜇, 𝜎), the above can be equivalently written as:
w <- (runif(n) <= 0.2)
z <- rnorm(n, 0, 1)*ifelse(w, 0.5, 1) + ifelse(w, -2, 3)
3.5 Exercises
Exercise 3.4 Answer the following questions.
• Why the statement “The Earth is flat or the smallpox vaccine is proven effective” is obviously
true?
• What is the difference between NA and NA_real_?
• Why is “FALSE & NA” equal to FALSE, but “TRUE & NA” is NA?
• Why has ifelse(x>=0, sqrt(x), NA_real_) a tendency to generate warnings and
how to rewrite it so as to prevent that from happening?
• What is the interpretation of mean(x >= 0 & x <= 1)?
• For some integer 𝑥 and 𝑦, how to verify whether 0 < 𝑥 < 100, 0 < 𝑦 < 100, and 𝑥 < 𝑦,
all at the same time?
• Mathematically, for all real 𝑥, 𝑦 > 0, we have log 𝑥𝑦 = log 𝑥 + log 𝑦. Why then
all(log(x*y) == log(x)+log(y)) can sometimes return FALSE? How to fix this?
50 I DEEP
1 𝑛
ℒ(𝒑, 𝒚) = ∑ℓ ,
𝑛 𝑖=1 𝑖
where
− log 𝑝𝑖 if 𝑦𝑖 is TRUE ,
ℓ𝑖 = {
− log(1 − 𝑝𝑖 ) if 𝑦𝑖 is FALSE .
Interpretation: in classification problems, 𝑦𝑖 ∈ {FALSE, TRUE} denotes the true class of the 𝑖-
th object (say, whether the 𝑖-th hospital patient is symptomatic) and 𝑝𝑖 ∈ (0, 1) is a machine
learning algorithm’s confidence that 𝑖 belongs to class TRUE (e.g., how sure a decision tree model
is that the corresponding person is unwell). Ideally, if 𝑦𝑖 is TRUE, 𝑝𝑖 should be close to 1 and to 0
otherwise. The cross-entropy loss quantifies by how much a classifier differs from the omniscient
one. The use of the logarithm penalises strong beliefs in the wrong answer.
By the way, if we have solved any of the exercises encountered so far by referring to
if statements, for loops, vector indexing like x[...], or any external R package, we
recommend going back and rewrite our code. Let’s keep things simple (effective, read-
able) by only using base R’s vectorised operations that we have introduced.
4
Lists and attributes
After two brain-teasing chapters, it is time to cool it down a little. In this more tech-
nical part, we will introduce lists, which serve as universal containers for R objects of
any size and type. Moreover, we will also show that each R object can be equipped
with a number of optional attributes. Thanks to them, we will be able to label elements
in any vector, and, in Chapter 10, introduce new complex data types such as matrices
and data frames.
It turns out that we can easily convert between these types, either on our explicit de-
mand (type casting) or on-the-fly (coercion, when we perform an operation that expects
something different from the kind of input it was fed with).
Note (*) Numeric vectors are reported as being either of the type double (double-
precision floating-point numbers) or integer (32-bit; it is a subset of double); see
Section 6.4.1. In most practical cases, this is a technical detail that we can risklessly
ignore; compare also the mode function.
52 I DEEP
tion 10.2.3. is.numeric is generic too, and is more universal than is.double, which only verifies whether
typeof returns "double". For instance, vectors of the type integer which we mention later are considered
numeric as well.
4 LISTS AND ATTRIBUTES 53
If we make an attempt at composing an object of mixed types with c, the common type
will be determined in such a way that data are stored without information loss:
c(-1, FALSE, TRUE, 2, "three", NA)
## [1] "-1" "FALSE" "TRUE" "2" "three" NA
c("zero", TRUE, NA)
## [1] "zero" "TRUE" NA
c(-1, FALSE, TRUE, 2, NA)
## [1] -1 0 1 2 NA
Hence, we see that logical is the most specialised of the tree, whereas character is
the most general.
Some functions that expect vectors of specific types can apply coercion by themselves
(or act as if they do so):
c(NA, FALSE, TRUE) + 10 # implicit conversion logical –> numeric
## [1] NA 10 11
c(-1, 0, 1) & TRUE # implicit conversion numeric –> logical
## [1] TRUE FALSE TRUE
sum(c(TRUE, TRUE, FALSE, TRUE, FALSE)) # same as sum(as.numeric(...))
## [1] 3
cumsum(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 2 2 3 3
cummin(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 1 0 0 0
Exercise 4.1 In one of the previous exercises, we computed the cross-entropy loss between a lo-
gical vector 𝒚 ∈ {0, 1}𝑛 and a numeric vector 𝒑 ∈ (0, 1)𝑛 . This measure can be equivalently
defined as:
1 𝑛
ℒ(𝒑, 𝒚) = − ⎛⎜∑ 𝑦 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )⎞
⎟.
𝑛 ⎝𝑖=1 𝑖 ⎠
Using vectorised operations, but not relying on ifelse this time, implement this formula. Then,
compute the cross-entropy loss between, for instance, “y <- sample(c(FALSE, TRUE), n)”
and “p <- runif(n)” for some n. Note how seamlessly we translate between FALSE/TRUEs and
0/1s in the above equation (in particular, where 1 − 𝑦𝑖 means the logical negation of 𝑦𝑖 ).
54 I DEEP
4.2 Lists
Lists are generalised vectors. They can be comprised of R objects of any kind, also other
lists. It is why we classify them as recursive (and not atomic) objects. They are especially
useful wherever there is a need to handle some multitude as a single entity.
Notice that it is not the same as c(1, 2, 3). We got a sequence that wraps three
numeric vectors, each of length one. More examples:
list(1:3, 4, c(TRUE, FALSE, NA, TRUE), "and so forth") # different types
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] TRUE FALSE NA TRUE
##
## [[4]]
## [1] "and so forth"
list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3)) # a list of lists
## [[1]]
## [[1]][[1]]
## [1] TRUE FALSE NA TRUE
##
## [[1]][[2]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
##
## [[2]]
(continues on next page)
4 LISTS AND ATTRIBUTES 55
The display of lists is (un)pretty bloated. However, the str function prints any R object
in a more concise fashion:
str(list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3)))
## List of 2
## $ :List of 2
## ..$ : logi [1:4] TRUE FALSE NA TRUE
## ..$ : chr [1:26] "a" "b" "c" "d" ...
## $ :List of 1
## ..$ : int [1:3] 1 2 3
Note In Section 4.1, we said that the c function, when fed with arguments of mixed
types, tries to determine the common type that retains the sense of data. If coercion
to an atomic vector is not possible, the result will be a list.
c(1, "two", identity) # `identity` is an object of the type "function"
## [[1]]
## [1] 1
##
## [[2]]
## [1] "two"
##
## [[3]]
## function (x)
## x
## <environment: namespace:base>
Note (*) Chapter 11 will mention the simplify2array function, which generalises
unlist in a way that can sometimes give rise to a matrix.
4 LISTS AND ATTRIBUTES 57
4.3 NULL
NULL, being the one and only instance of the eponymous type, can be used as a place-
holder for an R object or designate the absence of any entities whatsoever.
list(NULL, NULL, month.name)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## [1] "January" "February" "March" "April" "May"
## [6] "June" "July" "August" "September" "October"
## [11] "November" "December"
NULL is different from a vector of length zero because the latter has a type. However,
NULL sometimes behaves like a zero-length vector. In particular, length(NULL) returns
0. Also, c called with no arguments returns NULL.
Testing for NULL-ness can be done with a call to is.null.
Important NULL is not the same as NA. The former cannot be emplaced in an atomic
vector.
c(1, NA, 3, NULL, 5) # here, NULL behaves like a zero-length vector
## [1] 1 NA 3 5
Later we will see that some functions return NULL invisibly when they have nothing
interesting to report. This is the case of print or plot, which are called because of
their side effects (printing and plotting).
Furthermore, in certain contexts, replacing content with NULL will actually result in
its removal, e.g., when subsetting a list.
is any R object except NULL. They can be introduced by calling, amongst others2 , the
structure function:
The object of concern, 1:10, was displayed first. We need to get used to that. Most of
the time, we suggest to treat the “attr…” parts of the display as if they were printed in
tiny font.
Equipping an object with attributes does not usually change its nature; see, however,
Chapter 10 for a few exceptions. The above x is still treated as an ordinary sequence
of numbers by most functions:
sum(x) # the same as sum(1:10); `sum` does not care about any attributes
## [1] 55
typeof(x) # just a numeric vector, but with some perks
## [1] "integer"
Important Attributes are generally ignored by most functions unless they have spe-
cifically been programmed to pay attention to them.
2 Other ways include the replacement versions of the attr and attributes functions; see Section 9.3.6.
4 LISTS AND ATTRIBUTES 59
Additionally, the na.action attribute tells us where the missing observations were:
attr(y_na_free, "na.action") # read the attribute value
## [1] 3 6
## attr(,"class")
## [1] "omit"
We sought all occurrences of the pattern within two character strings. As their number
may vary from string to string, wrapping the results in a list was a good design choice.
Each list element gives the starting positions where matches can be found: there are
60 I DEEP
three and one match(es), respectively. Moreover, every vector of positions has a desig-
nated match.length attribute (amongst others), in case we need it.
Exercise 4.2 Create a list with EUR/AUD, EUR/GBP, and EUR/USD exchange rates read
from the euraud-*.csv, eurgbp-*.csv, and eurusd-*.csv files in our data repository3 .
Each of its three elements should be a numeric vector storing the currency exchange rates. Further-
more, equip them with currency_from, currency_to, date_from, and date_to attributes.
For example:
## [1] NA 1.6006 1.6031 NA NA 1.6119 1.6251 1.6195 1.6193 1.6132
## [11] NA NA 1.6117 1.6110 1.6188 1.6115 1.6122 NA
## attr(,"currency_from")
## [1] "EUR"
## attr(,"currency_to")
## [1] "AUD"
## attr(,"date_from")
## [1] "2020-01-01"
## attr(,"date_to")
## [1] "2020-06-30"
Such an additional piece of information could be stored in a few separate variables (other vectors),
but then it would not be as convenient to use as the above representation.
• they can be accessed via designated functions, e.g., names, class, dim, dimnames,
levels, etc.,
Important (*) The accessor functions such as names or class might return meaningful
values, even if the corresponding attribute is not set explicitly; see, e.g., Section 11.1.5
for an example.
The labels may improve the expressivity and readability of our code and data.
Exercise 4.4 Verify that the above x is still an ordinary numeric vector by calling typeof and
sum on it.
Let’s stress that we can ignore the names attribute whatsoever. If we apply any oper-
ation discussed in Chapter 2, we will garner the same result regardless whether such
extra information is present or not.
It is just the print function that changed its behaviour slightly. After all, it is a special
attribute. Instead of reporting:
## [1] 13 2 6
## attr(,"names")
## [1] "spam" "sausage" "celery"
we got a nicely formatted table-like display. Non-special attributes are still printed in
the standard way:
structure(x, additional_attribute=1:10)
## spam sausage celery
## 13 2 6
## attr(,"additional_attribute")
## [1] 1 2 3 4 5 6 7 8 9 10
62 I DEEP
Note Chapter 5 will also mention that some operations (such as indexing) gain super-
powers in the presence of the names attribute.
Named vectors can be easily created with the c and list functions as well:
c(a=1, b=2)
## a b
## 1 2
list(a=1, b=2)
## $a
## [1] 1
##
## $b
## [1] 2
c(a=c(x=1, y=2), b=3, c=c(z=4)) # this is smart
## a.x a.y b c.z
## 1 2 3 4
Let’s contemplate how a named list is printed on the console. Again, it is still a list, but
with some extras.
Exercise 4.5 A whole lot of functions return named vectors. Evaluate the following expressions
and read the corresponding pages in their documentation:
• quantile(runif(100)),
• hist(runif(100), plot=FALSE),
• options() (take note of digits, scipen, max.print, and width),
• capabilities().
Note (*) Most of the time, lists are used merely as containers for other R objects. This
is a dull yet essential role. However, let’s just mention here that every data frame is,
in fact, a generic vector (see Chapter 12). Each column corresponds to a named list
element:
(df <- head(iris)) # some data frame
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
(continues on next page)
4 LISTS AND ATTRIBUTES 63
Therefore, the functions we discuss in this chapter are of use in processing such struc-
tured data too.
We can alter an attribute’s value or add further attributes by referring to the struc-
ture function once again. Moreover, setting an attribute’s value to NULL gets rid of it
completely.
structure(x, attribute1=NULL, attribute4="added", attribute3="modified")
## X Y
## "some" "object"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "modified"
## attr(,"attribute4")
## [1] "added"
In Section 9.3.6, we will introduce replacement functions. They will enable us to modify or
remove an object’s attribute by calling “attr(x, "some_attribute") <- new_value”.
Moreover, Section 5.5 highlights that certain operations (such as vector indexing, ele-
mentwise arithmetic operations, and coercion) might not preserve all attributes of the
objects that were given as their inputs.
4 LISTS AND ATTRIBUTES 65
4.5 Exercises
Exercise 4.6 Provide an answer to the following questions.
• What is the meaning of c(TRUE, FALSE)*1:10?
• What does sum(as.logical(x)) compute when x is a numeric vector?
• We said that atomic vectors of the type character are the most general ones. Therefore, is
as.numeric(as.character(x)) the same as as.numeric(x), regardless of the type of
x?
• What is the meaning of as.logical(x+y) if x and y are logical vectors? What about as.
logical(x*y), as.logical(1-x), and as.logical(x!=y)?
Exercise 4.8 Given numeric vectors x, y, z, and w, how to combine x, y, and list(z, w) so as
to obtain list(x, y, z, w)? More generally, given a set of atomic vectors and lists of atomic
66 I DEEP
vectors, how to combine them to obtain a single list of atomic vectors (not a list of atomic vectors
and lists, not atomic vectors unwound, etc.)?
Exercise 4.9 readRDS serialises R objects and writes their snapshots to disk so that they can
be restored via a call to saveRDS at a later time. Verify that this function preserves object attrib-
utes. Also, check out dput and dget which work with objects’ textual representation in the form
executable R code.
Exercise 4.10 (*) Use jsonlite::fromJSON to read a JSON file in the form of a named list.
In the extremely unlikely event of our finding the current chapter boring, let’s rejoice:
some of the exercises and remarks that we will encounter in the next part, which is
devoted to vector indexing, will definitely be deliciously stimulating!
5
Vector indexing
We now know plenty of ways to process vectors in their entirety, but how to extract and
replace their specific parts? We will be collectively referring to such activities as index-
ing. This is because they are often performed through the index operator, `[`.
Both functions work on lists too1 . They are useful for previewing the contents of really
big objects. Also, they never complain about our trying to fetch too many elements:
head(x, 100) # no more than the first 100 elements
## [1] 1 2 3 4 5 6 7 8 9 10
1 head and tail are actually S3 generics defined in the utils package. We are able to call them on
Important We might have wondered why “[1]” is displayed each time we print out
an atomic vector on the console:
print((1:51)*10)
## [1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170
## [18] 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340
## [35] 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510
It is merely a visual hint indicating which vector element we output at the beginning
in each line.
5 VECTOR INDEXING 69
When applied on lists, the index operator always returns a list as well, even if we ask
for a single element:
y[2] # a list that includes the second element
## [[1]]
## [1] 11 12
y[c(1, 3)] # not the same as x[1, 3] (a different story)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 21 22 23
If we want to extract a component, i.e., to dig into what is inside a list at a specific
location, we can refer to `[[`:
y[[2]] # extract the second element
## [1] 11 12
This is exactly why R displays “[[1]]”, “[[2]]”, etc. when lists are printed.
On a side note, calling x[[i]] on an atomic vector, where i is a single value, has almost2
the same effect as x[i]. However, `[[` generates an error if the subscript is out of
bounds.
Important Let’s reflect on the operators’ behaviour in the case of nonexistent items:
c(1, 2, 3)[4]
## [1] NA
list(1, 2, 3)[4]
## [[1]]
## NULL
(continues on next page)
2 See also Section 5.5 for the discussion on the preservation of object attributes.
70 I DEEP
Its meaning is different from y[c(1, 3)], though; we are about to extract a single
value, remember? Here, indexing is applied recursively. Namely, the above is equivalent
to y[[1]][[3]]. We got an error because y[[1]] is of a length smaller than three.
More examples:
y[[c(3, 1)]] # y[[3]][[1]]
## [1] 21
list(list(7))[[c(1, 1)]] # 7, not list(7)
## [1] 7
In other words, x[l], where l is a logical vector, returns all x[i] with i such that l[i]
is TRUE. We thus extracted the elements at indexes 1, 5, 6, 8, and 10.
Important Be careful: if the element selector is NA, we will get a missing value (for
atomic vectors) or NULL (for lists).
c("one", "two", "three")[c(NA, TRUE, FALSE)]
## [1] NA "two"
list("one", "two", "three")[c(NA, TRUE, FALSE)]
## [[1]]
## NULL
##
## [[2]]
## [1] "two"
This, lamentably, comes with no warning, which might be problematic when indexers
are generated programmatically. As a remedy, we sometimes pass the logical indexer
to the which function first. It returns the indexes of the elements equal to TRUE, ignor-
ing the missing ones.
which(c(NA, TRUE, FALSE, TRUE, FALSE, NA, TRUE))
## [1] 2 4 7
c("one", "two", "three")[which(c(NA, TRUE, FALSE))]
## [1] "two"
Recall that in Chapter 3, we discussed ample vectorised operations that generate lo-
gical vectors. Anything that yields a logical vector of the same length as x can be passed
as an indexer.
x > 60 # yes, it is a perfect indexer candidate
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
x[x > 60] # select elements in `x` that are greater than 60
## [1] 70 80 90 100
x[x < 30 | 70 < x] # elements not between 30 and 70
## [1] 10 20 80 90 100
x[x < mean(x)] # elements smaller than the mean
## [1] 10 20 30 40 50
x[x^2 > 7777 | log10(x) <= 1.6] # indexing via a transformed version of `x`
(continues on next page)
72 I DEEP
The indexer is always evaluated first and then passed to the subsetting operation. The
index operator does not care how an indexer is generated.
Furthermore, the recycling rule is applied when necessary:
x[c(FALSE, TRUE)] # every second element
## [1] 20 40 60 80 100
y[c(TRUE, FALSE)] # interestingly, there is no warning here
## [[1]]
## [1] 1
##
## [[2]]
## [1] 21 22 23
Exercise 5.1 Consider a simple database about six people, their favourite dishes, and birth
years.
name <- c("Graham", "John", "Terry", "Eric", "Michael", "Terry")
food <- c("bacon", "spam", "spam", "eggs", "spam", "beans")
year <- c( 1941, 1939, 1942, 1943, 1943, 1940 )
The consecutive elements in different vectors correspond to each other, e.g., Graham was born in
1941, and his go-to food was bacon.
• List the names of people born in 1941 or 1942.
• List the names of those who like spam.
• List the names of those who like spam and were born after 1940.
• Compute the average birth year of the lovers of spam.
• Give the average age, in 1969, of those who didn’t find spam utmostly delicious.
The answers must be provided programmatically, i.e., do not just write "Eric" and "Graham".
Make the code generic enough so that it works in the case of any other database of this kind, no
matter its size.
Exercise 5.2 Remove missing values from a given vector without referring to na.omit.
These labels can be referred to when extracting the elements. To do this, we use an
indexer that is a character vector:
x[c("a", "f", "a", "g", "z")]
## a f a g <NA>
## 10 60 10 70 NA
Important We have said that special object attributes add extra functionality on top
of the existing ones. Therefore, indexing by means of positive, negative, and logical
vectors is still available:
x[1:3]
## a b c
## 10 20 30
x[-(1:5)]
## f g h i j
## 60 70 80 90 100
x[x > 70]
## h i j
## 80 90 100
Important Labels do not have to be unique. When we have repeated names, the first
matching element is extracted:
structure(c(1, 2, 3), names=c("a", "b", "a"))["a"]
## a
## 1
There is no direct way to select all but given names, just like with negative integer in-
dexers. For a workaround, see Section 5.4.1.
Exercise 5.3 Rewrite the solution to Exercise 5.1 assuming that we now have three features
wrapped inside a list.
(people <- list(
Name=c("Graham", "John", "Terry", "Eric", "Michael", "Terry", "Steve"),
Food=c("bacon", "spam", "spam", "eggs", "spam", "beans", "spam"),
Year=c( 1941, 1939, 1942, 1943, 1943, 1940, NA_real_)
))
## $Name
## [1] "Graham" "John" "Terry" "Eric" "Michael" "Terry" "Steve"
##
## $Food
## [1] "bacon" "spam" "spam" "eggs" "spam" "beans" "spam"
##
## $Year
## [1] 1941 1939 1942 1943 1943 1940 NA
Do not refer to name, food, and year directly. Instead, use the full people[["Name"]] etc. ac-
cessors. There is no need to pout: it is just a tiny bit of extra work. Also, notice that Steve has joined
the group; hello, Steve.
The principles of vectorisation, recycling rule, and implicit coercion are all in place:
x[c(TRUE, FALSE)] <- c("a", "b", "c")
print(x)
## [1] "a" "2" "b" "4" "c" "6" "a" "8" "b" "10" "c" "42"
Long story long: first, to ensure that the new content can be poured into the old wine-
skin, R coerced the numeric vector to a character one. Then, every second element
therein, a total of six items, was replaced by a recycled version of the replacement se-
quence of length three. Finally, the name x was rebound to such a brought-forth object
and the previous one became forgotten.
Note For more details on replacement functions in general, see Section 9.3.6. Such
operations alter the state of the object they are called on (quite rare a behaviour in
functional languages).
Exercise 5.4 Replace missing values in a given numeric vector with the arithmetic mean of its
well-defined observations.
Moreover:
y[1] <- list(1:10) # replace one element with one object
y[-1] <- 10:11 # replace two elements with two singletons
print(y)
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] 10
##
## $c
## [1] 11
Important Setting a list item to NULL removes it from the list completely.
y <- list(1, 2, 3, 4)
y[1] <- NULL # removes the first element (i.e., 1)
y[[1]] <- NULL # removes the first element (i.e., now 2)
y[1] <- list(NULL) # sets the first element (i.e., now 3) to NULL
print(y)
## [[1]]
## NULL
##
## [[2]]
## [1] 4
The same notation convention is used for dropping object attributes; see Section 9.3.6.
Note that x was not equipped with the names attribute before. The unlabelled elements
were assigned blank labels (empty strings).
Note It is not possible to insert new elements at the beginning or in the middle of a
sequence, at least not with the index operator. By writing “x[3:4] <- 1:5” we do not
3 And often cheaply; see Section 8.3.5 for performance notes. Still, a warning can be generated on each
replace two elements in the middle with five other ones. However, we can always use
the c function to slice parts of the vector and intertwine them with some new content:
x <- seq(10, 100, 10)
x <- c(x[1:2], 1:5, x[5:7])
print(x)
## [1] 10 20 1 2 3 4 5 50 60 70
Example 5.6 Here is how we can remove the elements of a vector that have been assigned spe-
cified labels.
(x <- structure(1:12, names=month.abb)) # example vector
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 2 3 4 5 6 7 8 9 10 11 12
x[!(names(x) %in% c("Jan", "May", "Sep", "Oct"))] # get rid of some elements
## Feb Mar Apr Jun Jul Aug Nov Dec
## 2 3 4 6 7 8 11 12
More generally, match(x, y) gives us the index of the element in y that matches each
x[i].
Exercise 5.8 Refer to the manual of findInterval to verify the function’s behaviour when we
do not include ±∞ as endpoints and how to make ∞ classified as a member of the fourth interval.
Exercise 5.9 Using a call to findInterval, compose a statement that generates a logical vec-
tor whose 𝑖-th element indicates whether x[i] is in the interval [0.25, 0.5]. Was this easier to
write than an expression involving `<=` and `>=`?
For instance, we can assign people into groups determined by their favourite dish:
name <- c("Graham", "John", "Terry", "Eric", "Michael", "Terry")
food <- c("bacon", "spam", "spam", "eggs", "spam", "beans")
split(name, food) # group names with respect to food
## $bacon
## [1] "Graham"
##
## $beans
## [1] "Terry"
##
## $eggs
## [1] "Eric"
##
## $spam
## [1] "John" "Terry" "Michael"
The result is a named list with labels determined by the unique elements in the second
vector.
Here is another example, where we pigeonhole some numbers into the four previously
mentioned intervals:
x <- c(0, 0.2, 0.25, 0.4, 0.66, 1)
split(x, findInterval(x, c(-Inf, 0.25, 0.5, 0.75, Inf)))
## $`1`
## [1] 0.0 0.2
##
## $`2`
## [1] 0.25 0.40
##
## $`3`
## [1] 0.66
##
## $`4`
## [1] 1
Items in the first argument that correspond to missing values in the grouping vector
will be ignored. Also, unsurprisingly, the recycling rule is applied when necessary.
We can also split x into groups defined by a combination of levels of two or more vari-
ables z1, z2, etc., by calling split(x, list(z1, z2, ...)).
Example 5.10 The ToothGrowth dataset is a named list (more precisely, a data frame; see
Chapter 12) that represents the results of an experimental study involving 60 guinea pigs. The
experiment’s aim was to measure the effect of different vitamin C supplement types and doses
on the growth of the rodents’ teeth lengths:
ToothGrowth <- as.list(ToothGrowth) # it is a list, but with extra attribs
ToothGrowth[["supp"]] <- as.character(ToothGrowth[["supp"]]) # was: factor
(continues on next page)
5 VECTOR INDEXING 81
We can split len with respect to the combinations of supp and dose (also called interactions)
by calling:
split(ToothGrowth[["len"]], ToothGrowth[c("supp", "dose")], sep="_")
## $OJ_0.5
## [1] 15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7
##
## $VC_0.5
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0
##
## $OJ_1
## [1] 19.7 23.3 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3
##
## $VC_1
## [1] 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5
##
## $OJ_2
## [1] 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0
##
## $VC_2
## [1] 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
ToothGrowth GROUP BY supp, dose” in SQL. As an appetiser, let’s pass a list of vectors to
the boxplot function; see Figure 5.1.
boxplot(split(ToothGrowth[["len"]], ToothGrowth[c("supp", "dose")], sep="_"))
35
30
25
20
15
10
5
Figure 5.1. Box-and-whisker plots of len split by supp and dose in ToothGrowth.
Note unsplit revokes the effects of split. Later, we will get used to calling un-
split(Map(some_transformation, split(x, z)), z) to modify the values in x in-
dependently in each group defined by z (e.g., standardise the variables within each
class separately).
Note that o[1] is the index of the smallest element in x, o[2] is the position of the
second smallest, …, and o[length(o)] is the index of the greatest value. Hence, e.g.,
x[o[1]] is equivalent to min(x).
Another example:
5 VECTOR INDEXING 83
Note The ordering permutation that order returns is unique (that is why we call it the
permutation), even for inputs containing duplicated elements. Owing to the use of a
stable sorting algorithm, ties (repeated elements) will be listed in the order of occur-
rence.
order(c(10, 20, 40, 10, 10, 30, 20, 10, 10))
## [1] 1 4 5 8 9 2 7 6 3
We have, e.g., five 10s at positions 1, 4, 5, 8, and 9. These five indexes are guaranteed
to be listed in this very order.
Note A call to sort(x) is equivalent to x[order(x)], but the former function can be
faster in certain scenarios. For instance, one of its arguments can induce a partially
sorted vector which can be helpful if we only seek a few order statistics (e.g., the seven
smallest values). Speed is rarely a bottleneck in the case of sorting (when it is, we have
a problem!). This is why we will not bother ourselves with such topics until the last part
of this pleasant book. Currently, we aim at expanding our skill repertoire so that we
can implement anything we can think of.
Exercise 5.11 is.unsorted(x) determines if the elements in x are… not sorted with respect
to `<=`. Write an R expression that generates the same result by referring to the order function.
Also, assuming that x is numeric, do the same by means of a call to diff.
order also accepts one or more arguments via the dot-dot-dot parameter, `...`. This
way, we can sort a vector with respect to many criteria. If there are ties in the first
variable, they will be resolved by the order of elements in the second variable. This is
most useful for rearranging rows of a data frame, which we will exercise in Chapter 12.
x <- c( 10, 20, 30, 40, 50, 60)
y1 <- c("a", "b", "a", "a", "b", "b")
y2 <- c("w", "w", "v", "u", "u", "v")
(continues on next page)
84 I DEEP
Note that order(order(x)) can be considered as a way to rank all the elements in x.
For instance, the third value in x, 40, is assigned rank 7: it is the seventh smallest value
in this vector. This breaks the ties on a first-come, first-served basis. But we can also
write:
order(order(x, runif(length(x)))) # ranks with ties broken at random
## [1] 2 5 7 4 3 1 8 6
This function can be used to remove repeated observations; see also unique. This func-
tion returns a value that is not guaranteed to be sorted (unlike in some other lan-
guages/libraries).
5 VECTOR INDEXING 85
Exercise 5.13 What can be the use case of a call to match(x, unique(x))?
Exercise 5.14 Given two named lists x and y, which we treat as key-value pairs, determine their
set-theoretic union (with respect to the keys). For example:
x <- list(a=1, b=2)
y <- list(c=3, a=4)
z <- ...to.do... # combine x and y
str(z)
## List of 3
## $ a: num 4
## $ b: num 2
## $ c: num 3
5.5.1 c
First, c drops5 all attributes except names:
(x <- structure(1:4, names=c("a", "b", "c", "d"), attrib1="<3"))
## a b c d
## 1 2 3 4
## attr(,"attrib1")
## [1] "<3"
c(x) # only `names` are preserved
## a b c d
## 1 2 3 4
We can therefore end up calling this function chiefly for this nice side effect. Also, recall
that unname drops the labels.
unname(x)
## [1] 1 2 3 4
## attr(,"attrib1")
## [1] "<3"
5.5.2 as.something
as.vector, as.numeric, and similar drop all attributes in the case where the output
is an atomic vector, but it might not necessarily do so in other cases (because they are
S3 generics; see Chapter 10).
as.vector(x) # drops all attributes if x is atomic
## [1] 1 2 3 4
5.5.3 Subsetting
Subsetting with `[` (except where the indexer is not given) drops all attributes but
names (as well as dim and dimnames; see Chapter 11), which is adjusted accordingly:
The replacement version of the index operator modifies the values in an existing vector
whilst preserving all the attributes. In particular, skipping the indexer replaces all the
elements:
y <- x
y[] <- c("u", "v") # note that c("u", "v") has no attributes
print(y)
## a b c d
## "u" "v" "u" "v"
## attr(,"attrib1")
## [1] "<3"
Binary operations are expected to get the attributes from the longer input. If they are
of equal sizes, the first argument is preferred to the second.
y <- structure(c(1, 10), names=c("f", "g"), attrib1=":|", attrib2=":O")
y * x # x is longer
## a b c d
## 1 20 3 40
## attr(,"attrib1")
## [1] "<3"
y[c("h", "i")] <- c(100, 1000) # add two new elements at the end
y * x
## f g h i
## 1 20 300 4000
## attr(,"attrib1")
## [1] ":|"
## attr(,"attrib2")
## [1] ":O"
x * y
## a b c d
## 1 20 300 4000
## attr(,"attrib1")
## [1] "<3"
## attr(,"attrib2")
## [1] ":O"
Also, Section 9.3.6 mentions a way to copy all attributes from one object to another.
88 I DEEP
Important Even in base R, the foregoing rules are not enforced strictly. We con-
sider them inconsistencies that should be, for the time being, treated as features (with
which we need to learn to live as they have not been fixed for years, but hope springs
eternal).
As far as third-party extension packages are concerned, suffice it to say that a lot of R
programmers do not know what attributes are whatsoever. It is always best to refer to
the documentation, perform a few experiments, and/or manually ensure the preser-
vation of the data we care about.
5.6 Exercises
Exercise 5.17 Answer the following questions (contemplate first, then use R to find the answer).
• What is the result of x[c()]? Is it the same as x[]?
• Is x[c(1, 1, 1)] equivalent to x[1]?
• Is x[1] equivalent to x["1"]?
• Is x[c(-1, -1, -1)] equivalent to x[-1]?
• What does x[c(0, 1, 2, NA)] do?
• What does x[0] return?
• What does x[1, 2, 3] do?
• What about x[c(0, -1, -2)] and x[c(-1, -2, NA)]?
• Why x[NA] is so significantly different from x[c(1, NA)]?
• What is x[c(FALSE, TRUE, 2)]?
• What will we obtain by calling x[x<min(x)]?
• What about x[length(x)+1]?
• Why x[min(y)] is most probably a mistake? What could it mean? How can it be fixed?
• Why cannot we mix indexes of different types and write x[c(1, "b", "c", 4)]? Or can
we?
• Why would we call as.vector(na.omit(x)) instead of just na.omit(x)?
• What is the difference between sort and order?
• What is the type and the length of the object returned by a call to split(a, u)? What about
split(a, c(u, v))?
5 VECTOR INDEXING 89
• How to get rid of the seventh element from a list of ten elements?
• How to get rid of the seventh, eight, and ninth elements from a list with ten elements?
• How to get rid of the seventh element from an atomic vector of ten elements?
• If y is a list, by how many elements “y[c(length(y)+1, length(y)+1, length(y)+1)]
<- list(1, 2, 3)” will extend it?
Exercise 5.23 (*) Given two vectors x and y both of length 𝑛, a call to approx(x, y, ...
) can be used to interpolate linearly between the points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ). We
can use it to generate new 𝑦s for previously unobserved 𝑥s (somewhere “in-between” the data we
already have). Moreover, spline(x, y, ...) can perform a cubic spline interpolation, which
is smoother; see Figure 5.2.
x <- c(1, 3, 5, 7, 10)
y <- c(1, 15, 25, 6, 0)
x_new <- seq(1, 10, by=0.25)
y_new1 <- approx(x, y, xout=x_new)[["y"]]
y_new2 <- spline(x, y, xout=x_new)[["y"]]
plot(x, y, ylim=c(-10, 30)) # the points to interpolate between
lines(x_new, y_new1, col="black", lty="solid") # linear interpolation
lines(x_new, y_new2, col="darkred", lty="dashed") # cubic interpolation
(continues on next page)
90 I DEEP
linear
cubic
20
10
y
0
-10
2 4 6 8 10
x
Exercise 5.24 Given some 1 ≤ from ≤ to ≤ n, use findInterval to generate a logical vector of
length n with TRUE elements only at indexes between from and to, inclusive.
Exercise 5.25 Implement expressions that give rise to the same results as calls to which,
which.min, which.max, and rev functions. What is the difference between x[x>y] and
x[which(x>y)]? What about which.min(x) vs which(x == min(x))?
Exercise 5.26 Given two equal-length vectors x and y, fetch the value from the former that cor-
responds to the smallest value in the latter. Write three versions of such an expression, each deal-
ing with potential ties in y differently. For example:
x <- c("a", "b", "c", "d", "e", "f")
y <- c( 3, 1, 2, 1, 1, 4)
It should choose the first ("b"), last ("e"), or random element from x fulfilling the above property
("b", "d", or "e" with equal probability). Make sure your code works for x being of the type
character or numeric as well as an empty vector.
6 https://github.com/gagolews/teaching-data/raw/master/marek/euraud-20200101-20200630.csv
5 VECTOR INDEXING 91
Exercise 5.27 Implement an expression that yields the same result as duplicated(x) for a
numeric vector x, but using diff and order.
Exercise 5.28 Based on match and unique, implement your versions of union(x, y), in-
tersect(x, y), setdiff(x, y), is.element(x, y), and setequal(x, y) for x and y
being nonempty numeric vectors.
6
Character vectors
The only difference between these two is that we cannot directly include, e.g., an apo-
strophe in a single quote-delimited string. On the other hand, "'tis good ol' spam"
and 'I "love" bacon' are both okay.
However, to embrace characters whose inclusion might otherwise be difficult or im-
possible, we may always employ the so-called escape sequences.
R uses the backslash, “\”, as the escape character. In particular:
• \" inputs a double quote,
• \' generates a single quote,
• \\ includes a backslash,
• \n endows a new line.
(x <- "I \"love\" bacon\n\\\"/")
## [1] "I \"love\" bacon\n\\\"/"
The print function (which was implicitly called to display the above object) does not
reveal the special meaning of the escape sequences. Instead, print outputs strings in
94 I DEEP
the same way that we ourselves would follow when inputting them. The number of
characters in x is 18, and not 23:
nchar(x)
## [1] 18
Note (*) The Unicode standard 15.0 (version dated September 2022) defines 149 186
characters, i.a., letters from different scripts, mathematical symbols, and emojis.
Each is assigned a unique numeric identifier; see the Unicode Character Code Charts1 .
For example, the inverted exclamation mark (see the Latin-1 Supplement section therein)
has been mapped to the hexadecimal code 0xA1 (or 161 decimally). Knowing this magic
number permits us to specify a Unicode code point using one of the following escape
sequences:
• \uxxxx – codes using four hexadecimal digits,
• \Uxxxxxxxx – codes using eight hexadecimal digits.
For instance:
cat("!\u00a1!\U000000a1!", sep="\n")
## !¡!¡!
All R installations allow for working with Unicode strings. More precisely, they sup-
port dealing with UTF-8, being a super-encoding that is native to most UNIX-like
boxes, including GNU/Linux and m**OS. Other operating systems may use some 8-
bit encoding as the system one (e.g., latin1 or cp1252), but they can be mixed with
Unicode seamlessly; see help("Encoding"), help("iconv"), and [26] for discussion.
1 https://www.unicode.org/charts
6 CHARACTER VECTORS 95
Nevertheless, certain output devices (web browsers, LaTeX renderers, text terminals)
might be unable to display every possible Unicode character, e.g., due to some fonts’
being missing. However, as far as processing character data is concerned, this does
not matter because R does it with its eyes closed. For example:
cat("\U0001f642\u2665\u0bb8\U0001f923\U0001f60d\u2307", sep="\n")
## ������
In the PDF version2 of this adorable book, the Unicode glyphs are not rendered cor-
rectly for some reason. However, its HTML variant3 , generated from the same source
files, should be displayed by most web browsers properly.
Note (*) Some output devices may support the following codes that control the posi-
tion of the caret (text cursor):
• \b inserts a backspace (moves cursor one column to the left),
• \t implants a tabulator (advances to the next tab stop, e.g., a multiply of four or
eight text columns),
• \r injects a carriage return (move to the beginning of the current line).
cat("abc\bd\tef\rg\nhij", sep="\n")
## gbd ef
## hij
These can be used on unbuffered outputs like stderr to display the status of the cur-
rent operation, for instance, an animated textual progress bar, the print-out of the
ETA, or the percentage of work completed.
Further, certain terminals can also understand the ECMA-48/ANSI-X3.64 escape se-
quences4 of the form \u001b[... to control the cursor’s position, text colour, and
even style. For example, \u001b[1;31m outputs red text in bold font and \u001b[0m
resets the settings to default. We recommend giving, e.g., cat("\u001b[1;31mspam\
u001b[0m") or cat("\u001b[5;36m\u001b[Abacon\u001b[Espam\u001b[0m") a try.
2https://deepr.gagolewski.com/deepr.pdf
3https://deepr.gagolewski.com/
4 https://en.wikipedia.org/wiki/ANSI_escape_code
5 Internally, there is a string cache (a hash table). Multiple clones of the same string do not occupy more
We can also collapse (flatten, aggregate) a sequence of strings into a single string:
paste(c("a", "b", "c", "d"), collapse=",")
## [1] "a,b,c,d"
paste(c("a", "b", "c", "d"), 1:2, sep="", collapse="")
## [1] "a1b2c1d2"
Perhaps for convenience, alas, paste treats missing values differently from most other
vectorised functions:
paste(c("A", NA_character_, "B"), "!", sep="")
## [1] "A!" "NA!" "B!"
SEXPTYPE of STRSXP. They are arrays with elements whose SEXPTYPE is CHARSXP, each of which is a string
of characters (char*).
6 CHARACTER VECTORS 97
Moreover, sprintf is a workhorse for turning possibly many atomic vectors into
strings. Its first argument is a format string. Special escape sequences starting with
the per cent sign, “%”, serve as placeholders for the actual values. For instance, “%s” is
replaced with a string and “%f” with a floating point value taken from further argu-
ments.
sprintf("%s%s", "a", c("X", "Y", "Z")) # like paste(...)
## [1] "aX" "aY" "aZ"
sprintf("key=%s, value=%f", c("spam", "eggs"), c(100000, 0))
## [1] "key=spam, value=100000.000000" "key=eggs, value=0.000000"
The numbers’ precision, strings’ widths and justification, etc., can be customised, e.g.,
“%6.2f” is a number that, when converted to text, will occupy six text columns7 , with
two decimal digits of precision.
sprintf("%10s=%6.2f%%", "rate", 2/3*100) # "%%" renders the per cent sign
## [1] " rate= 66.67%"
sprintf("%.*f", 1:5, pi) # variable precision
## [1] "3.1" "3.14" "3.142" "3.1416" "3.14159"
Also, e.g., “%1$s”, “%2$s”, … inserts the first, second, … argument as text.
sprintf("%1$s, %2$s, %1$s, and %1$s", "spam", "bacon") # numbered argument
## [1] "spam, bacon, spam, and spam"
7 This is only true for 8-bit native encodings or ASCII; see also sprintf from the stringx package, which
takes the text width and not the number of bytes into account.
98 I DEEP
writeLines is its counterpart. There is also an option to read or write parts of files at
a time using file connections which we mention in Section 8.3.5. Moreover, cat(...,
append=TRUE) can be used to create a text file incrementally.
In Section 5.4.1, we introduced the match function and its derivative, the `%in%` oper-
ator. They are vectorised in a different way:
match(c("spam", "spam", "bacon", "eggs"), c("spam", "eggs"))
## [1] 1 1 NA 2
c("spam", "spam", "bacon", "eggs") %in% c("spam", "eggs")
## [1] TRUE TRUE FALSE TRUE
Note (*) match relies on a simple, bytewise comparison of the corresponding code
points. It might not be valid in natural language processing activities, e.g., where
the German word groß should be equivalent to gross [18]. Moreover, in the rare situ-
ations where we read Unicode-unnormalised data, canonically equivalent strings may
be considered different; see [17].
If we provide many prefixes, the above function will be applied elementwisely, just like
the `==` operator.
6 CHARACTER VECTORS 99
Note (*) In Section 9.4.7, we discuss match.arg, which a few R functions rely on when
they need to select a value from a range of possible choices. Furthermore, Section 9.3.2
and Section 15.4.4 mention the (discouraged) partial matching of list labels and func-
tion argument names.
Important The order of arguments is like grepl(needle, haystack), not vice versa.
Also, this function is not vectorised with respect to the first argument.
Exercise 6.2 How the calls to grep(y, x, value=FALSE) and grep(y, x, value=TRUE)
can be implemented based on grepl and other operations we are already familiar with?
Note (*) As a curiosity, agrepl performs approximate matching, which can account
for a smöll nmber of tpyos.
agrepl("spam", x)
## [1] TRUE TRUE FALSE TRUE
agrepl("ham", x, ignore.case=TRUE)
## [1] TRUE TRUE TRUE TRUE
It is based on Levenshtein’s edit distance that measures the number of character inser-
tions, deletions, or substitutions required to turn one string into another.
Note For more details on regular expressions in general, see, e.g., [24]. The ultimate
reference on the PCRE2 pattern syntax is the Unix man page pcre2pattern(3)8 . From
now on, we assume that the reader is familiar with it.
Apart from the Perl-compatible regexes, R also gives access to the TRE library (ERE-
like), which is the default one; see help("regex"). However, we discourage its use
because it is feature-poorer.
Exercise 6.3 The list.files function generates the list of file names in a given directory that
match a given regular expression. For instance, the following gives all CSV files in a folder:
list.files("~/Projects/teaching-data/r/", "\\.csv$")
## [1] "air_quality_1973.csv" "anscombe.csv" "iris.csv"
## [4] "titanic.csv" "tooth_growth.csv" "trees.csv"
## [7] "world_phones.csv"
Write a single regular expression that matches file names ending with “.csv” or “.csv.gz”.
Also, scribble a regex that matches CSV files whose names do not begin with “eurusd”.
regexpr("spam", x, fixed=TRUE)
## [1] 1 3 -1 -1
## attr(,"match.length")
## [1] 4 4 -1 -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
In particular, there is a pattern occurrence starting at the third code point of the
second string in x. Moreover, the last string has no pattern match, which is denoted
by -1.
The match.length attribute is generally more informative when searching with regu-
lar expressions.
To locate all the matches, i.e., globally, we use gregexpr:
8 http://www.pcre.org/current/doc/html/pcre2pattern.html
6 CHARACTER VECTORS 101
As we noted in Section 4.4.2, wrapping the results in a list was a clever choice for the
number of matches can obviously vary between strings.
In Section 7.2, we will look at the Map function, which, along with substring intro-
duced below, can aid in getting the most out of such data. Meanwhile, let’s just men-
tion that regmatches extracts the matching substrings:
regmatches(x, gregexpr("(?i)spam\\p{L}*", x, perl=TRUE))
## [[1]]
## [1] "spam"
##
(continues on next page)
102 I DEEP
Note (*) Consider what happens when a regular expression contains parenthesised
subexpressions (capture groups).
r <- "(?<basename>[^. ]+)\\.(?<extension>[^ ]*)"
This regex consists of two capture groups separated by a dot. The first one is labelled
“basename”. It comprises several arbitrary characters except for spaces and dots. The
second group, named “extension”, is a substring consisting of anything but spaces.
Such a pattern can be used for unpacking space-delimited lists of file names.
z <- "dataset.csv.gz something_else.txt spam"
regexpr(r, z, perl=TRUE)
## [1] 1
## attr(,"match.length")
## [1] 14
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
## basename extension
## [1,] 1 9
## attr(,"capture.length")
## basename extension
## [1,] 7 6
## attr(,"capture.names")
## [1] "basename" "extension"
gregexpr(r, z, perl=TRUE)
## [[1]]
## [1] 1 16
## attr(,"match.length")
## [1] 14 18
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
(continues on next page)
6 CHARACTER VECTORS 103
The capture.* attributes give us access to the matches to the individual capture
groups, i.e., the basename and the extension.
Exercise 6.4 (*) Check out the difference between the results generated by regexec and reg-
expr as well as between the outputs of gregexec and gregexpr.
Note (*) If a regex defines capture groups, matches thereto can be mentioned not only
in the pattern itself but also in the replacement string:
gsub("(\\p{L})\\p{L}\\1", "\\1", "aha egg gag NaN spam", perl=TRUE)
## [1] "a egg g N spam"
Matched are, in the following order: a letter (it is a capture group), another letter, and
the former letter again. Each such palindrome of length three is replaced with just the
repeated letter.
Exercise 6.5 (*) Display the source code of glob2rx by calling print(glob2rx) and study
how this function converts wildcards such as file???.* or *.csv to regular expressions that
can be passed to, e.g., list.files.
Note that this time the search pattern specifying the token delimiter is given as the
second argument (an inconsistency).
Note There is also a replacement (compare Section 9.3.6) version of the foregoing:
x <- "spam, spam, bacon, and spam"
substring(x, 7, 11) <- "eggs"
print(x)
## [1] "spam, eggs, bacon, and spam"
Unfortunately, the number of characters in the replacement string should not exceed
the length of the part being substituted (try "chickpeas" instead of "eggs"). However,
substring replacement can be written as a composition of substring extraction and
concatenation:
paste(substring(x, 1, 6), "chickpeas", substring(x, 11), sep="")
## [1] "spam, chickpeas, bacon, and spam"
Exercise 6.6 Take the output generated by regexpr and apply substring to extract the pat-
tern occurrences. If there is no match in a string, the corresponding output should be NA.
6 CHARACTER VECTORS 105
toupper("spam")
## [1] "SPAM"
Note Like many other string operations in base R, these functions per