Chapter 07 Reproducible
Chapter 07 Reproducible
Mark Andrews
Contents
Introduction 1
Git 25
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Creating and initiating a Git repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Adding and editing files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Using remote repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
References 36
Introduction
The end product of any data analysis is usually a set of tables, figures, and the seemingly countless statistics
and other quantities that specify the results of the statistical modelling or testing and that was performed.
These results are then usually communicated in reports, including and especially peer-reviewed scientific
articles, or through talks and presentations, or through mainstream or social media, and so on. These results
are the end product of an often long and arduous process, something we will call the data analysis pipeline,
that began initially with just the raw data. For example, the original raw data might have been in the form of
a set of .xlsx files that were downloaded from a website. These might have been wrangled and transformed
repeatedly by a long series of operations, such as those described in Chapter 3, to produce various “tidy”
data-sets, which themselves might then have been repeatedly and iteratively visualized and statistically
analysed and modelled. It is not an exaggeration to say that this whole process might have involved hundreds
or even thousands of hours of work, taking place intermittently over the course of months or even years, and
involving many different people at many different stages. We can view the data analysis pipeline as akin to a
factory: raw materials, in the form of the raw data, go in; these are worked on and turned into something
new and valuable by the combined efforts of human labour and machines (computers); and finally the end
products are produced that are to be consumed by others.
The aim of reproducible data analysis, at its most general, is to make the data analysis factory or pipeline as
open and transparent as possible, and to allow all others, including our future selves, to be able to exactly
reproduce any of the results that were produced by it. In other words, the aim is to make it possible for
anyone to determine exactly where and how any given table, figure, or statistical quantity was obtained, and
to be able to reproduce any of these results exactly. If the pipeline is suitably reproducible, anyone ought to
1
be able to take the original raw data and reproduce all the final results, or alternatively ought to be able to
take the raw data and analyse them in new and different ways, thus producing new and interestingly different
results.
Doing reproducible data analysis is often motivated by a general commitment to doing open science. Since
the origin of modern science in the 17th century, part of its defining ethos (see Merton 1973) has been the
unrestricted sharing of the fruits of research, and also a full disclosure of all details of any research so that
they may be scrutinized by others. Thus, openness and transparency are core ethical principles in all science.
Recently, it has become apparent that these ethical principles, although often endorsed in principle, are
not usually followed in practice, and that in fact there is a widespread culture of not sharing data (see, for
example, Tenopir et al. 2011; Houtkoop et al. 2018; Fecher, Friesike, and Hebing 2015), not sharing data
analysis and other computer code (see, for example, Stodden, Guo, and Ma 2013; Shamir et al. 2013), and
that there is a widespread general lack of research transparency in science (see, for example, Iqbal et al.
2016). This has lead to repeated calls for major cultural changes related to the sharing of data and code
and general research transparency under the banner of doing open science (for example, Nosek et al. 2015;
Munafò et al. 2017; Gorgolewski and Poldrack 2016; Ioannidis 2015).
Reproducible data analysis can also be motivated simply as a means of doing more high quality and robust
data analysis, even when this analysis is not being done as part of scientific research per se, such as with
confidential analyses that is done in business and industry. In these contexts, the raw data and analysis
pipeline may always remain confidential and never be shared publicly. Nonetheless, doing this analysis using
reproducible research practices allows for essential quality control. It allows the analysts themselves, those they
do the analysis for, or future analysts who inherit the project or are brought on board it later, to scrutinize
and double-check every detail of the analysis and reproduce every result. These are essential measures to
identify errors, increase rigour, and verify the final results and conclusions. Even, or especially, outside
of academic or scientific research, there can be enormous practical and financial incentives to minimizing
errors and increasing analytical rigour in data analysis. As an example, it has been argued that a lack of
reproducible data analysis techniques was partly to blame for a $9bn loss at the investment bank JPMorgan
in 2012 (see Hern 2013).
This chapter is about doing open, transparent, and reproducible research using R and related software tools.
In particular, we will focus on RMarkdown and knitr for making reproducible reports, and Git for version
control of all our files. There are other important software tools for reproducible research that we could cover,
but do not do so in order to keep our coverage relatively brief and introductory. These tools include Docker
for creating lightweight virtual operating systems, R packages for packaging and distributing code and data,
build automation tools like GNU Make or the R package drake (Landau 2018), continuous integration tools
like Jenkins (https://www.jenkins.io/) or Travis CI (https://travis-ci.com/).
Before we proceed, however, let us briefly discuss some terminology. The terms open, transparent, and
reproducible are a collection of adjectives that describe a set of data analysis practices. These terms are
obviously not identical, but they are related. It is, however, not trivial to state the extent to which any one
depends upon or requires any other. While it is not necessary to be pedantic about the definitions and scope
of these terms, we will briefly outline our understanding of each term.
Open Open data analysis, like open source software or open science generally, is data analysis where all the
data, code, and any other required materials are fully disclosed to and shared with others, usually by
being made publicly available. Being publicly available, and in an unrestricted manner, is therefore
usually taken to be a defining feature of open data analysis.
Transparent Transparent data analysis is analysis where, as mentioned above, it is possible to determine
exactly where and how any given table, figure, or statistical result was obtained. Making data and
code and all other material open is usually a sufficient condition for the analysis to be transparent,
but it is possible to conceive of situations where data and code are open and available but are poorly
written and organized, or are obfuscated, or require undocumented manual intervention, and so lack
transparency. Likewise, as also mentioned above, it is not necessary for data to be open, at least in the
sense of publicly available, for it to be transparent.
2
Reproducible Reproducible data analysis is any data analysis where an independent analyst can exactly
reproduce all the results. For an analysis to be reproducible it is necessary that all the data and
code is available and in full working condition. Strongly implied, however, is that running the code is
essentially a turnkey operation. Software tools, such as build automation tools like GNU make, are
often used, especially with larger projects. For reports and articles, and their myriad tables and figures
and in-text quantities, literate programming tools, particularly rmarkdown with R, are used. Version
control software, such as git, is often used to organize and keep a log of the development of all the
analysis code and scripts. To allow the code to be used across different operating systems and with
the correct dependencies, virtual or containerized operating systems using tools such as docker are
sometimes used. Other tools, such as the checkpoint::checkpoint() function in R can be used to
ensure the correct versions of R package dependencies are used.
3
(Gruber 2004) primarily as means of allowing web users to create formatted posts, including links,
images, lists, etc., on online discussion forums. Markdown consists of normal text, just as you would
write in an email or any other document, as well some minimal syntax that instructs how this text
should be formatted when it is rendered into some output document, such as a html page, pdf or Word
document.
Knitr Knitr (see Xie 2017) is a general tool for dynamic documentation generation using R. In brief and put
very simply, knitr extracts R code from RMarkdown documents, runs this code, and then, by default,
inserts copies of this code and the code’s output into a Markdown document. Knitr then runs pandoc
to create a document, such as a pdf, MS Word document, etc, from this Markdown file.
Pandoc Pandoc (MacFarlane 2006) is, in general, a document converter that can convert a large number of
input document types to an equally large number of output document types. For present purposes, it
is the means by which Markdown documents generated by knitr are converted into their final output
formats such as pdf, MS Word, html, etc.
LATEX LATEX [@lamport1994latex] is a document preparation system that is specialized for creating high
quality technical and scientific documents, especially those containing mathematical formulas and
technical diagrams. It is widely used for creating academic and research manuscripts in mathematically
oriented fields such as statistics, computer science, physics, etc. LATEX documents are created by first
writing a .tex source code file, which is mixture of LATEX and TEX code, markup syntax, and plain
text. This .tex file is then rendered to, usually, a pdf using a LATEX rendering engine, of which there
are many but the most widely is the default pdflatex engine.
Installation
If using RStudio, the necessary R packages including rmarkdown and knitr are automatically installed.
Likewise, the external pandoc program is also automatically installed when rmarkdown is installed. While
this set of tools will allow you to create html and MS Word documents, LATEX must be installed to create
pdf outputs. Installation and configuration of LATEX, because it is a large external program, may not always
be straightforward. It may seem, therefore, that this LATEX installation step is not worthwhile, especially
given that many people unquestionably and happily use MS Word to write their manuscripts. However,
we highly recommend installing LATEX and using the LATEX-ed pdf document output over MS Word for
writing manuscripts. At the very least, the typesetting quality of the resulting LATEX-ed document will be
considerably higher than that of the MS Word. Moreover, by using LATEX with RMarkdown, we can avail of
all the power of LATEX to create the final document. This includes the use of all LATEX packages for creating
and styling mathematical and technical content, including technical diagrams, fine control of figures and
tables, their look and feel, and their placement, internal cross referencing to refer to individual pages, sections,
figures, etc., automatic index generation, and many more. Even if these features may seem unnecessary at
the beginning, in our opinion, it is nonetheless still worthwhile to use LATEX with RMarkdown from the
beginning, so that all these features can be used as one’s experience with RMarkdown grows.
To install LATEX for the purposes of using with RMarkdown, we can install the R package tinytex.
install.packages("tinytex")
tinytex::install_tinytex()
This installation will take some time. After it completes, you should restart RStudio. After you restart, type
the following command (note the three :’s after tinytex):
tinytex:::is_tinytex()
After this installation completes, we can test that rmarkdown (which will have be installed by tidyvese) will
render pdf documents using LaTeX with the following code.
writeLines("Hello $x^2$", 'test.Rmd')
rmarkdown::render("test.Rmd", output_format = "pdf_document")
4
The writeLines creates a tiny .Rmd file named test.Rmd. The rmarkdown::render command then attempts
to render this as pdf document, which will require LATEX etc be properly working. If this works as expected,
we will have a pdf document named test.pdf in our working directory.
```{r}
x <- c(5, 12, 103)
y <- x ^ 2
```
The mean of `x` is `r mean(x)`, and the min value of `y` is `r min(y)`.
This example contains one R chunk. This is the portion of R code between the opening ```{r} and the
closing ```. It also has two pieces of inline R code. These are the code segments between the opening `r and
closing `. The remainder of the code in this example is plain Markdown code.
As we will see, we could open this file in the RStudio editor and then knit it to, for example, a pdf document.
This can be accomplishes by clicking on the small arrow to the right of the icon on the Rmarkdown file
editor, and choosing Knit to PDF. Alternatively, in the console, we could do the following, assuming that
example.Rmd is in the current working directory.
A pdf document example.pdf will be produced and its contents will be as follows.
The key features to notice here are that the Markdown code within the RMarkdown file is formatted according
to its instructions. For example, the appropriate words are italicized and emboldened. Also, the R code in
the chunk is appropriately formatted as code, using a monospaced font, and is syntax highlighted. Finally,
and most interestingly in this example, the values returned by the R commands in the two pieces of inline
code are inserted at the locations of, and thus replace, the original two pieces of inline code. In other words,
in the final document, the values of 40 and 25 replace `r mean(x)` and `r min(y)`, respectively.
Although this is a very simple example, a lot has happened to produce the final pdf output. The procedure is
roughly as follows:
• First, knitr extracts out the R code from example.Rmd and runs them in a separate R session. The
code is run in the order in which it appears in the document. So, first, the R chunk is run, which
creates two vectors, x and y. Then, the R commands mean(x) and then min(y) are executed.
• Next, all the Markdown code in example.Rmd is extracted out and inserted into a new temporary
file example.md. Likewise, by default, copies of the R code in the chunk, though not the inline code,
are inserted into this example.md Markdown file. Any R code that is inserted into example.md is
marked-up as R code so that it will be properly rendered, including with syntax highlighting, in the
final output document.
5
• Next, any output from the R code, whether the R chunk or the R inline code is inserted into the
example.md file. In the example above, there is no output from the R chunk. However, obviously the
commands mean(x) and min(y) both return numbers, and so these two numbers are inserted into the
example.md Markdown file at the exact locations of where the two pieces of inline R code occurred.
• Now, knitr calls pandoc. Pandoc firsts convert the example.md Markdown into a LATEX source code
file named example.tex. This file is essentially just another script whose contents would be easily
understandable to anyone who knows LATEX.
• Next, pandoc calls a LATEX rendering engine, such as pdflatex, and converts the LATEX source code file
example.tex into the pdf document example.pdf.
• Finally, the intermediate files, including example.md and example.tex are, by default, removed so
that the only remaining files are the original RMarkdown source file example.Rmd and the final output
document example.pdf.
---
title: "Data Analysis: A Report" Data Analysis: A Report
author: "Mark Andrews"
date: "October 25, 2019" Mark Andrews
output: pdf_document
--- October 25, 2019
# Introduction Analysis
Here, we do a Pearson’s correlation analysis.
First, we will load the `tidyverse` packages, and read (corr_model <- cor.test(~ x + y, data = data_df))
in the the from a `.csv` file.
```{r load_packages_data} ##
## Pearson's product-moment correlation
library(tidyverse) ##
data_df <- read_csv('example.csv') ## data: x and y
``` ## t = 2.1583, df = 48, p-value = 0.03593
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
# Analysis
## 0.02080288 0.53175304
## sample estimates:
Here, we do a Pearson's correlation analysis. ## cor
```{r analysis} ## 0.2974284
(corr_model <- cor.test(~ x + y, data = data_df)) The correlation coefficient is 0.297.
```
Visualization
The correlation coefficient is
`r round(corr_model$estimate, 3)`. The scatterplot between x and y is shown in Figure 1.
# Visualization
Figure 1: An example of a RMarkdown file on the left, and its corresponding pdf output on the right.
The pdf output is generated by knitting the RMarkdown file. In RStudio, this can be accomplished by
rmarkdown::render("example-2.Rmd"), where example-2.Rmd is the RMarkdown file.
In Figure 1, we shown the code of an RMarkdown file on the left and its corresponding pdf output on the
right. This example shows many of the features of a typical RMarkdown document. We will discuss these
features by looking through each section of the RMarkdown code.
6
The YAML header
The first few lines of the document, specifically those lines that are delimited by the lines with the three
dashes (i.e., ---) constitute its YAML header.
---
title: "Data Analysis: A Report"
author: "Mark Andrews"
date: "October 25, 2019"
output: pdf_document
---
YAML (a recursive acronym for Y AML Ain’t a M arkup Language) is itself a minimal markup language that
is now often used for software configuration files. As is perhaps clear from this example, in this header we
specify the title, author, date, and output format of the document. This information is used when rendering
the output document. At its simplest, YAML consists of key-value mappings where the term on the left
of the : is the key, and the term or statement on the right is the value. Thus, author: "Mark Andrews"
indicates that the author is "Mark Andrews", and when the final document is being generated, the "Mark
Andrews" will be inserted as the value of the author in output document’s template. Note that although we
use use double quotation marks as the values of the title, author, and date keys, neither single nor double
quotation marks are always necessary. In this example, we would only require the quotations for the value of
title. This is to ensure that the colon within Data Analysis: A Report is not mistakenly parsed by the
YAML parser as indicating a key-value mapping.
We specify the document output type in the header by output: pdf_document. As mentioned above, when
writing manuscripts, we recommend always using the pdf based output document option as this will be
generated by LATEX. However, had we preferred to create an MS Word output document, we would state
output: word_document. Likewise, had we preferred a HTML output document, which can obviously be
viewed in any browser or used to create a webpage, we would do output: html_document.
7
warning = F,
out.width = "45%",
fig.align='center')
Here, we are set the values of opts_chunk list of global options used by knitr. For example, in this case, we
start by indicating that, by default, all subsequent chunks in the RMarkdown document should have the
options: message = FALSE and warning = FALSE (because these are R statements, we can use F for FALSE).
Setting message = FALSE entails that the messages, which can be verbose, produced by R commands are not
shown. As an example of this, consider the usual output when we load a package, in this case using lme4.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
Therefore, globally setting message = FALSE can considerably reduce clutter in the output document.
Likewise, it is often preferable to globally suppress R warnings to avoid clutter in the final document.
Although not used in this example, another commonly set global option at this point is echo=FALSE. Often,
especially in manuscripts for peer-reviewed articles, we do not wish to display all, or even any, of the R code
used in the analysis. Rather, we just want to display the results of the analysis in tables, figures, etc, and so
globally setting echo=FALSE avoids having to set echo=FALSE on each subsequent chunk.
The next two settings, out.width = "45%" and fig.align='center', pertain to how figures should be
displayed by default. The out.width = "45%" indicates that the figure should occupy 45% of the width of
the page, and fig.align='center' indicates that it should be centered on the page.
Markdown sections
The use of the # followed by some text at the start of a line in the Markdown language indicates a section
header. Thus, in our case, the following line indicates we should have a section entitled Introduction in our
document.
# Introduction
While the single # is for a section, subsections are indicated by ##, and subsubsections are indicated by ###.
Here is an example with three levels of sections.
# This is a section
## This is a subsection
## This is a subsection
8
R chunk outputs
Just as we see the output of R commands when those commands are run in the console, the output of R
commands will appear in the output document unless we specify request otherwise. Consider the following
lines from our example RMarkdown file.
```{r analysis}
(corr_model <- cor.test(~ x + y, data = data_df))
```
In this case, the assignment statement is surrounded by ( and ) and in general in R, this causes the output of
the expression in this statement to be shown. In other words, the output that will be shown will be identical
to what would be shown had be just typed.
cor.test(~ x + y, data = data_df)
By default, this output will have each line beginning with ## (because they are part of the R chunk’s output
are not interpreted as subsection headers). Should we wish use alternative symbols at the beginning of the
output, we can indicate this by setting the value of the comment chunk option. For example, if we wanted all
R output to start with ‘>’, we set comment = '>', as in the following code.
```{r, echo=T, comment='>'}
rnorm(5) %>% round(2)
```
This code would then be rendered as follows.
rnorm(5) %>% round(2)
> [1] -1.10 0.64 -0.83 0.61 0.86
PK
If Φ = φ1 , φ2 . . . φk . . . φk , where each 0 ≤ φk ≤ 1, and k=1 φk = 1, then Φ is a probability mass function.
In these examples, all the LATEX code occurs in its inline mode, and this obtained by using $ delimiters. In
addition to inline mode, there is LATEX *display* mode. This is obtained by using $$ delimiters. LATEX
display mode is used to display mathematical formulae and other notation on newlines. Consider the following
example.
The probability of the observed data is as follows:
$$
\mathrm{P}(x_i \ldots x_n \vert \mu, \sigma^2)
= \prod_{i=1}^n \mathrm{P}(x_i \vert \mu, \sigma^2),
$$
9
where $\mu$ and $\sigma$ are parameters.
This would then be rendered as follows.
LATEX provides a very large number of mathematical symbols, notations, and formulae. It is beyond the scope
of this chapter to even provide an overview of all of these features. However, we will provide an overview of
some of the more widely used examples in a later section of this chapter.
Figures
If the R code in the a chunk generates a figure, for example by using ggplot, then that figure is inserted
into the document immediately after the location of the chunk. These figures can optionally have figure
captions by setting the value of fig.cap in the chunk header. When the fig.cap is set, at least when using
the pdf_document output, the figures then floats. A floating figure is one whose placement location in the
final document is determined by an algorithm that attempts to minimize a large number of constraints, such
as the amount of whitespace on the page, the number of figures on one page, and so on.
In our example file, we have the following lines.
```{r vis, echo=F, fig.cap='A scatterplot.'}
ggplot(data_df, aes(x, y)) +
geom_point() +
theme_classic()
```
This chunk will create a ggplot figure that will float. Because of the global settings of out.width = 40%
and fig.align = 'center', the figure will be relatively small and centered on the page. Because there is
sufficient space, the figure will be placed at the bottom of the page. However, were we to increase the width
of the figure to even out.width = 50%, the default settings of the float placement algorithm would place
the figure alone on the following page. Often, the choices made by LATEX’s float placement algorithm are
initially unappealing. The recommendation, however, is to not attempt any fine control until the document
is complete. As each new text block, or figure, or table is added to the document, the floats are relocated.
When the document is complete, it is possible to influence float placement by changing parameter settings on
the float placement algorithm.
It should be noted that if figures do not float, they are placed exactly where the chunk’s output would
otherwise occur. In some cases, especially if the figures are relatively small, this can be satisfactory.
10
separately for males and females.
# References
Figure 2: An additional RMarkdown file example. The rendered pdf output document is shown in Figure 3.
output:
pdf_document:
keep_tex: yes
header-includes:
- \usepackage{booktabs}
The latter two lines relate to the bibliography, and we will return to these below. The first two lines indicates
that the LATEX command \usepackage{booktabs} should be included in the header of the resulting .tex
file prior to be rendered to pdf by the LATEX engine. Recall that as mentioned above, the rendering process
to produce the final .pdf document is that knitr takes the .Rmd file and creates a .md file, with R code and
R code output inserted into it. If the final output is to be pdf, pandoc converts the .md file to a .tex file,
and that is then rendered to .pdf by a LATEX engine such as pdflatex. When pandoc creates the .tex file
from the .md, it uses .tex file templates that load some of the more widely used LATEX packages. However,
if any other LATEX packages are required to create the final pdf document, they may be listed as we have
done here in the YAML header using the header-includes option. It should be noted that there are many
thousands of additional LATEX packages that can be installed.
Another option to include addition LATEX functionality in your RMarkdown is to include a block of LATEX code
into the header. For example, we might have a file named include.tex with the following LATEX command.
\newcommand{\Prob}[1]{\mathrm{P}( #1 )}
If we had the following YAML header in our RMarkdown document, the commands in include.tex would
11
Data Analysis: Report II
Mark Andrews
Descriptive statistics
In this data set, we have measured the weight (in kg) and height (in cm) of 6068 participants (4082 males,
1986 females). In the following table, we show the mean and the standard deviations of the weights and
heights, separately for males and females.
Statistical model
We will model the relationship between weight and height as a varying intercepts normal linear model as
follows. For each i ∈ 1 . . . n,
yi ∼ N (µi , σ 2 ),
µi = β0 + β1 xi + β2 zi ,
where yi , xi , zi are the weight, height, and gender of participant i.
In R, this analysis can be easily performed as follows.
model <- lm(weight ~ height + gender, data = weight_df)
The R2 for this model is 0.45, F (2, 6065) = 2495.83, p < 0.01.
More information about varying intecept models can be found in Gelman and Hill (2007).
References
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models.
New York: Cambridge University Press.
Figure 3: The pdf document output corresponding to the RMarkdown code shown in Figure 2.
For example, in this case, we could now write \Prob{X = x} = 0.25 in our RMarkdown file, and this would
eventually be rendered as P(X = x) = 0.25. The file include.tex can have an extensive number of custom
commands, including numerous \usepackage{} statements.
Formatted tables
When writing reports, we often wish to present results of statistical analyses in tables. Naturally, we would
ideally like these tables to be formatted to a high standard. As we’ve seen, we can always simply display
the standard output of R commands that produce tables and data frames, but the resulting unformatted
monospaced font text is not of an acceptable standard for most manuscripts and reports, especially those
that are intended for eventual publication. As an example, consider the table produced by the following R
commands that use the built in swiss data set.
12
Fertility Agriculture Examination
Courtelary 80.2 17.0 15
Delemont 83.1 45.1 6
Franches-Mnt 92.5 39.7 5
Moutier 85.8 36.5 12
Neuveville 76.9 43.5 17
Here, we use booktabs = TRUE and in the YAML header, as mentioned, we ensured that the booktabs
LATEX package is loaded. By loading the booktabs package, the typesetting of the tables in LATEX is often
improved over the default options. The align = 'c' option ensures that the values in each column in the
table are centered. If we want to center the table produced by the above kable command, we can use the
kable_styling function in the kableExtra package as follows.
library(kableExtra)
swiss_df %>%
kable(format = 'latex',
booktabs = TRUE,
align = 'c') %>%
kable_styling(position = 'center')
Display maths
In this example, we provide some examples of display mode mathematics. As mentioned, this is where the
mathematical statements are on separate lines. In this example, we use the aligned environment in LATEX,
which allows us to align multiple mathematical statements. Specifically, the example we use is as follows.
$$
13
\begin{aligned}
y_i &\sim N(\mu_i, \sigma^2),\\
\mu_i &= \beta_0 + \beta_1 x_i + \beta_2 z_i,
\end{aligned}
$$
This is rendered as follows.
yi ∼ N (µi , σ 2 ),
µi = β0 + β1 xi + β2 zi ,
The & symbol on each line is used to align the lines, so that the ∼ on the first line is aligned with the =
on the second line. Here, we need to add \\ at the end of the first line to force a line break. The aligned
environment is widely used in LATEX, especially for showing algebraic derivations. For example, the following
$$
\begin{aligned}
y &= \frac{e^x}{1+e^x},\\
&= \frac{e^x}{e^x (e^{-x} + 1)},\\
&= \frac{1}{1 + e^{-x}}
\end{aligned}
$$
is rendered as follows.
ex
y= ,
1 + ex
ex
= x −x ,
e (e + 1)
1
= .
1 + e−x
Z 1
1 x2
√ e− 2 dx = 0.841
−∞ 2π
Note that also in this example, we also avail of the R function format.pval. This function can be used to
return an inequality when the p-value is lower than a certain threshold named eps. For example, consider
the following example.
p <- c(0.05, 0.02, 0.011, 0.005, 0.001)
format.pval(p, eps = 0.01)
14
## [1] "0.050" "0.020" "0.011" "<0.01" "<0.01"
Bibliography
In the YAML header, we have included the following statements.
bibliography: refs.bib
biblio-style: apalike
The first line tells RMarkdown to use the bibliographic information contained in the file refs.bib, which is
assumed to be present in the working directory. The second line tells RMarkdown to format the citations
and the references in the bibliography using APA style.
The refs.bib file is just a plain text file, and can be named anything. In this example, its content is minimal
and contains just the following content.
@book{GelmanHill:2007,
address = {New York},
author = {Gelman, Andrew and Hill, Jennifer},
title = {Data Analysis Using Regression and Multilevel/Hierarchical Models},
publisher = {Cambridge University Press},
year = 2007
}
The @book{ .... } content is a BibTEX bibliographic entry for a book, with BibTEX being the primary
bibliography manager used with LATEX. As this is a simple text file, these bibliographic entries can be relatively
easily created. However, because BibTEX is such a widely used bibliographic manager, BibTEX bibliographic
entries are provided in many scholarly article data bases, including Google Scholar. It is therefore very easy
to build up a personal bibliography file, which can then be re-used in all of one’s reports.
Having defined a bibliography in the YAML header as we have done, we can now use the BibTEX keys to
perform citations. The key for BibTEX entry is the text after the first { and before the first comma. Consider
the following 3 BibTEX entries.
@book{xie2017dynamic,
title={Dynamic Documents with R and knitr},
author={Xie, Yihui},
year={2017},
publisher={Chapman and Hall/CRC}
}
@article{knuth1984literate,
title={Literate programming},
author={Knuth, Donald Ervin},
journal={The Computer Journal},
volume={27},
number={2},
pages={97--111},
year={1984},
publisher={Oxford University Press}
}
@book{wickham:2019,
Author = {Hadley Wickham},
Title = {Advanced R},
Publisher = {Chapman and Hall/CRC},
Year = {2019},
edition = 2
15
}
The keys here are xie2017dynamic, knuth1984literate, and wickham:2019. Should we wish to refer to,
say, Wickham’s book in our RMarkdown file, we would simply write @wickham:2019. For example, the
following statement in RMarkdown
As is described in @wickham:2019, R is a functional programming language.
would result in “As in described in Wickham (2019), R is a functional programming language.” in the output
document. Alternatively, had we written
R is a functional programming language [@wickham:2019].
this would result in “R is a functional programming language (Wickham 2019).” in the output document.
In either case, the following line would then be inserted at the end of the output document.
If we had multiple citations, they would all be listed at the end of the output document in alphabetical order.
By default in RMarkdown, this list of bibliographic references are not given a section name. Therefore, as
we have done in this example RMarkdown document, we simply end the document with a section named
References by using the following line as the last line.
# References
All the references will now be inserted after this section header. Should we prefer the name Bibliography, or
another other option here, we simply change the name of this section.
Headers
We have already seen section headers above. Section headers begin with the # at the start of a new line
followed by the section title. These lines are preceded and followed by a blank line. The number of # symbols
indicate the level of the section: # indicates a section, ## is a subsection, ### is a subsubsection, and so
on. Note that there should be a space after the # symbol or symbols at the start of the line. For example,
# Introduction will create a section entitled Introduction, but #Introduction will produce a line with
#Introduction on it. In Figure 4, on the left, we see a Markdown file with multiple sections and subsections.
On the right, we see this file rendered as a pdf document.
16
Introduction
# Introduction
This is a sentence in the Introduction.
This is a sentence in the Introduction.
# Analysis
Analysis
This is a sentence in the Analysis section.
This is a sentence in the Analysis section.
## Exploratory analysis
Figure 4: An example of a Markdown file on the left, and its corresponding pdf output on the right. The
Markdown file has sections, indicated by lines beginning with #, and subsections, indicated by lines beginning
with ##.
Code
Often in technical documents, we need to display computer code. If we simply surround the code block by
```, we will obtain monospace typed text. For example, the following Markdown code shows Python code.
```
for x_i in x:
y.append(x_i)
```
This is rendered as follows.
for x_i in x:
y.append(x_i)
However, ideally we would prefer the code to be syntax highlighted. We can do so by indicating the code’s
language after the initial ```. In the following Markdown code, we state that the code is Python. 1
```python
for x_i in x:
y.append(x_i)
17
```
This is then rendered as follows.
for x_i in x:
y.append(x_i)
When we wish to display R code, we have another option. We can use a normal R code chunk, but set the
chunk parameter eval to FALSE, and set echo to TRUE, assuming it is not globally set to TRUE.
```{r, echo=TRUE, eval=FALSE}
n <- 10
x <- rnorm(n)
y <- 2.25 * x + rnorm(n)
M <- lm(y ~ x)
```
M <- lm(y ~ x)
Lists
There are three types of lists that are possible with the pdf document output: Itemized lists, enumerated
lists, and definition lists.
Itemized lists In itemized lists, also known as unordered lists or even bullet-point lists, each item begins
on a new line that begins with * followed by a space. The list has to be both preceded and is followed by a
blank line. For example, we have Markdown code with an listed list on the left, and its rendered output on
the right.
* Apple • Apple
* Orange • Orange
* Blueberry • Blueberry
Items in a list do not need to be written using only one line of Markdown code. A single item may be spread
over multiple lines as in the following example.
* If we start our item
on one line,
and continue to another • If we start our item on one line, and continue
line with no break, to another line with no break, it will still
it will still appear as one item. appear as one item.
* The same thing happens • The same thing happens if you continue on
if you continue the next line after and indention.
on the next line after
and indention.
We can also create nested lists by using indented lines that themselves begin with *, as in the following
example.
18
* Farm animal • Farm animal
* Cow – Cow
* Pig – Pig
* Sheep – Sheep
* Wild animal • Wild animal
* Fox – Fox
* Wolf – Wolf
In general in Markdown, the indentation should be made with four spaces or a tab. In some cases, fewer
spaces than 4 will suffice, but it is generally better to use four spaces. Also, beware that some editors,
including the RStudio editor, map the Tab key to 2 spaces.
Enumerated lists Enumerated lists are defined and behave just like itemized lists except that instead
of an item or sub-item defined by a line beginning with a * (followed by a space), it is defined by a line
beginning a number followed by a . and then space. In the following example, our items begin with 1.
followed by a space.
1. Potato 1. Potato
1. Brocolli 2. Brocolli
1. Cabbage 3. Cabbage
We do not need to always use 1. to obtain an enumerated list. Any other numbers will suffice, as in the
following example.
1. Potato 1. Potato
4. Brocolli 2. Brocolli
10. Cabbage 3. Cabbage
However, if we use a number greater than 1 as the first item, then that will be used as the starting value of
the numbering of the list as in the following example.
12. Potato 12. Potato
4. Brocolli 13. Brocolli
10. Cabbage 14. Cabbage
Enumerated lists can be nested just like itemized lists.
1. Tree
1. Tree
1. Fir
1. Fir
1. Pine
2. Pine
1. Oak
3. Oak
1. Flower
2. Flower
1. Rose
1. Rose
1. Tulip
2. Tulip
1. Daffodil
3. Daffodil
We may also mix enumerated and itemized lists when using nestings as in the following example.
* Tree
• Tree
1. Fir
1. Fir
1. Pine
2. Pine
1. Oak
3. Oak
* Flower
• Flower
1. Rose
1. Rose
1. Tulip
2. Tulip
1. Daffodil
3. Daffodil
Definition lists Definition lists, when rendered, begin with an emboldened term followed by a definition
or description. This can be useful for definitions per se, but also when we need to elaborate on or describe
19
certain terms. For example, they could be used to describe the meaning of different variables in a data set.
We create a definition list by beginning a new line with some text, which is usually brief, such as a name. The
subsequent line then begins with a : followed by the definition or description. For example, in the following
code, we provide a definition list elaborating upon the variables in a dataset.
participant-id
: An integer that uniquely
codes each participant. participant-id An integer that uniquely codes
each participant.
gender gender A binary variable with values female and
: A binary variable with male.
values `female` and `male`. age A numeric variable giving the participant’s
age in years.
age
: A numeric variable giving
the participant's age in years.
In what follows, we provide a brief and minimal introduction to the main mathematical typesetting commands
that are available in RMarkdown. To able to appreciate all of the mathematical typesetting options in
RMarkdown, we would have to provide a thorough introduction to mathematical typesetting in LATEX.
This is well beyond the scope of this book, but there are many books that provide extensive details about
mathematical typesetting in LATEX. For example, one highly recommended comprehensive and up to date
book is Grätzer (2016).
20
For example, $x$ becomes x, $A$ becomes A, etc. It also very common to use the upper and lower case
letters of the Greek alphabet. The lowercase Greek symbols are in the following table.
\alpha α \iota ι \sigma σ
\beta β \kappa κ \tau τ
\gamma γ \lambda λ \upsilon υ
\delta δ \mu µ \phi φ
\epsilon \nu ν \chi χ
\zeta ζ \xi ξ \psi ψ
\eta η \pi π \omega ω
\theta θ \rho ρ
Given that some uppercase Greek letters are identical to uppercase letters in English, only those that are
different to English letters have LATEX commands or even LATEX to render them.
These are shown in the following table.
\Gamma Γ \Lambda Λ \Sigma Σ \Psi Ψ
\Delta ∆ \Xi Ξ \Upsilon Υ \Omega Ω
\Theta Θ \Pi Π \Phi Φ
Calligraphic or blackboard bold fonts of English uppercase letters commonly appear too. The following
calligraphic uppercase English letters are often used in mathematical formulas and notation.
\mathcal{A} A \mathcal{G} G \mathcal{M} M \mathcal{S} S \mathcal{Y} Y
\mathcal{B} B \mathcal{H} H \mathcal{N} N \mathcal{T} T \mathcal{Z} Z
\mathcal{C} C \mathcal{I} I \mathcal{O} O \mathcal{U} U
\mathcal{D} D \mathcal{J} J \mathcal{P} P \mathcal{V} V
\mathcal{E} E \mathcal{K} K \mathcal{Q} Q \mathcal{W} W
\mathcal{F} F \mathcal{L} L \mathcal{R} R \mathcal{X} X
Likewise, the following blackboard bold fonts are widely used.
\mathbb{A} A \mathbb{G} G \mathbb{M} M \mathbb{S} S \mathbb{Y} Y
\mathbb{B} B \mathbb{H} H \mathbb{N} N \mathbb{T} T \mathbb{Z} Z
\mathbb{C} C \mathbb{I} I \mathbb{O} O \mathbb{U} U
\mathbb{D} D \mathbb{J} J \mathbb{P} P \mathbb{V} V
\mathbb{E} E \mathbb{K} K \mathbb{Q} Q \mathbb{W} W
\mathbb{F} F \mathbb{L} L \mathbb{R} R \mathbb{X} X
As examples of where these fonts are used, N, Z, Q, R, Q are used to denote the natural numbers, integers,
rational, real, and complex numbers, respectively.
2
22 , 2ij
21
It should be noted that not using the brace here would simply lead to a ! Double superscript or ! Double
subscript error.
It is also possible to mix sub and superscripts. For example, $2ˆi_j$ gives 2ij . Likewise, $2_jˆi$ also gives
2ij .
1 3 4+6 5
+ = =
2 4 8 4
While it is perhaps more common to use \frac{}{} in display mode, it may be used in inline mode as well,
as in the following example.
The result is $x = \frac{a + b}{c + d}$.
This is rendered as follows:
a+b
The result is x = c+d .
a
1
b+ c
22
n
X
xi
i=1
The limits of the sum are not strictly necessary. Thus, we could omit them as follows.
$$
\sum x_i
$$
This appears as follows:
X
xi
However, this summation notation is ambiguous at best, and so including the limits of the sum is highly
recommended at all times.
We write products using the uppercase Pi symbol, but again, we should not use \Pi but use \prod instead.
Just like \sum, \prod should always be used with limits, as in the following example.
$$
\prod_{i=1}^n x_i
$$
This appears as follows:
n
Y
xi
i=1
Z 1
f (x)dx
0
When limits on the integral are given, it is known as a definite integral. In the absence of limits, we have an
indefinite integral and this is the integral over all values of the variable of integrand.
Roots
We obtain the square root symbol by the \sqrt command. To use this properly, unless the expression is
a
√ single digit, we must include the expression
√ with the {} after the command. In other words, to obtain
2, we can use $\sqrt 2$, but to obtain 42,√ we p write $\sqrt{42}$. Had we written $\sqrt 42$ or
must
even $\sqrt(42)$, we would have obtained 42 or (42), which is not the desired result. Note that \sqrt
produces square root that stretches to enclose the expression to which it applies, as we see in the following
example that gives the formula for a sample standard deviation.
23
$$
s = \sqrt{\frac{\sum^n_{i=1} (x_i - \bar{x})^2}{n-1}}
$$
This will appear as follows.
sP
n
− x̄)2
i=1 (xi
s=
n−1
To obtain the n-th root, we place the value of n in square brackets after \sqrt and before the {}, as we see
in the following example, which displays the cubed root.
$$
\sqrt[3]{125} = 5
$$
This will appear as follows.
√
3
125 = 5
f = (x + y)(x + y),
= x2 + 2xy + y 2
Note that similarly to the case of the matrix environment, we use & for alignment and \\ for newlines.
We may use the aligned environment even when we do not need alignment per se, but just require multiline
equations, as in the following example.
24
$$
\begin{aligned}
x = a + b\\
y = c + d + e
\end{aligned}
$$
This appears as follows.
x=a+b
y =c+d+e
Git
Version control software (vcs) is an essential tool for the efficient management and organization of source
code. In the case of data analysis using R, the relevant source code files will primarily include .R and .Rmd
scripts, but even in small and routine projects, there are many other possibilities too. vcs allows us to keep
track of all the versions or revisions to a set of files in an efficient and orderly manner. As a simple example
to motivate the use of a vcs system, let us say that we are working on a relatively small data analysis
project initially involving some .R and .Rmd scripts, with names like preprocessing.R, exploration.R,
analysis.R, and report.Rmd. Let’s say that we work with these files, adding new code, editing or deleting
old code, etc., every few days in a normal R session. If we were to simply save the files after each session, we
would obviously only ever have their most recent versions. All the previous versions would be lost. In order
to avoid loss of previous versions in case they are needed, we could periodically save as, creating versions
like preprocessing_v1.R, analysis_oct23.R, and so on. As time goes by, it is highly likely too that new
files will appear. Some of these may have been only intended to be temporary files, but others might be
intended to be vital parts of the project. Some new file might be branches of other files, where we copy the
original, and work on some new feature of the code in the copy with the intention of merging the changes
back if and when necessary. By proceeding in this manner, there is a usually an eventual proliferation of new
files and different versions of files with sometimes ambiguous or inscrutable names like analysis_v1_tmp.R,
analysis_v1_tmp_new.R, preprocessing_tmp_foo.R, and so on. If files are being copied between different
devices or to cloud based storage, and edited on different devices, the situation can get ever more disorganized,
with files of similar names but perhaps slightly different contents or different time-stamps across different
machines. At this point, especially if we return to this work after a period of time, it is not usually not
clear even what each file does, where the latest version of any file is, not to mention what all the previous
or temporary versions contain and when and why they were made. If we collaborate with others, things
usually become even worse. First, we must decide on a means of sharing files, with email attachments still
probably being the default and most widely used method of doing so. Sending back and forth emails with
modifications creates yet more versions to manage, and multiple people working independently the same files
introduces conflicts that need to be manually resolved. Eventually, we have multiple files and versions, on
multiple devices, being edited independently by multiple different people. Knowing what each file and version
is or does, and who did what and when and where is usually lost as as a result.
This level of disorganization is frustrating, wasteful of time and effort, and obviously bad for reproducibility.
The authors themselves may find it difficult or impossible to pick through their files to recover and reproduce
all the steps involved in any analyses. Moreover, even if they were only working on one file such as an
RMarkdown file from which their final report was generated, the proliferation of versions across different
devices and owners would still occur, making it difficult to pick up and resume their work after a period
of inactivity. In addition, they may lose track of which version of the .Rmd produced which version of the
rendered manuscript. It is all very well knowing that a manuscript was produced by knitting a .Rmd, and
hence that all its reported results are reproducible in principle, but if we have lost track of the .Rmd that
produced it, it is obviously no longer reproducible in practice.
vcs systems allow us to manage our source files in an orderly and efficient manner. There are many vcs
systems available, both proprietary and open source, and while precise information on usage worldwide is hard
25
to establish definitively, almost all surveys of vcs usage show that Git is now by far the most popular and
widely used vcs system. Git is open-source software was originally developed in 2005 for version control of
the development of the Linux operating system, something for which it is still used. It gradually became more
widely used in the open source community and within a few years had become popular than the previously
very widely used open source subversion vcs system. With the growing popular of Git hosting sites like
GitHub, which currently hosts over 100 million Git based projects, Git is now the most widely used vcs
system worldwide.
In what follows, we aim to provide a brief introduction to some of the main features of Git. Obviously, it is
beyond the scope of this section to provide a comprehensive introduction to Git. Here, we just provide an
introduction to installing and configuring Git, initializing a Git repository, adding files and editing files in the
repository, and using a remote repository such as GitHub. These are the must knows to get up and running
with Git at the start. We do not cover important topics such reverting or resetting changes, branching,
merging, rebasing, and so on. This topics, and many others, can be found in books such as Chacon and
Straub (2014), which is available in its entirety online at https://git-scm.com/book/en/v2. tory
Installation
Git is available for Windows, MacOs, Linux. Git is first and foremost a command driven software. There
are graphical interfaces, i.e. GUIs, to Git, but we will not consider them here and do not recommend them
either given that there is only a small number of core commands to learn and they allow Git to be used both
efficiently and identically across all different devices.
For Windows, we highly recommend installing and using the Git Bash shell available from https://gitforwind
ows.org/. This provides a Bash Unix shell1 from which Git can be used just as it would be used on other
Unix systems like Linux2 and MacOs.
For MacOS, Git is already preinstalled on recent versions of MacOS. While this may be perfectly adequate,
the preinstalled Git on MacOS is based on a build of Git by Apple and is usually not up to date with the
latest version of Git. More recent versions of Git for MacOS are available elsewhere, such as https://git-
scm.com/download/mac.
For Linux, given the role of Git in the development of Linux, Git is seen as a vital Linux tool. It is easily
installed using the package managers of any Linux distribution, see https://git-scm.com/download/linux.
Once Git is installed, it is available for use using the command git in an operating system terminal. For
Windows, this means that it is available in the DOS shell using the command git. However, if Git is installed,
as we recommended, as the Git Bash shell, that Bash shell should be always used instead of the DOS shell.
For MacOS and Linux, Git will be available in the system terminal, and will work identically in the Bash, sh,
zsh, etc., shells.
For what follows, we will assume users are using a Unix shell, and so all of this will be equally applicable to
users of Linux, MacOS, and Windows users assuming they use Git Bash.
To establish that Git has been successfully installed and is available for use, type the following
git --version
The output of this command on a relatively up to date (as of July, 2022) version could appear as follows.
git version 2.25.1
Configuration
Before we start using Git, we need to perform some minimal configuration. Specifically, we first need to set
our name and email address. Git requires that each time we commit to the repository, as we will see below,
1 https://en.wikipedia.org/wiki/Bash_(Unix_shell)
2 Strictly speaking, Linux is not Unix, but rather a *nix or Unix-like operating system. However, it is essentially a free and
open-source re-implementation of Unix, and so can be seen as Unix for all practical purposes.
26
we have the name and email address of the person doing the committing. This information could be set on a
per commit basis. However, it is more common and easier to set this information as a global configuration
setting and then it will be used whenever the user performs a commit. This can be done using git config
as in the following code.
git config --global user.name 'Mark Andrews'
git config --global user.email '[email protected]'
It is recommended that we also need to set the text editor that we will use for writing our commit messages,
which is also something we will see below. By default, the text editor is the vi or vim editor. These editors
are standard Unix editors. They are loved by some and loathed by others. To the uninitiated, these editors
are likely to seen as difficult and probably annoying to use. In any case, it is certainly not necessary to use
them as any text editor can be used instead. If a user already has a preference, they should use this. If not,
one recommended editor, which is open-source editor and available across the Windows, MacOS, and Linux
platforms, is the Atom editor, see https://atom.io/. Atom can be set as the default editor to be used with
Git as follows.
git config --global core.editor "atom --wait"
Once set, this configuration information is stored in a dot file named .gitconfig in the user’s home directory.
Dot files are a standard Unix file system feature. They are simply files, or possibly even directories, that begin
with a dot. They are hidden by default in file listings, and are primarily intended to be used for configuration
information. The .gitconfig file can be edited at any time to change global configuration settings.
27
Initialized empty Git repository in /home/rstudio/Documents/project101/.git/
From this, we see that Git initializes a repository in project101, and created a directory therein named
.git. We may now look at the contents of project101 again.
ls -1aF
./
../
.git/
We see now that there is a hidden or dot directory named .git. The presence of the .git subdirectory is a
necessary and sufficient condition for the directory to be Git repository.
We can now run the git status command to see what is the current state of the repository.
On branch master
No commits yet
library(tidyverse)
The project contains an R script, `script.R`, for processing the `Titanic` data set.
After we have put these files in project101, we can do a file listing as we did above.
ls -1aF
./
../
.git/
readme.md
28
script.R
We see that script.R and readme.md are there. Next, we can do a git status as above.
git status
On branch master
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
readme.md
script.R
nothing added to commit but untracked files present (use "git add" to track)
Much of the information here is as it was previously, but now we are told that there are two Untracked
files, namely readme.md and script.R. Files listed as untracked mean that they are present in the directory
but not as yet being tracked or managed by Git. As the output indicates, however, we can use the command
git add to get Git to track them, as we do in the following command.
git add readme.md script.R
Now, let us check the status of the repository again.
git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: readme.md
new file: script.R
We see from the output that there are new files, readme.md and script.R, that can be committed. This
is an important point. The files are not yet committed to the repository, they are staged for commitment.
When files are staged, they are in an intermediate area, kind of like a departure lounge in an airport. To
commit them, we must run the git commit command as follows.
This will open your editor, and it will contain the following text exactly.
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# On branch master
#
# Initial commit
#
# Changes to be committed:
# new file: readme.md
# new file: script.R
#
In other words, the lines beginning with # provide information to you as you write your commit message, but
they will not be part of the message itself. You write your message above these # lines. It is conventional
29
and recommended3 that the first line of this message is no more than 50 characters long and is followed
by a blank line and then followed by more elaboration. The first line is treated as a subject line. Its first
character should be capitalized, and it should not end in a full-stop/period. It is also recommended that
this subject line be written in imperative tense and not past tense. For example, it is recommended that
we write something like “Add new function . . . ” rather than “Added new function . . . ”. Admittedly, this is
probably initially an unnatural way to write for most people. After the subject line, there must be a blank
line. Without it, some features of Git can be affected. Then, a more elaborate message can be written. Here,
the imperative tense is no longer necessary. In fact, it is not necessary to have a body at all, but it is highly
recommended to provide a message body and to use it provide details about what the code being committed
does. It will be helpful for others, including your future self, to understand what was being added and why.
The character line of the body text is recommended to to 72. Most Git-aware editors, like Atom, will indicate
if the subject line is over 50 characters, and if there is not a blank line after the subject, and will wrap the
body at 72 characters.
In this example, because this commit is our first commit, our subject line should acknowledge that this is the
beginning of the project. As such, a message like the following can be used.
Initialize the repository
Let us assume, therefore, that we this is what we have typed into the editor after we ran the git commit
command the editor was opened. After we save this file and quit the editor, the commit is completed.
Before proceeding, we should note that it is also possible to write one-liner commit messages using the option
-m as follows.
git commit -m 'Do something to something'
Here, no editor opens and there’s no body to the message. For simplicity, we will use this method below
sometimes. However, using this method is probably to be avoided as it will, by necessity, lead to minimal
and probably poorly thought out commit messages.
We can now view at the Git log of the repository with the following command.
git log
The output of git log would be something like this.
commit 203d4a176cc455e186460d0c32154fa160ad2003
Author: Mark Andrews <[email protected]>
Date: Sun Jul 17 09:38:43 2022 +0000
30
a fingerprint of the commit, but not just an arbitrarily assigned identifier, but one that is calculated using
a hashing algorithmic from the contents of the commit. It can therefore be used to uniquely identify the
commit and to do a file integrity check of its contents.
Having performed the commit, we can now again check the status of the repository.
git status
On branch master
nothing to commit, working tree clean
As we can see, now the repository is in a clean state. All the original files have been committed. There are as
yet no new files in the directory, and no edits to the existing files yet either.
As an interim summary, thus far, we have seen a number of essential and regularly used Git commands.
git init Create a Git repository in the current working directory of the shell.
git status Report the current working status of the repository. Specifically, are there untracked files in the
repository or files in the staging area that have not yet been committed.q
git add <file> ... Add the files <file> ... to the staging area. The staging area is like the departure
lounge of an airport. The files therein are going scheduled for committal, though they may be taken
out of the staging area too.
git commit Commit the files in the staging area. This command opens an editor and a commit message is
entered there.
In addition, let us edit the script.R file by changing the function sum for mean in the apply function.
Now, let us check the working state of the repository.
31
git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: script.R
Untracked files:
(use "git add <file>..." to include in what will be committed)
models.R
no changes added to commit (use "git add" and/or "git commit -a")
From this, we see that there is, as expected, one untracked file, models.R, and one modified file, script.R.
We may now add both of these files, and the commit them both to the repository at the same time using a
single commit. Alternatively, we could add and commit them individually. Whether we proceed one way or
another should be based on whether the commit is atomic, i.e. has one main unitary function or purpose.
In this example, because the new file and the change to script.R are not related to one another, we will
perform two separate commits. First, we will add the modified script.R to staging.
git add script.R
Before we proceed, let us check the working state of the repository.
git status
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: script.R
Untracked files:
(use "git add <file>..." to include in what will be committed)
models.R
As we can see, script.R is now in the staging area ready to be committed. We now can do the commit.
git commit
This will bring up our editor with the following contents.
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# On branch master
# Changes to be committed:
# modified: script.R
#
# Untracked files:
# models.R
#
As before, we should write our commit message above the lines beginning with #, following the conventions
and recommendations for good git messages. We will add the following text to the commit message in the
editor.
Change function used in the apply functional
32
The functional now calculates the mean number of men and
women who survived the Titanic, not the total number.
Then, as before, when we save and quit, the staged changes are committed. If we look at the logs, we will
now see the following.
git log
commit bdbf2b32ba526694945a0e2cda9bfa5fe2d6538b
Author: Mark Andrews <[email protected]>
Date: Sun Jul 17 10:38:43 2022 +0000
commit 203d4a176cc455e186460d0c32154fa160ad2003
Author: Mark Andrews <[email protected]>
Date: Sun Jul 17 09:38:43 2022 +0000
nothing added to commit but untracked files present (use "git add" to track)
As expected, we now see that there is just one untracked file, models.R, in the directory. We first add this
file to put it in the staging area.
git add models.R
Then, we commit it with the following message.
Add new lm script named `models.R`
33
Author: Mark Andrews <[email protected]>
Date: Sun Jul 17 10:38:43 2022 +0000
commit bdbf2b32ba526694945a0e2cda9bfa5fe2d6538b
Author: Mark Andrews <[email protected]>
Date: Sun Jul 17 10:38:43 2022 +0000
commit 203d4a176cc455e186460d0c32154fa160ad2003
Author: Mark Andrews <[email protected]>
Date: Sun Jul 17 09:38:43 2022 +0000
34
Figure 5: A screenshot of a GitHub page for creating a new repository.
default. Specifically, do not add a readme, or a licence, or a .gitignore file. As important as these are, we
can add them later using Git commands.
After we click the “Create repository”, we are then brought to a page that provides some Git commands for
different situations. We already have an existing repository, so we want the code listed under the heading
. . . or push an existing repository from the command line, which is
git remote add origin [email protected]:mark-andrews/project101.git
git push -u origin master
There are two important commands here, and so we will look at them individually. The first command adds
github.com:mark-andrews/project101.git as a remote host of the repository we have created. In this
repository, this remote repository will be named origin. The name origin is the default name for remote
repositories from which other repositories are cloned. However, we do not have to use this name, and in fact
another name, such as github, might be more useful, especially if we have multiple remote repositories. The
second command pushes the contents of the master branch of local repository to the remote repository named
origin. When you run this second command, you will be asked for your GitHub password, unless you have
set up passwordless GitHub authentication using ssh keys, which is a very convenient feature when frequently
using GitHub.
Remember that we must run these command in the Git shell in the, in this example, project101 repository that
we have been using above. After we run them, if we then browse in a web browser to https://github.com/mark-
andrews/project101 we will see that our repository and all its history and other vital information is being
now hosted there.
Cloning remotes
If you create a public GitHub repository, as we did above, anyone can now clone your repository. For example,
in a MacOS or Linux terminal or the Windows Git Bash shell, anyone can type the following command and
then clone project101.
git clone [email protected]:mark-andrews/project101.git
Alternatively, they could use this version of the clone command.
git clone https://github.com/mark-andrews/project101.git
The difference between these two versions is simply the internet protocol that is being used. Having cloned
35
project101, they will have access to everything that you pushed to GitHub, i.e. the master branch and all
the history and other vital features of the repository. They will be able to do everything you can do with
the repository: view the logs, make changes, commit changes, undo changes, roll back history, etc. All the
logs, commit messages, commit hashes etc will be identical in the clone as in the original. However, as is
probably obvious, whatever actions they take, they will not be able to affect your local repository in any
way. Moreover, they will not be able to affect your remote repository on GitHub either. They would only be
able to push to your GitHub repository if they had a GitHub account and you explicitly gave them push
permission.
References
Chacon, Scott, and Ben Straub. 2014. Pro Git. 2nd ed. Apress.
Fecher, Benedikt, Sascha Friesike, and Marcel Hebing. 2015. “What Drives Academic Data Sharing?” PloS
One 10 (2): e0118053.
Gorgolewski, Krzysztof J, and Russell A Poldrack. 2016. “A Practical Guide for Improving Transparency
and Reproducibility in Neuroimaging Research.” PLoS Biology 14 (7): e1002506.
Grätzer, George. 2016. Math into Latex. 5th ed. Springer Science & Business Media.
Gruber, John. 2004. “Markdown.” https://daringfireball.net/projects/markdown/.
Hern, Alex. 2013. “Is Excel the Most Dangerous Piece of Software in the World?” New Statesman.
https://www.newstatesman.com/technology/2013/02/excel-most-dangerous-piece-software-world.
Houtkoop, Bobby Lee, Chris Chambers, Malcolm Macleod, Dorothy VM Bishop, Thomas E Nichols, and
Eric-Jan Wagenmakers. 2018. “Data Sharing in Psychology: A Survey on Barriers and Preconditions.”
Advances in Methods and Practices in Psychological Science 1 (1): 70–85.
Ioannidis, John PA. 2015. “How to Make More Published Research True.” Revista Cubana de Información
En Ciencias de La Salud (ACIMED) 26 (2): 187–200.
Iqbal, Shareen A, Joshua D Wallach, Muin J Khoury, Sheri D Schully, and John PA Ioannidis. 2016.
“Reproducible Research Practices and Transparency Across the Biomedical Literature.” PLoS Biology 14 (1):
e1002333.
Knuth, Donald Ervin. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111.
Landau, William Michael. 2018. “The Drake R Package: A Pipeline Toolkit for Reproducibility and
High-Performance Computing.” Journal of Open Source Software 3 (21). https://doi.org/10.21105/joss.00550.
MacFarlane, John. 2006. “Pandoc: A Universal Document Converter.” https://pandoc.org/.
Merton, Robert K. 1973. The Sociology of Science: Theoretical and Empirical Investigations. University of
Chicago press.
Munafò, Marcus R, Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers,
Nathalie Percie Du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis.
2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1): 0021.
Nosek, Brian A, George Alter, George C Banks, Denny Borsboom, Sara D Bowman, Steven J Breckler, Stuart
Buck, et al. 2015. “Promoting an Open Research Culture.” Science 348 (6242): 1422–5.
Shamir, Lior, John F Wallin, Alice Allen, Bruce Berriman, Peter Teuben, Robert J Nemiroff, Jessica
Mink, Robert J Hanisch, and Kimberly DuPrie. 2013. “Practices in Source Code Sharing in Astrophysics.”
Astronomy and Computing 1: 54–58.
Stodden, Victoria, Peixuan Guo, and Zhaokun Ma. 2013. “Toward Reproducible Computational Research:
An Empirical Analysis of Data and Code Policy Adoption by Journals.” PloS One 8 (6): e67111.
36
Tenopir, Carol, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, Eleanor Read, Maribeth
Manoff, and Mike Frame. 2011. “Data Sharing by Scientists: Practices and Perceptions.” PloS One 6 (6):
e21101.
Xie, Yihui. 2017. Dynamic Documents with R and Knitr. Chapman; Hall/CRC.
37