An Introduction To Bayesian Data Analysis
An Introduction To Bayesian Data Analysis
2023-02-18
Preface #
This book is intended to be a relatively gentle introduction to carrying out Bayesian data
analysis and cognitive modeling using the probabilistic programming language Stan
(Carpenter et al. 2017), and the front-end to Stan called brms (Bürkner 2019). Our target
audience is cognitive scientists (e.g., linguists and psychologists) who carry out planned
behavioral experiments, and who are interested in learning the Bayesian data analysis
methodology from the ground up and in a principled manner. Our aim is to make Bayesian
statistics a standard part of the data analysis toolkit for experimental linguistics,
psycholinguistics, psychology, and related disciplines.
Many excellent introductory textbooks already exist for Bayesian data analysis. Why write yet
another book? Our text is different from other attempts in two respects. First, our main focus is
on showing how to analyze data from planned experiments involving repeated measures; this
type of experimental data involves unique complexities. We provide many examples of data
sets involving time measurements (e.g., self-paced reading, eye-tracking-while-reading, voice
onset time), event-related potentials, pupil sizes, accuracies (e.g., recall tasks, yes-no
questions), categorical answers (e.g., picture naming), choice-reaction time (e.g, Stroop task,
motion detection task), etc. Second, from the very outset, we stress a particular workflow that
has as its centerpiece simulating data; we aim to teach a philosophy that involves thinking
hard about the assumed underlying generative process, even before the data are collected.
The data analysis approach that we hope to teach through this book involves a cycle of prior
predictive and posterior predictive checks, and model validation using simulated data. We try
to inculcate a sense of how inferences can be drawn from the posterior distribution of
theoretically interesting parameters without resorting to binary decisions like “significant” or
“not-significant”. We are hopeful that this will set a new standard for reporting and interpreting
results of data analyses in a more nuanced manner, and lead to more measured claims in the
published literature.
The target audience for this book is students and researchers who want to treat statistics as
an equal partner in their scientific work. We expect that the reader is willing to take the time to
both understand and to run the computational analyses.
Any rigorous introduction to Bayesian data analysis requires at least a passive knowledge of
probability theory, calculus, and linear algebra. However, do not require that the reader has
this background when they start the book. Instead, the relevant ideas are introduced informally
and just in time, as soon as they are needed. The reader is never required to have an active
ability to solve probability problems, to solve integrals or compute derivatives, or to carry out
relatively complex matrix computations (such as inverting matrices) by hand.
What we do expect is familiarity with arithmetic, basic set theory and elementary probability
theory (e.g., sum and product rules, conditional probability), simple matrix operations like
addition and multiplication, and simple algebraic operations. A quick look through chapter 1 of
Gill (2006) before starting this book is highly recommended. We also presuppose that, when
the need arises, the reader is willing to look up concepts that that they might have forgotten
(e.g., logarithms).
We also expect that the reader already knows and/or is willing to learn enough of the
programming language R (R Core Team 2019) to reproduce the examples presented and to
carry out the exercises. If the reader is completely unfamiliar with R, before starting this book
they should first consult books like R for data science, and Efficient R programming.
We also assume that the reader has encountered simple linear modeling, and linear mixed
models (Bates, Mächler, et al. 2015a; Baayen, Davidson, and Bates 2008). What this means
in practice is that the reader should have used the lm() and lmer() functions in R. A
passing acquaintance with basic statistical concepts, like the correlation between two
variables, is also taken for granted.
This book is not appropriate for complete beginners to data analysis. Newcomers to data
analysis should start with a freely available textbook like Kerns (2014), and then read our
introduction to frequentist data analysis, which is also available freely online (Vasishth et al.
2021). This latter book will prepare the reader well for the material presented here.
One very important characteristic that the reader should bring to this book is a can-do spirit.
There will be many places where the going will get tough, and the reader will have to play
around with the material, or refresh their understanding of arithmetic or middle-school algebra.
The basic principles of such a can-do spirit are nicely summarized in the book by Burger and
Starbird (2012); also see Levy (2021). Although we cannot summarize the insights from these
books in a few words, inspired by the Burger and Starbird (2012) book, here is a short
enumeration of the kind of mindset the reader will need to cultivate:
Spend time on the basic, apparently easy material; make sure you understand it deeply.
Look for gaps in your understanding. Reading different presentations of the same material
(in different books or articles) can yield new insights.
Let mistakes and errors be your teacher. We instinctively recoil from our mistakes, but
errors are ultimately our friends; they have the potential to teach us more than our correct
answers can. In this sense, a correct solution can be less interesting than an incorrect
one.
When you are intimidated by some exercise or problem, give up and admit defeat
immediately. This relaxes the mind; you’ve already given up, there’s nothing more to do.
Then, after a while, try to solve a simpler version of the problem. Sometimes, it is useful
to break the problem down to smaller parts, each of which may be easier to solve.
Create your own questions. Don’t wait to be asked questions; develop your own problems
and then try to solve them.
Don’t expect to understand everything in the first pass. Just mentally note the gaps in
your understanding, and return to them later and work on these gaps.
Step back periodically to try to sketch out a broader picture of what you are learning.
Writing down what you know, without looking up anything, is one helpful way to achieve
this. Don’t wait for the teacher to give you bullet-point summaries of what you should have
learned; develop such summaries yourself.
Develop the art of finding information. When confronted with something you don’t know, or
with some obscure error message, use google to find some answers.
As instructors, we have noticed over the years that students with such a mindset generally do
very well. Some students already have that spirit, but others need to explicitly develop it. We
firmly believe that everyone can develop such a mindset; but one may have to work on
acquiring it.
In any case, such an attitude is absolutely necessary for a book of this sort.
The chapters in this book are intended to be read in sequence, but during the first pass
through the book, the reader should feel free to completely skip the boxes. These boxes
provide a more formal development (useful to transition to more advanced textbooks like
Gelman et al. 2014), or deal with tangential aspects of the topics presented in the chapter.
Here are some suggested paths through this book, depending on the reader’s goals:
For a short course for complete beginners, read chapters 1 to 5. We usually cover these
five chapters in a five-day summer school course that we teach annually. Most of the
material in this chapter is also covered in a free four-week course available online:
https://open.hpi.de/courses/bayesian-statistics2023.
For a course that focuses on regression models with the R package brms , read chapters
1 to 9 and, optionally, 15.
For an advanced course that focuses on complex models involving Stan, read chapters
10 to 20.
All distribution names are lower-case unless they are also a proper name (e.g., Poisson,
Bernoulli).
The univariate normal distribution is parameterized by the mean and standard deviation
(not variance).
The code for figures is provided only in some cases, where we consider it to be
pedagogically useful. In other cases, the code remains hidden, but it can be found in the
web version of the book. Notice that all the R code from the book can be extracted from
the Rmd source files for each chapter, which are released with the book.
Online materials
The entire book, including all data and source code, is available online for free on
https://vasishth.github.io/bayescogsci/book. The solutions to exercises will be made available
on request.
Software needed
R and RStudio, or any other Integrated Development Environment that you prefer, such
as Visual Studio Code and Emacs Speaks Statistics.
The R package rstan . At the time of writing this book, the CRAN version of rstan lags
behind the latest developments in Stan so it is recommended to install rstan from
https://mc-stan.org/r-packages/ as indicated in https://github.com/stan-
dev/rstan/wiki/RStan-Getting-Started
The R packages dplyr , purrr , tidyr , extraDistr , brms , hypr and lme4 are
used in many chapters of the book and can be installed the usual way:
install.packages(c("dplyr","purrr","tidyr", "extraDistr", "brms","hypr","lme4")) .
The following R packages are optional: tictoc , rootSolve , SHELF , cmdstanr , and
SBC .
Some packages, such as intoo , barsurf , bivariate , SIN , and rethinking could
require manual installation from archived or github versions.
The data and Stan models used in this book can be installed using
remotes::install_github("bnicenboim/bcogsci") . This command uses the function
install_github from the package remotes . (Thus this package should be in the system
as well.)
In every R session, load these packages, and set the options shown below for Stan.
Hide
library(MASS)
## be careful to load dplyr after MASS
library(dplyr)
library(tidyr)
library(purrr)
library(extraDistr)
library(ggplot2)
library(loo)
library(bridgesampling)
library(brms)
library(bayesplot)
library(tictoc)
library(hypr)
library(bcogsci)
library(lme4)
library(rstan)
library(rootSolve)
We are grateful to the many generations of students at the University of Potsdam, various
summer schools at ESSLLI, the LOT winter school, other short courses we have taught at
various institutions, and the annual summer school on Statistical Methods for Linguistics and
Psychology (SMLP) held annually at Potsdam, Germany. The participants in these courses
helped us considerably in improving the material presented here. A special thanks to Anna
Laurinavichyute, Paula Lissón, and Himanshu Yadav for co-teaching the the Bayesian courses
at SMLP. We are also grateful to members of Vasishth lab, especially Dorothea Pregla, for
comments on earlier drafts of this book. We would also like to thank Christian Robert
(otherwise known as Xi’an), Robin Ryder, Nicolas Chopin, Michael Betancourt, Andrew
Gelman, and the Stan developers (especially Bob Carpenter and Paul-Christian Bürkner) for
their advice; to Pavel Logačev for his feedback, and Athanassios Protopapas, Patricia
Mirabile, Masataka Ogawa, Alex Swiderski, Andrew Ellis, Jakub Szewczyk, Chi Hou Pau, Alec
Shaw, Patrick Wen, Riccardo Fusaroli, Abdulrahman Dallak, Elizabeth Pankratz, Jean-Pierre
Haeberly, Chris Hammill, Florian Wickelmaier, Ole Seeth, Jules Bouton, Siqi Zheng, Michael
Gaunt, Benjamin Senst, Chris Moreh, Richard Hatcher, and Noelia Stetie for catching typos,
unclear passages, and errors in the book. Thanks also go to Jeremy Oakley and other
statisticians at the School of Mathematics and Statistics, University of Sheffield, UK, for helpful
discussions, and ideas for exercises that were inspired from the MSc program taught online at
Sheffield.
This book would have been impossible to write without the following software: R (Version
4.2.2; R Core Team 2019) and the R-packages afex (Singmann et al. 2020), barsurf (Version
0.7.0; Spurdle 2020a), bayesplot (Version 1.9.0; Gabry and Mahr 2019), bcogsci (Version
0.0.0.9000; Nicenboim, Schad, and Vasishth 2020), bibtex (Version 0.5.0; Francois 2017),
bivariate (Version 0.7.0; Spurdle 2020b), bookdown (Version 0.28; Xie 2019a), bridgesampling
(Version 1.1.2; Gronau, Singmann, and Wagenmakers 2020), brms (Version 2.17.0; Bürkner
2019), citr (Aust 2019), cmdstanr (Version 0.5.3; Gabry and Češnovar 2021), cowplot (Version
1.1.1; Wilke 2020), digest (Version 0.6.31; Antoine Lucas et al. 2021), dplyr (Version 1.1.0;
Wickham, François, et al. 2019), DT (Version 0.24; Xie, Cheng, and Tan 2019), extraDistr
(Version 1.9.1; Wolodzko 2019), forcats (Version 1.0.0; Wickham 2019a), gdtools (Gohel et al.
2019), ggplot2 (Version 3.4.0; Wickham, Chang, et al. 2019), gridExtra (Version 2.3; Auguie
2017), htmlwidgets (Version 1.5.4; Vaidyanathan et al. 2018), hypr (Version 0.2.3; Schad et al.
2019; Rabe, Vasishth, Hohenstein, Kliegl, and Schad 2020a), intoo (Version 0.4.0; Spurdle
and Bode 2020), kableExtra (Version 1.3.4; Zhu 2019), knitr (Version 1.42; Xie 2019b), lme4
(Version 1.1.31; Bates, Mächler, et al. 2015b), loo (Version 2.5.1; Vehtari, Gelman, and Gabry
2017a; Yao et al. 2017), MASS (Version 7.3.58.2; Ripley 2019), Matrix (Version 1.5.3; Bates
and Maechler 2019), miniUI (Version 0.1.1.1; Cheng 2018), papaja (Version 0.1.1; Aust and
Barth 2020), pdftools (Version 3.3.2; Ooms 2021), purrr (Version 1.0.1; Henry and Wickham
2019), Rcpp (Version 1.0.10; Eddelbuettel et al. 2019), readr (Version 2.1.4; Wickham, Hester,
and Francois 2018), RefManageR (Version 1.4.0; McLean 2017), remotes (Version 2.4.2;
Hester et al. 2021), rethinking (Version 2.21; McElreath 2021), rmarkdown (Version 2.20;
Allaire et al. 2019), rootSolve (Version 1.8.2.3; Soetaert and Herman 2009), rstan (Version
2.26.13; Guo, Gabry, and Goodrich 2019), SBC (Version 0.1.1.9000; Kim et al. 2022), servr
(Version 0.24; Xie 2019c), SHELF (Version 1.8.0; Oakley 2021), SIN (Version 0.6; Drton
2013), StanHeaders (Version 2.26.13; Goodrich et al. 2019), stringr (Version 1.5.0; Wickham
2019b), texPreview (Sidi and Polhamus 2020), tibble (Version 3.1.8; Müller and Wickham
2020), tictoc (Version 1.0.1; Izrailev 2014), tidyr (Version 1.2.1; Wickham and Henry 2019),
tidyverse (Version 1.3.2; Wickham, Averick, et al. 2019), tinylabels (Version 0.2.3; Barth 2022),
and webshot (Version 0.5.3; Chang 2018).
Bruno Nicenboim (Tilburg, The Netherlands), Daniel Schad (Potsdam, Germany), Shravan
Vasishth (Potsdam, Germany)
References
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley
Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2019. rmarkdown: Dynamic
Documents for R. https://CRAN.R-project.org/package=rmarkdown.
Antoine Lucas, Dirk Eddelbuettel with contributions by, Jarek Tuszynski, Henrik Bengtsson,
Simon Urbanek, Mario Frasca, Bryan Lewis, Murray Stokely, et al. 2021. Digest: Create
Compact Hash Digests of R Objects. https://CRAN.R-project.org/package=digest.
Aust, Frederik. 2019. citr: RStudio Add-in to Insert Markdown Citations. https://CRAN.R-
project.org/package=citr.
Aust, Frederik, and Marius Barth. 2020. papaja: Create APA Manuscripts with R Markdown.
https://github.com/crsh/papaja.
Baayen, R Harald, Douglas J Davidson, and Douglas M Bates. 2008. “Mixed-Effects Modeling
with Crossed Random Effects for Subjects and Items.” Journal of Memory and Language 59
(4). Elsevier: 390–412.
Barth, Marius. 2022. tinylabels: Lightweight Variable Labels. https://cran.r-
project.org/package=tinylabels.
Bates, Douglas M, and Martin Maechler. 2019. Matrix: Sparse and Dense Matrix Classes and
Methods. https://CRAN.R-project.org/package=Matrix.
Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015a. “Fitting Linear
Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.
https://doi.org/10.18637/jss.v067.i01.
Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear
Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.
https://doi.org/10.18637/jss.v067.i01.
Burger, Edward B, and Michael Starbird. 2012. The 5 Elements of Effective Thinking.
Princeton University Press.
Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael J.
Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A
Probabilistic Programming Language.” Journal of Statistical Software 76 (1). Columbia Univ.,
New York, NY (United States); Harvard Univ., Cambridge, MA (United States).
Cheng, Joe. 2018. miniUI: Shiny Ui Widgets for Small Screens. https://CRAN.R-
project.org/package=miniUI.
Drton, Mathias. 2013. SIN: A Sinful Approach to Selection of Gaussian Graphical Markov
Models. https://CRAN.R-project.org/package=SIN.
Eddelbuettel, Dirk, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell,
Douglas M Bates, and John Chambers. 2019. Rcpp: Seamless R and C++ Integration.
https://CRAN.R-project.org/package=Rcpp.
Gabry, Jonah, and Tristan Mahr. 2019. bayesplot: Plotting for Bayesian Models.
https://CRAN.R-project.org/package=bayesplot.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B.
Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC
Press.
Gill, Jeff. 2006. Essential Mathematics for Political and Social Research. Cambridge University
Press Cambridge.
Gohel, David, Hadley Wickham, Lionel Henry, and Jeroen Ooms. 2019. gdtools: Utilities for
Graphical Rendering. https://CRAN.R-project.org/package=gdtools.
Goodrich, Ben, Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee, Michael
Betancourt, Marcus Brubaker, et al. 2019. StanHeaders: C++ Header Files for Stan.
https://CRAN.R-project.org/package=StanHeaders.
Gronau, Quentin F., Henrik Singmann, and Eric-Jan Wagenmakers. 2017. “Bridgesampling: An
R Package for Estimating Normalizing Constants.” Arxiv. http://arxiv.org/abs/1710.08162.
Guo, Jiqiang, Jonah Gabry, and Ben Goodrich. 2019. rstan: R Interface to Stan.
https://CRAN.R-project.org/package=rstan.
Henry, Lionel, and Hadley Wickham. 2019. purrr: Functional Programming Tools.
https://CRAN.R-project.org/package=purrr.
Hester, Jim, Gábor Csárdi, Hadley Wickham, Winston Chang, Martin Morgan, and Dan
Tenenbaum. 2021. Remotes: R Package Installation from Remote Repositories, Including
’Github’. https://CRAN.R-project.org/package=remotes.
Izrailev, Sergei. 2014. Tictoc: Functions for Timing R Scripts, as Well as Implementations of
Stack and List Structures. https://CRAN.R-project.org/package=tictoc.
Kerns, G.J. 2014. Introduction to Probability and Statistics Using R. Second Edition.
Kim, Shinyoung, Hyunji Moon, Martin Modrák, and Teemu Säilynoja. 2022. SBC: Simulation
Based Calibration for Rstan/Cmdstanr Models.
Levy, Dan. 2021. Maxims for Thinking Analytically: The Wisdom of Legendary Harvard
Professor Richard Zeckhauser. Dan Levy.
McLean, Mathew William. 2017. “RefManageR: Import and Manage Bibtex and Biblatex
References in R.” The Journal of Open Source Software. https://doi.org/10.21105/joss.00338.
Müller, Kirill, and Hadley Wickham. 2020. Tibble: Simple Data Frames. https://CRAN.R-
project.org/package=tibble.
Nicenboim, Bruno, Daniel J. Schad, and Shravan Vasishth. 2020. bcogsci: Data and Models
for the Book “an Introduction to Bayesian Data Analysis for Cognitive Science”.
Oakley, Jeremy. 2021. SHELF: Tools to Support the Sheffield Elicitation Framework.
https://CRAN.R-project.org/package=SHELF.
Ooms, Jeroen. 2021. pdftools: Text Extraction, Rendering and Converting of Pdf Documents.
https://CRAN.R-project.org/package=pdftools.
Rabe, Maximilian M., Shravan Vasishth, Sven Hohenstein, Reinhold Kliegl, and Daniel J.
Schad. 2020a. “Hypr: An R Package for Hypothesis-Driven Contrast Coding.” The Journal of
Open Source Software. https://doi.org/10.21105/joss.02134.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna,
Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ripley, Brian. 2019. MASS: Support Functions and Datasets for Venables and Ripley’s Mass.
https://CRAN.R-project.org/package=MASS.
Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2019. “How to
Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and
Language 110. https://doi.org/10.1016/j.jml.2019.104038.
Sidi, Jonathan, and Daniel Polhamus. 2020. TexPreview: Compile and Preview Snippets of
“Latex”. https://CRAN.R-project.org/package=texPreview.
Singmann, Henrik, Ben Bolker, Jake Westfall, Frederik Aust, and Mattan S. Ben-Shachar.
2020. Afex: Analysis of Factorial Experiments. https://CRAN.R-project.org/package=afex.
Soetaert, Karline, and Peter M.J. Herman. 2009. A Practical Guide to Ecological Modelling.
Using R as a Simulation Platform. Springer.
Spurdle, Abby. 2020a. Barsurf: Heatmap-Related Plots and Smooth Multiband Color
Interpolation. https://CRAN.R-project.org/package=barsurf.
Spurdle, Abby, and Emil Bode. 2020. Intoo: Minimal Language-Like Extensions.
https://CRAN.R-project.org/package=intoo.
Vaidyanathan, Ramnath, Yihui Xie, JJ Allaire, Joe Cheng, and Kenton Russell. 2018.
htmlwidgets: HTML Widgets for R. https://CRAN.R-project.org/package=htmlwidgets.
Vasishth, Shravan, Daniel J. Schad, Audrey Bürki, and Reinhold Kliegl. 2021. Linear Mixed
Models for Linguistics and Psychology: A Comprehensive Introduction. CRC Press.
https://vasishth.github.io/Freq_CogSci/.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017a. “Practical Bayesian Model Evaluation
Using Leave-One-Out Cross-Validation and Waic.” Statistics and Computing 27 (5): 1413–32.
https://doi.org/10.1007/s11222-016-9696-4.
Wickham, Hadley. 2019a. forcats: Tools for Working with Categorical Variables (Factors).
https://CRAN.R-project.org/package=forcats.
Wickham, Hadley. 2019b. stringr: Simple, Consistent Wrappers for Common String
Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan,
Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open
Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi,
Claus Wilke, Kara Woo, and Hiroaki Yutani. 2019. ggplot2: Create Elegant Data Visualisations
Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. dplyr: A Grammar of
Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Lionel Henry. 2019. Tidyr: Tidy Messy Data. https://CRAN.R-
project.org/package=tidyr.
Wickham, Hadley, Jim Hester, and Romain Francois. 2018. readr: Read Rectangular Text
Data. https://CRAN.R-project.org/package=readr.
Wilke, Claus O. 2020. cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’.
https://CRAN.R-project.org/package=cowplot.
Xie, Yihui. 2019a. bookdown: Authoring Books and Technical Documents with R Markdown.
https://CRAN.R-project.org/package=bookdown.
Xie, Yihui. 2019b. knitr: A General-Purpose Package for Dynamic Report Generation in R.
https://CRAN.R-project.org/package=knitr.
Xie, Yihui. 2019c. servr: A Simple Http Server to Serve Static Files or Dynamic Documents.
https://CRAN.R-project.org/package=servr.
Xie, Yihui, Joe Cheng, and Xianying Tan. 2019. DT: A Wrapper of the Javascript Library
’Datatables’. https://CRAN.R-project.org/package=DT.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2017. “Using Stacking to
Average Bayesian Predictive Distributions.” Bayesian Analysis. https://doi.org/10.1214/17-
BA1091.
Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.
https://CRAN.R-project.org/package=kableExtra.
Code
Chapter 1 Introduction
The central idea we will explore in this book is: given some data, how to use Bayes’ theorem
to quantify uncertainty about our belief regarding a scientific question of interest. Before we
get into the details of the underlying theory and its application, some familiarity with the
following topics needs to be in place: the basic concepts behind probability, the concept of
random variables, probability distributions, and the concept of likelihood. We therefore turn to
these topics first.
1.1 Probability
Informally, we all understand what the term probability means. We routinely talk about things
like the probability of it raining today. However, there are two distinct ways to think about
probability. One can think of the probability of an event with reference to the frequency with
which it might occur in repeated observations. Such a conception of probability is easy to
imagine in cases where an event can, at least in principle, occur repeatedly. An example
would be obtaining a 6 when tossing a die again and again. However, this frequentist view of
probability is difficult to justify when talking about certain one-of-a-kind events, such as
earthquakes. In such situations, probability is expressing our uncertainty about the event
happening. Moreover, we could even be uncertain about exactly how probable the event in
question is; for example, we might say something like “I am 90% certain that the probability of
an earthquake happening in the next year is between 10 and 40%”. In this book, we will be
particularly interested in quantifying uncertainty in this way: we will always want to know how
unsure we are of the estimate we are interested in.
Both the frequency-based and the uncertain-belief perspective have their place in statistical
inference, and depending on the situation, we are going to rely on both ways of thinking.
Regardless of these differences in perspective, the probability of an event happening is
defined to be constrained in the following way. The statements below are not formal
statements of the axioms of probability theory; for more details (and more precise
formulations), see Ross (2002) or Kolmogorov (1933).
The probability of an event must lie between 0 and 1, where 0 means that the event is
impossible and cannot happen, and 1 means that the event is certain to happen.
For any two mutually exclusive events, the probability that one or the other occurs is the
sum of their individual probabilities.
Two events are independent if and only if the probability of both events happening is
equal to the product of the probabilities of each event happening.
The probabilities of all possible events in the entire sample space must sum up to 1.
The above definitions are based on the axiomatic definition of probability by Kolmogorov
(1933).
In the context of data analysis, we will talk about probability in the following way. Consider
some data that we might have collected. This could be discrete 0,1 responses in a question-
response accuracy task, or continuous measurements of reading times in milliseconds from an
eyetracking study, etc. In any such cases, we will say that the data are being generated from a
random variable, which we will designate with a capital letter such as Y .1
The actually observed data will be distinguished from the random variable that generated it by
using lower case y. We can call y an instance of Y ; every new set of data will be slightly
different due to random variability.
We can summarize the above informal concepts relating to random variables very compactly if
we re-state them in mathematical form. A mathematical statement has the advantage not only
of brevity but also of reducing ambiguity.
Somewhat more formally (following Blitzstein and Hwang 2014), we define a random variable
Y as a function from a sample space of possible outcomes S to the real number system:2
Y : S → R
The random variable associates to each outcome ω ∈ S exactly one number Y (ω) = y .
Suppose that y represents all the possible values that the random variable generates; these
values are taken to belong to the support of S: y ∈ SY .
This random variable will be assumed to have a parameter θ that represents the probability of
producing a correct response. In statistics, given some observed data, typically our goal is to
obtain an estimate of this parameter’s true (unknown) value.
This discrete random variable Y has associated with it a function called a probability mass
function or PMF. This function, which is written p(y) , gives us the probability of obtaining each
of these 11 possible outcomes (from 0 correct responses to 10). We are using lower-case p(⋅)
here, and this is distinct from P (⋅) , which we will use to talk about probabilities.
We will write that this PMF p(y) depends on, or is conditional on, a particular fixed but
unknown value for θ; the PMF will be written p(y|θ) .3
In frequentist approaches to data analysis, only the observed data y are used to draw
inferences about θ. A typical question that we ask in the frequentist paradigm is: does θ have
a particular value θ0 ? One can obtain estimates of the unknown value of θ from the observed
data y, and then draw inferences about how different–or more precisely how far away–this
estimate is from the hypothesized θ0 . This is the essence of null hypothesis significance
testing. The conclusions from such a procedure are framed in terms of either rejecting the
hypothesis that θ has value θ0 , or failing to reject this hypothesis. Here, rejecting the null
hypothesis is the primary goal of the statistical hypothesis test.
Bayesian data analysis begins with a different question. What is common to the frequentist
paradigm is the assumption that the data are generated from a random variable Y and that
there is a function p(y|θ) that depends on the parameter θ. Where the Bayesian approach
diverges from the frequentist one is that an important goal is to express our uncertainty about
θ . In other words, we treat the parameter θ itself as a random variable, which means that we
assign a probability distribution p(θ) to this random variable. This distribution p(θ) is called
the prior distribution on θ; such a distribution could express our belief about the probability of
correct responses, before we observe the data y.
In later chapters, we will spend some time trying to understand how such a prior distribution
can be defined for a range of different research problems.
Given such a prior distribution and some data y, the end-product of a Bayesian data analysis
is what is called the posterior distribution of the parameter (or parameters) given the data:
p(θ|y) . This posterior distribution is the probability distribution of θ after conditioning on y, i.e.,
after the data has been observed and is therefore known. All our statistical inference is based
on this posterior distribution of θ; we can even carry out hypothesis tests that are analogous
(but not identical) to the likelihood ratio based frequentist hypothesis tests.
We already mentioned conditional probability above when discussing the probability of the
data given some parameter θ, which we wrote as the PMF p(y|θ) . Conditional probability is an
important concept in Bayesian data analysis, not least because it allows us to derive Bayes’
theorem. Let’s look at the definition of conditional probability next.
1.2 Conditional probability
Suppose that A stands for some discrete event; an example would be “the streets are wet.”
Suppose also that B stands for some other discrete event; an example is “it has been raining.”
We can talk about the probability of the streets being wet given that it has been raining; or
more generally, the probability of A given that B has happened.
This kind of statement is written as P rob(A|B) or more simply P (A|B) . This is the
conditional probability of event A given B . Conditional probability is defined as follows.
P (A, B)
P (A|B) = where P (B) > 0
P (B)
The conditional probability of A given B is thus defined to be the joint probability of A and B,
divided by the probability of B. We can rearrange the above equation so that we can talk about
the joint probability of both events A and B happening. This joint probability can be computed
by first taking P (B) , the probability that event B (it has been raining) happens, and multipling
this by the probability that A happens conditional on B , i.e., the probability that the streets are
wet given it has been raining. This multiplication will give us P (A, B) , the joint probability of A
and B , i.e., that it has been raining and that the streets are wet. We will write the above
description as: P (A, B) = P (A|B)P (B) .
Now, since the probability A and B happening is the same as the probability of B and A
happening, i.e., since P (B, A) = P (A, B) , we can equate the expansions of these two
terms:
P (B|A)P (A)
P (A|B) =
P (B)
The above statement is Bayes’ rule, and is the basis for all the statistical inference we will do
in this book.
1.3 The law of total probability
Related to the above discussion of conditional probability is the law of total probability.
Suppose that we have A1 , … , An distinct events that are pairwise disjoint which together
make up the entire sample space S ; see Figure 1.1. Then, P (B) , the probability of an event
B , will be the sum of the probabilities P (B ∩ Ai ) , i.e., the sum of the joint probabilities of B
and each A occurring (the symbol ∩ is the and’’ operator used in set theory.)
Formally:
P (B) = ∑ P (B ∩ Ai )
i=1
i=1
Thus, the probability of B is the sum of the conditional probabilities P (B|Ai ) weighted by the
probability P (Ai ) . We will see the law of total probability in action below when we talk about
marginal likelihood.
To make the discussion concrete, we will use an example of a discrete random variable, the
binomial. After discussing this discrete random variable, we present another example, this
time involving a continuous random variable, the normal random variable.
The binomial and normal cases serve as the canonical examples that we will need in the initial
stages of this book. We will introduce other random variables as needed: in particular, we will
need the Uniform and Beta distributions. In other textbooks, you will encounter distributions
like the Poisson, gamma, and the exponential. The most commonly used distributions and
their properties are discussed in most textbooks on statistics (see Further Readings at the end
of this chapter).
Suppose that our research goal is to estimate the probability, call it θ, of the word “umbrella”
appearing in this sentence, versus any other word. If the sentence is completed with the word
“umbrella”, we will refer to it as a success; any other completion will be referred to as a failure.
This is an example of a binomial random variable: given n trials, there can be only two
possible outcomes in each trial, a success or a failure, and there is some true unknown
probability θ of success that we want to estimate. When the number of trials n = 1 is one, the
random variable is called a Bernoulli distribution. So the Bernoulli distribution is the binomial
distribution with n = 1 .
One way to empirically estimate this probability of success is to carry out a cloze task. In a
cloze task, subjects are asked to complete a fragment of the original sentence, such as “It’s
raining, I’m going to take the …”. The predictability or cloze probability of “umbrella” is then
calculated as the proportion of times that the target word “umbrella” was produced as an
answer by subjects.
Assume for simplicity that 10 subjects are asked to complete the above sentence; each
subject does this task only once. This gives us independent responses from 10 trials that are
either coded a success (“umbrella” was produced) or as a failure (some other word was
produced). We can sum up the number of successes to calculate how many of the 10 trials
had “umbrella” as a response. For example, if 8 instances of “umbrella” are produced in 10
trials, we would estimate the cloze probability of producing “umbrella” would be 8/10 .
each time.
Hide
## [1] 7 3 4 7 5 2 5 5 5 4 5 8 4 4 8 4 2 5 10 9
The number of successes in each of the 20 simulated experiments above is being generated
by a discrete random variable Y with a probability distribution p(y|θ) called the binomial
distribution.
For discrete random variables such as the binomial, the probability distribution p(y|θ) is called
a probability mass function (PMF). The PMF defines the probability of each possible outcome.
In the above example, with n = 10 trials, there are 11 possible outcomes: y = 0, 1, 2, . . . , 10
successes. Which of these outcomes is most probable depends on the parameter θ in the
binomial distribution that represents the probability of success.
The left-hand side plot in Figure 1.2 shows an example of a binomial PMF with 10 trials, with
the parameter θ fixed at 0.5 . Setting θ to 0.5 leads to a PMF where the most probable
outcome is 5 successes out of 10 . If we had set θ to, say 0.1, then the most probable outcome
would be 1 success out of 10 ; and if we had set θ to 0.9 , then the most probable outcome
would be 9 successes out of 10 .
θ = 0.5 θ = 0.1 θ = 0.9
0.25 0.4 0.4
Probability
Probability
0.15
0.2 0.2
0.10
0.1 0.1
0.05
FIGURE 1.2: Probability mass functions of a binomial distribution assuming 10 trials, with
50%, 10%, and 90% probability of success.
The probability mass function for the binomial is written as follows.
n
k n−k
Binomial(k|n, θ) = ( )θ (1 − θ)
k
Here, n represents the total number of trials, k the number of successes (this could range
n
from 0 to 10), and θ the probability of success. The term ( )
k
, pronounced n-choose-k,
represents the number of ways in which one can choose k successes out of n trials. For
example, 1 success out of 10 can occur in 10 possible ways: the very first trial could be a 1,
n n!
the second trial could be a 1, etc. The term ( )
k
expands to k!(n−k)!
. In R, it is computed using
the function choose(n,k) , with n and k representing positive integer values.
When we want to express the fact that the data is assumed to be generated from a binomial
random variable, we will write Y ∼ Binomial(n, θ) . If the data is generated from a random
variable that has some other probability distribution f (θ) , we will write Y ∼ f (θ) . In this book,
we use f (⋅) synonymously with p(⋅) to represent a probability distribution.
It is possible to analytically compute the mean (expectation) and variance of the PMF
associated with the binomial random variable Y .
The expectation of a discrete random variable Y with probability mass function f(y), is defined
as
E[Y ] = ∑ y ⋅ f (y)
As a simple example, suppose that we toss a fair coin once. The possible outcomes are Tails
(represented as 0) and Heads (represented as 1), each with equal probability, 0.5. The
expectation is:
1
y=0
The expectation has the interpretation that if we were to do the experiment a large number of
times and calculate the sample mean of the observations, in the long run we would approach
the value 0.5. Another way to look at the above definition is that the expectation gives us the
weighted mean of the possible outcomes, weighted by the respective probabilities of each
outcome.
Without getting into the details of how these are derived mathematically (Kerns 2014), we just
state here that the mean of Y (the expectation E[Y ] ) and the variance of Y (written V ar(Y ) )
of a binomial distribution with parameter θ and n trials are E[Y ] = nθ and
V ar(Y ) = nθ(1 − θ) .
In the binomial example above, n is a fixed number because we decide on the total number of
n
trials before running the experiment. In the PMF, k
( )θ (1 − θ)
k
n−k
, θ is also a fixed value; the
only variable in a PMF is k. In real experimental situations, we never know the true value of θ.
But θ can be estimated from the data. From the observed data, we can compute the estimate
of θ, ^
θ = k/n . The quantity ^
θ is the observed proportion of successes, and is called the
maximum likelihood estimate of the true (but unknown) parameter θ. Once we have estimated
θ in this way, we can also obtain an estimate of the variance by computing ^(1 − θ
nθ ^) . These
estimates are then used for statistical inference.
What does the term “maximum likelihood estimate” mean? In order to understand this term, it
is necessary to first understand what a likelihood function is. Recall that in the discussion
above, the PMF p(k|n, θ) assumes that θ and n are fixed, and k will vary from 0 to 10 when
the experiment is repeated multiple times.
The likelihood function refers to the PMF p(k|n, θ) , treated as a function of θ. Once we have
observed a particular value for k, this value is now fixed, along with n . Once k and n are
fixed, the function p(k|n, θ) only depends on θ. Thus, the likelihood function is the same
function as the PMF, but assumes that the data is fixed and only the parameter θ varies (from
0 to 1). The likelihood function is written L(θ|k, n) , or simply L(θ) .
For example, suppose that we record n = 10 trials, and observe k = 7 successes. The
likelihood function is:
10
7 10−7
L(θ|k = 7, n = 10) = ( )θ (1 − θ)
7
If we now plot the likelihood function for all possible values of θ ranging from 0 to 1, we get the
plot shown in Figure 1.3.
Likelihood function
0.2
Likelihood
0.1
0.0
What is important about this plot is that it shows that, given the data, the maximum point is at
the point 0.7 , which corresponds to the estimated mean using the formula shown above:
k/n = 7/10 . Thus, the maximum likelihood estimate (MLE) gives us the most likely value that
the parameter θ has, given the data. In the binomial, the proportion of successes k/n can be
shown to be the maximum likelihood estimate of the parameter θ (e.g., see p. 339-340 of
Miller and Miller 2004).
A crucial point: the “most likely” value of the parameter is with respect to the data at hand. The
goal is to estimate an unknown parameter value from the data. This parameter value is
chosen such that the probability (discrete case) or probability density (continuous case) of
getting the sample values (i.e., the data) is a maximum. This parameter value is the maximum
likelihood estimate (MLE).
This MLE from a particular sample of data need not invariably give us an accurate estimate of
θ . For example, if we run our experiment for 10 trials and get 1 success out of 10 , the MLE is
0.10 . We could have just happened to observe only one success out of ten by chance, even if
the true θ were 0.7 . If we were to repeatedly run the experiment with increasing sample sizes,
as the sample size increases, the MLE would converge to the true value of the parameter.
Figure 1.4 illustrates this point. The key point here is that with a smaller sample size, the MLE
from a particular data-set may or may not point to the true value.
The MLE as a function of sample size
1.00
0.75
estimate
0.50
0.25
0.00
FIGURE 1.4: The plot shows the estimate of the mean proportion of successes sampled from
a binomial distribution with true probability of success 0.7, with increasing sample sizes. As
the sample size increases, the estimate converges to the true value of 0.7.
In Bayesian data analysis, we will constantly be asking the question: what information does a
probability distribution give us? In particular, we will treat each parameter θ as a random
variable; this will raise questions like: “what is the probability that the parameter θ lies between
two values a and b”; and “what is the range over which we can be 95% certain that the true
value of the parameter lies”? In order to be able to answer questions like these, we need to
know what information we can obtain once we have decided on a probability distribution that is
assumed to have generated the data, and how to extract this information using R. We
therefore discuss the different kinds of information we can obtain from a probability
distribution. For now we focus only on the binomial random variable introduced above.
1.4.2.1 Compute the probability of a particular outcome (discrete case
only)
The binomial distribution shown in Figure 1.2 already shows the probability of each possible
outcome under a different value for θ. In R, there is a built-in function that allows us to
calculate the probability of k successes out of n , given a particular value of k (this number
constitutes our data), the number of trials n , and given a particular value of θ; this is the
dbinom function. For example, the probability of 5 successes out of 10 when θ is 0.5 is:
Hide
## [1] 0.246
The probabilities of success when θ is 0.1 or 0.9 can be computed by replacing 0.5 above by
each of these probabilities. One can just do this by giving dbinom a vector of probabilities:
Hide
Using the dbinom function, we can compute the cumulative probability of obtaining 1 or less,
2 or less successes etc. This is done through a simple summation procedure:
Hide
## the cumulative probability of obtaining
## 0, 1, or 2 successes out of 10,
## with theta=0.5:
dbinom(0, size = 10, prob = 0.5) +
dbinom(1, size = 10, prob = 0.5) +
## [1] 0.0547
2
n
k n−k
∑( )θ (1 − θ)
k
k=0
An alternative to the cumbersome addition in the R code above is this more compact
statement, which is identical to the above mathematical expression:
Hide
## [1] 0.0547
R has a built-in function called pbinom that does this summation for us. If we want to know
the probability of 2 or fewer than 2 successes as in the above example, we can write:
Hide
## [1] 0.0547
The specification lower.tail = TRUE (the default value) ensures that the summation goes
from 2 to numbers smaller than 2 (which lie in the lower tail of the distribution in Figure 1.2). If
we wanted to know what the probability is of obtaining 3 or more successes out of 10 , we can
set lower.tail to FALSE :
Hide
pbinom(2, size = 10, prob = 0.5, lower.tail = FALSE)
## [1] 0.945
Hide
## equivalently:
## sum(dbinom(3:10,size = 10, prob = 0.5))
The cumulative distribution function or CDF can be plotted by computing the cumulative
probabilities for any value k or less than k, where k ranges from 0 to 10 in our running
example. The CDF is shown in Figure 1.5.
0.75
0.50
0.25
0.00
0 1 2 3 4 5 6 7 8 9 10
Possible outcomes k
FIGURE 1.5: The cumulative distribution function for a binomial distribution assuming 10 trials,
with 50% probability of success.
1.4.2.3 Compute the inverse of the cumulative distribution function (the
quantile function)
We can also find out the value of the variable k (the quantile) such that the probability of
obtaining k or less than k successes is some specific probability value p. If we switch the x
and y axes of Figure 1.5, we obtain another very useful function, the inverse of the CDF.
The inverse of the CDF (known as the quantile function in R because it returns the quantile,
the value k) is available in R as the function qbinom . The usage is as follows: to find out what
the value k of the outcome is such that the probability of obtaining k or less successes is 0.37
, type:
Hide
## [1] 4
One can visualize the inverse CDF of the binomial as in Figure 1.6.
6
quantile
We can generate simulated data from a binomial distribution by specifying the number of trials
and the probability of success θ. In R, we do this as follows:
Hide
## [1] 7
The above code generates the number of successes in an experiment with 10 trials.
Repeatedly run the above code; we will get different numbers of successes each time.
As mentioned earlier, if there is only one trial, then instead of the binomial distribution, we
have a Bernoulli distribution. For example, if we have 10 observations from a Bernoulli
distribution, where the probability of success is 0.5, we can simulate data as follows using the
function rbern from the package extraDistr .
Hide
rbern(n=10,prob=0.5)
## [1] 0 1 1 0 1 1 0 0 1 1
The above kind of output can also be generated by using the rbinom function: rbinom(n =
10, size = 1, prob = 0.5) . When the data are generated using the rbinom function in this
way, one can calculate the number of successes by just summing up the vector, or computing
its mean and multiplying by the number of trials, here 10 :
Hide
## [1] 0 1 1 1 0 1 0 0 0 0
Hide
mean(y) * 10
## [1] 4
Hide
sum(y)
## [1] 4
We will now revisit the idea of the random variable using a continuous distribution. Imagine
that we have a vector of reading time data y measured in milliseconds and coming from a
normal distribution. The normal distribution is defined in terms of two parameters: the location,
its mean value μ , which determines its center, and the scale, its standard deviation, σ, which
determines how much spread there is around this center point.
The probability density function (PDF) of the normal distribution is defined as follows:
2
1 (y − μ)
Normal(y|μ, σ) = f (y) = exp(− )
2
√2πσ
2 2σ
(in R, this parameter is specified in terms of the standard deviation rather than the variance).
Figure 1.7 visualizes the normal distribution for particular values of μ and σ, as a PDF (using
dnorm ), a CDF (using pnorm ), and the inverse CDF (using qnorm ). It should be clear from
the figure that these are three different ways of looking at the same information.
PDF CDF Inverse CDF
0.004 1.00
700
Probability
0.003 0.75 600
Density
y
0.001 0.25 400
300
0.000 0.00
200 400 600 800 200 400 600 800 0.00 0.25 0.50 0.75 1.00
y y Probability
FIGURE 1.7: The PDF, CDF, and inverse CDF for the Normal(μ = 500, σ = 100) .
As in the discrete example, the PDF, CDF, and inverse of the CDF allow us to ask questions
like:
What is the probability of observing values between a and b from a normal distribution
with mean μ and standard deviation σ? Using the above example, we can ask what the
probability of observing values between 200 and 700 ms:
Hide
## [1] 0.976
The probability of any point value in a PDF is always 0. This is because the probability in a
continuous probability distribution is the area under the curve, and the area at any point on the
x-axis is always 0. The implication here is that it is only meaningful to ask about probabilities
between two different point values; e.g., the probability that Y lies between a and b, or
P (a < Y < b) . Notice that P (a < Y < b) is the same statement as P (a ≤ Y ≤ b) .
What is the quantile q such that the probability is p of observing that value q or a value
more extreme than q? For example, we can work out the quantile q such that the
probability of observing q or some value less than it is 0.975, in the normal(500,100)
distribution. Formally, we would write this as P (Y < a) .
Hide
## [1] 696
The above output says that the quantile value q such that P rob(X < q) = 0.975 is q = 696 .
Generate simulated data. Given a vector of n independent and identically distributed data
y , i.e., given that each data point is being generated independently from
Y ∼ Normal(μ, σ) for some values of the parameters, the sample mean and standard
deviation4 are:
n
∑ yi
i=1
ȳ =
n
n
2
∑ (yi − ȳ )
i=1
sd(y) = √
n
For example, we can generate 10 data points using the rnorm function, and then use the
simulated data to compute the mean and standard deviation:
Hide
mean(y)
## [1] 532
Hide
sd(y)
## [1] 80.6
Again, the sample mean and sample standard deviation computed from a particular (simulated
or real) data set need not necessarily be close to the true values of the respective parameters.
Especially when sample size is small, one can end up with mis-estimates of the mean and
standard deviation.
Incidentally, simulated data can be used to generate all kinds of statistics. For example, we
can compute the lower and upper quantiles such that 95% of the simulated data are contained
within these quantiles:
Hide
quantile(y, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 427 673
Later on, we will be using samples to produce summary statistics like the ones shown above.
In continuous distributions like the normal discussed above, it is important to understand that
the probability density function or PDF, p(y|μ, σ) defines a mapping from the y values (the
possible values that the data can have) to a quantity called the density of each possible value.
We can see this function in action when we use dnorm to compute, say, the density value
corresponding to y = 1 and y = 2 in the standard normal distribution.
Hide
## density:
dnorm(1, mean = 0, sd = 1)
## [1] 0.242
Hide
dnorm(2, mean = 0, sd = 1)
## [1] 0.054
If the density at a particular point value like 1 is high compared to some other value – 2 in the
above example – then this point value 1 has a higher likelihood than 2 in the standard normal
distribution.
The quantity computed for the values 1 and 2 are not the probability of observing 1 or 2 in this
distribution. As mentioned earlier, probability in a continuous distribution is defined as the area
under the curve, and this area will always be zero at any point value (because there are
infinitely many different possible values). If we want to know the probability of obtaining values
between an upper and lower bound b and a, i.e., P (a < Y < b) where these are two distinct
values, we must use the cumulative distribution function or CDF: in R, for the normal
distribution, this is the pnorm function. For example, the probability of observing a value
between +2 and -2 in a normal distribution with mean 0 and standard deviation 1 is:
Hide
## [1] 0.954
The situation is different in discrete random variables. These have a probability mass function
(PMF) associated with them—an example is the binomial distribution that we saw earlier.
There, the PMF maps the possible y values to the probabilities of those values occurring. That
is why, in the binomial distribution, the probability of observing exactly 2 successes when
sampling from a Binomial(n = 10, θ = 0.5) can be computed using either dbinom or
pbinom :
Hide
## [1] 0.0439
Hide
pbinom(2, size = 10, prob = 0.5) - pbinom(1, size = 10, prob = 0.5)
## [1] 0.0439
In the second line of code above, we are computing the cumulative probability of observing
two or less successes, minus the probability of observing one or less successes. This gives us
the probability of observing exactly two successes. The dbinom gives us this same
information.
1.5.2 Truncating a normal distribution
In the above discussion, the support for the normal distribution ranges from minus infinity to
plus infinity. One can define PDFs with a more limited support; an example would be a normal
distribution whose PDF f (x) is such that the lower bound is truncated at 0 to allow only
0
positive values. In such a case, the area under the range minus infinity to zero (∫−∞ f (x) dx)
will be 0 because the range lies outside the support of the truncated normal distribution. Also,
if one truncates a probability density function like the standard normal (Normal(0, 1)) at 0, in
order to make the area between zero and plus infinity sum up to 1, we would have to multiply
the truncated distribution f(x) by some factor k such that the following integral sums to 1:
∞
k∫ f (x) dx = 1 (1.1)
0
1
Clearly, this factor is k = ∞ . For the standard normal, this integral is easy to compute
∫ f (x) dx
0
Hide
## [1] 0.5
Hide
## alternatively:
1 - pnorm(0, mean = 0, sd = 1, lower.tail = TRUE)
## [1] 0.5
Also, if we had truncated the distribution at 0 to the right instead of the left (allowing only
negative values), we would have to find the factor k in the same way as above, except that we
would have to find k such that:
k∫ f (x) dx = 1
−∞
For the standard normal case, in R, this factor would require us to use the CDF:
Hide
## [1] 0.5
Later in this book, we will be using such truncated distributions when doing Bayesian
modeling, and when we use them, we will want to multiply the truncated distribution by the
factor k to ensure that it is still a proper PDF that sums to 1. Truncated normal distributions
will be discussed in more detail in Box 4.1.
So far, we have only discussed univariate distributions; these are distributions that involve
only one variable. For example, when we talk about data generated from a Binomial
distribution, or from a Normal distribution, these are univariate distributions.
It is also possible to specify distributions with two or more dimensions. Some examples will
make it clear what a bivariate (or more generally, multivariate) distribution is.
Starting with the discrete case, consider the discrete bivariate distribution shown below. These
are data from an experiment where, inter alia, in each trial a Likert acceptability rating and a
question-response accuracy were recorded (the data are from a study by Laurinavichyute
(2020), used with permission here). Load the data by loading the R package bcogsci .
Hide
data("df_discreteagrmt")
Figure 1.8 shows the joint probability mass function of two random variables X and Y. The
random variable X consists of 7 possible values (this is the 1-7 Likert response scale), and the
random variable Y is question-response accuracy, with 0 representing an incorrect response,
and 1 representing a correct response. One can also display Figure 1.8 as a table; see Table
1.1.
7
1 6
5
4
y 3 x
0 2
1
FIGURE 1.8: Example of a discrete bivariate distribution. In these data, in every trial, two
pieces of information were collected: Likert responses and yes-no question responses. The
random variable X represents Likert scale responses on a scale of 1-7. and the random
variable Y represents 0, 1 (incorrect, correct) responses to comprehension questions.
TABLE 1.1: The joint PMF for two random variables X and Y.
x = 1 x = 2 x = 3 x = 4 x = 5 x = 6 x = 7
For each possible pair of values of X and Y, we have a joint probability pX,Y (x, y) . Given such
a bivariate distribution, there are two useful quantities we can compute: the marginal
distributions (pX and pY ), and the conditional distributions (pX|Y and pY |X ). Table 1.1 shows
the joint probability mass function pX,Y (x, y) .
The marginal distribution pY is defined as follows. SX is the support of X, i.e., all the possible
values of X.
pY (y) = ∑ pX,Y (x, y).
x∈SX
y∈SY
pY is computed, by summing up the rows; and pX by summing up the columns. We can see
why this is called the marginal distribution; the result appears in the margins of the table. In
the code below, the object probs contains bivariate PMF shown in Table 1.1.
Hide
# P(Y)
(PY <- rowSums(probs))
## y=0 y=1
## 0.291 0.709
Hide
sum(PY) ## sums to 1
## [1] 1
Hide
# P(X)
(PX <- colSums(probs))
Hide
sum(PX) ## sums to 1
## [1] 1
The marginal probabilities sum to 1, as they should. Table 1.2 shows the marginal
probabilities.
TABLE 1.2: The joint PMF for two random variables X and Y, along with the marginal
distributions of X and Y.
x = 1 x = 2 x = 3 x = 4 x = 5 x = 6 x = 7 P (Y )
To compute the marginal distribution of X, one is summing over all the Ys; and to compute the
marginal distribution of Y, one sums over all the X’s. We say that we are marginalizing out the
random variable that we are summing over. One can also visualize the two marginal
distributions using barplots (Figure 1.9).
P(X) P(Y)
0.20
0.6
0.4
0.10
0.2
0.00
0.0
FIGURE 1.9: The marginal distributions of the random variables X and Y, presented as
barplots.
1.6.1.2 Conditional distributions
For computing conditional distributions, recall that conditional probability (see section 1.2) is
defined as:
pX,Y (x, y)
pX∣Y (x ∣ y) =
pY (y)
and
pX,Y (x, y)
pY ∣X (y ∣ x) =
pX (x)
pX,Y (x, y)
pX∣Y (x ∣ y) = provided pY (y) = P (Y = y) > 0
pY (y)
As an example, let’s consider how pX∣Y would be computed. The possible values of y are 0, 1 ,
and so we have to find the conditional distribution (defined above) for each of these values.
I.e., we have to find pX∣Y (x ∣ y = 0) , and pX∣Y (x ∣ y = 1) .
pX,Y (1, 0)
pX∣Y (1 ∣ 0) =
pY (0)
0.018
=
0.291
=0.062
This conditional probability value will occupy the cell X=1, Y=0 in Table 1.3 summarizing the
conditional probability distribution pX|Y . In this way, one can fill in the entire table, which will
then represent the conditional distributions pX|Y =0 and pX|Y =1 . The reader may want to take
a few minutes to complete Table 1.3.
x = 1 x = 2 x = 3 x = 4 x = 5 x = 6 x = 7
pX|Y (x|y=1)
Here, we briefly define the covariance and correlation of two discrete random variables. For
detailed examples and discussion, see the references at the end of the chapter. Informally, if
there is a high probability that large values of a random variable X are associate with large
values of another random variable Y , we will say that the covariance between the two random
variable X and Y , written Cov(X, Y ) , is positive.
The covariance of two (discrete) random variables X and Y is defined as follows. E[⋅] refers
to the expectation of a random variable.
x y
If the standard deviations of the two random variables is σX and σY , the correlation between
the two random variables, ρXY , is defined as:
Cov(X, Y )
ρXY =
σX σY
Consider now the continuous bivariate case; this time, we will use simulated data. Consider
two normal random variables X and Y , each of which coming from, for example, a
Normal(0, 1) distribution, with some correlation ρX,Y between the two random variables.
A bivariate distribution for two random variables X and Y , each of which comes from a normal
distribution, is expressed in terms of the means and standard deviations of each of the two
distributions, and the correlation ρXY between them. The standard deviations and correlation
are expressed in a special form of a 2 × 2 matrix called a variance-covariance matrix Σ . If
ρXY is the correlation between the two random variables, and σX and σY the respective
standard deviations, the variance-covariance matrix is written as:
2
σ ρXY σX σY
X
Σ = ( )
2
ρXY σX σY σ
Y
X 0
( ) ∼ N2 (( ) , Σ)
Y 0
The joint PDF is written with reference to the two variables fX,Y (x, y) . It has the property that
the area under the curve sums to 1—this sum-to-1 property is the same idea that we
encountered in the univariate cases (the normal and binomial distributions), except that we are
talking about a bivariate distribution here.
Formally, we would write the area under the curve as a double integral: we are summing up
the volume under the curve for both X and Y (hence the two integrals).
Here, the terms dx and dy express the fact that we are summing the area under the curve
along the X axis and the Y axis.
The joint CDF would be written as follows. The equation below gives us the probability of
observing a value like (u, v) or some value smaller than that (i.e., some (u , v )
′ ′
, such that
u
′
< u and v
′
< v ).
Just as in the discrete case, the marginal distributions can be derived by marginalizing out the
other random variable:
Here, the integral sign ∫ is the continuous equivalent of the summation sign ∑ in the discrete
case. Luckily, we will never have to compute such integrals ourselves; but it is important to
appreciate how a marginal distribution arises from a bivariate distribution—by integrating out
or marginalizing out the other random variable.
A visualization will help. The figures below show a bivariate distribution with zero correlation
(Figure 1.10), a negative (Figure 1.11) and a positive correlation (Figure 1.12).
3
2
1
0
-1
-2
-3
y x
-3 -2 -1 0 1 2 3
3
2
1
0
-1
-2
-3
y x
-3 -2 -1 0 1 2 3
FIGURE 1.10: A bivariate normal distribution with zero correlation. Shown are four plots: the
top-right plot shows the three-dimensional bivariate density, the top-left plot the contour plot of
the distribution (seen from above). The lower plots show the cumulative distribution function
from two views, as a three-dimensional plot and as a contour plot.
3
2
1
0
-1
-2
-3
y x
-3 -2 -1 0 1 2 3
3
2
1
0
-1
-2
-3
y x
-3 -2 -1 0 1 2 3
FIGURE 1.11: A bivariate normal distribution with a negative correlation of -0.6. Shown are
four plots: the top-right plot shows the three-dimensional bivariate density, the top-left plot the
contour plot of the distribution (seen from above). The lower plots show the cumulative
distribution function from two views, as a three-dimensional plot and as a contour plot.
3
2
1
0
-1
-2
-3
y x
-3 -2 -1 0 1 2 3
3
2
1
0
-1
-2
-3
y x
-3 -2 -1 0 1 2 3
FIGURE 1.12: A bivariate normal distribution with a positive correlation of 0.6. Shown are four
plots: the top-right plot shows the three-dimensional bivariate density, the top-left plot the
contour plot of the distribution (seen from above). The lower plots show the cumulative
distribution function from two views, as a three-dimensional plot and as a contour plot.
In this book, we will make use of such multivariate distributions a lot, and it will soon become
important to know how to generate simulated bivariate or multivariate data that is correlated.
So let’s look at that next.
Suppose we want to generate 100 pairs of correlated data, with correlation ρ = 0.6 . The two
random variables have mean 0, and standard deviations 5 and 10 respectively.
Here is how we would generate such data. First, define a variance-covariance matrix; then,
use the multivariate analog of the rnorm() function, mvrnorm() , from the MASS package to
generate 100 data points.
Hide
## define a variance-covariance matrix:
Sigma <- matrix(c(5^2, 5 * 10 * .6, 5 * 10 * .6, 10^2),
byrow = FALSE, ncol = 2
)
## generate data:
u <- mvrnorm(
n = 100,
mu = c(0, 0),
Sigma = Sigma
)
head(u, n = 3)
## [,1] [,2]
## [1,] -5.32 -9.66
## [2,] 1.73 5.14
## [3,] 2.07 3.16
Figure 1.13 confirms that the simulated data are positively correlated.
20
0
u_2
-20
-10 -5 0 5 10
u_1
FIGURE 1.13: The relationship between two positively correlated random variables, generated
by simulating data using the R function mvrnorm from the MASS library.
One final useful fact about the variance-covariance matrix—one that we will need later—is that
it can be decomposed into the component standard deviations and an underlying correlation
matrix. For example, consider the matrix above:
Hide
Sigma
## [,1] [,2]
## [1,] 25 30
## [2,] 30 100
One can decompose the matrix as follows. The matrix can be seen as the product of a
diagonal matrix of the standard deviations and the correlation matrix:
5 0 1.0 0.6 5 0
( )( )( )
0 10 0.6 1.0 0 10
5 0 1.0 0.6 5 0 25 30
( )( )( ) = ( )
0 10 0.6 1.0 0 10 30 100
Hide
## sds:
(sds <- c(5, 10))
## [1] 5 10
Hide
## diagonal matrix:
(sd_diag <- diag(sds))
## [,1] [,2]
## [1,] 5 0
## [2,] 0 10
Hide
## correlation matrix:
(corrmatrix <- matrix(c(1, 0.6, 0.6, 1), ncol = 2))
## [,1] [,2]
## [1,] 1.0 0.6
## [2,] 0.6 1.0
Hide
## [,1] [,2]
## [1,] 25 30
## [2,] 30 100
Here, we introduce a concept that will turn up many times in this book. The concept we
unpack here is called “integrating out a parameter”. We will need this when we encounter
Bayes’ rule in the next chapter, and when we use Bayes factors later in the book (chapter 15).
Integrating out a parameter refers to the following situation. The example used here discusses
the binomial distribution, but the approach is generally applicable for any distribution.
Suppose we have a binomial random variable Y with PMF p(Y ) . Suppose also that this PMF
is defined in terms of parameter θ that can have only three possible values, 0.1, 0.5, 0.9 , each
with equal probability. In other words, the probability that θ is 0.1, 0.5, or 0.9 is 1/3 each.
We stick with our earlier example of n = 10 trials and k = 7 successes. The likelihood
function then is
10
7 3
p(k = 7, n = 10|θ) = ( )θ (1 − θ)
7
10
7 3
p(k = 7, n = 10) =( )θ (1 − θ1 ) × p(θ1 )
1
7
10
7 3
+( )θ (1 − θ2 ) × p(θ2 )
2
7
10
7 3
+( )θ (1 − θ3 ) × p(θ3 )
3
7
10 7 3
1
p(k = 7, n = 10) =( )0.1 (1 − 0.1) ×
7 3
10 1
7 3
+( )0.5 (1 − 0.5) ×
7 3
10 1
7 3
+( )0.9 (1 − 0.9) ×
7 3
1 10
7 3
p(k = 7, n = 10) = ( )[0.1 (1 − 0.1)
3 7
8 3
+0.5 (1 − 0.5)
8 3
+0.9 (1 − 0.9) ]
=0.058
Thus, a marginal likelihood is a kind of weighted sum of the likelihood, weighted by the
possible values of the parameter.6
The above example was contrived, because we stated that the parameter θ has only three
possible values. In reality, because the parameter θ can have all possible values between 0
and 1, the summation has to be done over a continuous space [0, 1] . The way this summation
is expressed in mathematics is through the integral symbol:
1
10
7 3
p(k = 7, n = 10) = ∫ ( )θ (1 − θ) dθ
7
0
This statement is computing something similar to what we computed above with the three
discrete parameter values, except that the summation is being done over a continuous space
ranging from 0 to 1. We say that the parameter θ has been integrated out, or marginalized.
Integrating out a parameter will be a very common operation in this book, but we will never
have to do the calculation ourselves. For the above case, we can compute the integral in R:
Hide
## [1] 0.0909
The value that is output by the integrate function above is the marginal likelihood.
This completes our discussion of random variables and probability distributions. We now
summarize what we have learned so far.
Table 1.4 summarizes the different functions relating to PMFs and PDFs, using the binomial
and normal as examples.
TABLE 1.4: Important R functions relating to random variables.
Discrete Continuous
Later on, we will use other distributions, such as the Uniform, Beta, etc., and each of these
has their own set of d-p-q-r functions in R. One can look up these different distributions in,
for example, Blitzstein and Hwang (2014).
1.9 Summary
This chapter briefly reviewed some very basic concepts in probability theory, univariate
discrete and continuous random variables, and bivariate distributions. An important set of
functions we encountered are the d-p-q-r family of functions for different distributions; these
are very useful for understanding the properties of commonly used distributions, visualizing
distributions, and for simulating data. Distributions will play a central role in this book; for
example, knowing how to visualize distributions will be important for deciding on prior
distributions for parameters. Other important ideas we learned about were marginal and
conditional probability, marginal likelihood, and how to define multivariate distributions; these
concepts will play an important role in Bayesian statistics.
A quick review of the mathematical foundations needed for statistics is available in the short
book by Fox (2009), as well as Gill (2006). Morin (2016) and Blitzstein and Hwang (2014) are
accessible introductions to probability theory. Ross (2002) is a more advanced treatment
which discusses random variable theory and illustrates applications of probability theory. A
good formal introduction to mathematical statistics (covering classical frequentist theory) is
Miller and Miller (2004). The freely available book by Kerns (2014) introduces frequentist and
Bayesian statistics from the ground up in a very comprehensive and systematic manner; the
source code for the book is available from https://github.com/gjkerns/IPSUR. The open-access
book, Probability and Statistics: a simulation-based introduction, by Bob Carpenter is also
worth studying: https://github.com/bob-carpenter/prob-stats. A thorough introduction to the
matrix algebra needed for statistics, with examples using R, is provided in Fieller (2016).
Commonly used probability distributions are presented in detail in Miller and Miller (2004),
Blitzstein and Hwang (2014), and Ross (2002).
1.11 Exercises
Given a normal distribution with mean 500 and standard deviation 100, use the pnorm
function to calculate the probability of obtaining values between 200 and 800 from this
distribution.
Calculate the following probabilities. Given a normal distribution with mean 800 and standard
deviation 150, what is the probability of obtaining:
Given a normal distribution with mean 600 and standard deviation 200, what is the probability
of obtaining:
Consider a normal distribution with mean 1 and standard deviation 1. Compute the lower and
upper boundaries such that:
the area (the probability) to the left of the lower boundary is 0.10.
the area (the probability) to the left of the upper boundary is 0.90.
Exercise 1.5 Practice using the qnorm function - Part 2
Given a normal distribution with mean 650 and standard deviation 125. There exist two
quantiles, the lower quantile q1 and the upper quantile q2, that are equidistant from the mean
650, such that the area under the curve of the normal between q1 and q2 is 80%. Find q1 and
q2.
Hide
Calculate the mean, variance, and the lower quantile q1 and the upper quantile q2, that are
equidistant and such that the range of probability between them is 80%.
This time we generate the data with a truncated normal distribution from the package
extraDistr . The details of this distribution will be discussed later in section 4.1 and in the
Box 4.1, but for now we can treat it as an unknown generative process:
Hide
Using the sample data, calculate the mean, variance, and the lower quantile q1 and the upper
quantile q2, such that the probability of observing values between these two quantiles is 80%.
Suppose that you have a bivariate distribution where one of the two random variables comes
from a normal distribution with mean μX = 600 and standard deviation σX = 100 , and the
other from a normal distribution with mean μY = 400 and standard deviation σY = 50 . The
correlation ρXY between the two random variables is 0.4 . Write down the variance-covariance
matrix of this bivariate distribution as a matrix (with numerical values, not mathematical
symbols), and then use it to generate 100 pairs of simulated data points. Plot the simulated
data such that the relationship between the random variables X and Y is clear. Generate two
sets of new data (100 pairs of data points each) with correlation −0.4 and 0, and plot these
alongside the plot for the data with correlation 0.4 .
References
Fieller, Nick. 2016. Basics of Matrix Algebra for Statistics with R. Boca Raton, FL: CRC Press.
Fox, John. 2009. A Mathematical Primer for Social Statistics. Vol. 159. Sage.
Gill, Jeff. 2006. Essential Mathematics for Political and Social Research. Cambridge University
Press Cambridge.
Kerns, G.J. 2014. Introduction to Probability and Statistics Using R. Second Edition.
Miller, I., and M. Miller. 2004. John E. Freund’s Mathematical Statistics with Applications.
Upper Saddle River, NJ: Prentice Hall.
Morin, David J. 2016. Probability: For the Enthusiastic Beginner. Createspace Independent
Publishing Platform.
Steyer, Rolf, and Werner Nagel. 2017. Probability and Conditional Expectation: Fundamentals
for the Empirical Sciences. Vol. 5. John Wiley & Sons.
1. Here, we use Y , but we could have used any letter, such as X, Z, . . . . Later on, in some
situations we will use Greek letters like θ, μ, σ to represent a random variable.↩
2. The actual formal definition of random variable is more complex, and it is based on
measure theory. A more rigorous definition can be found in, for example, Steyer and
Nagel (2017).↩
3. A notational aside: In frequentist treatments, the PMF would be written p(y; θ) , i.e., with a
semi-colon rather than the conditional distribution marked by the vertical bar. The semi-
colon is intended to indicate that in the frequentist paradigm, the parameters are fixed
point values; by contrast, in the Bayesian paradigm, parameters are random variables.
This has the consequence that for Bayesian, the distribution of y, p(y) is really a
conditional distribution, conditional on a random variable, here θ. For the frequentist, p(y)
requires some point value for θ, but it cannot be a conditional distribution because θ is not
a random variable. We define conditional distributions later in this section.↩
4. R will compute the standard deviation by dividing by n − 1 , not n ; this is because dividing
by n gives a biased estimate (chapter 10 of Miller and Miller 2004). This is not an
important detail for our purposes, and in any case for large n it doesn’t really matter
whether one divides by n or n − 1 .↩
5. There is a built-in convenience function, sdcor2cov in the SIN package that does this
calculation, taking the vector of standard deviations (not the diagonal matrix) and the
correlation matrix to yield the variance-covariance matrix: sdcor2cov(stddev = sds, corr
= corrmatrix) .↩
6. Where does the above formula come from? It falls out from the law of total probability
discussed above!↩
Code
Recall Bayes’ rule: When A and B are observable discrete events (such as “it has been
raining” or “the streets are wet”), we can state the rule as follows:
P (B ∣ A)P (A)
P (A ∣ B) = (2.1)
P (B)
Given a vector of data y, Bayes’ rule allows us to work out the posterior distributions of the
parameters of interest, which we can represent as the vector of parameters Θ . This
computation is achieved by rewriting (2.1) as (2.2). What is different here is that Bayes’ rule is
written in terms of probability distributions. Here, p(⋅) is a probability density function
(continuous case) or a probability mass function (discrete case).
p(y|Θ) × p(Θ)
p(Θ|y) = (2.2)
p(y)
Likelihood × Prior
Posterior =
Marginal Likelihood
The terms here have the following meaning. We elaborate on each point with an example
below.
The Posterior, p(Θ|y) , is the probability distribution of the parameters conditional on the
data.
The Likelihood, p(y|Θ ) is as described in chapter 1: it is the PMF (discrete case) or the
PDF (continuous case) expressed as a function of Θ .
The Prior, p(Θ) , is the initial probability distribution of the parameter(s), before seeing the
data.
The Marginal Likelihood, p(y) , was introduced in chapter 1 and standardizes the
posterior distribution to ensure that the area under the curve of the distribution sums to 1,
that is, it ensures that the posterior is a valid probability distribution.
Recall our cloze probability example earlier. Subjects are shown sentences like
Suppose that 100 subjects are asked to complete the sentence. If 80 out of 100 subjects
complete the sentence with “umbrella,” the estimated cloze probability or predictability (given
80
the preceding context) would be 100
= 0.8 . This is the maximum likelihood estimate of the
probability of producing this word; we will designate the estimate with a “hat” on the parameter
name: ^ = 0.8
θ . In the frequentist paradigm, ^ = 0.8
θ is an estimate of an unknown point value
θ “out there in nature.”
A crucial point to notice here is that the proportion 0.80 that we estimated above from the data
can vary from one data set to another, and the variability in the estimate will be influenced by
the sample size. For example, assuming that the true value of the θ parameter is in fact 0.80,
if we repeatedly carry out the above experiment with say 10 participants, we will get some
variability in the estimated proportion. Let’s check this by carrying out 100 simulated
experiments and computing the variability of the estimated means under repeated sampling:
Hide
estimated_means <- rbinom(n = 100, size = 10, prob = 0.80) / 10
sd(estimated_means)
## [1] 0.145
The repeated runs of the (simulated) experiment are the sole underlying cause for the
variability (shown by the output of the sd(estimated) command above) in the estimated
proportion; the parameter θ = 0.80 itself is invariant here (we are repeatedly estimating this
point value).
However, consider now an alternative radical idea: what if we treat θ as a random variable?
That is, suppose now that θ has a PDF associated with it. This PDF would now represent our
belief about possible values of θ, even before we have seen any data. For example, if at the
outset of the experiment, we believe that all possible values between 0 and 1 are equally
likely, we could represent that belief by stating that θ ∼ Uniform(0, 1) . The radical new idea
here is that we now have a way to represent our prior belief or knowledge about plausible
values of the parameter.
Now, if we were to run our simulated experiments again and again, there would be two
sources of variability in the estimate of the parameter: the data as well as the uncertainty
associated with θ.
Hide
## [1] 0.331
The higher standard deviation is now coming from the uncertainty associated with the θ
parameter. To see this, assume a “tighter” PDF for θ, say θ ∼ Uniform(0.3, 0.8) , then the
variability in the estimated means would again be smaller, but not as small as when we
assumed that θ was a point value:
Hide
theta<-runif(100,min=0.3,max=0.8)
estimated_means<-rbinom(n=100,size=10,prob=theta)/10
sd(estimated_means)
## [1] 0.209
In other words, the greater the uncertainty associated with the parameter θ, the greater the
variability in the data.
The Bayesian approach to parameter estimation makes this radical departure from the
standard frequentist assumption that θ is a point value; in the Bayesian approach, θ is a
random variable with a probability density/mass function associated with it. This PDF is called
a prior distribution, and represents our prior belief or prior knowledge about possible values of
this parameter. Once we obtain data, these data serve to modify our prior belief about this
distribution; this updated probability density function of the parameter is called the posterior
distribution. These ideas are unpacked in the sections below.
Under the assumptions we have set up above, the responses follow a binomial distribution,
and so the PMF can be written as follows.
n
k n−k
p(k|n, θ) = ( )θ (1 − θ) (2.3)
k
where k indicates the number of times “umbrella” is given as an answer, and n the total
number of answers given. Here, k can be any whole number going from 0 to 100 .
In a particular experiment that we carry out, if we collect 100 data points (n = 100 ) and it
turns out that k = 80 , these data are now a fixed quantity. The only variable now in the PMF
above is θ:
n
80 20
p(k = 80|n = 100, θ) = ( )θ (1 − θ)
k
The above function is a now a continuous function of the value θ, which has possible values
ranging from 0 to 1. Compare this to the PMF of the binomial, which treats θ as a fixed value
and defines a discrete distribution over the n+1 possible discrete values k that we can
observe (the possible number of successes).
Recall that the PMF and the likelihood are the same function seen from different points of
view. The only difference between the two is what is considered to be fixed and what is
varying. The PMF treats data as varying from experiment to experiment and θ as fixed,
whereas the likelihood function treats the data as fixed and the parameter θ as varying.
We now turn our attention back to our main goal, which is to find out, using Bayes’ rule, the
posterior distribution of θ given our data: p(θ|n, k) . In order to use Bayes’ rule to calculate this
posterior distribution, we need to define a prior distribution over the parameter θ. In doing so,
we are explicitly expressing our prior uncertainty about plausible values of θ.
For the choice of prior for θ in the binomial distribution, we need to assume that the parameter
θ is a random variable that has a PDF whose range lies within [0,1], the range over which θ
can vary (this is because θ represents a probability). The beta distribution, which is a PDF for
a continuous random variable, is commonly used as prior for parameters representing
probabilities. One reason for this choice is that its PDF ranges over the interval [0, 1] . The
other reason for this choice is that it makes the Bayes’ rule calculation remarkably easy.
1
a−1 b−1
p(θ|a, b) = θ (1 − θ) (2.4)
B(a, b)
1
The term B(a, b) expands to ∫
0
θ
a−1
(1 − θ)
b−1
dθ , and is a normalizing constant that
ensures that the area under the curve sums to one. In some textbooks, you may see the PDF
Γ(a+b)
of the beta distribution with the normalizing constant (the expression Γ(n) is defined
Γ(a)Γ(b)
as (n-1)!):
Γ(a + b)
a−1 b−1
p(θ|a, b) = θ (1 − θ)
Γ(a)Γ(b)
These two statements for the beta distribution are identical because B(a, b) can be shown to
Γ(a)Γ(b)
be equal to (Ross 2002).
Γ(a+b)
The beta distribution’s parameters a and b can be interpreted as expressing our prior beliefs
about the probability of success; a represents the number of “successes”, in our case,
answers that are “umbrella” and b the number of failures, the answers that are not “umbrella”.
Figure 2.1 shows the different beta distribution shapes given different values of a and b.
Beta(a = 1, b = 1) Beta(a = 4, b = 4)
density 5 5
density
4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta
density
4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta
density
4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta
density
4 4
3 3
2 2
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
theta theta
a a × b
E[X] = Var(X) = (2.5)
2
a + b (a + b) (a + b + 1)
As an example, choosing a = 4 and b = 4 would mean that the answer “umbrella” is as likely
as a different answer, but we are relatively unsure about this. We could express our
uncertainty by computing the region over which we are 95% certain that the value of the
parameter lies; this is the 95% credible interval. For this, we would use the qbeta function in
R; the parameters a and b are called shape1 and shape2 in R .
Hide
The credible interval chosen above is an equal-tailed interval: the area below the lower bound
and above the upper bound is the same (0.025 in the above case). One could define
alternative intervals; for example, in a distribution with only one mode (one peak; a unimodal
distribution), one could choose to use the narrowest interval that contains the mode. This is
called the highest posterior density interval (HDI). In skewed posterior distributions, the equal-
tailed credible interval and the HDI will not be identical, because the HDI will have unequal tail
probabilities. Some authors, such as Kruschke (2014), prefer to report the HDI. We will use
the equal-tailed interval in this book, simply because this is the standard output in Stan and
brms .
If we were to choose a = 10 and b = 10 , we would still be assuming that a priori the answer
“umbrella” is just as likely as some other answer, but now our prior uncertainty about this
mean is lower, as the 95% credible interval computed below shows.
Hide
In Figure 2.1, we can see also the difference in uncertainty in these two examples graphically.
Which prior should we choose? In a real data analysis problem, the choice of prior would
depend on what prior knowledge we want to bring into the analysis (see chapter 6). If we don’t
have much prior information, we could use a = b = 1 ; this gives us a uniform prior (i.e.,
Uniform(0, 1) ). This kind of prior goes by various names, such as flat, non-informative prior,
or uninformative prior. By contrast, if we have a lot of prior knowledge and/or a strong belief
(e.g., based on a particular theory’s predictions, or prior data) that θ has a particular range of
plausible values, we can use a different set of a, b values to reflect our belief about the
parameter. Generally speaking, the larger our parameters a and b, the narrower the spread of
the distribution; i.e., the lower our uncertainty about the mean value of the parameter.
We will discuss prior specification in detail later in chapter 6. For the moment, just for
illustration, we choose the values a = 4 and b = 4 for the beta prior. Then, our prior for θ is
the following beta PDF:
1
3 3
p(θ) = θ (1 − θ)
B(4, 4)
Having chosen a likelihood, and having defined a prior on θ, we are ready to carry out our first
Bayesian analysis to derive a posterior distribution for θ.
Having specified the likelihood and the prior, we will now use Bayes’ rule to calculate p(θ|n, k)
. Using Bayes’ rule simply involves replacing the Likelihood and the Prior we defined above
into the equation we saw earlier:
Likelihood × Prior
Posterior =
Marginal Likelihood
Replace the terms for likelihood and prior into this equation:
100 80 20 1 3 3
[( )θ × (1 − θ) ] × [ × θ (1 − θ) ]
80 B(4,4)
1
where p(k = 80) is ∫
0
p(k = 80|n = 100, θ)p(θ) dθ . This term will be a constant once the
number of successes k is known; this is the marginal likelihood we encountered in chapter 1.
In fact, once k is known, there are several constant values in the above equation; they are
constants because none of them depend on the parameter of interest, θ. We can collect all of
these together:
100
( )
80 80 20 3 3
p(θ|n = 100, k = 80) = [ ] [θ (1 − θ) × θ (1 − θ) ] (2.7)
B(4, 4) × p(k = 80)
100
( )
The first term that is in square brackets, , is all the constants collected together,
80
B(4,4)×p(k=80)
and is the normalizing constant we have seen before; it makes the posterior distribution
p(θ|n = 100, k = 80) sum to one. Since it is a constant, we can ignore it for now and focus
on the two other terms in the equation. Because we are ignoring the constant, we will now say
that the posterior is proportional to the right-hand side.
80 20 3 3
p(θ|n = 100, k = 80) ∝ [θ (1 − θ) × θ (1 − θ) ] (2.8)
80+3 20+3 83 23
p(θ|n = 100, k = 80) ∝ [θ (1 − θ) ] = θ (1 − θ) (2.9)
The expression on the right-hand side corresponds to a beta distribution with parameters
a = 84 , and b = 24 . This becomes evident if we rewrite the right-hand side such that it
represents the core part of a beta PDF (see equation (2.4)). All that is missing is a normalizing
constant which would make the area under the curve sum to one.
83 23 84−1 24−1
θ (1 − θ) = θ (1 − θ)
This core part of any PDF or PMF is called the kernel of that distribution. Without a
normalizing constant, the area under the curve will not sum to one. Let’s check this:
Hide
## [1] 1.42e-26
So the area under the curve (AUC) is not 1—the posterior that we computed above is not a
proper probability distribution. What we have just done above is to compute the following
integral:
1
84 24
∫ θ (1 − θ)
0
We can use this integral to figure out what the normalizing constant is. Basically, we want to
know what the constant k is such that the area under the curve sums to 1:
1
84 24
k∫ θ (1 − θ) = 1
0
1
We know what ∫
0
θ
84
(1 − θ)
24
is; we just computed that value (called AUC in the R code
above). So, the normalizing constant is:
1 1
k = =
1
84 24 AU C
∫ θ (1 − θ)
0
So, what we have is the distribution of θ given the data, expressed as a PDF:
1 84−1 24−1
p(θ|n = 100, k = 80) = θ (1 − θ)
B(84, 24)
Hide
## [1] 1
The above example is a case of a conjugate analysis: the posterior on the parameter has the
same form (belongs to the same family of probability distributions) as the prior. The above
combination of likelihood and prior is called the beta-binomial conjugate case. There are
several other such combinations of Likelihoods and Priors that yield a posterior that has a
PDF that belongs to the same family as the PDF on the prior; some examples will appear in
the exercises.
Formally, conjugacy is defined as follows: Given the likelihood p(y|θ) , if the prior p(θ) results
in a posterior p(θ|y) that has the same form as p(θ) , then we call p(θ) a conjugate prior.
For the beta-binomial conjugate case, we can derive a very general relationship between the
likelihood, prior, and posterior. Given the binomial likelihood up to proportionality (ignoring the
constant) k
θ (1 − θ)
n−k
, and given the prior, also up to proportionality, θ
a−1
(1 − θ)
b−1
, their
product will be:
Thus, given a Binomial(n, k|θ) likelihood, and a Beta(a, b) prior on θ, the posterior will be
Beta(a + k, b + n − k) .
We established in the example above that the posterior is a beta distribution with parameters
a = 84 , and b = 24 . We visualize the likelihood, prior, and the posterior alongside each other
in Figure 2.2.
10.0
7.5
density
Posterior
5.0 Prior
Scaled likelihood
2.5
0.0
FIGURE 2.2: The (scaled) likelihood, prior, and posterior in the beta-binomial conjugate
example. The likelihood is scaled to integrate to 1 to make it easier to compare to the prior
and posterior distributions.
We can summarize the posterior distribution either graphically as we did above, or summarize
it by computing the mean and the variance. The mean gives us an estimate of the cloze
probability of producing “umbrella” in that sentence (given the model, i.e., given the likelihood
and prior):
84
^
E[θ ] = = 0.78 (2.10)
84 + 24
84 × 24
^] =
var[θ = 0.0016 (2.11)
2
(84 + 24) (84 + 24 + 1)
We could also display the 95% credible interval, the range over which we are 95% certain the
true value of θ lies, given the data and model.
Hide
Typically, we would summarize the results of a Bayesian analysis by displaying the posterior
distribution of the parameter (or parameters) graphically, along with the above summary
statistics: the mean, the standard deviation or variance, and the 95% credible interval. You will
see many examples of such summaries later.
Just for the sake of illustration, let’s take four different beta priors, each reflecting increasing
certainty.
Beta(a = 2, b = 2)
Beta(a = 3, b = 3)
Beta(a = 6, b = 6)
Each prior reflects a belief that θ = 0.5 , with varying degrees of (un)certainty. Given the
general formula we developed above for the beta-binomial case, we just need to plug in the
likelihood and the prior to get the posterior:
p(θ|n, k) ∝ p(k|n, θ)p(θ)
We can visualize each of these triplets of priors, likelihoods and posteriors; see Figure 2.3.
Prior: Prior:
beta(a = 2, b = 2) beta(a = 3, b = 3)
10.0 10.0
7.5 7.5
density
density
5.0 5.0
2.5 2.5
0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ
Prior: Prior:
beta(a = 6, b = 6) beta(a = 21, b = 21)
10.0 10.0
7.5 7.5
density
density
5.0 5.0
2.5 2.5
0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ
FIGURE 2.3: The (scaled) likelihood, prior, and posterior in the beta-binomial conjugate
example, for different uncertainties in the prior. The likelihood is scaled to integrate to 1 to
make its comparison easier.
Given some data and given a likelihood function, the tighter the prior, the greater the extent to
which the posterior orients itself towards the prior. In general, we can say the following about
the likelihood-prior-posterior relationship:
The posterior distribution of a parameter is a compromise between the prior and the
likelihood.
For a given set of data, the greater the certainty in the prior, the more heavily will the
posterior be influenced by the prior mean.
Conversely, for a given set of data, the greater the uncertainty in the prior, the more
heavily will the posterior be influenced by the likelihood.
Another important observation emerges if we increase the sample size from 100 to, say,
1000000 . Suppose we still get a sample mean of 0.8 here, so that k = 800000 . Now, the
posterior mean will be influenced almost entirely by the sample mean. This is because, in the
general form for the posterior Beta(a + k, b + n − k) that we computed above, the n and k
become very large relative to the a, b values, and dominate in determining the posterior mean.
Whenever we do a Bayesian analysis, it is good practice to check whether the parameter you
are interested in estimating is sensitive to the prior specification. Such an investigation is
called a sensitivity analysis. Later in this book, we will see many examples of sensitivity
analyses in realistic data-analysis settings.
In the above example, we used an artificial example where we asked 100 subjects to
complete the sentence shown at the beginning of the chapter, and then we counted the
number of times that they produced “umbrella” vs. some other word as a continuation. Given
80 instances of “umbrella”, and using a Beta(4, 4) prior, we derived the posterior to be
Beta(84, 24) . We could now use this posterior as our prior for the next study. Suppose that
we were to carry out a second experiment, again with 100 subjects, and this time 60 produced
“umbrella”. We could now use our new prior (Beta(84, 24)) to obtain an updated posterior. We
have a = 84, b = 24, n = 100, k = 60 . This gives us as posterior:
Beta(a + k, b + n − k) = Beta(84 + 60, 24 + 100 − 60) = Beta(144, 64) .
Now, if we were to pool all our data that we have from the two experiments, then we would
have as data n = 200, k = 140 . Suppose that we keep our initial prior of a = 4, b = 4 . Then,
our posterior would be Beta(4 + 140, 4 + 200 − 140) = Beta(144, 64) . This is exactly the
same posterior that we got when first analyzed the first 100 subjects’ data, derived the
posterior, and then used that posterior as a prior for the next 100 subjects’ data.
This toy example illustrates an important point that has great practical importance for cognitive
science. One can incrementally gain information about a research question by using
information from previous studies and deriving a posterior, and then use that posterior as a
prior. For practical examples from psycholinguistics showing how information can be pooled
from previous studies, see Jäger, Engelmann, and Vasishth (2017) and Nicenboim, Roettger,
and Vasishth (2018). Vasishth and Engelmann (2022) illustrates an example of how the
posterior from a previous study or collection of studies can be used to compute the posterior
derived from new data.
2.3 Summary
In this chapter, we learned how to use Bayes’ rule in the specific case of a binomial likelihood,
and a beta prior on the θ parameter in the likelihood function. Our goal in any Bayesian
analysis will follow the path we took in this simple example: decide on an appropriate
likelihood function, decide on priors for all the parameters involved in the likelihood function,
and use this model (i.e., the likelihood and the priors) to derive the posterior distribution of
each parameter. Then we draw inferences about our research question based on the posterior
distribution of the parameter.
In the example discussed in this chapter, Bayesian analysis was easy. This was because we
considered the simple conjugate case of the beta-binomial. In realistic data-analysis settings,
our likelihood function will be very complex, and many parameters will be involved. Multiplying
the likelihood function and the priors will become mathematically difficult or impossible. For
such situations, we use computational methods to obtain samples from the posterior
distributions of the parameters.
Accessible introductions to conjugate Bayesian analysis are Lynch (2007), and Lunn et al.
(2012). Somewhat more demanding discussions of conjugate analysis are in Lee (2012),
Carlin and Louis (2008), Christensen et al. (2011), O’Hagan and Forster (2004) and Bernardo
and Smith (2009).
2.5 Exercises
Let A and B be two observable events. P (A) is the probability that A occurs, and P (B) is
the probability that B occurs. P (A|B) is the conditional probability that A occurs given that B
P (A, B)
P (A|B) = where P (B) > 0
P (B)
Using the above definition, and using the fact that P (A, B) = P (B, A) (i.e., the probability of
A and B both occurring is the same as the probability of B and A both occurring), derive an
expression for P (B|A) . Show the steps clearly in the derivation.
Suppose you are given data k consisting of the number of successes, coming from a
Binomial(n, θ) distribution. Given k successes in n trials coming from a binomial distribution,
we define a Beta(a, b) prior on the parameter θ.
Write down the Beta distribution that represents the posterior, in terms of a, b, n, and k.
Practical application
We ask 10 yes/no questions from a subject, and the subject returns 0 correct answers. We
assume a binomial likelihood function for these data. Also assume a Beta(1, 1) prior on the
parameter θ, which represents the probability of success. Use the result you derived above to
write down the posterior distribution of the θ parameter.
Suppose that we perform n independent trials until we get a success (e.g., a heads in a coin
toss). For coin tosses, the possible outcomes could be, H, TH, . The probability of success in
each trial is θ. Then, the Geometric random variable, call it X , gives us the probability of
getting a success in n trials as follows:
n−1
P rob(X = n) = θ(1 − θ)
where n = 1, 2, … .
Let the prior on θ be Beta(a, b) , a beta distribution with parameters a,b. The posterior
distribution is a beta distribution with parameters a* and b*. Determine these parameters in
terms of a, b, and n .
a a−1
b y exp{−by}
Ga(y|a, b) =
Γ(a)
Suppose that we have data x1 , … , xn , with sample size n that is exponentially distributed.
The exponential likelihood function is:
n
p(x1 , … , xn |λ) = λ exp{−λ ∑ xi }
i=1
It turns out that if we assume a Ga(a,b) prior distribution for λ and the above Exponential
likelihood, the posterior distribution of λ is a Gamma distribution. In other words, the
Gamma(a,b) prior on the λ parameter in the Exponential distribution will be written:
a a−1
b λ exp{−bλ}
Ga(λ|a, b) =
Γ(a)
This is a contrived example. Suppose we are modeling the number of times that a speaker
says the word “I” per day. This could be of interest if we are studying, for example, how self-
oriented a speaker is. The number of times x that the word is uttered in over a particular time
period (here, one day) can be modeled by a Poisson distribution (x = 0, 1, 2, … ):
x
exp(−θ)θ
f (x ∣ θ) =
x!
where the rate θ is unknown, and the numbers of utterances of the target word on each day
are independent given θ.
We are told that the prior mean of θ is 100 and prior variance for θ is 225. This information is
based on the results of previous studies on the topic. We will use the Gamma(a,b) density
(see previous question) as a prior for θ because this is a conjugate prior to the Poisson
distribution.
a. First, visualize the prior, a Gamma density prior for θ based on the above information.
a
[Hint: we know that for a Gamma density with parameters a, b, the mean is and the
b
variance is . Since we are given values for the mean and variance, we can solve for a,b,
a
2
b
b. Next, derive the posterior distribution of the parameter θ up to proportionality, and write
down the posterior distribution in terms of the parameters of a Gamma distribution.
Practical application
Suppose we know that the number of “I” utterances from a particular individual is
115, 97, 79, 131 . Use the result you derived above to obtain the posterior distribution. In other
words, write down the parameters of the Gamma distribution (call them a∗, b∗ ) representing
the posterior distribution of θ.
Plot the prior and the posterior distributions alongside each other.
Now suppose you get one new data point: 200. Using the posterior Gamma(a∗, b∗) as your
prior, write down the updated posterior (in terms of the updated parameters of the Gamma
distribution) given this new data point. Add the updated posterior to the plot you made above.
Exercise 2.6 The posterior mean is a weighted mean of the prior mean and the MLE
(Poisson-Gamma conjugate case)
The number of times an event happens per unit time can be modeled using a Poisson
distribution, whose PMF is:
x
exp(−θ)θ
f (x ∣ θ) =
x!
Suppose that we define a Gamma(a,b) prior for the rate parameter θ. It is a fact (see
exercises above) that the posterior of the θ parameter is a Gamma(a∗, b∗) distribution,
where a∗ and b∗ are the updated parameters given the data: θ ∼ Gamma(a∗, b∗) .
Prove that the posterior mean is a weighted mean of the prior mean and the maximum
n
likelihood estimate (mean) of the Poisson-distributed data, x̄ = ∑
i=1
x/n . Hint: the
a
mean of a Gamma distribution is b
.
a∗ a w1 w2
= × + x̄ × (2.12)
b∗ b w1 + w2 w1 + w2
n
where w1 = 1 and w2 =
b
.
Given equation (2.12), show that as n increases (as sample size goes up), the maximum
likelihood estimate x̄ dominates in determining the posterior mean, and when n gets
smaller and smaller, the prior mean dominates in determining the posterior mean.
the posterior variance will get smaller and smaller (the uncertainty on the posterior will go
down).
References
Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley &
Sons.
Carlin, Bradley P, and Thomas A Louis. 2008. Bayesian Methods for Data Analysis. CRC
Press.
Christensen, Ronald, Wesley Johnson, Adam Branscum, and Timothy Hanson. 2011.
“Bayesian Ideas and Data Analysis.” CRC Press.
Jäger, Lena A., Felix Engelmann, and Shravan Vasishth. 2017. “Similarity-Based Interference
in Sentence Comprehension: Literature review and Bayesian meta-analysis.” Journal of
Memory and Language 94: 316–39. https://doi.org/https://doi.org/10.1016/j.jml.2017.01.004.
Kruschke, John. 2014. Doing Bayesian Data Analysis: A tutorial with R, JAGS, and Stan.
Academic Press.
Lee, Peter M. 2012. Bayesian Statistics: An Introduction. John Wiley & Sons.
Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012.
The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.
Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for
Social Scientists. New York, NY: Springer.
Nicenboim, Bruno, Timo B. Roettger, and Shravan Vasishth. 2018. “Using Meta-Analysis for
Evidence Synthesis: The case of incomplete neutralization in German.” Journal of Phonetics
70: 39–55. https://doi.org/https://doi.org/10.1016/j.wocn.2018.06.001.
O’Hagan, Antony, and Jonathan Forster. 2004. “Kendall’s Advanced Theory of Statistics, Vol.
2B: Bayesian Inference.” Wiley.
p(y|Θ) ⋅ p(Θ)
p(Θ|y) =
p(y)
(3.1)
p(y|Θ) ⋅ p(Θ)
p(Θ|y) =
∫ p(y|Θ) ⋅ p(Θ)dΘ
Θ
Unless we are dealing with conjugate distributions, the solution will be extremely hard to
derive or there will be no analytical solution. This was the major bottleneck of Bayesian
analysis in the past, and required Bayesian practitioners to program an approximation method
by themselves before they could even begin the Bayesian analysis. Fortunately, many of the
probabilistic programming languages freely available today (see the next section for a listing)
allow us to define our models without having to acquire expert knowledge about the relevant
numerical techniques.
Let’s say that we want to derive the posterior of the model from section 2.2, that is, the
posterior distribution of the cloze probability of “umbrella”, θ, given the following data: a word
(e.g., “umbrella”) was answered 80 out of 100 times, and assuming a binomial distribution as
the likelihood function, and Beta(a = 4, b = 4) as a prior distribution for the cloze probability.
If we can obtain samples from the posterior distribution of θ, instead of an analytically derived
posterior distribution, given enough samples we will have a good approximation of the
posterior distribution. Obtaining samples from the posterior will be the only viable option in the
models that we will discuss in this book. By “obtaining samples”, we are talking about a
situation analogous to when we use rbinom() or rnorm() to obtain samples from a
particular distribution. For more details about sampling algorithms, see the further readings
suggested in section 3.10.
10.0
7.5
density
5.0
2.5
0.0
FIGURE 3.1: Histogram of the samples of θ from the posterior distribution generated via
sampling. The black line shows the density plot of the analytically derived posterior.
3.2 Bayesian Regression Models using Stan: brms
The surge in popularity of Bayesian statistics is closely tied to the increase in computing
power and the appearance of probabilistic programming languages, such as WinBUGS (Lunn
et al. 2000), JAGS (Plummer 2016), PyMC3 (Salvatier, Wiecki, and Fonnesbeck 2016), Turing
(Ge, Xu, and Ghahramani 2018), and Stan (Carpenter et al. 2017); for a historical review, see
Plummer (2022).
These probabilistic programming languages allow the user to define models without having to
deal (for the most part) with the complexities of the sampling process. However, they require
learning a new language since the user has to fully specify the statistical model using a
particular syntax.7 Furthermore, some knowledge of the sampling process is needed to
correctly parameterize the models and to avoid convergence issues (these topics will be
covered in detail in chapter 11).
There are some alternatives that allow Bayesian inference in R without having to fully specify
the model “by hand”. The packages rstanarm (Goodrich et al. 2018) and brms (Bürkner 2019)
provide Bayesian equivalents of many popular R model-fitting functions, such as (g)lmer
(Bates, Mächler, et al. 2015b) and many others; both rstanarm and brms use Stan as the
back-end for estimation and sampling. The package R-INLA (Lindgren and Rue 2015) allows
for fitting a limited selection of likelihood functions and priors in comparison to rstanarm and
brms (R-INLA can fit models that can be expressed as latent Gaussian models). This package
uses the integrated nested Laplace approximation (INLA) method for approximating Bayesian
inference rather than a sampling algorithm as it is used by the other probabilistic languages
listed. Another alternative is JASP (JASP Team 2019), which provides a graphical user
interface for both frequentist and Bayesian modeling, and is intended to be an open-source
alternative to SPSS.
We will focus on brms in this part of the book. This is because it can be useful for a smooth
transition from frequentist models to their Bayesian equivalents. The package brms is not
only powerful enough to satisfy the statistical needs of many cognitive scientists, it has the
added benefit that the Stan code can be inspected (with the brms functions
make_stancode() and make_standata() ), allowing the user to customize their models or learn
from the code produced internally by brms to eventually transition to writing the models
entirely in Stan. We revisit the models of this and the following chapters in the introduction to
Stan in chapter 10.
3.2.1 A simple linear model: A single subject pressing a button
repeatedly (a finger tapping task)
We’ll use the following example of a finger tapping task (for a review, see Hubel et al. 2013) to
illustrate the basic steps for fitting a model. Suppose that a subject first sees a blank screen.
Then, after a certain amount of time (say 200 ms), the subject sees a cross in the middle of a
screen, and as soon as they see the cross, they tap on the space bar as fast as they can until
the experiment is over (361 trials). The dependent measure here is the time it takes in
milliseconds from one press of the space bar to the next one. The data in each trial are
therefore finger tapping times in milliseconds. Suppose that the research question is: how long
does it take for this particular subject to press a key? (As an aside, notice that the data are not
independent and identically distributed but are ignoring that detail for now; we will address that
issue later in the book).
1. There is a true (unknown) underlying time, μ ms, that the subject needs to press the
space bar.
2. There is some noise in this process.
3. The noise is normally distributed (this assumption is questionable given that finger
tapping as also response times are generally skewed; we will fix this assumption later).8
This means that the likelihood for each observation n will be:
tn ∼ Normal(μ, σ) (3.2)
The reader may have encountered the model shown in Equation (3.2) in the form shown in
Equation (3.3):
iid
tn = μ + ε, where εn ∼ Normal(0, σ) (3.3)
When the model is written in this way, it should be understood as meaning that each data
point tn has some variability around a mean value μ , and that variability has standard
deviation σ. The term iid’’ (independent and identically distributed) implies that each data point
tn is independently generated (is not correlated with any of the other data points), and is
coming from the same distribution (namely, N ormal(μ, σ) ).
For a frequentist model that will give us the maximum likelihood estimate (the sample mean)
of the time it takes to press the space bar, this would be enough information to write the
formula in R , t ~ 1 , and plug it into the function lm() together with the data: lm(t ~ 1,
data) . The meaning of the 1 here is that lm will estimate the intercept in the model, in our
case μ . If the reader is completely unfamiliar with linear models, the references in section 4.5
will be helpful.
For a Bayesian linear model, we will also need to define priors for the two parameters of our
model. Let’s say that we know for sure that the time it takes to press a key will be positive and
lower than a minute (or 60000 ms), but we don’t want to make a commitment regarding which
values are more likely. We encode what we know about the noise in the task in σ: we know
that this parameter must be positive and we’ll assume that any value below 2000 ms is
equally likely. These priors are in general strongly discouraged: A flat (or very wide) prior will
almost never be the best approximation of what we know. Prior specification will be discussed
in detail in chapter 6.
In this case, even if we know very little about the task, we know that pressing the spacebar will
take at most a couple of seconds. We’ll use flat priors in this section for pedagogical
purposes; the next sections will already show more realistic uses of priors.
μ ∼ Uniform(0, 60000)
(3.4)
σ ∼ Uniform(0, 2000)
First, load the data frame df_spacebar from the bcogsci package:
Hide
data("df_spacebar")
df_spacebar
## # A tibble: 361 × 2
## t trial
## <int> <int>
## 1 141 1
## 2 138 2
## 3 128 3
## # … with 358 more rows
It is always a good idea to plot the data before doing anything else; see Figure 3.2. As we
suspected, the data look a bit skewed, but we ignore this for the moment.
Hide
ggplot(df_spacebar, aes(t)) +
geom_density() +
xlab("Finger tapping times") +
ggtitle("Button-press data")
Button-press data
0.020
0.015
density
0.010
0.005
0.000
Fit the model defined by equations (3.2) and (3.4) with brms in the following way.
Hide
data = df_spacebar,
family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0, ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0, ub = 2000)
),
chains = 4,
iter = 2000,
warmup = 1000
)
The brms code has some differences from a model fit with lm . At this beginning stage, we’ll
focus on the following options:
1. The term family = gaussian() makes it explicit that the underlying likelihood function is
a normal distribution (Gaussian and normal are synonyms). This detail is implicit in lm .
Other linking functions are possible, exactly as in the glm function. The default for brms
that corresponds to the lm function is gaussian() .
2. The term prior takes as argument a vector of priors. Although this specification of priors
is optional, the researcher should always explicitly specify each prior. Otherwise, brms
will define priors by default, which may or may not be appropriate for the research area. In
cases where the distribution has a restricted coverage, that is, not every value is valid
(e.g., smaller than 0 or larger than 60000 are not valid for the intercept), we need to set
lower and upper boundaries with lb and ub .9
3. The term chains refers to the number of independent runs for sampling (by default four).
4. The term iter refers to the number of iterations that the sampler makes to sample from
the posterior distribution of each parameter (by default 2000 ).
5. The term warmup refers to the number of iterations from the start of sampling that are
eventually discarded (by default half of iter ).
The last three options, chains , iter , warmup determine the behavior of the sampling
algorithm: the No-U-Turn Sampler (NUTS; Hoffman and Gelman 2014) extension of
Hamiltonian Monte Carlo (Duane et al. 1987; Neal 2011). We will discuss sampling in a bit
more depth in chapter 10, but the basic process is explained next.
The code specification starts four chains independently from each other. Each chain
“searches” for samples of the posterior distribution in a multidimensional space, where each
parameter corresponds to a dimension. The shape of this space is determined by the priors
and the likelihood. The chains start at random locations, and in each iteration they take one
sample each. When sampling begins, the samples may or may not belong to the posterior
distributions of the parameters. Eventually, the chains end up in the vicinity of the posterior
distribution, and from that point onward the samples will belong to the posterior.
Thus, when sampling begins, the samples from the different chains can be far from each
other, but at some point they will “converge” and start delivering samples from the posterior
distributions. Although there are no guarantees that the number of iterations we run the chains
for will be sufficient for obtaining samples from the posteriors, the default values of brms (and
Stan) are in many cases sufficient to achieve convergence. When the default number of
iterations do not suffice, brms (actually, Stan) will print out warnings, with suggestions for
fixing the convergence problems. If all the chains converge to the same distribution, by
removing the “warmup” samples, we make sure that we do not get samples from the initial
path to the posterior distributions. The default in brms is that half of the total number of
iterations in each chain (which default to 2000) will count as “warmup”. So, if one runs a model
with four chains and the default number of iterations, we will obtain a total of 4000 samples
from the four chains, after discarding the warmup iterations.
Figure 3.3(a) shows the path of the chains from the warmup phase onwards. Such plots are
called trace plots. The warmup is shown only for illustration purposes; generally, one should
only inspect the chains after the point where convergence has (presumably) been achieved
(i.e., after the dashed line). After convergence has occurred, a visual diagnostic check is that
chains should look like a “fat hairy caterpillar.” Compare the trace plot of our model in Figure
3.3(a) with the trace plot of a model that did not converge, shown in Figure 3.3(b).
Trace plots are not always diagnostic as regards convergence. The trace plots might look fine,
but the model may not have converged. Fortunately, Stan automatically runs several
diagnostics with the information from the chains, and if there are no warnings after fitting the
model and the trace plots look fine, we can be reasonably sure that the model converged, and
assume that our samples are from the true posterior distribution. However, it is necessary to
run more than one chain (preferably four), with a couple of thousands of iterations (at least) in
order for the diagnostics to work.
50
0 1
sigma 2
200 3
150 Warm-up 4
100
50
0
0 500 1000 1500 2000
Iteration number
500
0 1
sigma 2
2000 3
1500
4
1000 Warm-up
500
0
0 500 1000 1500 2000
Iteration number
FIGURE 3.3: a. Trace plots of our brms model for the button-pressing data. All the chains
start from initial values above 200 and are outside of the plot. b. Trace plots of a model that
did not converge. We can diagnose the non-convergence by the observing that the chains do
not overlap—each chain seems to be sampling from a different distribution.
Once the model has been fit (and assuming that we got no warning messages about
convergence problems), we can print out the samples of the posterior distributions of each of
the parameters using as_draws_df() (which stores metadata about the chains) or with
as.data.frame() :
Hide
as_draws_df(fit_press) %>%
head(3)
The term b_Intercept in the brms output corresponds to our μ . We can ignore the last two
columns: lp is not really part of the posterior, it’s the log-density of the unnormalized
posterior for each iteration ( lp will be discussed later in Box 10.1), lprior is the log-
density of the (joint) prior distribution and it is there for compatibility with the package
priorsense (https://github.com/n-kall/priorsense).
Plot the density and trace plot of each parameter after the warmup (Figure 3.4).
b_Intercept b_Intercept
0.3
172
0.2 170
168
0.1
165
Chain
0.0 1
165 168 170 172 0 200 400 600 800 1000
2
sigma sigma
29 3
0.4
4
0.3 27
0.2 25
0.1 23
0.0
22 24 26 28 0 200 400 600 800 1000
FIGURE 3.4: Density and trace plots of our brms model for the button-pressing data.
Printing the object with the brms fit provides a nice, if somewhat verbose, summary:
Hide
fit_press
# posterior_summary(fit_press) is also useful
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: t ~ 1
## Data: df_spacebar (Number of observations: 361)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 168.63 1.26 166.18 171.06 1.00 3234 2568
##
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The Estimate is just the mean of the posterior samples, Est.Error is the standard deviation
of the posterior and the CIs mark the lower and upper bounds of the 95% credible intervals (to
distinguish credible intervals from frequentist confidence intervals, the former will be
abbreviated as CrIs):
Hide
## [1] 169
Hide
as_draws_df(fit_press)$b_Intercept %>% sd()
## [1] 1.26
Hide
as_draws_df(fit_press)$b_Intercept %>%
quantile(c(0.025, .975))
## 2.5% 97.5%
## 166 171
Furthermore, the summary provides the Rhat , Bulk_ESS , and Tail_ESS of each parameter.
R-hat compares the between- and within-chain estimates of each parameter. R-hat is larger
than 1 when chains have not mixed well, one can only rely on the model if the R-hats for all
the parameters are less than 1.05. (R warnings will appear otherwise). Bulk ESS (bulk
effective sample size) is a measure of sampling efficiency in the bulk of the posterior
distribution, that is the effective sample size for the mean and median estimates, whereas tail
ESS (tail effective sample size) indicates the sampling efficiency at the tails of the distribution,
that is the minimum of effective sample sizes for 5% and 95% quantiles. The effective sample
size is generally smaller than the number of post-warmup samples, because the samples from
the chains are not independent (they are correlated to some extent), and carry less
information about the posterior distribution in comparison to independent samples. In some
cases, however, such as with the Intercept here, the effective sample size is actually larger
than the number of post-warmup samples. This might happen for parameters with a normally
distributed posterior (in the unconstrained space, see Box (thm:target)) and low dependence
on the other parameters (Vehtari, Gelman, Simpson, Carpenter, and Bürkner 2019). Very low
effective sample size indicates sampling problems (and are accompanied by R warnings) and
in general appear together with chains that are not properly mixed. As rule of thumb, Vehtari,
Gelman, Simpson, Carpenter, and Bürkner (2019) suggest that a minimum of 400 effective
sample size is required for statistical summaries.
We see that we can fit our model without problems, and we get some posterior distributions
for our parameters. However, we should ask ourselves the following questions:
1. What information are the priors encoding? Do the priors make sense?
2. Does the likelihood assumed in the model make sense for the data?
We’ll try to answer these questions by looking at the prior and posterior predictive
distributions, and by doing sensitivity analyses. This is explained in the following sections.
μ ∼ Uniform(0, 60000)
(3.5)
σ ∼ Uniform(0, 2000)
These priors encode assumptions about the kind of data we would expect to see in a future
study. To understand these assumptions, we are going to generate data from the model; such
data, which is generated entirely by the prior distributions, is called the prior predictive
distribution. Generating prior predictive distributions repeatedly helps us to check whether the
priors make sense. What we want to know here is, do the priors generate realistic-looking
data?
Formally, we want to know the density p(⋅) of data points ypred , … , ypred
1 N
from a data set
ypred of length N , given a vector of priors Θ and our likelihood p(⋅|Θ) ; (in our example,
Θ = ⟨μ, σ⟩ ). The prior predictive density is written as follows:
In essence, the vector of parameters is integrated out. This yields the probability distribution of
possible data sets given the priors and the likelihood, before any observations are taken into
account.
The integration can be carried out computationally by generating samples from the prior
distribution.
normal_predictive_distribution <-
function(mu_samples, sigma_samples, N_obs) {
# empty data frame with headers:
df_pred <- tibble(
trialn = numeric(0),
t_pred = numeric(0),
iter = numeric(0)
)
# i iterates from 1 to the length of mu_samples,
# which we assume is identical to
# the length of the sigma_samples:
for (i in seq_along(mu_samples)) {
mu <- mu_samples[i]
sigma <- sigma_samples[i]
df_pred <- bind_rows(
df_pred,
tibble(
trialn = seq_len(N_obs), # 1, 2,... N_obs
t_pred = rnorm(N_obs, mu, sigma),
iter = i
)
)
}
df_pred
}
The following code produces 1000 samples of the prior predictive distribution of the model
that we defined in section 3.2.1. This means that it will produce 361000 predicted values (361
predicted observations for each of the 1000 simulations). Although this approach works, it’s
quite slow (a couple of seconds). See Box 3.1 for a more efficient version of this function.
Section 3.7.2 will show that it’s possible to use brms to sample from the priors, ignoring the
t in the data by setting sample_prior = "only" . However, since brms still depends on
Stan’s sampler, which uses Hamiltonian Monte Carlo, the prior sampling process can also fail
to converge, especially when one uses very uninformative priors, like the ones used in this
example. In contrast, our function above, which uses rnorm() , cannot have convergence
issues and will always produce multiple sets of prior predictive data ypred , … , ypred
1 n
.
Hide
Hide
prior_pred
## # A tibble: 361,000 × 3
## trialn t_pred iter
## <dbl> <dbl> <dbl>
## 1 1 16710. 1
## 2 2 16686. 1
## 3 3 17245. 1
## # … with 360,997 more rows
Box 3.1 A more efficient function for generating prior predictive distribution
A more efficient function can be created in the following way using the map2_dfr()
function from the purrr package. This function yields an approximately 10-fold increase
in speed. Although the distributions should be the same with both functions, the specific
numbers in the tables won’t be, due to the randomness in the process of sampling.
The purrr function map2_dfr() (which works similarly to the base R function lapply()
and Map() ) essentially runs a for-loop, and builds a data frame with the output. It iterates
over the values of two vectors (or lists) simultaneously, here, mu_samples and
sigma_samples and, in each iteration, it applies a function to each value of the two
vectors, here, mu and sigma . The output of each function is a data frame (or tibble in
this case) with N_obs observations which is bound in a larger data frame at the end of
the loop. Each of these data frames bound together represents an iteration in the
simulation, and we identify the iterations by setting .id = "iter" .
Although this method for generating prior predictive distributions is a bit involved, it
presents an advantage in comparison to the more straightforward use of predict() (or
posterior_predict() , which can also generate prior predictions) together with setting
sample_prior = "only" in the brms model (as we will do in section 3.7.2). Namely, here
we don’t depend on Stan’s sampler, and that means that no matter the number of
iterations in our simulation or how uninformative our priors, there will never be any
convergence problems.
Hide
library(purrr)
# Define the function:
normal_predictive_distribution <- function(mu_samples,
sigma_samples,
N_obs) {
tic()
prior_pred <- normal_predictive_distribution(
mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs
)
toc()
Figure 3.5 shows the first 18 samples of the prior predictive distribution (i.e., 18 independently
generated prior predicted data sets) with the code below.
Hide
prior_pred %>%
filter(iter <= 18) %>%
ggplot(aes(t_pred)) +
geom_histogram(aes(y = after_stat(density))) +
xlab("predicted t (ms)") +
theme(
axis.text.x =
element_text(angle = 40, vjust = 1, hjust = 1, size = 14)
) +
scale_y_continuous(
limits = c(0, 0.0005),
1 2 3
0.00050
0.00025
0.00000
4 5 6
0.00050
0.00025
0.00000
7 8 9
0.00050
0.00025
density
0.00000
10 11 12
0.00050
0.00025
0.00000
13 14 15
0.00050
0.00025
0.00000
16 17 18
0.00050
0.00025
0.00000
0 00 000 000 0 00 000 000 0 00 000 000
0 0 0
20 40 60 20 40 60 20 40 60
predicted t (ms)
FIGURE 3.5: Eighteen samples from the prior predictive distribution of the model defined in
section 3.2.1.
The prior predictive distribution in Figure 3.5 shows prior data sets that are not realistic: Apart
from the fact that the data sets show that finger tapping times distributions are symmetrical–
and we know from prior experience with such data that they are generally right-skewed–some
data sets present finger tapping times that are unrealistically long. Worse yet, if we inspect
enough samples from the prior predicted data, it will become clear that a few data sets have
negative finger tapping time values.
We can also look at the distribution of summary statistics in the prior predictive data. Even if
we don’t know beforehand what the data should look like, it’s very likely that we have some
expectations for possible mean, minimum, or maximum values. For example, in the button-
pressing example, it seems reasonable to assume that average finger tapping times are
between -
200 600 ms; finger tapping times are very unlikely to be below 50 ms (given the
delays in keyboards), and even long lapses of attention won’t be greater than a couple of
seconds.10 Three distributions of summary statistics are shown in Figure 3.6.
mean_t
0.00003
0.00002
0.00001
0.00000
min_t
0.00003
density
0.00002
0.00001
0.00000
max_t
0.00003
0.00002
0.00001
0.00000
0 20000 40000 60000
t
FIGURE 3.6: The prior predictive distributions of the mean, minimum, and maximum values of
the button-pressing model defined in section 3.2.1.
Figure 3.6 shows that we used much less prior information than what we could have: Our
priors were encoding the information that any mean between 0 and 60000 ms is equally likely.
It seems clear that a value close to 0 or to 60000 ms would be extremely surprising. This wide
range of mean values occurs because of the uniform prior on μ . Similarly, maximum values
are quite “uniform”, spanning a much wider range than what one would expect. Finally, in the
distribution of minimum values, negative finger tapping times occur. This might seem
surprising (our prior for μ excluded negative values), but the reason that negative values
appear is that the prior is interpreted together with the likelihood (Gelman, Simpson, and
Betancourt 2017), and the likelihood is a normal distribution, which will allow for negative
samples even if the location parameter μ has a positive value.
To summarize the above discussion, the priors used in the example are clearly not very
realistic given what we might know about finger tapping times for such a button pressing task.
This raises the question: what priors should we have chosen? In the next section, we consider
this question.
For most cases that we will encounter in this book, there are four main classes of priors that
we can choose from. In the Bayesian community, there is no fixed nomenclature for classifying
different kinds of priors. For this book, we have chosen specific names for each type of prior,
but this is just a convention that we follow for consistency. There are also other classes of
prior that we do not discuss in this book. An example is improper priors such as
Uniform(−∞, +∞) , which are not proper probability distributions because the area under
the curve does not sum to 1.
When thinking about priors, the reader should not get hung up on what precisely the name is
for a particular type of prior; they should rather focus on what that prior means in the context
of the research problem.
One option is to choose priors that are as uninformative as possible. The idea behind this
approach is to let the data “speak for itself” and to not bias the statistical inference with
“subjective” priors. There are several issues with this approach: First, the prior is as subjective
as the likelihood, and in fact, different choices of likelihood might have a much stronger impact
on the posterior than different choices of priors. Second, uninformative priors are in general
unrealistic because they give equal weight to all values within the support of the prior
distribution, ignoring the fact that usually there is some minimal information about the
parameters of interest. Usually, at the very least, the order of magnitude is known (response
times or finger tapping times will be in milliseconds and not days, EEG signals some
microvolts and not volts, etc.). Third, uninformative priors make the sampling slower and might
lead to convergence problems. Unless there is a large amount of data, it would be wise to
avoid such priors. Fourth, it is not always clear which parameterization of a given distribution
the flat priors should be assigned to. For example, the Normal distribution is sometimes
defined based on its standard deviation (σ), variance (σ 2 ), or precision (1/σ 2 ): a flat prior for
the standard deviation is not flat for the precision of the distribution. Although it is sometimes
possible to find an uninformative prior that is uninvariant under a change of parameters (also
called Jeffreys priors; Jaynes 2003, sec. 6.15; Jeffreys 1939, Chapter 3), this is not always the
case. Finally, if Bayes factors need to be computed, uninformative priors can lead to very
misleading conclusions (chapter 15).
If there does not exist much prior information (and if this information cannot be worked out
through reasoning about the problem), and there is enough data (what “enough” means here
will presently become clear when we look at specific examples), it is fine to use regularizing
priors. These are priors that down-weight extreme values (that is, they provide regularization),
they are usually not very informative, and mostly let the likelihood dominate in determining the
posteriors. These priors are theory-neutral; that is, they usually do not bias the parameters to
values supported by any prior belief or theory. The idea behind this type of prior is to help to
stabilize computation. These priors are sometimes called weakly informative or mildly
informative priors in the Bayesian literature. For many applications, they perform well, but
discussed in chapter 15, they tend to be problematic if Bayes factors need to be computed.
ms or so). This is a regularizing prior because it rules out negative button-pressing times and
down-weights extreme values over 2000 ms.
The idea here is to have priors that encode all (or most of) the theory-neutral information that
the researcher has. Since one generally knows what one’s data do and do not look like, it is
possible to build priors that truly reflect the properties of potential data sets, using prior
predictive checks. In this book, many examples of this class of priors will come up.
There are cases where a lot of prior knowledge exists. In general, unless there are very good
reasons for having relatively informative priors (see chapter 15), it is not a good idea to let the
priors have too much influence on the posterior. An example where informative priors would
be important is when investigating a language-impaired population from which we can’t get
many subjects, but a lot of previously published papers exist on the research topic.
These four options constitute a continuum. The uniform prior from the last model (section
3.2.1) falls between flat, uninformative and regularizing priors. In practical data analysis
situations, we are mostly going to choose priors that fall between regularizing and principled.
Informative priors, in the sense defined above, will be used only relatively rarely; but they
become more important to consider when doing Bayes factor analyses (chapter 15).
What would happen if even wider priors were used for the model defined previously (in section
3.2.1)? Suppose that every mean between −10
6
and 10
6
ms is assumed to be equally likely.
This prior is clearly unrealistic and actually makes no sense at all: we are not expecting
negative finger tapping times. Regarding the standard deviation, one could assume that any
value between 0 and 10
6
is equally likely.11 The likelihood remains unchanged.
6 6
μ ∼ Uniform(−10 , 10 )
(3.6)
6
σ ∼ Uniform(0, 10 )
Hide
# The default settings are used when they are not set explicitly:
prior(uniform(-10^6, 10^6),
class = Intercept,
lb = -10^6,
ub = 10^6),
prior(uniform(0, 10^6),
class = sigma,
lb = 0,
ub = 10^6)
),
iter = 3000,
control = list(adapt_delta = .99,
max_treedepth = 15)
)
Even with these extremely unrealistic priors, which require us to change the adapt_delta and
max_treedepth default values to achieve convergence, the output of the model is virtually
Hide
fit_press_unif
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 168.65 1.34 166.02 171.28 1.00 3939 3345
##
b_Intercept sigma
0.3 0.4
0.3
0.2
density
0.2
0.1
0.1
0.0 0.0
FIGURE 3.7: Comparison of the posterior distributions from the model with extremely
unrealistic priors, fit_press_unif , against the previous model with more “realistically”
bounded uniform distributions (but still not recommended), fit_press .
Next, consider what happens if very informative priors are used. Assume that mean values
very close to 400 ms are the most likely, and that the standard deviation of the finger tapping
times is very close to 100 ms. Given that this is a model of button-pressing times, such an
informative prior seems wrong—200 ms seems like a more realistic mean button-pressing
time, not 400 ms. You can check this by doing an experiment yourself and looking at the
recorded times; a software like Linger (http://tedlab.mit.edu/~dr/Linger/) makes it easy to set
up such an experiment.
The Normal+ notation indicates a normal distribution truncated at zero such that only positive
values are allowed (Box 4.1 discusses this type of distribution in detail). Even though the prior
for the standard deviation is restricted to be positive, we are not required to add lb = 0 to
the prior, and it is automatically taken into account by brms .
μ ∼ Normal(400, 10)
(3.7)
σ ∼ Normal+ (100, 10)
Hide
data = df_spacebar,
family = gaussian(),
prior = c(
prior(normal(400, 10), class = Intercept),
# `brms` knows that SDs need to be bounded
# to exclude values below zero:
Despite these unrealistic but informative priors, the likelihood mostly dominates and the new
posterior means and credible intervals are just a couple of milliseconds away from the
previous estimates:
Hide
fit_press_inf
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 172.89 1.38 170.22 175.66 1.00 2452 2659
##
As a final example of a sensitivity analysis, choose some principled priors. Assuming that we
have some prior experience with previous similar experiments, suppose the mean reaction
time is expected to be around 200 ms, with a 95% probability of the mean ranging from 0 to
400 ms. This uncertainty is perhaps unreasonably large, but one might want to allow a bit
more uncertainty than one really thinks is reasonable (this kind of conservativity in allowing
somewhat more uncertainty is sometimes called Cromwell’s rule in Bayesian statistics; see
O’Hagan and Forster 2004, sec. 3.19). In such a case, one can decide on the prior
Normal(200, 100) . Given that the experiment involves only one subject and the task is very
simple, one might not expect the residual standard deviation σ to be very large: As an
example, one can settle on a location of 50 ms for a truncated normal distribution, but still
allow for relatively large uncertainty: N ormal+ (50, 50) . The prior specifications are
summarized below.
μ ∼ Normal(200, 100)
Why are these priors principled? The designation “principled” here largely depends on our
domain knowledge. Chapter 6 discusses how one can use domain knowledge when specifying
priors.
One can achieve a better understanding of what a particular set of priors imply by visualizing
the priors graphically, and carrying out prior predictive checks. These steps are skipped here,
but these issues will be discussed in detail in chapters 6 and 7. These chapters will give more
detailed information about choosing priors and on developing a principled workflow for
Bayesian data analysis.
Hide
fit_press_prin <- brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(
prior(normal(200, 100), class = Intercept),
Hide
fit_press_prin
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 168.70 1.32 166.07 171.24 1.00 3533 2548
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 25.01 0.94 23.24 26.94 1.00 3171 2484
##
## ...
The above examples of using different priors should not be misunderstood to mean that priors
never matter. When there is enough data, the likelihood will dominate in determining the
posterior distributions. What constitutes “enough” data is also a function of the complexity of
the model; as a general rule, more complex models require more data.
Even in cases where there is enough data and the likelihood dominates in determining the
posteriors, regularizing, principled priors (i.e., priors that are more consistent with our a priori
beliefs about the data) will in general speed-up model convergence.
In order to determine the extent to which the posterior is influenced by the priors, it is a good
practice to carry out a sensitivity analysis: try different priors and either verify that the posterior
doesn’t change drastically, or report how the posterior is affected by some specific priors (for
examples from psycholinguistics, see Vasishth et al. 2013; Vasishth and Engelmann 2022).
Chapter 15 will demonstrate that sensitivity analysis becomes crucial for reporting Bayes
factors; even in cases where the choice of priors does not affect the posterior distribution, it
generally affects the Bayes factor.
The posterior predictive distribution is a collection of data sets generated from the model (the
likelihood and the priors). Having obtained the posterior distributions of the parameters after
taking into account the data, the posterior distributions can be used to generate future data
from the model. In other words, given the posterior distributions of the parameters of the
model, the posterior predictive distribution gives us some indication of what future data might
look like.
Once the posterior distributions p(Θ ∣ y) are available, the predictions based on these
distributions can be generated by integrating out the parameters:
Assuming that past and future observations are conditionally independent given Θ , i.e.,
p(ypred ∣ Θ, y) = p(ypred ∣ Θ) , the above equation can be written as:
In Equation (3.8), we are conditioning ypred only on y, we do not condition on what we don’t
know (Θ); the unknown parameters have been integrated out. This posterior predictive
distribution has important differences from predictions obtained with the frequentist approach.
The frequentist approach gives a point estimate of each predicted observation given the
maximum likelihood estimate of Θ (a point value), whereas the Bayesian approach gives a
distribution of values for each predicted observation. As with the prior predictive distribution,
the integration can be carried out computationally by generating samples from the posterior
predictive distribution. The same function that we created before,
normal_predictive_distribution() , can be used here. The only difference is that instead of
sampling mu and sigma from the priors, the samples come from the posterior.
Hide
N_obs <- nrow(df_spacebar)
mu_samples <- as_draws_df(fit_press)$b_Intercept
sigma_samples <- as_draws_df(fit_press)$sigma
normal_predictive_distribution(
mu_samples = mu_samples,
sigma_samples = sigma_samples,
N_obs = N_obs
)
## # A tibble: 1,444,000 × 3
## iter trialn t_pred
The brms function posterior_predict() is a convenient function that delivers samples from
the posterior predictive distribution. Using the command posterior_predict(fit_press)
yields the predicted finger tapping times in a matrix, with the samples as rows and the
observations (data-points) as columns. (Bear in mind that if a model is fit with sample_prior =
"only" , the dependent variable is ignored and posterior_predict will yield samples from the
The posterior predictive distribution can be used to examine the “descriptive adequacy” of the
model under consideration (Gelman et al. 2014, Chapter 6; Shiffrin et al. 2008). Examining the
posterior predictive distribution to establish descriptive adequacy is called posterior predictive
checks. The goal here is to establish that the posterior predictive data look more or less
similar to the observed data. Achieving descriptive adequacy means that the current data
could have been generated by the model. Although passing a test of descriptive adequacy is
not strong evidence in favor of a model, a major failure in descriptive adequacy can be
interpreted as strong evidence against a model (Shiffrin et al. 2008). For this reason,
comparing the descriptive adequacy of different models is not enough to differentiate between
their relative performance. When doing model comparison, it is important to consider the
criteria that Roberts and Pashler (2000) define. Although Roberts and Pashler (2000) are
more interested in process models and not necessarily Bayesian models, their criteria are
important for any kind of model comparison. Their main point is that it is not enough to have a
good fit to the data for a model to be convincing. One should check that the range of
predictions that the model makes is reasonably constrained; if a model can capture any
possible outcome, then the model fit to a particular data set is not so informative. In the
Bayesian modeling context, although posterior predictive checking is important, it is only a
sanity check to assess whether the model behavior is reasonable (for more on this point, see
chapter 7).
In many cases, one can simply use the plot functions from brms (that act as wrappers for
bayesplot functions). For example, the plotting function pp_check() takes as arguments the
model, the number of predicted data sets, and the type of visualization, and it can display
different visualizations of posterior predictive checks. In these type of plots, the observed data
are plotted as y and predicted data as yrep . Below, we use pp_check() to investigate how
well the observed distribution of finger tapping times fit our model based on some number (11
and 100 ) of samples of the posterior predictive distributions (that is, simulated data sets) ; see
Figures 3.8 and 3.9.
Hide
y
y rep
100 200 300 400 100 200 300 400 100 200 300 400 100 200 300 400
FIGURE 3.8: Histograms of eleven samples from the posterior predictive distribution of the
model fit_press (yrep ).
Hide
y
y rep
FIGURE 3.9: A posterior predictive check that shows the fit of the model fit_press in
comparison to data sets from the posterior predictive distribution using an overlay of density
plots.
The data is slightly skewed and has no values smaller than 100 ms, but the predictive
distributions are centered and symmetrical; see Figures 3.8 and 3.9. This posterior predictive
check shows a slight mismatch between the observed and predicted data. Can we build a
better model? We’ll come back to this issue in the next section.
Finger tapping times (and response times in general) are not usually normally distributed. A
more realistic distribution is the log-normal. A random variable (such as time) that is log-
normally distributed takes only positive real values and is right-skewed. Although other
distributions can also produce data with such properties, the log-normal will turn out to be a
pretty reasonable distribution for finger tapping times and response times.
3.7.1 The log-normal likelihood
If y is log-normally distributed, this means that log(y) is normally distributed.12 The log-
normal distribution is also defined using the parameters location, μ , and scale, σ, but these
are on the log ms scale; they correspond to the mean and standard deviation of the logarithm
of the data y, log(y) , which will be normally distributed. Thus, when we model some data y
using the log-normal likelihood, the parameters μ and σ are on a different scale than the data
y . Equation (3.9) shows the relationship between the log-normal and the normal.
log(y) ∼ Normal(μ, σ)
(3.9)
y ∼ LogNormal(μ, σ)
We can obtain samples from the log-normal distribution, using the normal distribution by first
setting an auxiliary variable z, so that z = log(y) . This means that z ∼ Normal(μ, σ) . Then
we can just use exp(z) as samples from the LogNormal(μ, σ) , since
exp(z) = exp(log(y)) = y . The code below produces Figure 3.10.
Hide
mu <- 6
sigma <- 0.5
N <- 500000
0.0020 0.0020
0.0015 0.0015
density
density
0.0010 0.0010
0.0005 0.0005
0.0000 0.0000
FIGURE 3.10: Two log-normal distributions with the same parameters generated by either
generating samples from a log-normal distribution or exponentiating samples from a normal
distribution.
If we assume that finger tapping times are log-normally distributed, the likelihood function
changes as follows:
tn ∼ LogNormal(μ, σ)
But now the scale of the priors needs to change! Let’s start with uniform priors for ease of
exposition, even though, as we mentioned earlier, these are not really appropriate here. (More
realistic priors are discussed below.)
μ ∼ Uniform(0, 11)
(3.10)
σ ∼ Uniform(0, 1)
Because the parameters are on a different scale than the dependent variable, their
interpretation changes and it is more complex than if we were dealing with a linear model that
assumes a normal likelihood (location and scale do not coincide with the mean and standard
deviation of the log-normal):
The location μ : In our previous linear model, μ represented the mean (in a normal
distribution, the mean happens to be identical to the median and the mode). But now, the
mean needs to be calculated by computing exp(μ + σ /2)
2
. In other words, in the log-
normal, the mean is dependent on both μ and σ. The median is just exp(μ) . Notice that
the prior of μ is not on the milliseconds scale, but rather on the log milliseconds scale.
The scale σ: This is the standard deviation of the normal distribution of log(y) . The
standard deviation of a log-normal distribution with location μ and scale σ will be
2
exp(μ + σ /2) × √exp(σ ) − 1
2
. Unlike the normal distribution, the spread of the log-
normal distribution depends on both μ and σ.
To understand the meaning of the priors on the millisecond scale, both the priors and the
likelihood need to be taken into account. Generating a prior predictive distribution will help in
interpreting the priors. This distribution can be generated by just exponentiating the samples
produced by normal_predictive_distribution() (or, alternatively, edit the function by
replacing rnorm() with rlnorm() ).
Hide
Next, plot the distribution of some representative statistics; see Figure 3.11.
mean_t
0.10
0.05
0.00
median_t
0.10
0.05
density
0.00
min_t
0.10
0.05
0.00
max_t
0.10
0.05
0.00
1 10 100 1000 10000 100000
Finger tapping times in ms
FIGURE 3.11: The prior predictive distribution of the mean, median, minimum, and maximum
value of the log-normal model with priors defined in Equation (3.10), that is
μ ∼ Uniform(0, 11) and σ ∼ Uniform(0, 1) . The x-axis is log-transformed.
We cannot generate negative values any more, since exp( any finite real number)> 0. These
priors might work in the sense that the model might converge; but it would be better to have
regularizing priors for the model. An example of regularizing priors:
μ ∼ Normal(6, 1.5)
(3.11)
σ ∼ Normal+ (0, 1)
The prior for σ here is a truncated distribution, and although its location is zero, this is not its
mean. We can calculate its approximate mean from a large number of random samples of the
prior distribution using the function rtnorm() from the package extraDistr . In this function,
we have to set the parameter a = 0 to express the fact that the normal distribution is
truncated from the left at 0. (Box 4.1 discusses this type of distribution in detail):
Hide
mean(rtnorm(100000, 0, 1, a = 0))
## [1] 0.795
Even before generating the prior predictive distributions, we can calculate the values within
which we are 95% sure that the expected median of the observations will lie. We do this by
looking at what happens at two standard deviations away from the mean of the prior, μ , that is
6 − 2 × 1.5 and 6 + 2 × 1.5 , and exponentiating these values:
Hide
c(
lower = exp(6 - 2 * 1.5),
higher = exp(6 + 2 * 1.5)
## lower higher
## 20.1 8103.1
This means that the prior for μ is still not too informative (these are medians; the actual values
generated by the log-normal distribution can be much more spread out). Next, plot the
distribution of some representative statistics of the prior predictive distributions. brms allows
one to sample from the priors, ignoring the observed data t , by setting sample_prior =
"only" in the brm function.
If we want to use brms to generate prior predictive data in this manner even before we have
any data, (at least for now) we do need to have some non- NA values as the dependent
variable t . Setting sample_prior = "only" will ignore the data, but we still need to add a
data frame in the data = specification in the brm function: In this case, we add a vector of
1 as “data”. The family is specified as lognormal() ; recall that in the first example, the
Hide
df_spacebar_ref <- df_spacebar %>%
mutate(t = rep(1, n()))
fit_prior_press_ln <- brm(t ~ 1,
data = df_spacebar_ref,
family = lognormal(),
prior = c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = sigma)
),
sample_prior = "only",
control = list(adapt_delta = .9)
To avoid the warnings, increase the adapt_delta parameter’s default value from 0.8 to 0.9 to
simulate the data. Since Stan samples from the prior distributions in the same way that it
samples from the posterior distribution, one should not ignore warnings; always ensure that
the model converged. In that respect, the custom function normal_predictive_distribution()
defined in section 3.3 has the advantage that it will always yield independent samples from the
prior distribution and will not experience any convergence problems. This is because it just
relies on the rnorm() function in R.
Plot the prior predictive distribution of means with the following code (the figure is not
produced here, to conserve space). In a prior predictive distribution, we generally want to
ignore the data; this requires setting prefix = "ppd" in pp_check() .
Hide
trans = "log",
breaks = c(0.001, 1, 100, 1000, 10000, 100000),
labels = c(
"0.001", "1", "100", "1000", "10000",
"100000"
)
) +
ggtitle("Prior predictive distribution of means")
To plot the distribution of minimum, and maximum values, just replace mean with min , and
max respectively. The distributions of the three statistics are displayed in Figure 3.12.
Hide
p1 <- pp_check(fit_prior_press_ln, type = "stat", stat = "mean", prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times [ms]",
trans = "log",
) +
ggtitle("Prior predictive distribution of minimum values")
p3 <- pp_check(fit_prior_press_ln, type = "stat", stat = "max", prefix = "ppd") +
coord_cartesian(xlim = c(0.001, 300000)) +
scale_x_continuous("Finger tapping times [ms]",
trans = "log",
) +
ggtitle("Prior predictive distribution of maximum values")
plot_grid(p1, p2, p3, nrow = 3, ncol =1)
Prior predictive distribution of means
T = mean
T (y pred)
T = min
T (y pred)
T = max
T (y pred)
FIGURE 3.12: The prior predictive distributions of the mean, maximum, and minimum values
of the log-normal model with priors defined in equation (3.11). The prior predictive distributions
are labeled ypred . The x-axis shows values back-transformed from the log-scale.
Figure 3.12 shows that the priors used here are still quite uninformative. The tails of the prior
predictive distributions that correspond to our normal priors shown in Figure 3.12 are even
further to the right, reaching more extreme values than for the prior predictive distributions
generated by uniform priors (shown in Figure 3.11). The new priors are still far from
representing our prior knowledge. We could run more iterations of choosing priors and
generating prior predictive distributions until we have priors that generate realistic data.
However, given that the bulk of the distributions of the mean, maximum, minimum values lies
roughly in the correct order of magnitude, these priors are going to be acceptable. In general,
summary statistics (e.g., mean, median, min, max) can be used to test whether the priors are
in a plausible range. This can be done by defining, for the particular research problem under
study, the extreme data that would be very implausible to ever observe (e.g., reading times at
a word larger than one minute) and choosing priors such that such extreme finger tapping
times occur only very rarely in the prior predictive distribution.
Next, fit the model; recall that both the distribution family and prior change in comparison to
the previous example.
Hide
When we look at the summary of the posterior, the parameters are on the log-scale:
Hide
fit_press_ln
## ...
## Population-Level Effects:
If the research goal is to find out how long it takes to press the space bar in milliseconds, we
need to transform the μ (or Intercept in the model) to milliseconds. Because the median of
the log-normal distribution is exp(μ) , the following returns the estimates in milliseconds:
Hide
Hide
Next, check whether the predicted data sets look similar to the observed data set. See Figure
3.13; compare this with the earlier Figure 3.9.
Hide
y
y rep
The key question is: Are the posterior predicted data now more similar to the observed data,
compared to the case where we had a Normal likelihood? According to Figure 3.13, it seems
so, but it’s not easy to tell.
Another way to examine the extent to which the predicted data looks similar to the observed
data would be to look at the distribution of some summary statistic. As with prior predictive
distributions, examine the distribution of representative summary statistics for the data sets
generated by different models. However, in contrast with what occurs with prior predictive
distributions, at this point we have a clear reference, our observations, and this means that we
can compare the summary statistics with the observed statistics from our data. We suspect
that the normal distribution would generate finger tapping times that are too fast (since it’s
symmetrical) and that the log-normal distribution may capture the long tail better than the
normal model. Based on this supposition, compute the distribution of minimum and maximum
values for the posterior predictive distributions, and compare them with the minimum and
maximum value respectively in the data. The function pp_check() implements this by
specifying stat either "min" or "max" for both fit_press , and fit_press_ln ; an
example is shown below. The plots are shown in Figures 3.14 and 3.15.
Hide
T = min T = min
T (y rep) T (y rep)
T (y ) T (y )
FIGURE 3.14: The distributions of minimum values in a posterior predictive check, using the
normal and log-normal probability density functions. The minimum in the data is 110 ms.
T = max T = max
T (y rep) T (y rep)
T (y ) T (y )
This completes our introduction to brms . We are now ready to learn about more regression
models.
The core brms function for fitting models, for generating prior predictive and posterior
predictive data:
Hide
family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0, ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0, ub = 2000)
),
chains = 4,
iter = 2000,
warmup = 1000
## uncomment for prior predictive:
## sample_prior = "only",
Hide
as_draws_df(fit_press)
Hide
plot(fit_press)
Hide
prefix = "ppd")
3.9 Summary
This chapter showed how to fit and interpret a Bayesian model with a normal likelihood. We
looked at the effect of priors by investigating prior predictive distributions and by carrying out a
sensitivity analysis. We also looked at the fit of the posterior, by inspecting the posterior
predictive distribution (which gives us some idea about the descriptive adequacy of the
model). We also showed how to fit a Bayesian model with a log-normal likelihood, and how to
compare the predictive accuracy of different models.
3.10 Further reading
Sampling algorithms are discussed in detail in Gamerman and Lopes (2006). Also helpful are
the sections on sampling from the short open-source book by Bob Carpenter, Probability and
Statistics: a simulation-based introduction (https://github.com/bob-carpenter/prob-stats), and
the sections on sampling algorithms in Lambert (2018) and Lynch (2007). Introductory linear
modeling theory is covered in Dobson and Barnett (2011); more advanced treatments are in
Montgomery, Peck, and Vining (2012) and Seber and Lee (2003). Generalized linear models
are covered in detail in McCullagh and Nelder (2019). The reader may also benefit from our
own freely available online lecture notes on linear modeling: https://github.com/vasishth/LM.
3.11 Exercises
a. Fit the model fit_press with just a few iterations, say 50 iterations (set warmup to the
default of 25, and use four chains). Does the model converge?
b. Using normal distributions, choose priors that better represent your assumptions/beliefs
about finger tapping times. To think about a reasonable set of priors for μ and σ, you
should come up with your own subjective assessment about what you think a reasonable
range of values can be for μ and how much variability might happen. There is no correct
answer here, we’ll discuss priors in depth in chapter 6. Fit this model to the data. Do the
posterior distributions change?
a. Can you come up with very informative priors that influence the posterior in a noticeable
way (use normal distributions for priors, not uniform priors)? Again, there are no correct
answers here; you may have to try several different priors before you can noticeably
influence the posterior.
b. Generate and plot prior predictive distributions based on this prior and plot them.
c. Generate posterior predictive distributions based on this prior and plot them.
a. For the log-normal model fit_press_ln , change the prior of σ so that it is a log-normal
distribution with location (μ) of −2 and scale (σ) of 0.5 . What does such a prior imply
about your belief regarding button-pressing times in milliseconds? Is it a good prior?
Generate and plot prior predictive distributions. Do the new estimates change compared
to earlier models when you fit the model?
b. For the log-normal model, what is the mean (rather than median) time that takes to press
the space bar, what is the standard deviation of the finger tapping times in milliseconds?
Would it make sense to use a “skew normal distribution” instead of the log-normal? The skew
normal distribution has three parameters: location ξ (this is the lower-case version of the
Greek letter Ξ , pronounced “chi”, with the “ch” pronounced like the “ch” in “Bach”), scale ω
(omega), and shape α . The distribution is right skewed if α > 0 , is left skewed if α < 0 , and
is identical to the regular normal distribution if α = 0 . For fitting this in brms , one needs to
change family and set it to skew_normal() , and add a prior of class = alpha (location
remains class = Intercept and scale, class = sigma ).
a. Fit this model with a prior that assigns approximately 95% of the prior probability of
alpha to be between 0 and 10.
b. Generate posterior predictive distributions and compare the posterior distribution of
summary statistics of the skew normal with the normal and log-normal.
References
Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear
Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.
https://doi.org/10.18637/jss.v067.i01.
Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael J.
Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A
Probabilistic Programming Language.” Journal of Statistical Software 76 (1). Columbia Univ.,
New York, NY (United States); Harvard Univ., Cambridge, MA (United States).
Dobson, Annette J, and Adrian Barnett. 2011. An Introduction to Generalized Linear Models.
CRC press.
Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. “Hybrid Monte
Carlo.” Physics Letters B 195 (2): 216–22. https://doi.org/10.1016/0370-2693(87)91197-X.
Gamerman, Dani, and Hedibert F Lopes. 2006. Markov chain Monte Carlo: Stochastic
simulation for Bayesian inference. CRC Press.
Ge, Hong, Kai Xu, and Zoubin Ghahramani. 2018. “Turing: A Language for Flexible
Probabilistic Inference.” In Proceedings of Machine Learning Research, edited by Amos
Storkey and Fernando Perez-Cruz, 84:1682–90. Playa Blanca, Lanzarote, Canary Islands:
PMLR. http://proceedings.mlr.press/v84/ge18b.html.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B.
Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC
Press.
Gelman, Andrew, Daniel Simpson, and Michael J. Betancourt. 2017. “The Prior Can Often
Only Be Understood in the Context of the Likelihood.” Entropy 19 (10): 555.
https://doi.org/10.3390/e19100555.
Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2018. “Rstanarm: Bayesian
Applied Regression Modeling via Stan.” http://mc-stan.org/.
Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively
Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15
(1): 1593–1623. http://dl.acm.org/citation.cfm?id=2627435.2638586.
Hubel, Kerry A, Bruce Reed, E William Yund, Timothy J Herron, and David L Woods. 2013.
“Computerized Measures of Finger Tapping: Effects of Hand Dominance, Age, and Sex.”
Perceptual and Motor Skills 116 (3). SAGE Publications Sage CA: Los Angeles, CA: 929–52.
Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge university press.
Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. London, UK: Sage.
Lindgren, Finn, and Håvard Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of
Statistical Software 63 (1): 1–25.
Luce, R Duncan. 1991. Response Times: Their Role in Inferring Elementary Mental
Organization. Oxford University Press.
Lunn, D.J., A. Thomas, N. Best, and D. Spiegelhalter. 2000. “WinBUGS-A Bayesian Modelling
Framework: Concepts, Structure, and Extensibility.” Statistics and Computing 10 (4). Springer:
325–37.
Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for
Social Scientists. New York, NY: Springer.
McCullagh, Peter, and J.A. Nelder. 2019. Generalized Linear Models. Second Edition. Boca
Raton, Florida: Chapman; Hall/CRC.
Neal, Radford M. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain
Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Taylor
& Francis. https://doi.org/10.1201/b10905-10.
O’Hagan, Antony, and Jonathan Forster. 2004. “Kendall’s Advanced Theory of Statistics, Vol.
2B: Bayesian Inference.” Wiley.
Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on
Theory Testing.” Psychological Review 107 (2): 358–67.
Seber, George A. F., and Allen J. Lee. 2003. Linear Regression Analysis. 2nd Edition.
Hoboken, NJ: John Wiley; Sons.
Shiffrin, Richard, Michael D. Lee, Woojae Kim, and Eric-Jan Wagenmakers. 2008. “A Survey
of Model Evaluation Approaches with a Tutorial on Hierarchical Bayesian Methods.” Cognitive
Science: A Multidisciplinary Journal 32 (8): 1248–84.
https://doi.org/10.1080/03640210802414826.
Vasishth, Shravan, Zhong Chen, Qiang Li, and Gueilan Guo. 2013. “Processing Chinese
Relative Clauses: Evidence for the Subject-Relative Advantage.” PLoS ONE 8 (10). Public
Library of Science: 1–14.
Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner.
2019. “Rank-Normalization, Folding, and Localization: An Improved R̂ for Assessing
Convergence of MCMC.” arXiv Preprint arXiv:1903.08008.
Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner.
2019. “Rank-Normalization, Folding, and Localization: An Improved ˆ
R for Assessing
Convergence of MCMC.”
7. The Python package PyMC3 and the Julia library Turing are recent exceptions since they
are fully integrated into their respective languages.↩
8. We refer to the time it takes for a subject to respond or react to a stimuli as response time
(response and reaction times are often used interchangeably, cf. Luce 1991). In this case,
however, there are no stimuli, and the subject only taps the space bar.↩
9. The problem here is that although the parameter for the intercept is assigned a uniform
distribution bounded between 0 and 60000 ms, the sampler might start sampling from an
initial value outside this range. The sampler can start from an initial value that is outside
the 0-60000 range because the initial value is chosen randomly (unless the user specifies
an initial value explicitly). ↩
10. We’ll see later how to generate prior predictive distributions of statistics such as mean,
minimum, or maximum value in section 3.7.2 using brms and pp_check() .↩
11. Even though, in theory, one could use even wider priors, in practice, these are the widest
priors that achieve convergence.↩
We generally run experiments because we are interested in the relationship between two or
more variables. A regression will tell us how our dependent variable, also called the response
or outcome variable (e.g., pupil size, response times, accuracy, etc.) is affected by one or
many independent variables, predictors, or explanatory variables. Predictors can be
categorical (e.g., male or female), ordinal (first, second, third, etc.), or continuous (e.g., age).
In this chapter we focus on simple regression models with different likelihood functions.
Let us look at the effect of cognitive processing on human pupil size to illustrate the use of
Bayesian linear regression models. Although pupil size is mostly related to the amount of light
that reaches the retina or the distance to a perceived object, pupil sizes are also
systematically influenced by cognitive processing: Increased cognitive load leads to an
increase in the pupil size (for a review, see Mathot 2018).
For this example, we’ll use the data from one subject’s pupil size of the control experiment by
Wahn et al. (2016), averaged by trial. The data are available from df_pupil in the package
bcogsci . In this experiment, the subject covertly tracks between zero and five objects among
several randomly moving objects on a computer screen. This task is called multiple object
tracking (or MOT; see Pylyshyn and Storm 1988). First, several objects appear on the screen,
and a subset of the objects are indicated as “targets” at the beginning. Then, the objects start
moving randomly across the screen and become indistinguishable. After several seconds, the
objects stop moving and the subject need to indicate which objects were the targets. See
Figure 4.1. Our research goal is to examine how the number of moving objects being tracked–
that is, how attentional load–affects pupil size.
FIGURE 4.1: Flow of events in a trial where two objects need to be tracked. Adapted from
Blumberg, Peterson, and Parasuraman (2015); licensed under CC BY 4.0
(https://creativecommons.org/licenses/by/4.0/).
We will model pupil size as normally distributed, because we are not expecting a skew, and
we have no further information available about the distribution of pupil sizes. (Given the units
used here, pupil sizes cannot be of size zero or negative, so we know for sure that this choice
is not exactly right.) For simplicity, assume a linear relationship between load and the pupil
size.
parameters.
For setting plausible priors, some research needs to be done to find some information about
pupil sizes. Although we might know that pupil diameters range between 2 to 4 mm in bright
light to 4 to 8 mm in the dark (Spector 1990), this experiment was conducted with the Eyelink-
II eyetracker which measures the pupils in arbitrary units (Hayes and Petrov 2016). If this is
our first-ever analysis of pupil size, before setting up the priors, we’ll need to look at some
measures of pupil size. (If we had analyzed this type of data before, we could also look at
estimates from previous experiments). Fortunately, we have some measurements of the same
subject with no attentional load for the first 100 ms, measured every 10 ms, in the data frame
df_pupil_pilot from the package bcogsci : This will give us some idea about the order of
Hide
data("df_pupil_pilot")
df_pupil_pilot$p_size %>% summary()
With this information we can set a regularizing prior for α . Center the prior around 1000 to be
in the right order of magnitude.13 Since we don’t know how much pupil sizes are going to vary
by load yet, we include a rather wide prior by defining it as a normal distribution and setting its
standard deviation as 500 .
α ∼ Normal(1000, 500)
Given that our predictor load is centered, with the prior for α , we are saying that we suspect
that the average pupil size for the average load in the experiment will be in a 95% credible
interval limited by approximately 1000 ± 2 ⋅ 500 = [0, 2000] units. We can calculate this with
more precision in R using the qnorm function:
Hide
## [1] 20 1980
We know that the measurements of the pilot data are strongly correlated because they were
taken 10 milliseconds apart. For this reason, they won’t give us a realistic estimate of how
much the pupil size can vary. Accordingly, set up quite an uninformative prior for σ that
encodes this lack of precise information: σ is surely larger than zero and has to be in the order
of magnitude of the pupil size with no load.
With this prior for σ, we are saying that we expect that the standard deviation of the pupil
sizes should be in the following 95% credible interval.
Hide
In order to compute the 95% credible interval, we used qtnorm from the extraDistr
package rather than qnorm() . As mentioned earlier, the relevant command specification is
qtnorm(..., a = 0) ; recall that a = 0 indicates a truncated normal distribution, truncated at
The mean of Normal+ , a normal distribution truncated at zero so as to allow for only positive
values, does not coincide with its location indicated with the parameter μ (and neither does
the standard deviation coincide with the scale, σ); see Box 4.1.
Hide
samples <- rtnorm(20000, mean = 0, sd = 1000, a = 0)
c(mean = mean(samples), sd = sd(samples))
## mean sd
## 794 595
We still need to set a prior for β, the change in pupil size produced by the attentional load.
Given that pupil size changes are not easily perceptible (we don’t usually observe changes in
pupil size in our day-to-day life), we expect them to be much smaller than the pupil size (which
we assume has mean 1000 units), so we use the following prior:
β ∼ Normal(0, 100)
With the prior of β, we are saying that we don’t really know if the attentional load will increase
or even decrease the pupil size (it is centered at zero), but we do know that one unit of load
(that is one more object to track) will potentially change the pupil size in a way that is
consistent with the following 95% credible interval.
Hide
That is, we don’t expect changes in size that increase or decrease the pupil size more than
200 units for one unit increase in load.
The priors we have specified here are relatively uninformative; as mentioned earlier, this is
because we are considering the situation where we don’t have much prior experience with
pupil size studies. In other settings, we might have more prior knowledge and experience; in
that case, one could use somewhat more principled priors. We will return to this point in the
chapter on priors (chapter 6) and on the Bayesian workflow (chapter 7).
Any distribution can be truncated. For a continuous distribution, the truncated version of
the original distribution will have non-zero probability density values for a continuous
subset of the original coverage. To make this more concrete, in our previous example, the
normal distribution has coverage for values between minus infinity to plus infinity, and our
truncated version N ormal+ has coverage between zero and plus infinity: all negative
values have a density of zero. Let’s see how we can generalize this to be able to
understand any truncation of any continuous distribution. (For the discrete case, we can
simply replace the integral with a sum, and replace PDF with PMF).
From the axiomatic definitions of probability, we know that the area below a PDF, f (x) ,
must be equal to one (section 1.1). More formally, this means that the integral of f
∫ f (x)dx = 1
−∞
But if the distribution is truncated, f is going to be evaluated in some subset of its possible
values, f (a < X < b) ; in the specific case of N ormal+ , for example, a = 0 , and b = ∞
. In the general case, this means that the integral of the PDF evaluated for a < X < b will
be lower than one, unless a = −∞ and b = +∞ .
∫ f (x)dx < 1
a
We want to ensure that we build a new PDF for the truncated distribution so that even
though it has less coverage than the non-truncated version, it still integrates to one. To
achieve this, we divide the “unnormalized” PDF by the total area of f (a < X < b) (recall
the discussion surrounding Equation (1.1)):
f (x)
f[a,b] (x) =
b
∫ f (x)dx
a
The denominator of the previous equation is the difference between the CDF evaluated at
X = b and the CDF evaluated at X = a ; this can be written as F (b) − F (a) :
f (x)
f[a,b] (x) = (4.1)
F (b) − F (a)
For the specific case where f (x) is N ormal(x|0, σ) and we want the PDF of
N ormal+ (x|0, σ) , the bounds will be a = 0 and b = ∞ .
N ormal(x|0, σ)
N ormal+ (x|0, σ) =
1/2
## [1] TRUE
Unless the truncation of the normal distribution is symmetrical, the mean μ of the
truncated normal does not coincide with the mean of the parent (untruncated) normal
distribution; call this mean of the parent distribution μ
^ . For any type of truncation, the
standard deviation of the truncated distribution σ does not coincide with the standard
deviation of the parent distribution; call this latter standard deviation σ
^. Confusingly
enough, the arguments of the family of truncated functions *tnorm keeps the names of
the family of functions *norm , the terms mean and sd . So, when defining a truncated
normal distribution like dtnorm(mean = 300, sd = 100, a = 0, b = Inf) , the mean and
sd refer to the mean μ
^ and standard deviation σ
^ of the untruncated parent distribution.
Sometimes one needs to model observed data as coming from a truncated normal
distribution. An example would be a vector of observed standard deviations; perhaps one
wants to use these estimates to work out a truncated normal prior. In order to derive such
an empirically motivated prior, we have to work out what mean and standard deviation we
need to use in a truncated normal distribution. We could compute the mean and standard
deviation from the observed vector of standard deviations, and then use the procedure
shown below to work out the mean and standard deviation that we would need to put into
the truncated normal distribution. This approach is used in chapter 6, section 6.1.4 for
working out a prior based on standard deviation estimates from existing data.
The mean and standard deviation of the parent distribution of a truncated normal (μ
^ and σ
^
) with boundaries a and b, given the mean μ and standard deviation σ of the truncated
normal, are computed as follows (Johnson, Kotz, and Balakrishnan 1995). ϕ(X) is the
PDF of the standard normal (i.e., Normal(μ = 0, σ = 1) ) evaluated at X , and Φ(X) is
the CDF of the standard normal evaluated at X .
α = (a − μ)/
^ σ
^ β = (b − μ)/
^ σ
^
Then, the mean μ of the truncated distribution can be computed as follows based on the
parameters of the parent distribution:
ϕ(β) − ϕ(α)
μ = μ
^ − σ
^ (4.2)
Φ(β) − Φ(α)
The variance σ
2
of the truncated distribution is:
2
βϕ(α) − αϕ(β) ϕ(α) − ϕ(β)
2 2
σ = σ
^ × (1 − − ( ) ) (4.3)
Φ(β) − Φ(α) Φ(β) − Φ(α)
Equations (4.2) and (4.3) have two variables, so if one is given the values for the
truncated distribution μ and σ, one can solve (using algebra) for the mean and standard
deviation of the untruncated distribution, μ
^ and σ
^.
For example, suppose that a = 0 and b = 500 , and that the mean and standard deviation
of the untruncated parent distribution is μ
^ = 300 and σ
^ = 200 . We can simulate such a
situation and estimate the mean and standard deviation of the truncated distribution:
Hide
mean(x)
## [1] 271
Hide
sd(x)
## [1] 129
These simulated values are identical to the values computed using equations (4.2) and
(4.3):
Hide
a <- 0
b <- 500
bar_x <- 300
bar_sigma <- 200
alpha <- (a - bar_x) / bar_sigma
## computed analytically:
(mu <- bar_x - bar_sigma * term1)
## [1] 271
Hide
## [1] 129
The equations for the mean and variance of the truncated distribution (μ and σ) can also
be used to work out the mean and variance of the parent untruncated distribution (μ
^ and σ
^
Suppose that we have observed data with mean μ = 271 and σ = 129 . We want to
assume that the data are coming from a truncated normal which has lower bound 0 and
upper bound 500 . What are the mean and standard deviation of the parent distribution, μ
^
and σ
^?
ϕ(β) − ϕ(α)
μ − μ
^ + σ
^ = 0 (4.4)
Φ(β) − Φ(α)
The variance σ
2
of the truncated distribution is:
2
βϕ(α) − αϕ(β) ϕ(α) − ϕ(β)
2 2
σ − σ
^ × (1 − − ( ) ) = 0 (4.5)
Φ(β) − Φ(α) Φ(β) − Φ(α)
Define the system of equations according to the specifications of multiroot from the
package rootSolve : x for the unknowns (μ
^ and σ
^), and parms for the known
parameters: a, b, and the mean and standard deviation of the truncated normal.
Hide
sigma_hat *
sqrt((1 - ((beta) * dnorm(beta) - (alpha) * dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)) -
((dnorm(beta) - dnorm(alpha)) /
(pnorm(beta) - pnorm(alpha)))^2))
)
Solving the two equations using multiroot() from the package rootSolve gives us the
mean and standard deviation μ
^ and σ
^ of the parent normal distribution. (Notice that x is
a required parameter of the previous function so that it works with multiroot() , however,
outside of the function the variable x is a vector containing the samples of the truncated
normal distribution generated with rtnorm() ).
Hide
soln <- multiroot(f = eq_system, start = c(1, 1),
parms = c(a = 0, b = 500,
mu = mean(x), sigma = sd(x)))
soln$root
Before fitting the brms model of the effect of load on pupil size, load the data and center the
predictor load :
Hide
data("df_pupil")
(df_pupil <- df_pupil %>%
mutate(c_load = load - mean(load)))
## # A tibble: 41 × 5
## subj trial load p_size c_load
Hide
fit_pupil <- brm(p_size ~ 1 + c_load,
data = df_pupil,
family = gaussian(),
prior = c(
prior(normal(1000, 500), class = Intercept),
The only difference from our previous models is that we now have a predictor in the formula
and in the priors. Priors for predictors are indicated with class = b , and the specific predictor
with coef = c_load . If we want to set the same priors to different predictors we can omit the
argument coef . Even if we drop the 1 from the formula, brm() will fit the same model as
when we specify 1 explicitly. If we really want to remove the intercept, this must be indicated
with 0 +... or -1 +... . Also see the Box 4.2 for more details about the treatment of the
intercepts by brms . The priors are normal distributions for the intercept (α) and the slope (β),
and a truncated normal distribution for the scale parameter σ, which coincides with the
standard deviation (because the likelihood is a normal distribution). brms will automatically
truncate the prior specification for σ and allow only positive values.
Next, inspect the output of the model. The posteriors and trace plots are shown in Figure 4.2;
this figure is generated by typing:
Hide
plot(fit_pupil)
b_Intercept b_Intercept
750
0.015
0.010 700
0.005 650
0.000
660 700 740 0 200 400 600 800 1000
b_c_load b_c_load
Chain
0.03
60
1
0.02 40
2
20 3
0.01
0 4
0.00
0 20 40 60 0 200 400 600 800 1000
sigma sigma
200
175
0.02
150
0.01 125
100
0.00
100 125 150 175 0 200 400 600 800 1000
FIGURE 4.2: The posterior distributions of the parameters in the brms model fit_pupil ,
along with the corresponding trace plots.
Hide
fit_pupil
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 701.38 20.28 662.06 741.42 1.00 3851 2949
##
## ...
In the next section, we discuss how one can communicate the relevant information from the
model.
When we set up a prior for the intercept in brms , we actually set a prior for an intercept
assuming that all the predictors are centered. This means that when predictors are not
centered (and only then), there is a mismatch between the interpretation of the intercept
as returned in the output of brms and the interpretation of the intercept with respect to its
prior specification. In this case, only the intercept in the output corresponds to the formula
in the brms call, that is, the intercept in the output corresponds to the non-centered
model. However, as we show below, when the intercept is much larger than the effects
that we are considering in the formula (what we generally call β), this discrepancy hardly
matters.
The reason for this mismatch when our predictors are uncentered is that brms increases
sampling efficiency by automatically centering all the predictors internally (that is the
population-level design matrix X is internally centered around its column means when
brms fits a model). This did not matter in our previous examples because we centered
our predictor (or we had no predictor), but it might matter if we want to have uncentered
predictors. In the design we are discussing, a non-centered predictor of load will mean
that the intercept, α , has a straightforward interpretation: the α is the mean pupil size
when there is no attention load. This is in contrast with the centered version presented
before, where the intercept α represents the pupil size for the average load of 2.44
( c_load is equal to 0). The difference between the non-centered model (below) and the
centered version presented before is depicted in Figure 4.3.
Suppose that we are quite sure that the prior values for the no load condition (i.e., load is
non-centered) fall between 400 and 1200 ms. In that case, the following prior could be set
for :
α Normal(800, 200) . In this case, the model becomes:
Hide
prior_nc <- c(
prior(normal(800, 200), class = b, coef = Intercept),
prior(normal(0, 1000), class = sigma),
prior(normal(0, 100), class = b, coef = load)
)
1000 1000
800 800
p_size
p_size
600 600
0 1 2 3 4 5 -2 -1 0 1 2
load c_load
FIGURE 4.3: Regression lines for the non-centered and centered linear regressions. The
intercept (α) represented by a circle is positioned differently depending on the centering,
whereas the slope (β) represented by a vertical dashed line has the same magnitude in
both models.
When the predictor is non-centered as shown above, the regular centered intercept is
removed by adding 0 to the formula, and by replacing the intercept with the “actual”
intercept we want to set priors on with Intercept . The word Intercept is a reserved
word; we cannot name any predictor with this name. This new parameter is also of class
b , so its prior needs to be defined accordingly. Once we use 0 + Intercept + .. , the
intercept is not calculated with predictors that are automatically centered any more.
The output below shows that, as expected, although the posterior for the intercept has
changed noticeably, the posterior for the effect of load remains virtually unchanged.
Hide
posterior_summary(fit_pupil_non_centered,
variable = c("b_Intercept", "b_load"))
Notice the following potential pitfall. A model like the one below will fit a non-centered load
predictor, but will assign a prior of Normal(800, 200) to the intercept of a model that
assumes a centered predictor, αcentered , and not the current intercept, α .
Hide
prior = prior_nc
)
What does it mean to set a prior to αcentered in a model that doesn’t include αcentered ?
The fitted (expected) values of the non-centered model and the centered one are
identical, that is, the values of the response distribution without the residual error are
identical for both models:
The left side of Equation (4.6) refers to the expected values based on our current non-
centered model, and the right side refers to the expected values based on the centered
model. We can re-arrange terms to understand what the effect is of a prior on αcentered in
our model that doesn’t include αcentered .
α + loadn ⋅ β = αcentered + loadn ⋅ β − mean(load) ⋅ β
α = αcentered − mean(load) ⋅ β
α + mean(load) ⋅ β = αcentered
That means that in the centered model, we are actually setting our prior to
α + mean(load) ⋅ β . When β is very small (or the means of our predictors are very small
because they might be “almost” centered), and the prior for α is very wide, we might
hardly notice the difference between setting a prior to αcentered or to our actual α in a non-
centered model (especially if the likelihood dominates anyway). But it is important to pay
attention to what the parameters represent that we are setting priors on.
To sum up, brms automatically centers all predictors for posterior estimation, and the
prior of the intercept is applied to the centered version of the model during model fitting.
However, when the predictors specified in the formula are not centered, then brms uses
the equations shown before to return in the output the posterior of the intercept for the
non-centered predictors.14
In our example analyses with brms in this book, we will always center our predictors.
We want to answer the research question “What is the effect of attentional load on the
subject’s pupil size?” To answer this question, we’ll need to examine what happens with the
posterior distribution of β, which is printed out as c_load in the summary of brms . The
summary of the posterior tells us that the most likely values of β will be around the mean of
the posterior, 33.9, and we can be 95% certain that the value of β, given the model and the
data, lies between 11.43 and 57.01.
The model tells us that as attentional load increases, the pupil size of the subject becomes
larger. If we want to determine how likely it is that the pupil size increased rather than
decreased, we can examine the proportion of samples above zero. (The intercept and the
slopes are always preceded by b_ in brms . One can see all the names of parameters being
estimated with variables() .)
Hide
mean(as_draws_df(fit_pupil)$b_c_load > 0)
## [1] 0.998
This high probability does not mean that the effect of load is non-zero. It means instead
that it’s much more likely that the effect is positive rather than negative. In order to claim that
the effect is likely to be non-zero, we would have to compare the model with an alternative
model in which the model assumes that the effect of load is 0. We’ll come back to this issue
when we discuss model comparison in chapter 14.
Sometimes it’s useful to customize the posterior predictive check to visualize the fit of our
model. We iterate over the different loads (e.g, 0 to 4), and we show the posterior predictive
distributions based on 100 simulations for each load together with the observed pupil sizes in
Figure 4.4. We don’t have enough data to derive a strong conclusion: both the predictive
distributions and our data look very widely spread out, and it’s hard to tell if the distribution of
the observations could have been generated by our model. For now we can say that it doesn’t
look too bad.
Hide
for (l in 0:4) {
) +
geom_point(data = df_sub_pupil, aes(x = p_size, y = 0.0001)) +
ggtitle(paste("load: ", l)) +
coord_cartesian(xlim = c(400, 1000))
print(p)
}
load: 0
y
y rep
load: 1
y
y rep
load: 2
y
y rep
load: 3
y
y rep
load: 4
y
y rep
Hide
for (l in 0:4) {
df_sub_pupil <- filter(df_pupil, load == l)
p <- pp_check(fit_pupil,
type = "stat",
ndraws = 1000,
newdata = df_sub_pupil,
stat = "mean"
) +
load: 0
T = mean
T (y rep)
T (y )
load: 1
T = mean
T (y rep)
T (y )
T = mean
T (y rep)
T (y )
load: 3
T = mean
T (y rep)
T (y )
load: 4
T = mean
T (y rep)
T (y )
FIGURE 4.5: Distribution of posterior predicted means in gray and observed pupil size means
in black lines by load.
Figure 4.5 shows that the observed means for no load and for a load of one are falling in the
tails of the distributions. Although our model predicts a monotonic increase of pupil size, the
data might be indicating that the relevant difference is simply between no load, and some
load. However, given the uncertainty in the posterior predictive distributions and that the
observed means are contained somewhere in the predicted distributions, it could be the case
that with this model we are overinterpreting noise.
4.2 Log-normal model: Does trial affect finger tapping
times?
Let us revisit the small experiment from section 3.2.1, where a subject repeatedly taps the
space bar as fast as possible. Suppose that we want to know whether the subject tended to
speed up (a practice effect) or slow down (a fatigue effect) while pressing the space bar. We’ll
use the same data set df_spacebar as before, and we’ll center the column trial :
Hide
If we assume that finger tapping times are log-normally distributed, the likelihood becomes:
Use the same priors as in section 3.7.2 for α (which is equivalent to μ in the previous model)
and for σ.
α ∼ Normal(6, 1.5)
σ ∼ Normal+ (0, 1)
We still need a prior for β. Effects are multiplicative rather than additive when we assume a
log-normal likelihood, and that means that we need to take into account α in order to interpret
β ; for details, see Box 4.3. We are going to try to understand how all our priors interact, by
generating some prior predictive distributions. We start with the following prior centered in
zero, a prior agnostic regarding the direction of the effect, which allows for a slowdown (β > 0
β ∼ Normal(0, 1)
Hide
# Ignore the dependent variable,
# use a vector of ones a placeholder.
df_spacebar_ref <- df_spacebar %>%
mutate(t = rep(1, n()))
fit_prior_press_trial <- brm(t ~ 1 + c_trial,
data = df_spacebar_ref,
family = lognormal(),
prior = c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = b, coef = c_trial)
),
sample_prior = "only",
control = list(adapt_delta = .9)
)
In order to understand the type of data that we are assuming a priori with the prior of the
parameter β, plot the median difference between the finger tapping times at adjacent trials. As
the prior of β gets wider, larger differences are observed between adjacent trials. The
objective of the prior predictive checks is to calibrate the prior of β to obtain a plausible range
of differences. We are going to plot a distribution of medians because they are less affected
by the variance in the prior predicted distribution than the distribution of mean differences;
distributions of means will have much more spread. To make the distribution of means more
realistic, we would also need to find a more accurate prior for the scale σ. (Recall that the
mean of log-normal distributed values depend on both the location, μ and the scale, σ, of the
distribution.) To plot the median effect, first define a function that calculates the difference
between adjacent trials, and then apply the median to the result. We use that function in
pp_check and show the result in Figure 4.6. As expected, the median effect is centered on
zero (as is our prior), but we see that the distribution of possible medians for the effect is too
widely spread out and includes values that are too extreme.
Hide
median_diff <- function(x) {
median(x - lag(x), na.rm = TRUE)
}
pp_check(fit_prior_press_trial,
type = "stat",
stat = "median_diff",
# show only prior predictive distributions
prefix = "ppd",
# each bin has a width of 500ms
binwidth = 500) +
# cut the top of the plot to improve its scale
T = median_diff
T (y pred)
FIGURE 4.6: The prior predictive distribution of the median effect of the model defined in 4.2
with β ∼ Normal(0, 1) .
Repeat the same procedure with β ∼ Normal(0, 0.01) ; the resulting prior predictive
distribution is shown in Figure 4.7. The prior predictive distribution shows us that the prior is
still quite vague; it is, however at least in the right order of magnitude.
T = median_diff
T (y pred)
FIGURE 4.7: The prior predictive distribution of the median difference in finger tapping times
between adjacent trials based on the model defined in section 4.2 with β ∼ Normal(0, 0.01) .
Prior selection might look daunting and can be a lot of work. However, this work is usually
done only the first time we start working with an experimental paradigm; besides, priors can
be informed by the estimates from previous experiments (even maximum likelihood estimates
from frequentist models can be useful). We will generally use very similar (or identical priors)
for analyses dealing with the same type of task. When in doubt, a sensitivity analysis (see
section 3.4) can tell us whether the posterior distribution depends unintentionally strongly on
our prior selection. We will return to the issue of prior selection in chapter 6.
2
1 (log(y) − μ)
LogNormal(y|μ, σ) = f (z) = exp(− )
2
√2πσ y
2 2σ
As explained in section 3.7.1, the model from Equation (4.7) is equivalent to the following:
The family of normal distributions is closed under linear transformations: that is, if X is
normally distributed with mean μ and standard deviation σ, then (for any real numbers a
and b), aX + b is also normally distributed, with mean aμ + b (and standard deviation
√a σ
2 2
= |a|σ ).
Exponentiate both sides, and we use the property of exponents that exp(x + y) is equal
to exp(x) ⋅ exp(y) , and set Y = exp(Z) .
The last equation has two terms being multiplied, the first one, Y , is telling us that we are
assuming that finger tapping times are log-normally distributed with a median of exp(α) ,
the second term, exp(c_trialn ⋅ β) is telling us that the effect of trial number is
multiplicative and grows or decays exponentially with the trial number. This has two
important consequences:
1. Different values of the intercept, α , given the same β, will affect the difference in
finger tapping or response times for two adjacent trials (compare this with what
happens with an additive model, such as when a normal likelihood is used); see
Figure 4.8. This is because, unlike in the additive case, the intercept doesn’t cancel
out:
Additive case:
(α + trialn ⋅ β) − (α + trialn−1 ⋅ β) =
= α − α + (trialn − trialn−1 ) ⋅ β
= (trialn − trialn−1 ) ⋅ β
Multiplicative case:
30000
20000
10000
0 5 10 15
Intercept value
FIGURE 4.8: The fitted values of the difference in response time between two adjacent
trials, when β = 0.01 and α lies between 0.1 and 15. The graph shows how changes in
the intercept lead to changes in the difference in response times between trials, even if β
is fixed.
2. As the trial number increases, the same value of β will have a very different impact on
the original scale of the dependent variable: Any (fixed) negative value for β will lead
to exponential decay and any (fixed) positive value will lead to exponential growth;
see Figure 4.9.
400
200
100
8000
4000
2000
FIGURE 4.9: The fitted values of the dependent variable (response times in ms) as a
function of trial number, when (A) β = −0.01 , exponential decay, and when (B) β = 0.01 ,
exponential growth.
Does exponential growth or decay make sense in this particular example? We need to
consider that if they do make sense, they will be an approximation valid for a specific
range of values, at some point we will expect a ceiling or a floor effect: response times
cannot truly be 0 milliseconds, or take several minutes. However, in our specific model,
exponential growth or decay by trial is probably a bad approximation: We will predict that
our subject will take extremely long (if β > 0 ) or extremely short (if β < 0 ) time in
pressing the space bar in a relatively low number of trials. This doesn’t mean that the
likelihood is wrong by itself, but it does mean that at least we need to put a cap on the
growth or decay of our experimental manipulation. We can do this if the exponential
growth or decay is a function of, for example, log-transformed trial numbers:
400
390
385
425
415
410
0 100 200 300
Trial number
FIGURE 4.10: Fitted value of the dependent variable (times in ms) as function of the
natural logarithm of the trial number, when (A) β = −0.01 , exponential decay, and when
(B) β = .01 , exponential growth.
The normal distribution is most often assumed to describe the random variation that
occurs in the data from many scientific disciplines. However, most measurements actually
show skewed distributions. Limpert, Stahel, and Abbt (2001) discuss the log-normal
distribution in scientific disciplines and how diverse type of data, from lengths of latent
periods of infectious diseases to distribution of mineral resources in the Earth’s crust,
including even body height–the quintessential example of a normal distribution–closely fit
the log-normal distribution.
Limpert, Stahel, and Abbt (2001) point out that because a random variable that results
from multiplying many independent variables has an approximate log-normal distribution,
the most basic indicator of the importance of the log-normal distribution may be very
general: Chemistry and physics are fundamental in life, and the prevailing operation in the
laws of these disciplines is multiplication rather than addition.
Furthermore, at many physiological and anatomical levels in the brain, the distribution of
numerous parameters is in fact strongly skewed with a heavy tail, suggesting that skewed
(typically log-normal) distributions are fundamental to structural and functional brain
organization. This might be explained given that the majority of interactions in highly
interconnected systems, especially in biological systems, are multiplicative and synergistic
rather than additive (Buzsáki and Mizuseki 2014).
Does the log-normal distribution make sense for response times? It has been long noticed
that the log-normal distribution often provides a good fit to response times distributions
(Brée 1975; Ulrich and Miller 1994). One advantage of assuming log-normally distributed
response times (but, in fact, this is true for many skewed distributions) is that it entails that
the standard deviation of the reaction time distribution will increase with the mean, as has
been observed in empirical distributions of response times (Wagenmakers, Grasman, and
Molenaar 2005). Interestingly, it turns out that log-normal response times are also easily
generated by certain process models. Ulrich and Miller (1993) show, for example, that
models in which response times are determined by a series of processes cascading
activation from an input level to an output level (usually passing through a number of
intervening processing levels along the way) can generate log-normally distributed
response times.
We are now relatively satisfied with the priors for our model, and we can fit the model of the
effect of trial as a button-pressing using brms . We need to specify that the family is
lognormal() .
Hide
fit_press_trial <- brm(t ~ 1 + c_trial,
data = df_spacebar,
family = lognormal(),
prior = c(
prior(normal(6, 1.5), class = Intercept),
Instead of printing out the complete output from the model, look at the estimates from the
posteriors for the parameters , , and σ. These parameters are on the log scale:
α β
Hide
posterior_summary(fit_press_trial,
variable = c("b_Intercept",
"b_c_trial",
"sigma"))
The posterior distributions can be plotted to obtain a graphical summary of all the parameters
in the model (Figure 4.11):
Hide
plot(fit_press_trial)
b_Intercept b_Intercept
60 5.14
5.13
40
5.12
20 5.11
5.10
0
5.10 5.11 5.12 5.13 5.14 0 200 400 600 800 1000
b_c_trial b_c_trial
6000 0.0007
Chain
1
0.0006
4000
2
0.0005
2000 3
0.0004
4
0 0.0003
0.0004 0.0005 0.0006 0.0007 0 200 400 600 800 1000
sigma sigma
0.14
75
0.13
50
0.12
25
0.11
0
0.11 0.12 0.13 0.14 0 200 400 600 800 1000
FIGURE 4.11: Posterior distributions of the model of the effect of trial on button-pressing.
Next, we turn to the question of what we can report as our results, and what we can conclude
from the data.
As shown above, the first step is to summarize the posteriors in a table or graphically (or
both). If the research relates to the effect estimated by the model, the posterior of β can be
summarized in the following way: ^ = 0.00052
β , 95% CrI = [0.0004, 0.00065] .
The effect is easier to interpret in milliseconds. We can transform the estimates back to the
millisecond scale from the log scale, but we need to take into account that the scale is not
linear, and that the effect between two button presses will differ depending on where we are in
the experiment.
We will have a certain estimate if we consider the difference between response times in a trial
at the middle of the experiment (when the centered trial number is zero) and the previous one
(when the centered trial number is minus one).
Hide
alpha_samples <- as_draws_df(fit_press_trial)$b_Intercept
beta_samples <- as_draws_df(fit_press_trial)$b_c_trial
effect_middle_ms <- exp(alpha_samples) -
exp(alpha_samples - 1 * beta_samples)
## ms effect in the middle of the expt
We will obtain different estimate if we consider the difference between the second and the first
trial:
Hide
effect_beginning_ms <-
exp(alpha_samples + second_trial * beta_samples) -
exp(alpha_samples + first_trial * beta_samples)
## ms effect from first to second trial:
c(mean = mean(effect_beginning_ms),
So far we converted the estimates to obtain median effects, that’s why we used exp(⋅) , if we
want to obtain mean effects we need to take into account σ, since we need to calculate
exp(⋅ + σ /2)
2
. However, we can also use the built-in function fitted() which calculates
mean effects. Consider again the difference between the second and the first trial this time
using fitted() .
First, define for which observations we want to obtain the fitted values in millisecond scale. If
we are interested in the difference between the second and first trial, create a data frame with
their centered versions.
Hide
Second, use fitted() on the brms object, including the new data, and setting the summary
parameter to FALSE . The first column contains the posterior samples transformed into
milliseconds of the first trial, and the second column of the second trial.
Hide
## [,1] [,2]
## [1,] 153 153
## [2,] 154 155
Last, calculate the difference between trials, and report mean and 95% quantiles.
Hide
The practical relevance of the effect for the research question can be important too. For
example, only after 100 button presses do we see a barely noticeable slowdown:
Hide
effect_100 <-
exp(alpha_samples + 100 * beta_samples) -
exp(alpha_samples)
c(mean = mean(effect_100),
quantile(effect_100, c(0.025, 0.975)))
We need to consider whether our uncertainty of this estimate, and the estimated mean effect
have any scientific relevance. Such relevance can be established by considering the previous
literature, predictions from a quantitative model, or other expert domain knowledge.
Sometimes, a quantitative meta-analysis is helpful; for examples, see Bürki, Alario, and
Vasishth (2022), Cox et al. (2022), Bürki et al. (2020), Jäger, Engelmann, and Vasishth (2017),
Mahowald et al. (2016), Nicenboim, Roettger, and Vasishth (2018), and Vasishth et al. (2013).
We will discuss meta-analysis in later in the book, in chapter 13.
Sometimes, researchers are only interested in establishing that an effect is present or absent;
the magnitude and uncertainty of the estimate is of secondary interest. Here, the goal is to
argue that there is evidence of a slowdown. The word evidence has a special meaning in
statistics (Royall 1997), and in null hypothesis significance testing, a likelihood ratio test is the
standard way to argue that one has evidence for an effect. In the Bayesian data analysis
context, in order to answer such a question, a Bayes factor analysis must be carried out. We’ll
come back to this issue in the model comparison chapters 14-16.
4.2.4 Descriptive adequacy
We look now at the predictions of the model. Since we now know that trial effects are very
small, let’s examine predictions of the model for differences in response times between 100
button presses. Similarly as for prior predictive checks, we define a function,
median_diff100() , that calculates the median difference between a trial n and a trial n + 100
. This time we’ll compare the observed median difference against the range of predicted
differences based on the model and the data rather than only the model as we did for the prior
predictions. Below we use virtually the same code that we use for plotting prior predictive
checks, but since we now use the fitted model, we’ll obtain posterior predictive checks; this is
displayed in Figure 4.12.
Hide
stat = "median_diff100")
T = median_diff100
T (y rep)
T (y )
0 4 8 12 16
FIGURE 4.12: The posterior predictive distribution of the median difference in response times
between a trial n and a trial n + 100 based on the model fit_press_trial and the observed
data.
From Figure 4.12, we can conclude that model predictions for differences in response trials
between trials are reasonable.
4.3 Logistic regression: Does set size affect free
recall?
In this section, we will learn how the principles we have learned so far can naturally extend to
generalized linear models (GLMs). We focus on one special case of GLMs that has wide
application in linguistics and psychology, logistic regression.
As an example data set, we look at a study investigating the capacity level of working memory.
The data are a subset of a data set created by Oberauer (2019). Each subject was presented
word lists of varying lengths (2, 4, 6, and 8 elements), and then was asked to recall a word
given its position on the list; see Figure 4.13. We will focus on the data from one subject.
FIGURE 4.13: The flow of events in a trial with memory set size 4 and free recall. Adapted
from Oberauer (2019); licensed under CC BY 4.0
(https://creativecommons.org/licenses/by/4.0/).
It is well-established that as the number of items to be held in working memory increases,
performance, that is accuracy, decreases (see Oberauer and Kliegl 2001, among others). We
will investigate this claim with data from only one subject.
The data can be found in df_recall in the package bcogsci . The code below loads the
data, centers the predictor set_size , and briefly explores the data set.
Hide
data("df_recall")
df_recall <- df_recall %>%
mutate(c_set_size = set_size - mean(set_size))
# Set sizes in the data set:
df_recall$set_size %>%
## [1] 2 4 6 8
Hide
## # A tibble: 4 × 2
## # Groups: set_size [4]
## set_size n
## <int> <int>
## 1 2 23
## 2 4 23
## 3 6 23
## # … with 1 more row
Here, the column correct records the incorrect vs. correct responses with 0 vs 1 , and
the column c_set_size records the centered memory set size; these latter scores have
continuous values -3, -1, 1, and 3. These continuous values are centered versions of 2, 4, 6,
and 8.
Hide
df_recall
## # A tibble: 92 × 8
## subj set_size correct trial session block tested c_set_size
## <chr> <int> <int> <int> <int> <int> <int> <dbl>
## 1 10 4 1 1 1 1 2 -1
## 2 10 8 0 4 1 1 8 3
## 3 10 2 1 9 1 1 2 -3
## # … with 89 more rows
We want to model the trial-by-trial accuracy and examine whether the probability of recalling a
word is related to the number of words in the set that the subject needs to remember.
Recall that the Bernoulli likelihood generates a 0 or 1 response with a particular probability θ.
For example, one can generate simulated data for 10 trials, with a 50% probability of getting a
1 using rbern from the package extraDistr .
Hide
## [1] 0 1 1 1 1 1 0 1 0 1
We can therefore define each dependent value correct_n in the data as being generated
from a Bernoulli random variable with probability of success θn . Here, n = 1, … , N indexes
the trial, correct_n is the dependent variable (0 indicates an incorrect recall and 1 a correct
recall), and θn is the probability of correctly recalling a probe in a given trial n .
Since θn is bounded to be between 0 and 1 (it is a probability), we cannot just fit a regression
model using the normal (or log-normal) likelihood as we did in the preceding examples. Such
a model would be inappropriate because it would assume that the data range from −∞ to
+∞ (or from 0 to +∞ ), rather than being limited to zeros and ones.
The generalized linear modeling framework solves this problem by defining a link function g(⋅)
that connects the linear model to the quantity to be estimated (here, the probabilities θn ). The
link function used for 0, 1 responses is called the logit link, and is defined as follows.
θn
ηn = g(θn ) = log( )
1 − θn
is called the odds.15 The logit link function is therefore a log-odds; it maps
θn
The term
1−θn
probability values ranging from [0, 1] to real numbers ranging from −∞ to +∞ . Figure 4.14
shows the logit link function, η = g(θ) , and the inverse logit, θ = g
−1
(η) , which is called the
logistic function; the relevance of this logistic function will become clear in a moment.
−1
4 η=g(θ) 0.75
θ=g (η)
0 0.50
η
0.25
-4
0.00
The linear model is now fit not to the 0,1 responses as the dependent variable, but to ηn , i.e.,
log-odds, as the dependent variable:
θn
ηn = log( ) = α + β ⋅ c_set_size
1 − θn
Unlike linear models, the model is defined so that there is no residual error term (ε) in this
model. Once ηn is estimated, one can solve the above equation for θn (in other words, we
compute the inverse of the logit function and obtain the estimates on the probability scale).
This gives the above-mentioned logistic regression function:
exp(ηn ) 1
−1
θn = g (ηn ) = =
1 + exp(ηn ) 1 + exp(−ηn )
The last equality in the equation above arises by dividing both the numerator and denominator
by exp(ηn ) .
In summary, the generalized linear model with the logit link fits the following Bernoulli
likelihood:
The model is fit on the log-odds scale, ηn = α + c_set_sizen ⋅ β . Once ηn has been
estimated, the inverse logit or the logistic function is used to compute the probability estimates
exp(ηn )
θn = . An example of this calculation will be shown in the next section.
1+exp(ηn )
In order to decide on priors for α and β, we need to take into account that these parameters
do not represent probabilities or proportions, but log-odds, the x-axis in Figure 4.14 (right-
hand side figure). As shown in the figure, the relationship between log-odds and probabilities
is not linear.
There are two functions in R that implement the logit and inverse logit functions: qlogis(p)
for the logit function and plogis(x) for the inverse logit or logistic function.
Now we need to set priors for α and β. Given that we centered our predictor, the intercept, α ,
represents the log-odds of correctly recalling one word in a random position for the average
2+4+6+8
set size of five (since 5 =
4
), which, incidentally, was not presented in the experiment.
This is one case where the intercept doesn’t have a clear interpretation if we leave the
prediction uncentered: With non-centered set size, the intercept will be the log-odds of
recalling one word in a set of zero words, which obviously makes no sense.
The prior for α will depend on how difficult the recall task is. We could assume that the
probability of recalling a word for an average set size, α , is centered in .5 (a 50/50 chance)
with a great deal of uncertainty. The R command qlogis(.5) tells us that .5 corresponds to
zero in log-odds space. How do we include a great deal of uncertainty? We could look at
Figure 4.14, and decide on a standard deviation of 4 in a normal distribution centered in zero:
α ∼ Normal(0, 4)
Let’s plot this prior in log-odds and in probability scale by drawing random samples.
Hide
samples_logodds <- tibble(alpha = rnorm(100000, 0, 4))
samples_prob <- tibble(p = plogis(rnorm(100000, 0, 4)))
ggplot(samples_logodds, aes(alpha)) +
geom_density()
ggplot(samples_prob, aes(p)) +
geom_density()
0.100
0.075 2
density
density
0.050
0.025
0.000 0
-20 -10 0 10 20 0.00 0.25 0.50 0.75 1.00
alpha p
FIGURE 4.15: The prior for α ∼ Normal(0, 4) in log-odds and in probability space.
Figure 4.15 shows that our prior assigns more probability mass to extreme probabilities of
recall than to intermediate values. Clearly, this is not what we intended.
We could try several values for standard deviation of the prior, until we find a prior that make
sense for us. Reducing the standard deviation to 1.5 seems to make sense as shown in
Figure 4.16.
α ∼ Normal(0, 1.5)
0.9
0.2
density
density
0.6
0.1
0.3
0.0 0.0
-4 0 4 0.00 0.25 0.50 0.75 1.00
alpha p
FIGURE 4.16: Prior for α ∼ Normal(0, 1.5) in log-odds and in probability space.
We need to decide now on the prior for β, the effect in log-odds of increasing the set size. We
could choose a normal distribution centered on zero, reflecting our lack of any commitment
regarding the direction of the effect. Let’s get some intuitions regarding different possible
standard deviations for this prior, by testing the following distributions as priors:
a. β ∼ Normal(0, 1)
b. β ∼ Normal(0, .5)
c. β ∼ Normal(0, .1)
d. β ∼ Normal(0, .01)
e. β ∼ Normal(0, .001)
In principle, we could produce the prior predictive distributions using brms with sample_prior
= "only" and then predict() . However, as mentioned before, brms also uses Stan’s
Hamiltonian sampler for sampling from the priors, and this can lead to convergence problems
when the priors are too uninformative (as in this case). We solve this issue by performing prior
predictive checks directly in R using the r* family of functions (e.g., rnorm() , rbinom() ,
etc.) together with loops. This method is not as simple as using the convenient functions
provided by brms , but it is very flexible and can be very powerful. We show the prior
predictive distributions in Figure 4.17; for the details on the implementation in R, see Box 4.4.
Hide
map2_dfr(alpha_samples, beta_samples,
function(alpha, beta) {
tibble(
set_size = set_size,
# center size:
c_set_size = set_size - mean(set_size),
)
},
.id = "iter"
) %>%
# .id is always a string and needs
# to be converted to a number
mutate(iter = as.numeric(iter))
}
Let’s assume 800 observations with 200 observation for each set size:
Hide
Hide
logistic_model_pred(
alpha_samples = alpha_samples,
beta_samples = beta_samples,
set_size = set_size,
N_obs = N_obs
) %>%
mutate(prior_beta_sd = sd)
})
Calculate the accuracy for each one of the priors we want to examine, for each iteration,
and for each set size.
Hide
mean_accuracy <-
prior_pred %>%
group_by(prior_beta_sd, iter, set_size) %>%
summarize(accuracy = mean(correct_pred)) %>%
mutate(prior = paste0("Normal(0, ", prior_beta_sd, ")"))
Hide
mean_accuracy %>%
ggplot(aes(accuracy)) +
geom_histogram() +
facet_grid(set_size ~ prior) +
scale_x_continuous(breaks = c(0, .5, 1))
It’s sometimes more useful to look at the predicted differences in accuracy between set
sizes. We calculate them as follows, and plot them in Figure 4.18.
Hide
filter(set_size > 2)
Hide
diff_accuracy %>%
ggplot(aes(diff_accuracy)) +
geom_histogram() +
facet_grid(diffsize ~ prior) +
scale_x_continuous(breaks = c(-.5, 0, .5))
Figure 4.17 shows that, as expected, the priors are centered at zero. We see that the
distribution of possible accuracies for the prior that has a standard deviation of 1 is
problematic: There is too much probability concentrated near 0 and 1 for set sizes of 2 and 8.
It’s hard to tell the differences between the other priors, and it might be more useful to look at
the predicted differences in accuracy between set sizes in Figure 4.18.
Normal(0, 0.001) Normal(0, 0.01) Normal(0, 0.1) Normal(0, 0.5) Normal(0, 1)
125
100
75
2
50
25
0
125
100
75
4
50
25
count
0
125
100
75
6
50
25
0
125
100
75
8
50
25
0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
accuracy
FIGURE 4.17: The prior predictive distributions of mean accuracy of the model defined in
section 4.3, for different set sizes and different priors for β.
400
200 4-2
0
600
400
count
6-4
200
0
600
400
8-6
200
0
-0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5
diff_accuracy
FIGURE 4.18: The prior predictive distributions of differences in mean accuracy between set
sizes of the model defined in section 4.3 for different priors for β.
If we are not sure whether the increase of set size could produce something between a null
effect and a relatively large effect, we can choose the prior with a standard deviation of 0.1 .
Under this reasoning, we settle on the following priors:
α ∼ Normal(0, 1.5)
β ∼ Normal(0, 0.1)
Having decided on the likelihood, the link function, and the priors, the model can now be fit
using brms . We need to specify that the family is bernoulli() , and the link is logit .
Hide
Next, look at the summary of the posteriors of each of the parameters. Keep in mind that the
parameters are in log-odds space:
Hide
posterior_summary(fit_recall,
variable = c("b_Intercept", "b_c_set_size"))
Hide
plot(fit_recall)
b_Intercept b_Intercept
3.0
1.0 2.5
2.0
0.5
1.5
Chain
1.0
0.0 1
1.0 1.5 2.0 2.5 3.0 0 200 400 600 800 1000
2
b_c_set_size b_c_set_size
0.1 3
4
4 0.0
3 -0.1
-0.2
2
-0.3
1
-0.4
0
-0.4 -0.3 -0.2 -0.1 0.0 0 200 400 600 800 1000
FIGURE 4.19: The posterior distributions of the parameters in the brms model fit_recall ,
along with the corresponding trace plots.
Next, we turn to the question of what we can report as our results, and what we can conclude
from the data.
Here, we are in a situation analogous to the one we saw earlier with the log-normal model. If
we want to talk about the effect estimated by the model in log-odds space, we summarize the
posterior of β in the following way: ^ = −0.182
β , 95% CrI = [−0.343, −0.022] .
However, the effect is easier to understand in proportions rather than in log-odds. Let’s look at
the average accuracy for the task first:
Hide
As before, to transform the effect of our manipulation to an easier to interpret scale (i.e.,
proportions), we need to take into account that the scale is not linear, and that the effect of
increasing the set size depends on the average accuracy, and the set size that we start from.
We can do the following calculation, similar to what we did for the trial effects experiment, to
find out the decrease in accuracy in proportions or probability scale:
Hide
Notice the interpretation here, if we increase the set size from the average set size minus one
to the average set size, we get a reduction in the accuracy of recall of −0.019 , 95% CrI =
[−0.038, −0.002] . Recall that the average set size, 5, was not presented to the subject! We
could alternatively look at the decrease in accuracy from a set size of 2 to 4:
Hide
four <- 4 - mean(df_recall$set_size)
two <- 2 - mean(df_recall$set_size)
effect_4m2 <-
plogis(alpha_samples + four * beta_samples) -
plogis(alpha_samples + two * beta_samples)
c(mean = mean(effect_4m2),
quantile(effect_4m2, c(0.025, 0.975)))
We can also back-transform to probability scale using the function fitted() rather than
using plogis() . One advantage is that this will work regardless of the type of link function; in
this section we only discussed the logit link, but other link functions are possible to use in
generalized linear models (e.g., the probit link; see Dobson and Barnett (2011)).
Since the set size is the only predictor and it is centered, for estimating the average accuracy,
we can consider an imaginary observation where the c_set_size is zero. (If there were more
centered predictors, we would need to set all of them to zero). Now we can use the summary
argument provided by fitted() .
Hide
fitted(fit_recall,
newdata = data.frame(c_set_size = 0),
summary = TRUE)[,c("Estimate", "Q2.5","Q97.5")]
For estimating the difference in accuracy from the average set size minus one to the average
set size, and from a set size of two to four, first, define newdata with these set sizes.
Hide
new_sets <- data.frame(c_set_size = c(0, -1, four, two))
set_sizes <- fitted(fit_recall,
newdata = new_sets,
summary = FALSE)
Then calculate the appropriate differences considering that column one of set_sizes
corresponds to the average set size, column two to the average set size minus one and so
forth.
Hide
Hide
c(0.025, 0.975)))
Hide
c(0.025, 0.975)))
As expected we get exactly the same values with fitted() than when we calculate them “by
hand.”
4.3.5 Descriptive adequacy
One potentially useful aspect of posterior distributions is that we could also make predictions
for other conditions not presented in the actual experiment, such as set sizes that weren’t
tested. We could then carry out another experiment to investigate whether our model was right
using another experiment. To make predictions for other set sizes, we extend our data set,
adding rows with set sizes of 3, 5, and 7. To be consistent with the data of the other set sizes
in the experiment, we add 23 trials of each new set size (this is the number of trial by set size
in the data set). Something important to notice is that we need to center our predictor
based on the original mean set size. This is because we want to maintain our interpretation
of the intercept. We extend the data as follows, and we summarize the data and plot it in
Figure 4.20.
Hide
stat = "mean",
group = "c_set_size",
newdata = df_recall_ext,
facet_args = list(
ncol = 1, scales = "fixed",
labeller = as_labeller(set_size)
),
binwidth = 0.02
)
set size 2
set size 3
set size 4
T (y )
set size 6
set size 7
set size 8
FIGURE 4.20: The distributions of posterior predicted mean accuracies for tested set sizes (2,
4, 6, and 8) and untested ones (3, 5, and 7) are labeled with yrep . The observed mean
accuracy, y, are only relevant for the tested set sizes *2, 4, 6, and 8); the “observed”
accuracies of the untested set sizes are represented as 0.
We could now gather new data in an experiment that also shows set sizes of 3, 5, and 7.
These data would be held out from the model fit_recall , since the model was fit when
those data were not available. Verifying that the new observations fit in our already generated
posterior predictive distribution would be a way to test genuine predictions from our model.
Having seen how we can fit simple regression models, we turn to hierarchical models in the
next chapter.
4.4 Summary
In this chapter, we learned how to fit simple linear regression models and to fit and interpret
models with a log-normal likelihood and logistic regression models. We investigated the prior
specification for the models, using prior predictive checks, and the descriptive adequacy of the
models using posterior predictive checks.
4.5 Further reading
Linear regression is discussed in several classic textbooks; these have largely a frequentist
orientation, but the basic theory of linear modeling presented there can easily be extended to
the Bayesian framework. An accessible textbook is by Dobson and Barnett (2011). Other
useful textbooks on linear modeling are Harrell Jr (2015), Faraway (2016), Fox (2015), and
Montgomery, Peck, and Vining (2012).
4.6 Exercises
Hide
data("df_powerpose")
head(df_powerpose)
The data set, which was originally published in Carney, Cuddy, and Yap (2010) but released in
modified form by Fosse (2016), shows the testosterone levels of 39 different individuals,
before and after treatment, where treatment refers to each individual being assigned to a high
power pose or a low power pose. In the original paper by Carney, Cuddy, and Yap (2010), the
unit given for testosterone measurement (estimated from saliva samples) was picograms per
milliliter (pg/ml). One picogram per milliliter is 0.001 nanogram per milliliter (ng/ml).
The research hypothesis is that on average, assigning a subject a high power pose vs. a low
power pose will lead to higher testosterone levels after treatment. Assuming that you know
nothing about normal ranges of testosterone using salivary measurement, choose an
appropriate Cauchy prior (e.g., Cauchy(0, 2.5) ) for the target parameter(s).
Investigate this claim using a linear model and the default priors of brms . You’ll need to
estimate the effect of a new variable that encodes the change in testosterone.
Exercise 4.2 Another linear regression model: Revisiting attentional load effect on pupil size.
Here, we revisit the analysis shown in the chapter, on how attentional load affects pupil size.
a. Our priors for this experiment were quite arbitrary. How do the prior predictive
distributions look like? Do they make sense?
b. Is our posterior distribution sensitive to the priors that we selected? Perform a sensitivity
analysis to find out whether the posterior is affected by our choice of prior for the σ.
c. Our data set includes also a column that indicates the trial number. Could it be that trial
has also an effect on the pupil size? As in lm , we indicate another main effect with a +
sign. How would you communicate the new results?
Exercise 4.3 Log-normal model: Revisiting the effect of trial on finger tapping times.
a. Estimate the slowdown in milliseconds between the last two times the subject pressed the
space bar in the experiment.
b. How would you change your model (keeping the log-normal likelihood) so that it includes
centered log-transformed trial numbers or square-root-transformed trial numbers (instead
of centered trial numbers)? Does the effect in milliseconds change?
Exercise 4.4 Logistic regression: Revisiting the effect of set size on free recall.
Our data set includes also a column coded as tested that indicates the position of the
queued word. (In Figure 4.13 tested would be 3). Could it be that position also has an effect
on recall accuracy? How would you incorporate this in the model? (We indicate another main
effect with a + sign).
Hide
data("df_red")
head(df_red)
## risk age red pink redorpink
## 8 0 19 0 0 0
## 9 0 25 0 0 0
## 10 0 20 0 0 0
## 11 0 20 0 0 0
## 14 0 20 0 0 0
## 15 0 18 0 0 0
The data set is from a study (Beall and Tracy 2013) that contains information about the color
of the clothing worn (red, pink, or red or pink) when the subject (female) is at risk of becoming
pregnant (is ovulating, self-reported). The broader issue being investigated is whether women
wear red more often when they are ovulating (in order to attract a mate). Using logistic
regressions, fit three different models to investigate whether being ovulating increases the
probability of wearing (a) red, (b) pink, or (c) either pink or red. Use priors that are reasonable
(in your opinion).
References
Beall, Alec T., and Jessica L. Tracy. 2013. “Women Are More Likely to Wear Red or Pink at
Peak Fertility.” Psychological Science 24 (9). Sage Publications Sage CA: Los Angeles, CA:
1837–41.
Blumberg, Eric J., Matthew S. Peterson, and Raja Parasuraman. 2015. “Enhancing Multiple
Object Tracking Performance with Noninvasive Brain Stimulation: A Causal Role for the
Anterior Intraparietal Sulcus.” Frontiers in Systems Neuroscience 9: 3.
https://doi.org/10.3389/fnsys.2015.00003.
Buzsáki, György, and Kenji Mizuseki. 2014. “The Log-Dynamic Brain: How Skewed
Distributions Affect Network Operations.” Nature Reviews Neuroscience 15 (4): 264–78.
https://doi.org/10.1038/nrn3687.
Bürki, Audrey, Francois-Xavier Alario, and Shravan Vasishth. 2022. “When Words Collide:
Bayesian Meta-Analyses of Distractor and Target Properties in the Picture-Word Interference
Paradigm.” Quarterly Journal of Experimental Psychology.
Bürki, Audrey, Shereen Elbuy, Sylvain Madec, and Shravan Vasishth. 2020. “What Did We
Learn from Forty Years of Research on Semantic Interference? A Bayesian Meta-Analysis.”
Journal of Memory and Language. https://doi.org/10.1016/j.jml.2020.104125.
Carney, Dana R, Amy JC Cuddy, and Andy J Yap. 2010. “Power Posing: Brief Nonverbal
Displays Affect Neuroendocrine Levels and Risk Tolerance.” Psychological Science 21 (10).
Sage Publications Sage CA: Los Angeles, CA: 1363–8.
Cox, Christopher Martin Mikkelsen, Tamar Keren-Portnoy, Andreas Roepstorff, and Riccardo
Fusaroli. 2022. “A Bayesian Meta-Analysis of Infants’ Ability to Perceive Audio–Visual
Congruence for Speech.” Infancy 27 (1). Wiley Online Library: 67–96.
Dobson, Annette J, and Adrian Barnett. 2011. An Introduction to Generalized Linear Models.
CRC press.
Faraway, Julian J. 2016. Extending the Linear Model with R: Generalized Linear, Mixed
Effects and Nonparametric Regression Models. Chapman; Hall/CRC.
Fosse, Nathan E. 2016. “Replication Data for ‘Power Posing: Brief Nonverbal Displays Affect
Neuroendocrine Levels and Risk Tolerance’ by Carney, Cuddy, Yap (2010).” Harvard
Dataverse. https://doi.org/10.7910/DVN/FMEGS6.
Fox, John. 2015. Applied Regression Analysis and Generalized Linear Models. Sage
Publications.
Harrell Jr, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models,
Logistic and Ordinal Regression, and Survival Analysis. New York, NY: Springer.
Hayes, Taylor R., and Alexander A. Petrov. 2016. “Mapping and Correcting the Influence of
Gaze Position on Pupil Size Measurements.” Behavior Research Methods 48 (2): 510–27.
https://doi.org/10.3758/s13428-015-0588-x.
Jäger, Lena A., Felix Engelmann, and Shravan Vasishth. 2017. “Similarity-Based Interference
in Sentence Comprehension: Literature review and Bayesian meta-analysis.” Journal of
Memory and Language 94: 316–39. https://doi.org/https://doi.org/10.1016/j.jml.2017.01.004.
Limpert, Eckhard, Werner A. Stahel, and Markus Abbt. 2001. “Log-Normal Distributions Across
the Sciences: Keys and Clues.” BioScience 51 (5): 341. https://doi.org/10.1641/0006-
3568(2001)051[0341:LNDATS]2.0.CO;2.
Mahowald, Kyle, Ariel James, Richard Futrell, and Edward Gibson. 2016. “A Meta-Analysis of
Syntactic Priming in Language Production.” Journal of Memory and Language 91. Elsevier: 5–
27.
Nicenboim, Bruno, Timo B. Roettger, and Shravan Vasishth. 2018. “Using Meta-Analysis for
Evidence Synthesis: The case of incomplete neutralization in German.” Journal of Phonetics
70: 39–55. https://doi.org/https://doi.org/10.1016/j.wocn.2018.06.001.
Oberauer, Klaus. 2019. “Working Memory Capacity Limits Memory for Bindings.” Journal of
Cognition 2 (1): 40. https://doi.org/10.5334/joc.86.
Oberauer, Klaus, and Reinhold Kliegl. 2001. “Beyond Resources: Formal Models of
Complexity Effects and Age Differences in Working Memory.” European Journal of Cognitive
Psychology 13 (1-2). Routledge: 187–215. https://doi.org/10.1080/09541440042000278.
Pylyshyn, Zenon W., and Ron W. Storm. 1988. “Tracking Multiple Independent Targets:
Evidence for a Parallel Tracking Mechanism.” Spatial Vision 3 (3): 179–97.
https://doi.org/10.1163/156856888X00122.
Royall, Richard. 1997. Statistical Evidence: A Likelihood Paradigm. New York: Chapman; Hall,
CRC Press.
Spector, Robert H. 1990. “The Pupils.” In Clinical Methods: The History, Physical, and
Laboratory Examinations, edited by H. Kenneth Walker, W. Dallas Hall, and J. Willis Hurst, 3rd
ed. Boston: Butterworths.
Ulrich, Rolf, and Jeff Miller. 1993. “Information Processing Models Generating Lognormally
Distributed Reaction Times.” Journal of Mathematical Psychology 37 (4): 513–25.
https://doi.org/10.1006/jmps.1993.1032.
Ulrich, Rolf, and Jeff Miller. 1994. “Effects of Truncation on Reaction Time Analysis.” Journal of
Experimental Psychology: General 123 (1): 34–80. https://doi.org/10/b8tsnh.
Vasishth, Shravan, Zhong Chen, Qiang Li, and Gueilan Guo. 2013. “Processing Chinese
Relative Clauses: Evidence for the Subject-Relative Advantage.” PLoS ONE 8 (10). Public
Library of Science: 1–14.
Wagenmakers, Eric-Jan, Raoul P. P. P. Grasman, and Peter C. M. Molenaar. 2005. “On the
Relation Between the Mean and the Variance of a Diffusion Model Response Time
Distribution.” Journal of Mathematical Psychology 49 (3): 195–204.
https://doi.org/10.1016/j.jmp.2005.02.003.
Wahn, Basil, Daniel P. Ferris, W. David Hairston, and Peter König. 2016. “Pupil Sizes Scale
with Attentional Load and Task Experience in a Multiple Object Tracking Task.” PLOS ONE 11
(12): e0168087. https://doi.org/10.1371/journal.pone.0168087.
13. The average pupil size will probably be higher than 800, since this measurement was with
no load, but, in any case, the exact number won’t matter, any mean for the prior between
500-1500 would be fine if the standard deviation is large.↩
14. These transformations are visible when checking the generated Stan code using
make_stancode .↩
15. Odds are defined to be the ratio of the probability of success to the probability of failure.
1/6
For example, the odds of obtaining a one in a fair six-sided die are = 1/5 . The
1−1/6
odds of obtaining a heads in a fair coin are 1/1. Do not confuse this technical term with
the day-to-day usage of the word “odds” to mean probability.↩
Code
Usually, experimental data in cognitive science contain “clusters”. These are natural groups
that contain observations that are more similar within the clusters than between them. The
most common examples of clusters in experimental designs are subjects and experimental
items (e.g., words, pictures, objects that are presented to the subjects). These clusters arise
because we have multiple (repeated) observations for each subject, and for each item. If we
want to incorporate this grouping structure in our analysis, we generally use a hierarchical
model (also called multi-level or a mixed model, Pinheiro and Bates 2000). This kind of
clustering and hierarchical modeling arises as a consequence of the idea of exchangeability.
Exchangeability is the Bayesian analog of the phrase “independent and identically distributed”
that appears regularly in classical (i.e., frequentist) statistics. Some connections and
differences between exchangeability and the frequentist concept of independent and
identically distributed (iid) are detailed in Box 5.1.
yn ∼ Normal(μsubj[n] , σ)
The notation subj[n] , which roughly follows Gelman and Hill (2007), identifies the subject
index. Suppose that 20 subjects respond 50 times each. If the data are ordered by subject id,
then subj[1] to subj[50] corresponds to a subject with id ,
i = 1 subj[51] to subj[100]
Hide
N_subj <- 20
N_obs_subj <- 50
N <- N_subj * N_obs_subj
df <- tibble(row = 1:N,
subj = rep(c(1:N_subj), each = N_obs_subj))
df
## # A tibble: 1,000 × 2
## row subj
## <int> <int>
## 1 1 1
## 2 2 1
## 3 3 1
## # … with 997 more rows
Hide
# Example:
## [1] 1 1 2
If the data yn are exchangeable, the parameters μi are also exchangeable. The fact that the
μi are exchangeable can be shown (Bernardo and Smith 2009) to be mathematically
equivalent to assuming that they come from a common distribution, for example:
μi ∼ Normal(μ, τ )
To make this more concrete, assume some completely arbitrary true values for the
parameters, and generate observations based on a hierarchical process in R.
Hide
mu <- 100
tau <- 15
sigma <- 4
mu_i <- rnorm(N_subj, mu, tau)
## 3 3 1 74.2
## # … with 997 more rows
The parameters μ and τ , called hyperparameters, are unknown and have prior distributions
(hyperpriors) defined for them. This fact leads to a hierarchical relationship between the
parameters: there is a common parameter μ for all the levels of a group, and the parameters
μi are assumed to be generated as a function of this common parameter μ . Here, τ
p(μ)
p(τ )
p(σ)
a. from each of the observed yn corresponding to the respective μsubj[n] parameter, and
b. from the parameters μ and τ that led to all the other yk (where k ≠ n ) being generated.
Fit this model in brms in the following way. Intercept corresponds to μ , sigma to σ, and
sd to τ . For now the prior distributions are arbitrary.
Hide
fit_h
## ...
## Group-Level Effects:
## ~subj (Number of levels: 20)
Hide
Hide
## [1] 72.5 115.3 97.1 86.0 62.5 97.0 104.0 88.5 119.3 82.6 67.8
## [12] 98.6 92.4 84.2 85.0 102.1 102.9 98.2 92.3 88.9
FIGURE 5.1: A directed acyclic graph illustrating a hierarchical model (partial pooling).
There are two other configurations possible that do not involve this hierarchical structure and
which represent two alternative, extreme scenarios.
One of these two configurations is called the complete pooling model, Here, the data yn are
assumed to be generated from a single distribution:
yn ∼ Normal(μ, σ).
Generate fake observations in a vector y based on arbitrary true values in R in the following
way.
Hide
sigma <- 4
mu <- 100
df_cp <- mutate(df, y = rnorm(N, mu, sigma))
df_cp
## # A tibble: 1,000 × 3
## row subj y
## <int> <int> <dbl>
## 1 1 1 94.1
## 2 2 1 100.
## 3 3 1 95.3
Fit it in brms .
Hide
Hide
fit_cp
## ...
## Population-Level Effects:
The other configuration is called the no pooling model; here, each yn is assumed to be
generated from an independent distribution:
yn ∼ Normal(μsubj[n] , σ)
Generate fake observations from the no pooling model in R with arbitrary true values.
Hide
sigma <- 4
mu_i <- c(156, 178, 95, 183, 147, 191, 67, 153, 129, 119, 195,
150, 172, 97, 110, 115, 78, 126, 175, 80)
df_np <- mutate(df, y = rnorm(N, mu_i[subj], sigma))
df_np
## # A tibble: 1,000 × 3
## row subj y
## <int> <int> <dbl>
## 1 1 1 159.
## 2 2 1 158.
## 3 3 1 157.
## # … with 997 more rows
Fit it in brms . By using the formula 0 + factor(subj) , we remove the common intercept and
force the model to estimate one intercept for each level of subj . The column subj is
converted to a factor so that brms does not interpret it as a number.
Hide
The summary shows now the 20 estimates of μi as b_factorsubj and σ. (We ignore lp__
and lprior .)
Hide
Unlike the hierarchical model, now there is no common distribution that generates the μi
An important practical consequence of partial pooling is the idea of “borrowing strength from
the mean”: if we have very sparse data from a particular member of a group (e.g., missing
data from a particular subject), the estimate μi of that particular group member n is
determined by the parameter μ . In other words, when the data are sparse for group member n
, the posterior estimate μi is determined largely by the prior p(μ) . In this sense, even the
frequentist hierarchical modeling software in R, lmer from the package lme4 , is essentially
Bayesian in formulation (except of course that there is no prior as such on μ ).
So far we focused on the structure of μ , the location parameter of the likelihood. We could
even have partial pooling, complete pooling or no pooling with respect to σ, the scale
parameter of the likelihood. More generally, any parameter of a likelihood can have any of
these kinds of pooling.
In the coming sections, we will be looking at each of these models with more detail and using
realistic examples.
Formally, we say that the random variables Y1 , … , YN are finitely exchangeable if, for
any set of particular outcomes of an experiment y1 , … , yN , the probability p(y1 , … , yN )
that we assign to these outcomes is unaffected by permuting the labels given to the
variables. In other words, for any permutation π(n) , where n = 1, … , N ( π is a function
that takes as input the positive integer n and returns another positive integer; e.g., the
function takes a subject indexed as 1, and returns index 3), we can reasonably assume
that p(y1 , … , yN ) = p(yπ(1) , … , yπ(N ) ) . A simple example is a coin tossed twice.
Suppose the first coin toss is Y1 = 1 , a heads, and the second coin toss is Y2 = 0 , a tails.
If we are willing to assume that the probability of getting one heads is unaffected by
whether it appears in the first or the second toss, i.e.,
p(Y1 = 1, Y2 = 0) = p(Y1 = 0, Y2 = 1) , then we assume that the indices are
exchangeable.
Some important connections and differences between exchangeability and the frequentist
concept of independent and identically distributed (iid):
If the data are exchangeable, they are not necessarily iid. For example, suppose
you have a box with one black ball and two red balls in it. Your task is to repeatedly
draw a ball at random. Suppose that in your first draw, you draw one ball and get the
black ball. The probability of getting a black ball in the next two draws is now 0.
However, if in your first draw you had retrieved a red ball, then there is a non-zero
probability of drawing a black ball in the next two draws. The outcome in the first draw
affects the probability of subsequent draws–they are not independent. But the
sequence of random variables is exchangeable. To see this, consider the following: If
a red ball is drawn, count it as a 0, and if a black ball is drawn, then count it as 1.
Then, the three possible outcomes and the probabilities are
;
1, 0, 0 P (X1 = 1, X2 = 0, X3 = 0) =
1
3
× 1 × 1 =
1
3
2 1 1
0, 1, 0 P (X1 = 0, X2 = 1, X3 = 0) = × × 1 =
3 2 3
2 1 1
0, 0, 1 P (X1 = 0, X2 = 0, X3 = 1) = × × 1 =
3 2 3
If the data are exchangeable, then they are identically distributed. For example,
in the box containing one black ball and two red balls, suppose we count the draw of
a black ball as a 1, and the draw of a red ball as a 0. Then the probability
and ; this is also true for and . That is, these
1 2
P (X1 = 1) = P (X1 = 0) = X2 X3
3 3
If the data are iid in the standard frequentist sense, then they are exchangeable.
For example, suppose you have i = 1, … , n instances of a random variable X
whose PDF is f (x) . Suppose also that Xi are iid. The joint PDF (this can be discrete
or continuous, i.e., a PMF or PDF) is
Because the terms on the right-hand side can be permuted, the labels can be
permuted on any of the xi . This means that X1 , … , Xn are exchangeable.
5.2 A hierarchical model with a normal likelihood: The
N400 effect
2
Amplitude (μV)
Predictability
high
1 low
FIGURE 5.4: Typical ERP for the grand average across the N400 spatial window (central
parietal electrodes: Cz, CP1, CP2, P3, Pz, P4, POz) for high and low predictability nouns
(specifically from the constraining context of the experiment reported in Nicenboim, Vasishth,
and Rösler 2020a). The x-axis indicates time in seconds and the y-axis indicates voltage in
microvolts (unlike many EEG/ERP plots, the negative polarity is plotted downwards).
For example, in (1) below, the continuation ‘paint’ has higher predictability than the
continuation ‘dog’, and thus we would expect a more negative signal, that is, an N400 effect,
in ‘dog’ in (b) in comparison with ‘paint’ in (a). It is often the case that predictability is
measured with a cloze task (see section 1.4).
The EEG data are typically recorded in tens of electrodes every couple of milliseconds, but for
our purposes (i.e., for learning about Bayesian hierarchical models), we can safely ignore the
complexity of the data. A common way to simplify the high-dimensional EEG data when we are
dealing with the N400 is to focus on the average amplitude of the EEG signal at the typical
spatio-temporal window of the N400 (for example, see Frank et al. 2015).
For this example, we are going to focus on the N400 effect for critical nouns from a subset of
the data of Nieuwland et al. (2018). Nieuwland et al. (2018) presented a replication attempt of
an original experiment of DeLong, Urbach, and Kutas (2005) with sentences like (2).
We’ll ignore the goal of original experiment (DeLong, Urbach, and Kutas 2005), and its
replication (Nieuwland et al. 2018). We are going to focus on the N400 at the final nouns in the
experimental stimuli. In example (2), for example, the final noun ‘kite’ has higher predictability
than ‘airplane’, and thus we would expect a more negative signal in ‘airplane’ in (b) in
comparison with ‘kite’ in (a).
To speed up computation, we restrict the data set of Nieuwland et al. (2018) to the subjects
from the Edinburgh lab. This subset of the data can be found in df_eeg in the bcogsci
package. Center the cloze probability before using it as a predictor.
Hide
data("df_eeg")
(df_eeg <- df_eeg %>%
mutate(c_cloze = cloze - mean(cloze)))
## # A tibble: 2,863 × 7
## subj cloze item n400 cloze_ans N c_cloze
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 7.08 0 44 -0.476
## 2 1 0.03 2 -0.68 1 44 -0.446
## 3 1 1 3 1.39 44 44 0.524
## # … with 2,860 more rows
Hide
# Number of subjects
df_eeg %>%
distinct(subj) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 37
One convenient aspect of using averages of EEG data is that they are roughly normally
distributed. This allows us to use the normal likelihood. Figure 5.5 shows the distribution of the
data.
Hide
df_eeg %>% ggplot(aes(n400)) +
geom_histogram(
binwidth = 4,
colour = "gray",
alpha = .5,
aes(y = after_stat(density))
) +
stat_function(fun = dnorm, args = list(
mean = mean(df_eeg$n400),
sd = sd(df_eeg$n400)
)) +
xlab("Average voltage in microvolts for
the N400 spatiotemporal window")
0.03
density
0.02
0.01
0.00
FIGURE 5.5: Histogram of the N400 averages for every trial, overlaid is a density plot of a
normal distribution.
5.2.1 Complete pooling model (Mcp )
We’ll start from the simplest model which is basically the linear regression we encountered in
the preceding chapter.
1. The EEG averages for the N400 spatiotemporal window are normally distributed.
2. Observations are independent.
3. There is a linear relationship between cloze and the EEG signal for the trial.
This model is incorrect for these data due to assumption (2) being violated.
With the last assumption, we are saying that the difference in the average signal when we
compare nouns with cloze probability of 0 and 0.1 is the same as the difference in the signal
when we compare nouns with cloze values of 0.1 and 0.2 (or 0.9 and 1). This is just an
assumption, and it may not necessarily be the case in the actual data. This means that we are
going to get a posterior for β conditional on the assumption that the linear relationship holds.
Even if it approximately holds, we still don’t know how much we deviate from this assumption.
where n = 1, … , N , and signal is the dependent variable (average signal in the N400
spatiotemporal window in microvolts). The variable N represents the total number of data
points.
As always we need to rely on our previous knowledge and domain expertise to decide on
priors. We know that ERPs (signals time-locked to a stimulus) have mean amplitudes of a
couple of microvolts: This is easy to see in any plot of the EEG literature. This means that we
don’t expect the effect of our manipulation to exceed, say, 10μV . As before, a priori we’ll
assume that effects can be negative or positive. We can quantify our prior knowledge
regarding plausible values of β as normally distributed centered at zero with a standard
deviation of 10μV . (Other values such as 5μV would have been also reasonable, since it
would entail that 95% of the prior mass probability is between −10 $ and 10μV .)
If the signal for each ERP is baselined, that is, the mean signal of a time window before the
time window of interest is subtracted from the time window of interest, then the mean signal
would be relatively close to 0. Since we know that the ERPs were baselined in this study, we
expect that the grand mean of our signal should be relatively close to zero. Our prior for α is
then normally distributed centered in zero with a standard deviation of 10μV as well.
The standard deviation of our signal distribution is harder to guess. We know that EEG signals
are quite noisy, and that the standard deviation must be higher than zero. Our prior for σ is a
truncated normal distribution with location zero and scale 50. Recall that since we truncate the
distribution, the parameters location and scale do not correspond to the mean and standard
deviation of the new distribution; see Box 4.1.
We can draw random samples from this truncated distribution and calculate their mean and
standard deviation:
Hide
## mean sd
## 39.6 29.8
So we are essentially saying that we assume a priori that we will find the true standard
deviation of the signal in the following interval with 95% probability:
Hide
## 2.5% 97.5%
## 1.63 111.08
Hide
# Analytically:
# c(qtnorm(.025, 0, 50, a = 0), qtnorm(.975, 0, 50, a = 0))
α ∼ Normal(0, 10)
β ∼ Normal(0, 10)
A model such as Mcp is sometimes called a fixed-effects model: all the parameters are fixed
in the sense that do not vary from subject to subject or from item to item. A similar frequentist
model would correspond to fitting a simple linear model using the lm function: lm(n400 ~ 1 +
cloze, data = df_eeg) .
We fit this model in brms as follows (the default family is gaussian() so we can omit it). As
with the lm function in R, by default an intercept is fitted and thus n400 ~ c_cloze is
equivalent to n400 ~ 1 + c_cloze :
Hide
prior =
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma)
),
data = df_eeg
)
For now, check the summary, and plot the posteriors of the model (Figure 5.6).
Hide
fit_N400_cp
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 3.66 0.22 3.22 4.08 1.00 3753 2742
## c_cloze 2.27 0.55 1.22 3.33 1.00 3614 2917
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 11.82 0.15 11.52 12.13 1.00 4339 3010
##
## ...
Hide
plot(fit_N400_cp)
b_Intercept b_Intercept
4.5
1.5
4.0
1.0
3.5
0.5
3.0
0.0
3.0 3.5 4.0 0 200 400 600 800 1000
b_c_cloze b_c_cloze
4
Chain
0.6
1
3
0.4
2
2
0.2 3
1
4
0.0
1 2 3 4 0 200 400 600 800 1000
sigma sigma
2.5
12.2
2.0
12.0
1.5
11.8
1.0
0.5 11.5
0.0 11.2
11.5 11.8 12.0 12.2 0 200 400 600 800 1000
One of the assumptions of the previous model is clearly wrong: observations are not
independent, they are clustered by subject (and also by the specific item, but we’ll ignore this
until section 5.2.4). It is reasonable to assume that EEG signals are more similar within
subjects than between them. The following model assumes that each subject is completely
independent from each other.16
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Every subject’s model is fit independently of the other subjects; the subjects have no
parameters in common (an exception is the standard deviation, σ; this is the same for all
subjects in Equation (5.2)).
3. There is a linear relationship between cloze and the EEG signal for the trial.
As before, n represents each observation, that is, the n th row in the data frame, which has N
rows, and now the index i identifies the subject. The notation subj[n] , which roughly follows
Gelman and Hill (2007), identifies the subject index; for example, if subj[10] = 3 , then the 10
αi ∼ Normal(0, 10)
βi ∼ Normal(0, 10)
In brms , such a model can be fit by removing the common intercept with the formula n400 ~
0 + factor(subj) + c_cloze:factor(subj) .
This formula forces the model to estimate one intercept and one slope for each level of
subj .17 The by-subject intercepts are indicated with factor(subj) and the by-subject
slopes with c_cloze:factor(subj) . It’s very important to specify that subject should be
treated as a factor and not as a number; we don’t assume that subject number 3 will show 3
times more positive (or negative) average signal than subject number 1! The model fits 37
independent intercepts and 37 independent slopes. By setting a prior to class = b and
omitting coef , we are essentially setting identical priors to all the intercepts and slopes of the
model. The parameters are independent from each other; it is only our previous knowledge (or
prior beliefs) about their possible values (encoded in the priors) that is identical. We can set
different priors to each intercept and slope, but that will mean setting 74 priors!
Hide
data = df_eeg
)
For this model, printing a summary means printing the 75 parameters (α1,...,37 , β1,...,37 , and σ).
We could do this as always by printing out the model results: just type fit_N400_np .
It may be easier to understand the output of the model by plotting β1,..,37 using bayesplot .
( brms also includes a wrapper for this function called stanplot .) We can take a look at the
internal names that brms gives to the parameters with variables(fit_N400_np) ; they are
b_factorsubj , then the subject index and then :c_cloze . The code below changes the
subject labels back to their original numerical indices and plots them in Figure 5.7. The
subjects are ordered by the magnitude of their mean effects.
The model Mnp does not estimate a unique population-level effect; instead, there is a different
effect estimated for each subject. However, given the posterior means from each subject, it is
still possible to calculate the average of these estimates ^
β
1,...,I
, where I is the total number of
subjects:
Hide
# parameter name of beta by subject:
ind_effects_np <- paste0(
"b_factorsubj",
unique(df_eeg$subj), ":c_cloze"
)
## # A tibble: 1 × 3
## mean lq hq
## <dbl> <dbl> <dbl>
## 1 2.17 1.17 3.16
In Figure 5.7, the 95% credible interval of this overall mean effect is plotted as two vertical
lines together with the effect of cloze probability for each subject (ordered by effect size).
Here, rather than using a plotting function from brms , we can extract the summary of by-
subject effects, reorder them by magnitude, and then plot the summary with a custom plot
using ggplot2 .
Hide
# make a table of beta's by subject
beta_by_subj <- posterior_summary(fit_N400_np,
variable = ind_effects_np
) %>%
as.data.frame() %>%
Hide
ggplot(
beta_by_subj,
18
19
32
12
26
33
27
34
3
9
5
7
22
20
16
1
14
2
25
-10 -5 0 5 10 15
By-subject effect of cloze probability in microvolts
FIGURE 5.7: 95% credible intervals of the effect of cloze probability for each subject
according to the no pooling model, fit_N400_np . The solid vertical line represents the mean
over all the subjects; and the broken vertical lines mark the 95% credible interval for this
mean.
5.2.3 Varying intercepts and varying slopes model (Mv )
One major problem with the no-pooling model is that we completely ignore the fact that the
subjects were doing the same experiment. We fit each subject’s data ignoring the information
available in the other subjects’ data. The no-pooling model is very likely to overfit the
individual subjects’ data; we are likely to ignore the generalities of the data and we may end
up overinterpreting noisy estimates from each subject’s data. The model can be modified to
explicitly assume that the subjects have an overall effect common to all the subjects, with the
individual subjects deviating from this common effect.
In the model that we fit next, we will assume that there is an overall effect that is common to
the subjects and, importantly, that all subjects’ parameters originate from one common
(normal) distribution. This model specification will result in the estimation of posteriors for each
subject being also influenced by what we know about all the subjects together. We begin with
a hierarchical model with uncorrelated varying intercepts and slopes. The analogous
frequentist model can be fit using lmer from the package lme4 , using (1+c_cloze||subj)
or, equivalently, (c_cloze||subj) for the by-subject random effects.
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Each subject deviates to some extent (this is made precise below) from the grand mean
and from the mean effect of predictability. This implies that there is some between-subject
variability in the individual-level intercept and slope adjustments by subject.
3. There is a linear relationship between cloze and the EEG signal.
The likelihood now incorporates the assumption that both the intercept and slope are adjusted
by subject.
β ∼ Normal(0, 10)
u1 ∼ Normal(0, τu )
1
u2 ∼ Normal(0, τu )
2
In this model each subject has their own intercept adjustment, usubj,1 , and slope adjustment,
18
usubj,2 . If usubj,1 is positive, the subject will have a more positive EEG signal than the grand
mean average. If usubj,2 is positive, the subject will have a more positive EEG response to a
change of one unit in c_cloze than the overall mean effect (i.e., there will be a more positive
effect of cloze probability on the N400). The parameters u are sometimes called random
effects and thus a model with fixed effects (α and β) and random effects is called a mixed
model. However, random effects have different meanings in different contexts. To avoid
ambiguity, brms calls these random-effects parameters group-level effects. Since we are
estimating α and u at the same time and we assume that the average of the u’s is 0 (since it
is assumed to be normally distributed with mean 0), what is common between the subjects,
the grand mean, is estimated as the intercept α , and the deviations of individual subjects’
means from this grand mean are the adjustments u1 . Similarly, the mean effect of cloze is
estimated as β, and the deviations of individual subjects’ mean effects of cloze from β are the
adjustment u2 . The standard deviations of these two adjustment terms, τu
1
and τu
2
,
respectively, represent between subject variability; see Box 5.2.
The by-subject adjustments u1 and u2 are parameters in the model, and therefore have priors
defined on them. By contrast, in the frequentist lmer model, the adjustments u1 and u2 are
not parameters; they are called conditional modes; see Bates, Mächler, et al. (2015b).
Parameters that appear in the prior specifications for parameters, such as τu , are often called
hyperparameters,19 and the priors on such hyperparameters are called hyperpriors. Thus, the
parameter u1 has Normal(0, τu )
1
as a prior; τu
1
is a hyperparameter, and the hyperprior on
20
τu
1
is Normal(0, 20) .
We know that in general, in EEG experiments, the standard deviations for the by-subject
adjustments are smaller than the standard deviation of the observations (which is the within-
subjects standard deviation). That is, usually the between-subject variability in the intercepts
and slopes is smaller than the within-subjects variability in the data. For this reason, reducing
the scale of the truncated normal distribution to 20 (in comparison to 50 ) seems reasonable
for the priors of the τ parameters. As always, we can do a sensitivity analysis to verify that our
priors are reasonably uninformative (if we intended them to be uninformative).
αi ∼ Normal(α, τu )
1
βi ∼ Normal(β, τu2 )
Mostly by convention, the adjustments u are assumed to come from a normal distribution.
Another reason is that if we don’t know anything about the distribution besides its mean
and variance, the normal distribution is the most conservative assumption (see chapter 9
of McElreath 2020).
For now, we are assuming that there is no relationship (no correlation) between the by-subject
intercept and slope adjustments u1 and u2 ; this lack of correlation is indicated in brms using
the double pipe || . The double pipe is also used in the same way in lmer from the
package lme4 (in fact brms bases its syntax on that of the lme4 package).
In brms , we need to specify hyperpriors for τu1 and τu2 ; these are called sd in brms , to
distinguish these standard deviations from the standard deviation of the residuals σ. As with
the population-level effects, the by-subjects intercept adjustments are implicitly fit for the
group-level effects and thus (c_cloze || subj) is equivalent to (1 + c_cloze || subj) . If
we don’t want an intercept we need to explicitly indicate it with (0 + c_cloze || subj) or (-1
+ c_cloze || subj) . Such a removal of the intercept is not normally done.
Hide
prior_v <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
When we print a brms fit, we first see the summaries of the posteriors of the standard
deviation of the by-group intercept and slopes, τu1 and τu2 as sd(Intercept) and
sd(c_cloze) , and then, as with previous models, the population-level effects, α and β as
Intercept and c_cloze , and the scale of the likelihood, σ , as sigma . The full summary can
be printed out by typing:
Hide
fit_N400_v
Because the above command will result in some pages of output, it is easier to understand the
summary graphically (Figure 5.8). Rather than the wrapper plot() , we use the original
function of the package bayesplot , mcmc_dens() , to only show density plots. We extract the
first 5 parameters of the model with variables(fit_N400_v)[1:5] .
Hide
2 3 4 5 1 2 3 4 2 3 4
sd_subj__c_cloze sigma
Hide
# make a table of u_2s
ind_effects_v <- paste0("r_subj[", unique(df_eeg$subj),
",c_cloze]")
u_2_v <- posterior_summary(fit_N400_v, variable = ind_effects_v) %>%
as_tibble() %>%
u_2_v,
aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = subj)
) +
geom_point() +
geom_errorbarh() +
31
32
19
12
26
27
33
34
3
7
9
5
16
20
22
1
2
14
25
-6 -3 0 3 6
By-subject adjustment to the slope in microvolts
FIGURE 5.9: 95% credible intervals of adjustments to the effect of cloze probability for each
subject (u1,1..37 ) according to the varying intercept and varying slopes model, fit_N400_v . To
obtain the effect of cloze probability for each subject, we would need to add the estimate of β
to each adjustment.
There is an important difference between the no-pooling model and the varying intercepts and
slopes model we just fit. The no-pooling model fits each individual subject’s intercept and
slope independently for each subject. By contrast, the varying intercepts and slopes model
takes all the subjects’ data into account in order to compute the fixed effects α and β; and the
model “shrinks” (Pinheiro and Bates 2000) the by-subject intercept and slope adjustments
towards the fixed effects estimates. In Figure 5.10, we can see the shrinkage of the estimates
in the varying intercepts model by comparing them with the estimates of the no pooling model
(Mnp ).
Hide
# Extract parameter estimates from the no pooling model:
par_np <- posterior_summary(fit_N400_np, variable = ind_effects_np) %>%
as_tibble() %>%
mutate(
# is in each column.
# Remove all the draws meta data by using as.data.frame
by_subj_effect <- as.data.frame(beta + adjustment)
# Summarize them by getting a table with the mean and the
# quantiles for each column and then binding them.
}) %>%
bind_rows() %>%
# Add a column to identify that the model,
# and one with the subject labels:
mutate(
model = "Hierarchical",
subj = unique(df_eeg$subj)
)
# The mean and 95% CI of both models in one data frame:
by_subj_df <- bind_rows(par_h, par_np) %>%
arrange(Estimate) %>%
mutate(subj = factor(subj, levels = unique(.data$subj)))
Hide
ggplot(
by_subj_df,
aes(
ymin = Q2.5, ymax = Q97.5, x = subj, y = Estimate, color = model,
shape = model
)
) +
geom_errorbar(position = position_dodge(1)) +
geom_point(position = position_dodge(1)) +
# We'll also add the mean and 95% CrI of the overall difference
# to the plot:
geom_hline(
yintercept =
posterior_summary(fit_N400_v,
variable = "b_c_cloze")[, "Estimate"]
) +
geom_hline(
yintercept =
posterior_summary(fit_N400_v,
variable = "b_c_cloze")[, "Q2.5"],
linetype = "dotted", linewidth = .5
) +
geom_hline(
yintercept =
posterior_summary(fit_N400_v,
variable = "b_c_cloze")[, "Q97.5"],
linetype = "dotted", linewidth = .5
) +
xlab("N400 effect of predictability") +
coord_flip()
35
6
28
36
8
21
30
37
24
10
15
13
11
17
N400 effect of predictability
4
29
23
31 model
18 Hierarchical
19 No pooling
32
12
26
33
27
34
3
9
5
7
22
20
16
1
14
2
25
-10 -5 0 5 10 15
Estimate
FIGURE 5.10: This plot compares the estimates of the effect of cloze probability for each
subject between (i) the no pooling, fit_N400_np and (ii) the varying intercepts and varying
slopes, hierarchical, model, fit_N400_v .
5.2.4 Correlated varying intercept varying slopes model (Mh
)
The model Mv allowed for differences in intercepts (mean voltage) and slopes (effects of
cloze) across subjects, but it has the implicit assumption that these varying intercepts and
varying slopes are independent. It is in principle possible that subjects showing more negative
voltage may also show stronger effects (or weaker effects). Next, we fit a model that allows a
correlation between the intercepts and slopes. We model the correlation between varying
intercepts and slopes by defining a variance-covariance matrix Σ between the by-subject
varying intercepts and slopes, and by assuming that both adjustments (intercept and slope)
come from a multivariate (in this case, a bivariate) normal distribution.
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Some aspects of the mean signal voltage and of the effect of predictability depend on the
subject, and these two might be correlated, i.e., we assume group-level intercepts and
slopes, and allow a correlation between them by-subject.
3. There is a linear relationship between cloze and the EEG signal for the trial.
The likelihood remains identical to the model Mv , which assumes no correlation between
group-level intercepts and slopes (section 5.2.3):
The correlation is indicated in the priors on the adjustments for intercept u1 and slopes u2 .
Priors:
α ∼ Normal(0, 10)
β ∼ Normal(0, 10)
ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0
In this model, a bivariate normal distribution generates the varying intercepts and varying
slopes u ; this is an n × 2 matrix. The variance-covariance matrix Σu defines the standard
deviations of the varying intercepts and varying slopes, and the correlation between them.
Recall from section 1.6.2 that the diagonals of the variance-covariance matrix contain the
variances of the correlated random variables, and the off-diagonals contain the covariances.
In this example, the covariance Cov(u1 , u2 ) between two variables u1 and u2 is defined as
the product of their correlation ρ and their standard deviations τu
1
and τu
2
. In other words,
Cov(u1 , u2 ) = ρu τu τu
1 2
.
2
τu ρ u τu1 τu2
1
Σu = ( )
2
ρu τu τu τu
1 2 2
In order to specify a prior for Σu , we need priors for the standard deviations, τu
1
, and τu
2
, and
also for their correlation, ρu . We can use the same priors for τ as before. For the correlation
parameter ρu (and the correlation matrix more generally), we use the LKJ prior. The basic idea
of the LKJ prior on the correlation matrix is that as its parameter, η (eta), increases, it will favor
correlations closer to zero.21 At η = 1 , the LKJ correlation distribution is uninformative (similar
to Beta(1, 1) ), at η < 1 , it favors extreme correlations (similar to Beta(a < 1, b < 1) ). We
set η = 2 so that we don’t favor extreme correlations, and we still represent our lack of
knowledge through the wide spread of the prior between −1 and 1. Thus, η = 2 gives us a
regularizing, relatively uninformative or mildly informative prior.
eta = 1 eta = 2
2.75 2.5
density
density
2.73 2.0
2.70 1.5
2.68
1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
rho rho
eta = 4 eta = .9
2.5
4.0
density
density
2.0
3.5
1.5
3.0
1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
rho rho
FIGURE 5.11: Visualization of the LKJ correlation distribution prior with four different values of
the η parameter.
τu ∼ Normal+ (0, 20)
1
ρu ∼ LKJcorr(2)
In our brms model, we allow a correlation between the by-subject intercepts and slopes by
using a single pipe | instead of the double pipe || that we used previously. This
convention follows that in the frequentist lmer function. As before, the varying intercepts are
implicitly fit.
Because we have a new parameter, the correlation ρu , we need to add a new prior for this
correlation; in brms , this is achieved by addition a prior for the parameter type cor .
Hide
prior_h <- c(
group = subj
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = subj
),
prior(lkj(2), class = cor, group = subj)
)
fit_N400_h <- brm(n400 ~ c_cloze + (c_cloze | subj),
prior = prior_h,
data = df_eeg
The estimates do not change much in comparison with the varying intercept/slope model,
probably because the estimation of the correlation is quite poor (i.e., there is a lot of
uncertainty). As before we show the estimates graphically (Figure 5.12. One can access the
complete summary as always with fit_N400_h .
Hide
plot(fit_N400_h, N = 6)
b_Intercept b_Intercept
0.75 5
0.50 4
0.25 3
0.00 2
2 3 4 5 0 200 400 600 800 1000
b_c_cloze b_c_cloze
0.6 5
4
0.4 3
0.2 2
1
0.0 0
1 2 3 4 5 0 200 400 600 800 1000
sd_subj__Intercept sd_subj__Intercept
0.9 3 Chain
0.6
0.3 2
0.0 1
1.5 2.0 2.5 3.0 3.5 0 200 400 600 800 1000
2
sd_subj__c_cloze sd_subj__c_cloze
0.4 3
0.3 4
0.2 2 4
0.1
0.0 0
1 2 3 4 5 0 200 400 600 800 1000
cor_subj__Intercept__c_cloze cor_subj__Intercept__c_cloze
1.00 1.0
0.75 0.5
0.50 0.0
0.25 -0.5
0.00 -1.0
-0.5 0.0 0.5 0 200 400 600 800 1000
sigma sigma
2.5 12.2
2.0 12.0
1.5 11.8
1.0 11.5
0.5 11.2
0.0 11.0
11.2 11.5 11.8 12.0 0 200 400 600 800 1000
We are now half-way to what is sometimes called the “maximal” hierarchical model (Barr et al.
2013). This usually refers to a model with all the by-participant and by-items group-level
variance components allowed by the experimental design and a full variance covariance
matrix for all the group-level parameters. Not all variance components are allowed by the
experimental design: in particular, between-group manipulations cannot have variance
components. For example, even if we assume that the working memory capacity of the
subjects might affect the N400, we cannot measure how working memory affects the subjects
differently.
As we will see in section 5.2.6 and in chapter 7, “maximal” is a misnomer for Bayesian
models, since this mostly refers to limitations of the popular frequentist package for fitting
models, lme4 .
The next section spells out a model with full variance-covariance matrix for both subjects and
items-level effects.
Our new model, Msih will allow for differences in intercepts (mean voltage) and slopes (effects
of predictability) across subjects and across items. In typical Latin square designs, subjects
and items are said to be crossed random effects—each subject sees exactly one instance of
each item. Here we assume a possible correlation between varying intercepts and slopes by
subjects, and another one by items.
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Some aspects of the mean signal voltage and of the effect of predictability depend on the
subject, i.e., we assume group-level intercepts, and slopes, and a correlation between
them by-subject.
3. Some aspects of the mean signal voltage and of the effect of predictability depend on the
item, i.e., we assume group-level intercepts, and slopes, and a correlation between them
by-item.
4. There is a linear relationship between cloze and the EEG signal for the trial.
Likelihood:
α ∼ Normal(0, 10)
β ∼ Normal(0, 10)
ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0
wj,1 0
( ) ∼ N (( ), Σw )
wj,2 0
We have added the index j, which represents each item, as we did with subjects; item[n]
indicates the item that corresponds to the observation in the n -th row of the data frame.
2
τu1 ρ u τu1 τu2
Σu = ( )
2
ρu τu τu τu
1 2 2
2
τw ρw τw τw
1 1 2
Σw = ( )
2
ρw τw τw τw
1 2 2
ρu ∼ LKJcorr(2)
ρw ∼ LKJcorr(2)
We set identical priors for by-items group-level effects as for by-subject group-level effects,
but this only because we don’t have any differentiated prior information about subject-level
vs. item-level variation. However, bear in mind that the estimation for items is completely
independent from the estimation for subjects. Although we wrote many more equations than
before, the brms model is quite straightforward to extend:
Hide
prior_sih_full <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = subj
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = subj
),
prior(lkj(2), class = cor, group = subject),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = item
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = item
),
prior(lkj(2), class = cor, group = item)
)
fit_N400_sih <- brm(n400 ~ c_cloze + (c_cloze | subj) +
(c_cloze | item),
prior = prior_sih_full,
data = df_eeg)
We can also simplify the call to brms , when we assign the same priors to the by-subject and
by-item parameters:
Hide
prior_sih <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(normal(0, 50), class = sigma),
data = df_eeg
)
We have new group-level effects in the summary, but again the estimate of the effect of cloze
remains virtually unchanged (Figure 5.13).
Hide
fit_N400_sih
Hide
plot(fit_N400_sih, N = 9)
b_Intercept b_Intercept
0.8 5
0.6 4
0.4
0.2 3
0.0 2
3 4 5 0 200 400 600 800 1000
b_c_cloze b_c_cloze
0.6
4
0.4 3
2
0.2 1
0.0 0
0 1 2 3 4 0 200 400 600 800 1000
sd_item__Intercept sd_item__Intercept
3
0.9
2
0.6
0.3 1
0.0
1 2 3 0 200 400 600 800 1000
sd_item__c_cloze sd_item__c_cloze
6
0.3
4
0.2
2
0.1
0.0 0
1 2 3 4 5 0 200 400 600 800 1000
sd_subj__c_cloze sd_subj__c_cloze
0.4 5
0.3 4
3
0.2 2
0.1 1
0.0 0
1 2 3 4 0 200 400 600 800 1000
cor_item__Intercept__c_cloze cor_item__Intercept__c_cloze
1.0 0.5
0.0
0.5
-0.5
0.0 -1.0
-0.5 0.0 0.5 0 200 400 600 800 1000
cor_subj__Intercept__c_cloze cor_subj__Intercept__c_cloze
1.0
0.9 0.5
0.6 0.0
0.3 -0.5
0.0 -1.0
-0.5 0.0 0.5 0 200 400 600 800 1000
sigma sigma
12.0
2.0 11.8
1.5
11.5
1.0
0.5 11.2
0.0 11.0
11.0 11.2 11.5 11.8 12.0 0 200 400 600 800 1000
.0 . .5 .8 .0 0 00 00 600 800 000
FIGURE 5.13: The posterior distributions of the parameters in the model fit_N400_sih.
We can use posterior predictive checks to verify that our last model can capture the entire
signal distribution. This is shown in Figure 5.14.
Hide
y
y rep
-50 -25 0 25 50
FIGURE 5.14: Overlay of densities from the posterior predictive distributions of the model
fit_N400_sih .
However, we know that in ERP studies, large levels of impedance between the recording
electrodes and the skin tissue increase the noise in the recordings (Picton et al. 2000). Given
that skin tissue is different between subjects, it could be the case that the level of noise varies
by subject. It might be a good idea to verify that our model is good enough for capturing the
by-subject variability. The code below produces Figure 5.15.
Hide
ppc_dens_overlay_grouped(df_eeg$n400,
yrep =
posterior_predict(fit_N400_sih,
ndraws = 100
),
group = df_eeg$subj
) +
xlab("Signal in the N400 spatiotemporal window")
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
y
22 23 24 25 26 27 28
y rep
29 30 31 32 33 34 35
-60-30 0 30 -60-30 0 30
Signal in the N400 spatiotemporal window
g p p
FIGURE 5.15: The plot shows 100 predicted distributions with the label yr ep and the
distribution of the average signal data with the label y density plots for the 37 subjects that
participated in the experiment.
Figure 5.15 hints that we might be misfitting some subjects: Some of the by-subject observed
distributions of the EEG signal averages look much tighter than their corresponding posterior
predictive distributions (e.g., subjects 3, 5, 9, 10, 14), whereas some other by-subject
observed distributions look wider (e.g., subjects 25, 26, 27). Another approach to examine
whether we misfit the by-subject noise level is to plot posterior distributions of the standard
deviations and compare them with the observed standard deviation. This is achieved in the
following code, which groups the data by subject, and shows the distribution of standard
deviations. The result is shown in Figure 5.16. It is clear now that, for some subjects, the
observed standard deviation lies outside the distribution of predictive standard deviations.
Hide
pp_check(fit_N400_sih,
type = "stat_grouped",
ndraws = 1000,
group = "subj",
stat = "sd"
)
1 2 3 4 5 6 7
10 12 14 10 12 14 16 10 12 14 10 12 14 8 10 12 14 8 10 12 14 8 10 12 14
8 9 10 11 12 13 14
10 12 14 9101112131415 10 12 14 10 12 14 10 12 14 10 12 14 91011121314
15 16 17 18 19 20 21
T = sd
10 12 14 16 10 12 14 16 10 12 14 8 10 12 14 10 12 14 9101112131415 91011121314 T (y rep)
22 23 24 25 26 27 28
T (y )
9101112131415 10 12 14 10 12 14 10 12 14 16 10 12 14 10.0
12.5
15.0
17.5 10 12 14
29 30 31 32 33 34 35
9 11 13 15 10 12 14 8 10 12 14 10 12 14 10 12 14 8 10 12 14 10 12 14
36 37
10 12 14 10 12 14
0 0
FIGURE 5.16: Distribution of posterior predicted standard deviations in gray and observed
standard deviation in black lines by subject.
Why is our “maximal” hierarchical model misfitting the by-subject distribution of data? This is
because, the maximal models are, in general and implicitly, models with the maximal group-
level effect structure for the location parameter (e.g., the mean, μ , in a normal model). Other
parameters (e.g., scale or shape parameters) are estimated as auxiliary parameters, and are
assumed to be constant across observations and clusters. This assumption is so common that
researchers may not be aware that it is just an assumption. In the Bayesian framework, it is
easy to change such default assumptions if necessary. Changing the assumption that all
subjects have the same residual standard deviation leads to the distributional regression
model. Such models can be fit in brms ; see also the brms vignette, https://cran.r-
project.org/web/packages/brms/vignettes/brms_distreg.html.
We are going to change our previous likelihood, so that the scale, σ has also a group-level
effect structure. We exponentiate σ to make sure that the negative adjustments do not cause
σ to become negative.
σn = exp(σα + σu )
subj[n]
We just need to add priors to our new parameters (that replace the old prior for σ). We set the
prior to the intercept of the standard deviation, σα , to be similar to our previous σ. For the
variance component of σ, τ σu , we set rather uninformative hyperpriors. Recall that everything
is exponentiated when it goes inside the likelihood; that is why we use log(50) rather than 50
in σ.
σα ∼ Normal(0, log(50))
σu ∼ Normal(0, τσ )
u
This model can be fit in brms using the internal function brmsformula() (or its shorter alias
bf ). This is a powerful function that extends the formulas that we used so far allowing for
setting a hierarchical regression to any parameter of a model. This will allow us to set a by-
subject hierarchical structure to the parameter σ. We also need to set new priors; these priors
are identified by dpar = sigma .
Hide
prior_s <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(normal(0, 20), class = sd),
prior(lkj(2), class = cor),
Inspect the output below; notice that our estimate for the effect of cloze remains very similar to
that of the model fit_N400_sih .
Hide
Hide
Hide
ppc_dens_overlay_grouped(df_eeg$n400,
yrep =
posterior_predict(fit_N400_s,
ndraws = 100
),
group = df_eeg$subj
) +
xlab("Signal in the N400 spatiotemporal window")
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
y
22 23 24 25 26 27 28
y rep
29 30 31 32 33 34 35
-40 0 40 -40 0 40
Signal in the N400 spatiotemporal window
g p p
FIGURE 5.17: The gray density plots show 100 predicted distributions from a model that
includes a hierarchical structure for σ. The black density plots show the distribution of the
average signal data for the 37 subjects in the experiment.
1 2 3 4 5 6 7
8 9 10 11 12 13 14
9 11 13 15 17 8 10 12 7.510.012.515.0 7.5
10.0
12.5
15.0 7.510.0
12.5
15.0 8 10 12 14 8 10 12 14
15 16 17 18 19 20 21
T = sd
12 15 18 21 9 12 15 187.5
10.0
12.5
15.0
17.5 10.0
12.5
15.0 9 11 13 15 1012141618 7.510.0
12.5
15.0 T (y rep)
22 23 24 25 26 27 28
T (y )
8 10 12 14 10.0
12.5
15.0
17.5 8 10 12 12 15 18 21 10.0
12.5
15.0
17.5 15 20 25 10.0
12.5
15.0
17.5
29 30 31 32 33 34 35
8 101214167.510.0
12.5
15.0 10.0
12.5
15.0
17.5
7.510.0
12.5
15.0 10.0
12.5
15.0 6 8 10 12 10.0
12.5
15.0
17.5
36 37
7.510.0
12.5
15.0 8 101214
7.5 0.0 .55.0 8 0
FIGURE 5.18: The gray lines show the distributions of posterior predicted standard deviations
from a model that includes a hierarchical structure for σ, and observed mean standard
deviations by subject (black vertical lines).
The model fit_N400_s raises the question: how much structure should we add to our
statistical model? Should we also assume that σ can vary by items, and also by our
experimental manipulation? Should we also have a maximal model for σ? Unfortunately, there
are no clear answers that apply to every situation. The amount of complexity that we can
introduce in a statistical model depends on (i) the answers we are looking for (we should
include parameters that represent what we want to estimate), (ii) the size of the data at hand
(more complex models require more data), (iii) our computing power (as the complexity
increases models take increasingly long to converge and require more computer power to
finish the computations in a feasible time frame), and (iv) our domain knowledge.
Whether certain effects should be included in a model also depends on whether they are
known to impact posterior inference or statistical testing (e.g., via Bayes factors). For
example, it is well known that estimating group-level effects for the location parameter can
have a strong influence on the test statistics for the corresponding population-level effect (Barr
et al. 2013; Schad et al. 2021). Given that population-level effects are often what researchers
care about, it is therefore important to consider group-level effects for the location parameter.
However, to our knowledge, it is not clear whether estimating group-level effects for the
standard deviation of the likelihood has an impact on inferences for the fixed effects. Maybe
there is one, but it is not widely known–statistical research would have to be conducted via
simulations to assess whether such an influence can take place. The point here is that for
some effects, it’s crucial to include them in the model, because they are known to affect the
inferences that we want to draw from the data. Other model components may (presumably) be
less decisive. Which ones these are remains an open question for research.
Ultimately, all models are approximations (that’s in the best case; often, they are plainly
wrong) and we need to think carefully about which aspects of our data we have to account
and which aspects we can abstract away from.
In the context of cognitive modeling, James L. McClelland (2009a) argues that models should
not focus on a every single detail of the process they intend to explain. In order to understand
a model, it needs to be simple enough. However, James L. McClelland (2009a) warns us that
one must bear in mind that oversimplification does have an impact on what we can conclude
from our analysis: A simplification can limit the phenomena that a model addresses, or can
even lead to incorrect predictions. There is a continuum between purely statistical models
(e.g., a linear regression) and computational cognitive models. For example, we can define
“hybrid” models such as the linear ballistic accumulator (Brown and Heathcote 2008; and see
Nicenboim 2018 for an implementation in Stan), where a great deal of cognitive detail is
sacrificed for tractability. The conclusions of James L. McClelland (2009a) apply to any type of
model in cognitive science: “Simplification is essential, but it comes at a cost, and real
understanding depends in part on understanding the effects of the simplification”.
Next, using data from Ebersole et al. (2016), we illustrate some of the issues that arise with a
log-normal likelihood in a hierarchical model. The data are from a Stroop task (Stroop 1935;
for a review, see MacLeod 1991). We will analyze a subset of the data of 3337 subjects that
participated in one variant of the Stroop task; this was part of a battery of tasks run in
Ebersole et al. (2016).
For this variant of the Stroop task, subjects were presented with one word at the center of the
screen (“red”, “blue”, or “green”). The word was written in either red, blue, or green color. In
one third of the trials, the word matched the color of the text (“congruent” condition); and in the
rest of the trials it did not match (“incongruent” condition). Subjects were instructed to only pay
attention to the color that the word was written in, and press 1 if the color was red, 2 if it
was blue, and 3 if it was green. In the incongruent condition, it is difficult to identify the color
when it mismatches the word that is written on the screen. For example, it is hard to respond
that the color is blue if the word written on the screen is green but the color it is presented in is
blue; naming the color blue here is difficult in comparison to a baseline condition (the
congruent condition), in which the word green appears in the color green. This increased
difficulty in the incongruent condition is called the Stroop effect; the effect is extremely robust
across variations in the task.
This task yields two measures: the accuracy of the decision made, and the time it took to
respond. For the Stroop task, accuracy is usually almost at ceiling; to simplify the model, we
will ignore accuracy. For a cognitive model that incorporates accuracy and response times into
a model to analyze these Stroop data, see Nicenboim (2018).
5.3.1 A correlated varying intercept varying slopes log-normal
model
If our theory only focuses on the difference between the response times for the “congruent” vs.
“incongruent” condition, we can ignore the actual color presented and the word that was
written. We can simply focus on whether a trial was congruent or incongruent. Define a
predictor c_cond to represent these two conditions. For simplicity, we will also assume that
all subjects share the same variance (as we saw in section 5.2.6, changing this assumption
leads to distributional regression models).
The above assumptions mean that we are going to fit the data with the following likelihood.
The likelihood function is identical to the one that we fit in section 5.2.4, except that here the
location and scale are embedded in a log-normal likelihood rather than a normal likelihood.
Equation (5.3) states that we are dealing with a hierarchical model with by-subjects varying
intercepts and varying slopes model:
In chapter 8, we will discuss the sum-contrast coding of the two conditions ( c_cond ). For
now, it suffices to say that we assign a +1 to c_cond for the “incongruent” condition, and a
-1 for the “congruent” condition (i.e., a sum-contrast coding). Under this contrast coding, if
the posterior mean of the parameter β turns out to be positive, that would mean that the model
predicts that the incongruent condition has slower reaction times than the congruent one. This
is because on average the location of the log-normal likelihood for each condition will be as
follows. In Equation (5.4), μincongruent refers to the location of the incongruent condition, and
μcongruent to the location of the congruent condition.
μincongruent = α + 1 ⋅ β
(5.4)
μcongruent = α + −1 ⋅ β
We could have chosen to do the opposite contrast coding assignments: −1 for the
incongruent condition, and +1 for congruent condition. In that case, if the posterior mean of
the parameter β turns out to be positive, that would mean that the incongruent condition has a
faster reaction time than the congruent condition. Given that the Stroop effect is very robust,
we do not expect such an outcome. In order to make the β parameter easier to interpret, we
have chosen the contrast coding where a positive sign on the mean of β implies that the
inconguent condition has slower reaction times.
As always, we need priors for all the parameters in our model. For the population-level
parameters (or fixed effects), we use the same priors as we did when we were fitting a
regression with a log-normal likelihood in section 3.7.2.
α ∼ Normal(6, 1.5)
β ∼ Normal(0, 0.01)
σ ∼ Normal+ (0, 1)
Here, β represents, on the log scale, the change in the intercept α as a function of the
experimental manipulation. In this model, β will probably be larger in magnitude than for the
model that examined the difference in pressing the spacebar for two consecutive trials in
section 3.7.2. We might need to examine the prior for β with predictive distributions, but we
will delay this for now.
In contrast to our previous models, the intercept α is not the grand mean of the location. This
is because the conditions were not balanced in the experiment (one-third of the conditions
were congruent and two-thirds incongruent). The intercept could be interpreted here as the
time (in log-scale) it takes to respond if we ignore the experimental manipulation.
Next, we turn our attention to the prior specification for the group-level parameters (or random
effects). If we assume a possible correlation between by-subject intercepts and slopes, our
model will have the following structure. In particular, we have to define priors for the
parameters in the variance-covariance matrix Σu .
ui,1 0
( ) ∼ N (( ), Σu )
ui,2 0
2
τu ρu τu τu
1 1 2
Σu = ( )
2
ρu τu τu τu
1 2 2
In practice, this means that we need priors for the by-subject standard deviations and
correlations. For the variance components, we will set a similar prior as for σ. We don’t expect
the by-group adjustments to the intercept and slope to have more variance than the within-
subject variance, so this prior will be quite conservative because it allows for a large range of
prior uncertainty. We assign the same prior for the correlations as we did in section 5.2.5.
τu ∼ Normal+ (0, 1)
1
ρu ∼ LKJcorr(2)
We are now ready to fit the model. To speed up computation, we subset 50 subjects of the
original data set; both the subsetted data and the original data set can be found in the
package bcogsci . If we were analyzing these data for publication in a journal article or the
like, we would obviously not subset the data.
We restrict ourselves to the correct trials only, and add a c_cond predictor, sum-coded as
described earlier.
Hide
data("df_stroop")
## # A tibble: 3,058 × 5
Hide
prior =
c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, .01), class = b),
prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = sd),
We will focus on β (but you can verify that there is nothing surprising in the other parameters
in the model fit_stroop ).
Hide
As shown in Figure 5.19, if we overlay the density plots for the prior and posterior distributions
of β, it becomes evident that the prior might have been too restrictive: the posterior is
relatively far from the prior, and the prior strongly down-weights the values that the posterior is
centered around. Such a strong discrepancy between the prior and posterior can be
investigated with a sensitivity analysis.
Hide
40 distribution
density
posterior
prior
20
FIGURE 5.19: The discrepancy between the prior and the posterior distributions for the slope
parameter in the model fit_stroop .
Here, the discrepancy evident in Figure 5.19 is investigated with a sensitivity analysis. We will
examine what happens for the following priors for β. In the models we fit below, all the other
parameters have the same priors as in the model fit_stroop ; we vary only the priors for β.
The different priors are:
β ∼ Normal(0, 0.05)
β ∼ Normal(0, 0.1)
β ∼ Normal(0, 1)
β ∼ Normal(0, 2)
We can summarize the estimates of β given different priors as shown in Table 5.1.
TABLE 5.1: The summary (mean and 95% credible interval) for the posterior distribution of the
slope in the model fit_stroop, given different priors on the slope parameter.
It might be easier to see how much the posterior difference between conditions changes
depending on the prior. In order to answer this question, we need to remember that the
median difference between conditions (MedianRTdif f ) can be calculated as the difference
between the exponents of each condition’s medians:
Equation (5.5) gives us the posterior distributions of the median difference between conditions
for the different models. We calculate the median difference rather than the mean difference
because the mean depends on the parameter σ, but the median doesn’t: The mean of a log-
normal distribution is 2
exp(μ + σ /2) , and the median is simply exp(μ) ; see also 3.7.2.
Table 5.2 summarizes the posterior distributions under different priors using the means of the
difference in medians, along with 95% credible intervals. It’s important to realize that the use
of mean to summarize the posterior distribution is orthogonal to our use of the median to
summarize the response times by condition: In the first case, we use the median to
summarize a group of observations, and in the second case, we use the mean to summarize a
group of samples from the posterior–we could have summarized the samples from the
posterior with its median instead of the mean.
TABLE 5.2: A summary, under a range of priors, of the posterior distributions of the mean
difference between the two conditions, back-transformed to the millisecond scale.
Table 5.2 shows us that the posterior changes substantially when we use wider priors. It
seems that the posterior is relatively unaffected when we use priors with a standard deviation
larger than 0.05 . However, if we assume a priori that the effect of the manipulation must be
small, we will end up obtaining a posterior that is consistent with that belief. When we include
less information about the possible effect sizes by using a less informative prior, we allow the
data to influence the posterior more. A sensitivity analysis is always an important component
of a good-quality Bayesian analysis.
Which analysis should one report after carrying out a sensitivity analysis? In the above
example, the priors ranging from Normal(0, 0.05) to Normal(0, 2) show rather similar
posterior distributions for the mean difference. The most common approach in Bayesian
analysis is to report the results of such relatively uninformative priors (e.g., one could report
the posterior associated with the Normal(0, 2) here), because this kind of prior allows for a
broader range of possible effects and is relatively agnostic. However, if there is a good reason
to use a prior from a previous analysis, then of course it makes sense to report the analysis
with the informative prior alongside an analysis with an uninformative prior. Reporting only
informative priors in a Bayesian analysis is generally not a good idea. The issue is
transparency: the reader should know what the posterior looks like for both an informative and
an uninformative prior.
Another situation where posterior distributions associated with multiple priors should be
reported is when one is carrying out an adversarial sensitivity analysis (Spiegelhalter, Abrams,
and Myles 2004): one can take a group of agnostic, enthusiastic, and adversarial or skeptical
priors that, respectively, reflect a non-committal a priori position, an informed position based
on the researcher’s prior beliefs, and an adversarial position based on a scientific opponent’s
beliefs. In such a situation, analyses using all three priors can be reported, so that the reader
can determine how different prior beliefs influence the posterior. For an example of such an
adversarial analysis, see Vasishth and Engelmann (2022). Finally, when carrying out
hypothesis testing using Bayes factors, the choice of the prior on the parameter of interest
becomes critically important; in that situation, it is very important to report a sensitivity
analysis, showing the Bayes factor as well as a summary of the posterior distributions (Schad
et al. 2021); we return to this point in chapter 15, which covers Bayes factors.
Carrying out Bayesian data analysis clearly requires much more effort than fitting a frequentist
model: we have to define priors, verify that our model works, and decide how to interpret the
results. By comparison, fitting a linear mixed model using lme4 consists of only a single line
of code. But there is a hidden cost to the relatively high speed furnished by the functions such
as lmer . First, the model fit using lmer or the like makes many assumptions, but they are
hidden from the user. This is not a problem for the knowledgeable modeler, but very
dangerous for the naive user. A second conceptual problem is that the way frequentist models
are typically used is to answer a binary question: is the effect “significant” or not? If a result is
significant, the paper is considered worth publishing; if not, it is not. Although frequentist
models can quickly answer the question that the null hypothesis test poses, the frequentist
test answers the wrong question. For discussion, see Vasishth and Nicenboim (2016).
Nevertheless, it is natural to ask why one should bother to go through all the trouble of fitting a
Bayesian model. An important reason is the flexibility in model specification. The approach we
have presented here can be used to extend essentially any parameter of any model. This
includes popular uses, such as logistic and Poisson regressions, and also useful models that
are relatively rarely used in cognitive science, such as multi-logistic regression (e.g., accuracy
in some task with more than two answers), ordered logistic (e.g., ratings, Bürkner and Vuorre
2018), models with a shifted log-normal distribution (see exercise 12.1 and chapter 20 which
deals with a log-normal race mode, and see Nicenboim, Logačev, et al. 2016; Rouder 2005),
and distributional regression models (as shown in 5.2.6). By contrast, a frequentist model,
although easy to fit quickly, forces the user to use an inflexible canned model, which may not
necessarily make sense for their data.
This flexibility allows us also to go beyond the statistical models discussed before, and to
develop complex hierarchical computational process models that are tailored to specific
phenomena. An example are computational cognitive models, these can be extended
hierarchically in a straightforward way, see Lee (2011b) and Lee and Wagenmakers (2014).
This is because, as we have seen with distributional regression models in section 5.2.6, any
parameter can have a group-level effect structure. Some examples of hierarchical
computational cognitive models in psycholinguistics are Logačev and Vasishth (2016),
Nicenboim and Vasishth (2018), Vasishth et al. (2017), Vasishth, Jaeger, and Nicenboim
(2017), Lissón et al. (2021), Logačev and Dokudan (2021), Paape et al. (2021), Yadav, Smith,
and Vasishth (2021a), and Yadav, Smith, and Vasishth (2021b). The hierarchical Bayesian
modeling approach can even be extended to process models that cannot be expressed as a
likelihood function, although in such cases one may have to write one’s own sampler; for an
example from psycholinguistics, see Yadav, Paape, et al. (2022).22. We discuss and
implement in Stan some relatively simple computational cognitive models in chapters 17-20.
5.5 Summary
This chapter presents two very commonly used classes of hierarchical model: those with
normal and log-normal likelihoods. We saw several common variants of such models: varying
intercepts, varying intercepts and varying slopes with or without a correlation parameter, and
crossed random effects for subjects and items. We also experienced the flexibility of the Stan
modeling framework through the example of a model that assumes a different residual
standard deviation for each subject.
Chapter 5 of Gelman et al. (2014) provides a rather technical but complete treatment of
exchangeability in Bayesian hierarchical models. Bernardo and Smith (2009) is a brief but
useful article explaining exchangeability, and Lunn et al. (2012) also has a helpful discussion
that we have drawn on in this chapter. Gelman and Hill (2007) is a comprehensive treatment
of hierarchical modeling, although it uses WinBUGS. Yarkoni (2020) discusses the importance
of modeling variability in variables that researchers clearly intend to generalize over (e.g.,
stimuli, tasks, or research sites), and how under-specification of group-level effects imposes
strong constraints on the generalizability of results. Sorensen, Hohenstein, and Vasishth
(2016) provides an introduction, using Stan, to the Laird-Ware style matrix formulation (Laird
and Ware 1982) of hierarchical models; this formulation has the advantage of flexibility and
efficiency when specifying models in Stan syntax.
5.7 Exercises
Exercise 5.1 A hierarchical model (normal likelihood) of cognitive load on pupil size.
As in section 4.1, we focus on the effect of cognitive load on pupil size, but this time we look at
all the subjects of Wahn et al. (2016):
Hide
data("df_pupil_complete")
df_pupil_complete
## # A tibble: 2,228 × 4
## subj trial load p_size
## <int> <int> <int> <dbl>
## 1 701 1 2 1021.
## 2 701 2 1 951.
## 3 701 3 5 1064.
## # … with 2,225 more rows
You should be able to now fit a “maximal” model (correlated varying intercept and slopes for
subjects) assuming a normal likelihood. Base your priors in the priors discussed in section 4.1.
a. Examine the effect of load on pupil size, and the average pupil size. What do you
conclude?
b. Do a sensitivity analysis for the prior on the intercept (α). What is the estimate of the
effect (β) under different priors?
c. Is the effect of load consistent across subjects? Investigate this visually.
Exercise 5.2 Are subject relatives easier to process than object relatives (log-normal
likelihood)?
We begin with a classic question from the psycholinguistics literature: Are subject relatives
easier to process than object relatives? The data come from Experiment 1 in a paper by
Grodner and Gibson (2005).
(1a) The reporter [who the photographer sent to the editor] was hoping for a good story.
(ORC)
(1b) The reporter [who sent the photographer to the editor] was hoping for a good story. (SRC)
The underlying explanation has to do with memory processes: Shorter linguistic dependencies
are easier to process due to either reduced interference or decay, or both. For implemented
computational models that spell this point out, see Lewis and Vasishth (2005) and Engelmann,
Jäger, and Vasishth (2020).
In the Grodner and Gibson data, the dependent measure is reading time at the relative clause
verb, (e.g., sent) of different sentences with either ORC or SRC. The dependent variable is in
milliseconds and was measured in a self-paced reading task. Self-paced reading is a task
where subjects read a sentence or a short text word-by-word or phrase-by-phrase, pressing a
button to get each word or phrase displayed; the preceding word disappears every time the
button is pressed. In 6.1, we provide a more detailed explanation of this experimental method.
For this experiment, we are expecting longer reading times at the relative clause verbs of
ORC sentences in comparison to the relative clause verb of SRC sentences.
Hide
data("df_gg05_rc")
df_gg05_rc
## # A tibble: 672 × 7
## subj item condition RT residRT qcorrect experiment
## <int> <int> <chr> <int> <dbl> <int> <chr>
Hide
You should be able to now fit a “maximal” model (correlated varying intercept and slopes for
subjects and for items) assuming a log-normal likelihood.
a. Examine the effect of relative clause attachment site (the predictor c_cond ) on reading
times RT (β).
b. Estimate the median difference between relative clause attachment sites in milliseconds,
and report the mean and 95% CI.
c. Do a sensitivity analysis. What is the estimate of the effect (β) under different priors?
What is the difference in milliseconds between conditions under different priors?
Hide
data("df_gibsonwu")
data("df_gibsonwu2")
The data are taken from two experiments that investigate (inter alia) the effect of relative
clause type on reading time in Chinese. The data are from Gibson and Wu (2013) and
Vasishth et al. (2013) respectively. The second data set is a direct replication attempt of the
Gibson and Wu (2013) experiment.
Chinese relative clauses are interesting theoretically because they are prenominal: the relative
clause appears before the head noun. For example, the English relative clauses shown above
would appear in the following order in Mandarin. The square brackets mark the relative
clause, and REL refers to the Chinese equivalent of the English relative pronoun who.
(2a) [The photographer sent to the editor] REL the reporter was hoping for a good story.
(ORC)
(2b) [sent the photographer to the editor] REL the reporter who was hoping for a good story.
(SRC)
As discussed in Gibson and Wu (2013), the consequence of Chinese relative clauses being
prenominal is that the distance between the verb in relative clause and the head noun is larger
in subject relatives than object relatives. Hsiao and Gibson (2003) were the first to suggest
that the larger distance in subject relatives leads to longer reading time at the head noun.
Under this view, the prediction is that subject relatives are harder to process than object
relatives. If this is true, this is interesting and surprising because in most other languages that
have been studied, subject relatives are easier to process than object relatives; so Chinese
will be a very unusual exception cross-linguistically.
The data provided are for the critical region (the head noun; here, reporter). The experiment
method is self-paced reading, so we have reading times in milliseconds. The second data set
is a direct replication attempt of the first data set, which is from Gibson and Wu (2013).
The research hypothesis is whether the difference in reading times between object and
subject relative clauses is negative. For the first data set ( df_gibsonwu ), investigate this
question by fitting two “maximal” hierarchical models (correlated varying intercept and slopes
for subjects and items). The dependent variable in both models is the raw reading time in
milliseconds. The first model should use the normal likelihood in the model; the second model
should use the log-normal likelihood. In both models, use ±0.5 sum coding to model the effect
of relative clause type. You will need to decide on appropriate priors for the various
parameters.
a. Plot the posterior predictive distributions from the two models. What is the difference in
the posterior predictive distributions of the two models; and why is there a difference?
b. Examine the posterior distributions of the effect estimates (in milliseconds) in the two
models. Why are these different?
c. Given the posterior predictive distributions you plotted above, why is the log-normal
likelihood model better for carrying out inference and hypothesis testing?
Next, work out a normal approximation of the log-normal model’s posterior distribution for the
relative clause effect that you obtained from the above data analysis. Then use that normal
approximation as an informative prior for the slope parameter when fitting a hierarchical model
to the second data set. This is an example of incrementally building up knowledge by
successively using a previous study’s posterior as a prior for the next study; this is essentially
equivalent to pooling both data sets (check that pooling the data and using a Normal(0,1) prior
for the effect of interest, with a log-normal likelihood, gives you approximately the same
posterior as the informative-prior model fit above).
data("df_dillonE1")
dillonE1 <- df_dillonE1
head(dillonE1)
The data are taken from an experiment that investigate (inter alia) the effect of number
similarity between a noun and the auxiliary verb in sentences like the following. There are two
levels to a factor called Int(erference): low and high.
(3a) low: The key to the cabinet are on the table (3b) high: The key to the cabinets are on the
table
Here, in (3b), the auxiliary verb are is predicted to be read faster than in (3a), because the
plural marking on the noun cabinets leads the reader to think that the sentence is
grammatical. (Both sentences are ungrammatical.) This phenomenon, where the high
condition is read faster than the low condition, is called agreement attraction.
The data provided are for the critical region (the auxiliary verb are). The experiment method is
eye-tracking; we have total reading times in milliseconds.
The research question is whether the difference in reading times between high and low
conditions is negative.
First, using a log-normal likelihood, fit a hierarchical model with correlated varying
intercept and slopes for subjects and items. You will need to decide on the priors for the
model.
By simply looking at the posterior distribution of the slope parameter β, what would you
conclude about the theoretical claim relating to agreement attraction?
The data set df_ab is a subset of the data of this paradigm from a replication conducted by
Grassi et al. (2021). In this subset, the probe was always present and the target was correctly
identified. We want to find out how the lag affects the accuracy of the identification of the
probe.
Hide
data("df_ab")
df_ab
## # A tibble: 2,101 × 4
Fit a logistic regression assuming a linear relationship between lag and accuracy
( probe_correct ). Assume a hierarchical structure with correlated varying intercept and slopes
for subjects. You will need to decide on the priors for this model.
a. How is the accuracy of the probe identification affected by the lag? Estimate this in log-
odds and percentages.
b. Is the linear relationship justified? Use posterior predictive checks to verify this.
c. Can you think about a better relationship between lag and accuracy? Fit a new model and
use posterior predictive checks to verify if the fit improved.
Exercise 5.6 Is there a Stroop effect in accuracy?
Instead of the response times of the correct answers, we want to find out whether accuracy
also changes by condition in the Stroop task. Fit the Stroop data with a hierarchical logistic
regression (i.e., a Bernoulli likelihood with a logit link). Use the complete data set,
df_stroop_complete which also includes incorrect answers, and subset it selecting the first 50
subjects.
We will relax some of the assumptions of the model of Stroop presented in section 5.3. We will
no longer assume that all subjects share the same variance component, and, in addition, we’ll
investigate whether the experimental manipulation affects the scale of the response times. A
reasonable hypothesis could be that the incongruent condition is noisier than the congruent
one.
Assume the following likelihood, and fit the model with sensible priors (recall that our initial
prior for β wasn’t reasonable). (Priors for all the sigma parameters require us to set dpar =
sigma ).
In this likelihood σn has both population- and group-level parameters: σα and σβ are the
intercept and slope of the population level effects repectively, and σu
subj[n],1
and σu
subj[n],2
are the
intercept and slope of the group-level effects.
Hide
data("df_english")
english <- df_english
data("df_dutch")
dutch <- df_dutch
(4a) The apartment that the maid who the service had sent over was cleaning every week was
well decorated.
(4b) *The apartment that the maid who the service had sent over — was well decorated
Based on these results from English, Gibson and Thomas (1999) proposed that working-
memory overload leads the comprehender to forget the prediction of the upcoming verb
phrase (VP), which reduces working-memory load. This came to be known as the VP-
forgetting hypothesis. The prediction is that in the word immediately following the final verb,
the grammatical condition (which is coded as +1 in the data frames) should be harder to read
than the ungrammatical condition (which is coded as -1).
The design shown above is set up to test this hypothesis using self-paced reading for English
(Vasishth et al. 2011), and for Dutch (Frank, Trompenaars, and Vasishth 2015). The data
provided are for the critical region (the noun phrase, labeled NP1, following the final verb); this
is the region for which the theory predicts differences between the two conditions. We have
reading times in log milliseconds.
a. First, fit a linear model with a full hierarchical structure by subjects and by items for the
English data. Because we have log milliseconds data, we can simply use the normal
likelihood (not the log-normal). What scale will be the parameters be in, milliseconds or
log milliseconds?
b. Second, using the posterior for the effect of interest from the English data, derive a prior
distribution for the effect in the Dutch data. Then fit two linear mixed models: (i) one model
with relatively uninformative priors for β (for example, N ormal(0, 1) ), and (ii) one model
with the prior for β you derived from the English data. Do the posterior distributions of the
Dutch data’s effect show any important differences given the two priors? If yes, why; if
not, why not?
c. Finally, just by looking at the English and Dutch posteriors, what can we say about the
VP-forgetting hypothesis? Are the posteriors of the effect from these two languages
consistent with the hypothesis?
References
Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects
Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and
Language 68 (3). Elsevier: 255–78.
Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear
Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.
https://doi.org/10.18637/jss.v067.i01.
Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley &
Sons.
Broadbent, Donald E., and Margaret H. P. Broadbent. 1987. “From Detection to Identification:
Response to Multiple Targets in Rapid Serial Visual Presentation.” Perception &
Psychophysics 42 (2): 105–13. https://doi.org/10.3758/BF03210