Generalized additive modeling
Martijn Wieling
University of Groningen
This lecture
· Introduction
- Generalized additive modeling
- Articulography
- Using articulography to study L2 pronunciation differences
· Design
· Methods: R code
· Results
· Discussion
3/55
Generalized additive modeling (1)
· Generalized additive model (GAM): relaxing assumption of linear relation between
dependent variable and predictor
· Relationship between individual predictors and (possibly transformed) dependent
variable is estimated by a non-linear smooth function:
g(y) = s(x1 ) + s(x2 , x3 ) + β4 x4 + …
- Multiple predictors can be combined in a (hyper)surface smooth (other lecture)
4/55
Question 1
Go to www.menti.com/870525
Are non-linear statistics useful?
0 0 0
Yes, many No, they are ?
patterns in too difficult
real life are and enable
non-linear overfitting
Press ENTER to show correct
5/55
Generalized additive modeling (2)
· Advantage of GAM over manual specification of non-linearities: the optimal shape
of the non-linearity is determined automatically
· Appropriate degree of smoothness is automatically determined by minimizing
combined error and “wigglyness” (no overfitting)
· Maximum number of basis functions limits the maximum amount of non-linearity
6/55
First ten basis functions
7/55
Generalized additive modeling (3)
· Choosing a smoothing basis
- Single predictor or isotropic predictors: thin plate regression spline (this lecture)
- Efficient approximation of the optimal (thin plate) spline
- Combining non-isotropic predictors: tensor product spline
· Generalized Additive Mixed Modeling:
- Random effects can be treated as smooths as well (Wood, 2008)
- R: gam and bam (package mgcv)
· For more (mathematical) details, see Wood (2006) and Wood (2017)
8/55
Articulography
9/55
Obtaining data
0:00 / 1:16
10/55
Recorded data
0:00 / 0:05
11/55
Present study: goal and setup
· 19 native Dutch speakers from Groningen
· 22 native Standard Southern British English speakers from London
· Material: 10 minimal pairs [t]:[θ] repeated twice:
- 'tent'-'tenth', 'fate'-'faith', 'fort'-'forth', 'kit'-'kith', 'mitt'-'myth'
- 'tank'-'thank', 'team'-'theme', 'tick'-'thick', 'ties'-'thighs', 'tongs'-'thongs'
- Note that the sound [θ] does not exist in the Dutch language
· Goal: compare distinction between this sound contrast for both groups
· Preprocessing:
- Articulatory segmentation: gestural onset to offset (within /ə/ context)
- Positions z-transformed per axis and time normalized (from 0 to 1) per speaker
12/55
Data: much individual variation and noisy data
13/55
Data overview
load("data/full.rda") # only includes anterior-posterior position of tongue tip sensor
head(full)
# Participant Group Word Sound Loc Trial Time TTfront
# 1 VENI_EN_1 EN tick T Init 1 0.0000 -0.392
# 2 VENI_EN_1 EN tick T Init 1 0.0161 -0.440
# 3 VENI_EN_1 EN tick T Init 1 0.0323 -0.440
# 4 VENI_EN_1 EN tick T Init 1 0.0484 -0.503
# 5 VENI_EN_1 EN tick T Init 1 0.0645 -0.513
# 6 VENI_EN_1 EN tick T Init 1 0.0806 -0.677
dim(full)
# [1] 126177 8
14/55
First model: tongue frontness for “tenth” and “tent”
(R version 4.2.2 Patched (2022-11-10 r83330), mgcv version 1.8.41, itsadug version 2.4.1)
library(mgcv)
library(itsadug)
tt <- droplevels(full[full$Word %in% c("tenth", "tent"), ])
# sort data per individual trajectory (necessary to detect autocorrelation)
tt <- tt[order(tt$Participant, tt$Trial, tt$Time), ] # sort data per trial
tt$start.event <- tt$Time == 0 # mark the start of every new trial (where Time equals 0)
# fit first GAM
m0 <- bam(TTfront ~ s(Time), data = tt)
15/55
Model summary
summary(m0)
#
# Family: gaussian
# Link function: identity
#
# Formula:
# TTfront ~ s(Time)
#
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.40119 0.00825 48.6 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time) 8.4 8.9 170 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# R-sq.(adj) = 0.106 Deviance explained = 10.6%
# fREML = 17375 Scale est. = 0.87389 n = 12839
16/55
Visualizing the non-linear time pattern of the model
(Interpreting GAM results always involves visualization)
plot(m0, rug = F, shade = T, main = "Partial effect", ylab = "TTfront", ylim = c(-1, 1))
plot_smooth(m0, view = "Time", rug = F, main = "Full effect", ylim = c(-1, 1))
17/55
Check if number of basis functions is adequate
(if p-value is low and edf close to k')
gam.check(m0)
#
# Method: fREML Optimizer: perf newton
# full convergence after 9 iterations.
# Gradient range [-1.25e-07,1.03e-07]
# (score 17375 & scale 0.874).
# Hessian positive definite, eigenvalue range [3.46,6419].
# Model rank = 10 / 10
#
# Basis dimension (k) checking results. Low p-value (k-index<1) may
# indicate that k is too low, especially if edf is close to k'.
#
# k' edf k-index p-value
# s(Time) 9.0 8.4 1 0.48
18/55
Increasing the number of basis functions with k
(double k if higher k is needed, but not higher than number of data points)
m0b <- bam(TTfront ~ s(Time, k = 20), data = tt)
summary(m0b)$s.table # extract smooths from summary: new edf
# edf Ref.df F p-value
# s(Time) 11.2 13.5 112 0
summary(m0)$s.table # original edf
# edf Ref.df F p-value
# s(Time) 8.4 8.9 170 0
19/55
Effect of increasing k
plot_smooth(m0, view = "Time", rug = F, main = "m0 (k=10)")
plot_smooth(m0b, view = "Time", rug = F, main = "m0b (k=20)")
20/55
Individual varation: random effects in lme4 vs. mgcv
· (1|Participant): s(Participant, bs="re")
· (0 + WF|Participant): s(Participant, WF, bs="re")
· (1 + WF|Participant): not possible in mgcv
21/55
Including individual variation: random intercepts
m1 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re"), data = tt)
summary(m1) # starting from here, slides only show the relevant part of the summary
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.43 0.0648 6.65 3.11e-11 ***
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time) 8.49 8.93 206.1 <2e-16 ***
# s(Participant) 40.41 41.00 65.6 <2e-16 ***
#
# Deviance explained = 26.3%
· Deviance explained of model without random intercept (r ): 10.6% 2
22/55
Effect of including a random intercept
plot_smooth(m0, view = "Time", rug = F, main = "m0", ylim = c(-0.5, 1.5))
plot_smooth(m1, view = "Time", rug = F, main = "m1", ylim = c(-0.5, 1.5))
23/55
Including individual variation: random slopes
m2 <- bam(TTfront ~ s(Time) + s(Participant, bs = "re") + s(Participant, Time, bs = "re"),
data = tt)
summary(m2)
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.431 0.106 4.07 4.72e-05 ***
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time) 8.53 8.94 163 <2e-16 ***
# s(Participant) 39.91 41.00 9294 <2e-16 ***
# s(Participant,Time) 39.12 41.00 10291 <2e-16 ***
#
# Deviance explained = 31.7%
24/55
Effect of including a random slope
plot_smooth(m1, view = "Time", rug = F, main = "m1", ylim = c(-0.5, 1.5))
plot_smooth(m2, view = "Time", rug = F, main = "m2", ylim = c(-0.5, 1.5))
25/55
Including individual variation: factor smooths
(i.e. including a non-linear random effect instead of a random intercept and random slope)
m3 <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt)
summary(m3)
# Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated 1-d smooths
# of same variable.
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.431 0.0648 6.64 3.22e-11 ***
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time) 8.12 8.48 23.3 <2e-16 ***
# s(Time,Participant) 289.74 377.00 16.1 <2e-16 ***
#
# Deviance explained = 40.7%
26/55
Visualization of individual variation
plot(m3, select = 2)
27/55
Effect of including a random non-linear effect
plot_smooth(m2, view = "Time", rug = F, main = "m2", ylim = c(-0.5, 1.2))
plot_smooth(m3, view = "Time", rug = F, main = "m3", ylim = c(-0.5, 1.2))
28/55
Influence of random effects
https://eolomea.let.rug.nl/GAM/RandomEffects (login: f112300 and ShinyDem0)
Illustration of random effects
Implementation Martijn Wieling & Jacolien van Rij (2015) | Server setup Martijn Wieling (2015)
Type of random effect:
Random intercept
Random slope
Random smooth
Select participant:
1 22
1 4 7 10 13 16 19 22
Bottom panel:
- black dashed line: main effect of Time (i.e.
Intercept + s(Time) )
- red dotted line: partial random effect for the
selected participant
- red solid line: summed effect for the selected
29/55
Speeding up computation
(via discrete and nthreads, only for default method="fREML")
system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt))
# user system elapsed
# 24.93 34.21 2.87
system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T))
# user system elapsed
# 5.502 7.611 0.491
system.time(bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, discrete = T,
nthreads = 2))
# user system elapsed
# 11.76 15.87 0.86
· Using nthreads is most effective for more complex models
30/55
Comparing “tenth” and “tent” in one model
(smooths are centered, so the factorial predictor also needs to be included in the fixed effects)
m4 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), data = tt,
discrete = T, nthreads = 2)
summary(m4)
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.10 0.0684 1.47 0.142
# Wordtenth 0.69 0.0118 58.66 <2e-16 ***
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time):Wordtent 7.28 7.92 9.27 <2e-16 ***
# s(Time):Wordtenth 8.51 8.79 42.96 <2e-16 ***
# s(Time,Participant) 307.69 377.00 22.54 <2e-16 ***
· Does the word distinction improve the model?
31/55
Assessing model improvement
(comparing fixed effects, so method='ML' and no discrete=T)
m3.ml <- bam(TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1), data = tt, method = "ML")
m4.ml <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs", m = 1), data = tt,
method = "ML")
compareML(m3.ml, m4.ml) # model m4.ml is much better!
# m3.ml: TTfront ~ s(Time) + s(Time, Participant, bs = "fs", m = 1)
#
# m4.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, bs = "fs",
# m = 1)
#
# Chi-square test of ML scores
# -----
# Model Score Edf Difference Df p.value Sig.
# 1 m3.ml 15288 5
# 2 m4.ml 13294 8 1993.462 3.000 < 2e-16 ***
#
# AIC difference: 4105.95, model m4.ml has lower AIC.
32/55
Visualizing the two patterns
plot_smooth(m4, view = "Time", rug = F, plot_all = "Word", main = "m4", ylim = c(-0.5, 2))
33/55
Visualizing the difference
plot_diff(m4, view = "Time", comp = list(Word = c("tenth", "tent")), ylim = c(-0.5, 2))
34/55
Question 2
Go to www.menti.com/870525
What is wrong in the difference plot?
0 0 0 0
Nothing Confidence Wrong ?
bands shape
seem too
thin
Press ENTER to show correct
35/55
Including individual variation for “tenth” vs. “tent”
m5 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs",
m = 1), data = tt, discrete = T, nthreads = 2)
summary(m5)
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.118 0.0947 1.25 0.213
# Wordtenth 0.622 0.1181 5.27 1.41e-07 ***
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time):Wordtent 7.56 7.97 9.6 <2e-16 ***
# s(Time):Wordtenth 8.40 8.56 22.5 <2e-16 ***
# s(Time,Participant):Wordtent 316.62 377.00 38.0 <2e-16 ***
# s(Time,Participant):Wordtenth 328.46 368.00 43.1 <2e-16 ***
· This approach is overly conservative when determining the (tenth-tent) difference
· An alternative approach is introduced in this lecture
36/55
More uncertainty in the difference
plot_diff(m4, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2),
main="m4: difference") # anti-conservative
plot_diff(m5, view="Time", comp=list(Word=c("tenth","tent")), ylim=c(-0.5,2),
main="m5: difference") # overly conservative
37/55
Autocorrelation in the data is a problem!
(residuals should be independent, otherwise the standard errors and p-values are wrong)
m5acf <- acf_resid(m5) # show autocorr.
· See also supplementary material
38/55
Correcting for autocorrelation
rhoval <- m5acf[2] # correlation of residuals at time t with those at time t-1 (about 0.93)
m6 <- bam(TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word, bs = "fs",
m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m6)
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.118 0.090 1.31 0.192
# Wordtenth 0.623 0.113 5.51 3.66e-08 ***
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time):Wordtent 7.42 8.09 8.94 <2e-16 ***
# s(Time):Wordtenth 8.31 8.59 21.77 <2e-16 ***
# s(Time,Participant):Wordtent 229.53 377.00 2.84 <2e-16 ***
# s(Time,Participant):Wordtenth 269.34 368.00 3.81 <2e-16 ***
39/55
Autocorrelation has been removed
par(mfrow = c(1, 2))
acf_resid(m5, main = "ACF of m5")
acf_resid(m6, main = "ACF of m6")
40/55
Clear model improvement
compareML(m5.ml, m6.ml)
# m5.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
# bs = "fs", m = 1)
#
# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
# bs = "fs", m = 1)
#
# Model m6.ml preferred: lower ML score (12679.727), and equal df (0.000).
# -----
# Model Score Edf Difference Df
# 1 m5.ml 9381 10
# 2 m6.ml -3298 10 -12679.727 0.000
#
# AIC difference: 24349.22, model m6.ml has lower AIC.
41/55
Question 3
Go to www.menti.com/870525
How can autocorrelation be reduced?
0 0 0 0
Better Setting an Setting Excluding
fitting model appropriate discrete=T random
rho value effects
Press ENTER to show correct
42/55
Random effects and autocorrelation
https://eolomea.let.rug.nl/GAM/AutoCorrelation (login: f112300 and ShinyDem0)
Reducing autocorrelation
Implementation Martijn Wieling & Jacolien van Rij (2015) | Server setup Martijn Wieling (2015)
Type of random effect:
None
Subject intercept
Subject smooth
Subject smooth + Event intercept
Event smooths
Select rho:
0 0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Median absolute residual size
Plots:
- topleft panel: Main time effect (summed
43/55
Distinguishing the two speaker groups
tt$WordGroup <- interaction(tt$Word, tt$Group)
m7 <- bam(TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant, by = Word, bs = "fs",
m = 1), data = tt, rho = rhoval, AR.start = tt$start.event, discrete = T, nthreads = 2)
summary(m7)
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.065 0.120 -0.54 0.588
# WordGrouptenth.EN 0.744 0.152 4.89 1e-06 ***
# WordGrouptent.NL 0.385 0.175 2.21 0.027 *
# WordGrouptenth.NL 0.879 0.157 5.59 2.3e-08 ***
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time):WordGrouptent.EN 3.88 4.59 2.32 0.041 *
# s(Time):WordGrouptenth.EN 7.93 8.35 15.82 <2e-16 ***
# s(Time):WordGrouptent.NL 7.46 8.14 10.85 <2e-16 ***
# s(Time):WordGrouptenth.NL 7.72 8.19 11.87 <2e-16 ***
# s(Time,Participant):Wordtent 218.20 376.00 2.63 <2e-16 ***
# s(Time,Participant):Wordtenth 256.81 367.00 3.45 <2e-16 ***
44/55
Visualizing the patterns
(note that the difference curves are overly conservative)
plot_smooth(m7, view='Time', rug=F, plot_all='WordGroup', main='m7', ylim=c(-1,2))
plot_diff(m7, view='Time', comp=list(WordGroup=c('tenth.EN','tent.EN')), main='EN difference', ylim=c(-1,2))
plot_diff(m7, view='Time', comp=list(WordGroup=c('tenth.NL','tent.NL')), main='NL difference', ylim=c(-1,2))
45/55
Including language is necessary
compareML(m6.ml, m7.ml) # m6.ml and m7.ml fitted with ML instead of REML
# m6.ml: TTfront ~ s(Time, by = Word) + Word + s(Time, Participant, by = Word,
# bs = "fs", m = 1)
#
# m7.ml: TTfront ~ s(Time, by = WordGroup) + WordGroup + s(Time, Participant,
# by = Word, bs = "fs", m = 1)
#
# Chi-square test of ML scores
# -----
# Model Score Edf Difference Df p.value Sig.
# 1 m6.ml -3298 10
# 2 m7.ml -3314 16 15.218 6.000 3.247e-05 ***
#
# AIC difference: 4.51, model m7.ml has lower AIC.
46/55
The full tongue tip model: all words
(Word is now a random-effect factor, Sound distinguishes T from TH words)
full$SoundGroup <- interaction(full$Sound, full$Group)
# alternative function (from itsadug) for sorting and generating start.event column
full <- start_event(full, event = c("Participant", "Trial"))
# duration discrete=F: 105 sec., discrete=T: 54 s. (1 thr.) / 37 s. (2 thr.)
system.time(model <- bam(TTfront ~ s(Time, by = SoundGroup) + SoundGroup + s(Time, Participant,
by = Sound, bs = "fs", m = 1) + s(Time, Word, by = Group, bs = "fs", m = 1), data = full,
rho = rhoval, AR.start = full$start.event, discrete = T, nthreads = 4))
# user system elapsed
# 147.1 305.2 34.4
47/55
The full tongue tip model: results
summary(model, re.test = FALSE) # exclude random effects from summary (much faster)
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.002 0.132 -0.01 0.988
# SoundGroupTH.EN 0.699 0.201 3.48 0.000509 ***
# SoundGroupT.NL 0.180 0.216 0.84 0.404
# SoundGroupTH.NL 0.509 0.229 2.22 0.026 *
#
# Approximate significance of smooth terms:
# edf Ref.df F p-value
# s(Time):SoundGroupT.EN 5.16 5.50 2.96 0.014 *
# s(Time):SoundGroupTH.EN 6.61 6.86 6.55 3.04e-06 ***
# s(Time):SoundGroupT.NL 7.25 7.45 4.38 3.12e-05 ***
# s(Time):SoundGroupTH.NL 6.91 7.13 5.95 5.52e-06 ***
#
# Deviance explained = 53.5%
48/55
L1-based differences
(note that the difference curves are overly conservative)
plot_diff(model, view = 'Time', ylim=c(-0.5, 1.5), comp = list(SoundGroup = c('TH.EN', 'T.EN')))
plot_diff(model, view = 'Time', ylim=c(-0.5, 1.5), comp = list(SoundGroup = c('TH.NL', 'T.NL')))
49/55
Discussion
· Native English speakers appear to make the /t/-/θ/ distinction, whereas Dutch L2
speakers of English generally do not
- But this does not necessarily mean the difference between these distinctions is
significant (see this tutorial; Wieling, 2018)
· The approach used in this lecture to model non-linear random effects yielded
overly conservative difference smooths
- In this lecture about GAMs, a better approach is explained
50/55
Question 4
Go to www.menti.com/870525
Which additional category would improve the
model?
0 0 0 0
Men vs. Words Tongue tip ?
women starting vs. tongue
with t/th back
vs. ending
with t/th
Press ENTER to show correct
51/55
How to report?
· For an example of how to report this type of analysis, see:
- Wieling (2018), Journal of Phonetics: GAM tutorial (paper package: data and code)
- Wieling et al. (2016), Journal of Phonetics (paper package: data and code)
52/55
Recap
· We have applied GAMs to articulography data and learned how to:
- use s(Time) to model a non-linearity over time
- use the k parameter to control the number of basis functions
- use the plotting functions plot_smooth and plot_diff
- use the parameter setting bs="re" to add random intercepts and slopes
- add non-linear random effects using s(Time,...,bs="fs",m=1)
- use the by-parameter to obtain separate non-linearities
- Note that this factorial predictor also needs to be included in fixed effects!
- use compareML to compare models (comparing fixed effects: method='ML')
· Associated lab session:
- https://www.let.rug.nl/wieling/Statistics/GAM-Intro/lab
53/55
Evaluation
Go to www.menti.com/870525
Please provide your opinion about this lecture in at
most 3 words/phrases!
54/55
Questions?
Thank you for your attention!
http://www.martijnwieling.nl
[email protected]