CDA Course
CDA Course
Course Introduction
• This course focuses on methods for categorical response, or outcome variables.
– Binary, e.g.
– Nominal, e.g.
– Ordinal, e.g.
– Discrete-valued (“interval”), e.g.
1
• Explanatory, or predictor variables can be any type
• Very generally, we are trying to build models
P (Yi = 1) = π P (Yi = 0) = 1 − π
Pn
The total number of successes (1s) Y = i=1 Yi has the binomial distribution, denoted by bin(n, π). The
probability mass function for the possible outcomes y for Y is
n
p(y) = π y (1 − π)(n−y) , y = 0, 1, ..., n
y
Statistical inference
Inference is the use of sample data to estimate unknown parameters of the population. One method we will
focus on is maximum likelihood estimation (MLE).
2
Population Sample
Population
Sample Statistic
Parameter
where L(β) = log(l(β)). The MLE is the parameter value under which the data observed have the
highest probability of occurrence.
3
• Under regularity conditions, cov(β̂) is the inverse of the information matrix, which is
2
∂ L(β)
[I(β)]i,j = −E
∂βi ∂βj
• The standard errors are the square roots of the diagonal elements for the inverse of the information
matrix. The greater the curvature of the log likelihood function, the smaller the standard errors.
= y log(π) + (n − y) log(1 − π)
• Differentiating wrt π and setting it to 0 gives the MLE π̂ = y/n.
• The Fisher information is
I(π) = n/[π(1 − π)]
• The asympotic distribution of the MLE π̂ is N (π, π(1 − π)/n).
Wald test
Consider the hypothesis
H0 : β = β0 H1 : β 6= β0
Under H0 : β = β0 , the wald test statistic z is approximately standard normal. Therefore H0 is rejected if
|z| > zα/2 .
where l0 and l1 are the maximized likelihood under H0 and H0 ∪ H1 . The null hypothesis is rejected if
−2 log Λ > χ2α (df ) where df is the difference in the dimensions of the parameter spaces under H0 ∪ H1 and
H0 .
4
Figure 1: binomial likelihood
5
Score (a.k.a. Wilson) test
Score test, also called the Wilson or Lagrange multiplier test, is based on the slope and expected curvature
of the log-likelihood function L(β) at the null value β0 . It utilizes the size of the score function
u(β) = ∂L(β)/∂β
evaluated at β0 .
The score test statistic is
u(β0 ) π̂ − π0
z= =p
[I(β0 )]1/2 π0 (1 − π0 )/n
.
• Using the Score method, compute the 95% confidence interval for π (true proportion of vegetarians in
the population):
6
Bayesian inference for binomial parameters
Bayesian analyses incorporate “prior information” about parameters using
• prior subjective belief about a parameter, or
• prior knowledge from other studies, or
• very little knowledge (a “weakly informative” prior)
Prior distribution (g) is combined with the likelihood (f ) to create a posterior (h):
f (y|θ)g(θ)
h(θ|y) =
f (y)
∝ f (y|θ)g(θ)
Beta is a conjugate prior distribution for a binomial parameter, implying that the posterior is also a beta distribution, specifically, h
follows a beta(y + α1 , n − y + α2 ).
Shiny app for Bayesian inference of a Binomial.
An exercise
1. Write down your prior belief about the probability that this coin will land heads.
2. Share it with the class
3. Use the prior probabilities to estimate a beta distribution.
library(MASS)
x <- c(
## enter probabilities here
)
fitdistr(x, "beta", list(shape1=1,shape2=1))
4. Use the app to see how the posterior changes as we flip the coin.
7
Intro to Contingency Tables
Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Distributions of categorical variables: Multinomial
Suppose that each of n independent and identical trials can have
outcome in any of c categories. Let
(
1 if trial i has outcome in category j
yij =
0 otherwise
π̂j = nj /n
The Chi-Squared distribution
One simple distribution for count data that do not result from a
fixed number of trials. The Poisson pmf is
e −µ µy
p(y ) = , y = 0, 1, 2, ... E (Y ) = Var (Y ) = µ
y!
For adult residents of Britain who visit France this year, let
I Y1 = number who fly there
I Y2 =number who travel there by train without a car
I Y3 =number who travel there by ferry without a car
I Y4 =number who take a car
A poisson model for (Y1 , Y2 , Y3 , Y4 ) treats these as independent
Poisson random variables, with parameters (µ1 , µ2 , µ3 , µ4 ). The
P
total n = i Yi also has a Possion distribution, with parameter
P
i µi .
Distributions of categorical variables: Poisson
P
The conditional distribution of (Y1 , Y2 , Y3 , Y4 ) given i Yi = n is
P
multinomial(n, πi = µi / j µj )
Example: A survey of student characteristics
In the R data set survey, the Smoke column records the survey
response about the student’s smoking habit. As there are exactly
four proper response in the survey: “Heavy”, “Regul” (regularly),
“Occas” (occasionally) and “Never”, the Smoke data is multinomial.
library(MASS) # load the MASS package
levels(survey$Smoke)
##
## Heavy Never Occas Regul
## 11 189 19 17
Example: A survey of student characteristics
Suppose the campus smoking data are as shown above. You wish to
test null hypothesis of whether the frequency of smoking is the
same in all of the groups on campus, or H0 : πj = πj0 , j = 1, ..., 4.
([Link] <- [Link]([Link],
p = rep(1/length([Link]), length
##
## Chi-squared test for given probabilities
##
## data: [Link]
## X-squared = 382.51, df = 3, p-value < 2.2e-16
Thus, there is strong evidence against the null hypothesis that all
groups are equally represented on campus (p<.0001).
Example (continued): expected and observed counts
[Link]$expected
##
## Heavy Never Occas Regul
## 11 189 19 17
Testing with estimated expected frequencies
In some applications, the hypothesized πj0 = πj0 (θ) are functions of
a smaller set of unknown parameters θ.
For example, consider a scenario (Table 1.1 in CDA) in which we are
studying the rates of infection in dairy calves. Some calves become
infected with pneumonia. A subset of those calves also develop a
secondary infection within two weeks of the first infection clearing
up. The goal of the study was to test whether the probability of
primary infection was the same as the conditional probability of
secondary infection, given that the calf got the primary infection.
Let π be the probability of primary infection. Fill in the following
2x2 table with the associated probabilites under the null hypothesis:
Secondary Infection
Primary Infection Yes No Total
Yes
No
Example continued
Let nab denote the number of observations in row a and column b.
Secondary Infection
Primary Infection Yes No Total
Yes n11 n12
No n21 n22
The MLE is
Secondary Infection
Primary Infection Yes No
Yes 30(38.1) 63(39.0)
No 0 63(78.9)
The MLE is
The p-value is
P(χ21 > 19.7) =
1-pchisq(19.7, df=1)
## [1] 9.060137e-06
Therefore, the evidence suggests that the probability of primary and
secondary infections being the same is not supported by the data.
Under H0 , we would anticipate that many more calves would have
secondary infections than did end up being infected. “The
researchers concluded that primary infection had an immunizing
efect tht reduced the likelihood of a secondary infection.”
Lecture 3: Contingency Tables
Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Contingency Tables (notation)
Imagine we have random variables X and Y . Let X and Y be
categorical variables with I and J categories, respectively. We can
draw a contingency table with i rows and j columns:
Y =1 Y =2 ··· Y =i ··· Y =J
X =1 n11 n12 ··· n1j ··· n1J n1+
X =2 n21 n22 ··· n2j ··· n2J n2+
.. .. ..
. ··· ··· . ··· ··· ··· .
X =i ni1 ni2 ··· nij ··· niJ ni+
.. .. ..
. ··· ··· ··· ··· . ··· .
X =I nI1 nI2 ··· nIj ··· nIJ nI+
n+1 n+2 ··· n+j ··· n+J n
Y =1 Y =2 ··· Y =i ··· Y =J
X =1 fi11 fi12 ··· fi1j ··· fi1J fi1+
X =2 fi21 fi22 ··· fi2j ··· fi2J fi2+
.. .. ..
. ··· ··· . ··· ··· ··· .
X =i fii1 fii2 ··· fiij ··· fiiJ fii+
.. .. ..
. ··· ··· ··· ··· . ··· .
X =I fiI1 fiI2 ··· fiIj ··· fiIJ fiI+
fi+1 fi+2 ··· fi+j ··· fi+J fi
Contingency Tables - Marginal Probabilities
Y =1 Y =2 ··· Y =i ··· Y =J
X =1 fi11 fi12 ··· fi1j ··· fi1J fi1+
X =2 fi21 fi22 ··· fi2j ··· fi2J fi2+
.. .. ..
. ··· ··· . ··· ··· ··· .
X =i fii1 fii2 ··· fiij ··· fiiJ fii+
.. .. ..
. ··· ··· ··· ··· . ··· .
X =I fiI1 fiI2 ··· fiIj ··· fiIJ fiI+
fi+1 fi+2 ··· fi+j ··· fi+J fi
Contingency Tables - Conditional Probabilities
Mac PC Linux
DuBois
SciLi
Arnold
n
Sampling Methods - Multinomial (fixed n)
2. Multinomial Sampling (fixed n)
Another example: Cohort study. Enroll 5,000 people and track them
over a year. Measure video game playing (none / low / high) and
hospitalizations for repetitive strain injuries.
Sampling Methods - Multinomial (fixed row n / column n)
Case Control
SE_1 . . . . . .
SE_2 . . . . . .
SE_3 . . . . . .
Total 1000 1000
The ‰2 Statistic
* Eij is the expected value for each cell based on its probability
under the null hypothesis
* Oij is the value observed for that cell
* We reject the H0 if our ‰2 statistic is large - a larger TS means
larger deviations from expected counts
* Test is 2-sided by virtue of (...)2
* Compare to a (1 ≠ –)%ile of the ‰2df distribution with appropriate
degrees of freedom
Goodness of Fit ‰2 Test
H0 : variables independent
Note: Becuase this is a 2x2 table with row and column totals
known, n11 determines the other three counts.
Fisher’s Exact Test: Example
In this case, the P-value for Fisher’s exact test is the null probability
of this table and of tables having even more evidence in favor of her
claim.
The
A observed
BA Btable,
A tB 0 = 3 correct choices, has null probability
4 4 8
/ = 0.229.
3 1 4
The only table more extreme in the direction of Ha is n11 = 4, which
has a probability of 0.014. The P-value is P(n11 Ø 3) = 0.243.
Despite the underwhelming evidence in this test, Bristol did
eventually convince Fisher that she could tell the difference.
For more, see Agresti p. 90 - 92.
Birth Order and Gender
H0 : fi1 = fi2 = fi3 = fi4 = 0.25
Ha : at least 1 fii ”= 0.25
First Child M F
Second Child M F M F Totals
Count 218 227 278 277 1000
Expected Value 250 250 250 250 1000
Birth Order and Gender
Chi-squared test for this data:
q4
k=1 (nobs ≠ nexp )2
TS =
nexp
n = 1000, k = 4, df = k-1 = 3
– = 0.05
## [1] 12.264
## [1] 0.006531413
Birth Order and Gender
With a p-value of 0.0065, we reject the null hypothesis at a
significance level of 0.05. We have evidence to suggest that at least
one fik ”= 0.25.
Chi−Square Density Graph
0.25
0.20
0.15
dchisq(x, df = 3)
0.10
0.05
0.00
0 5 10 15
Intro to Contingency Tables
Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Independence
Definition:
Two categorical variable are independent iff
P(Y = j| X = i) = P(Y = j)
Testing for independence (Two-way contigency table)
I Under H0 : πij = πi+ π+j , ∀ i, j, the expected cell counts are
ni+ n+j
π̂i+ = , π̂+j =
n n
i=1 j=1
µ2
I µ̂ij requires estimating πi+ and π+j which have degrees of
freedom I − 1 and J − 1, respectively. Notice the constraints
i πi+ = j π+j = 1
P P
(IJ = 1) − (I − 1) − (J − 1) = (I − 1)(J − 1)
I X 2 is asymptotically χ2(I−1)(J−1)
I It is helpful to look at the residuals
(O − E )2
{ }
E
The residuals can give useful information about where the
model is fitting well or not
Measure of Diagnostic Tests
Diagnosis
Disease Status + -
D π11 π12
D π21 π22
I Sensitivity = 0.86
I Specificity = 0.88
P(D ∩ +)
PPV = P(D|+) =
P(+)
P(+|D)P(D)
=
P(+|D)P(D) + P(+|D)P(D)
P(+|D)P(D)
=
P(+|D)P(D) + (1 − P(−|D))P(D)
Sensitivity × Prevalence
=
Sensitivity × Prevalence + (1 − Specificity) × (1 − Prevalence)
The same example:
Diagnosis
Disease Status + -
D 0.86 0.14
D 0.12 0.88
Response Y
Explanatory X Success Failure Row Total
group 1 n11 = x1 n12 = n1 − x1 n1
group 2 n21 = x2 n22 = n1 − x2 n2
I difference of proportions
I relative risk
I odds ratio
Difference of Proportions
Response Y
Explanatory X Success Failure Row Total
group 1 n11 = x1 n12 = n1 − x1 n1
group 2 n21 = x2 n22 = n1 − x2 n2
I Definition
r = π1 /π2
I Motivation: The difference between π1 = 0.010 and
π2 = 0.001 is more noteworthy than the difference between
π1 = 0.410 and π2 = 0.401. The “relative risk”
(0.010/0.001=10, 0.410/0.401=1.02) is more informative than
“difference of proportions” (0.009 for both).
I The estimate of r is
r̂ = π̂1 /π̂2
I The estimator converges to normality faster on the log scale.
I The estimator of log r is
I The CI for r̂ is
I Odds in group 1:
π1
φ1 =
(1 − π1 )
I Interpretation: φ1 = 3 means a success is three times as likely
as a failure in group 1
I Odds ratio:
φ1 π1 /(1 − π1 )
θ= = ∼ χ2
φ2 π2 /(1 − π2 )
I Interpretation: θ = 4 means the odds of success in group 1 are
four times the odds of success in group 2
I The estimate is
n11 n22
θ̂ =
n12 n21
I log(θ̂) converge to normality much faster than θ̂
I An estimate of asymptotic standard error for log(θ̂) is
s
1 1 1 1
σ̂(log θ̂) = + + +
n11 n12 n21 n22
This formula can be derived using the Delta method Recall
log θ̂ = log(π̂1 ) − log(1 − π̂1 ) − log(π̂2 ) + log(1 − π̂2 )
First, f (β) = log(π̂1 ) − log(1 − π̂1 )
π1 (1 − π1 ) 1 1
σ= , f 0(β) = +
n1 π1 1 − π1
1 1
[f 0(β)]2 σ 2 = +
n1 π1 n1 (1 − π1 )
The estimate is 1
n11 + 1
n12
Lung Cancer
Smoker Cases Controls
Yes 688 650
No 21 59
Total 709 709
The Delta method builds upon the Central Limit Theorem to allow
us to examine the convergence of the distribution of a function g of
a random variable X .
It is not too complicated to derive the Delta method in the
univariate case. We need to use Slutsky’s Theorem along the way; it
will be helpful to first review ideas of convergence in order to better
understand where Slutsky’s Theorem fits into the derivation.
Delta Method: Convergence of Random Variables
√
Suppose we know that n(Xn − θ) → d N(0, σ 2 ) and we are
interested in the behavior of some function g(Xn ) as n → ∞. If
g0(θ) exists and is not zero, we can write
g(Xn ) ≈ g0(θ)(Xn − θ) + g(θ) using Taylor’s approximation:
∞
(Xn − θ)i
g(Xn ) = g0(θ)(Xn − θ) + g(θ) + g (i) (θ)
X
i=2
i!
Delta Method: Hand-wave-y Derivation
∞
√ √ √ √ X (Xn − θ)i
ng(Xn ) = n∗g0(θ)(Xn −θ)+ n∗g(θ)+ n∗ g (i) (θ)
i=2
i!
√ √ √
n(g(Xn ) − g(θ)) = n ∗ g0(θ)(Xn − θ) + n ∗ R(Xn )
Delta Method: Hand-wave-y Derivation
√
g0(θ) n(Xn − θ) →d N(0, σ 2 (g0(θ))2 )
.
It can be shown that the remainder term R(Xn ) →p 0 (see the
Stephens link from McGill below for a proof).
We now have the necessary setup to apply Slutsky’s Theorem, and
we can conclude that
√
n(g(Xn ) − g(θ)) →d N(0, σ 2 (g0(θ))2 )
.
Delta Method: References
I [Link]
[Link]
I https:
//[Link]/wiki/Convergence_of_random_variables
I [Link]
I [Link]
I [Link]
[Link]
Lecture 5: Likelihood Ratio Confidence Intervals &
Bayesian Methods for Contingency Tables
Nick Reich / transcribed by Bianca Doone and Guandong Yang
! Pn !
n
Y e−µ µyi e−nµ µ i=1 yi
L(µ|y) = log = log Qn
i=1
yi ! i=1 yi !
n
X
= −nµ + log(µ) yi + C
i=1
Qn
where C = i=1 yi ! is a constant.
n
X
L(µ|y) ∝ −nµ + log(µ) yi
i=1
• Note that taking the first derivative and setting equal to zero provides us the MLE for the data:
n
1X
µ̂ = ȳ = yi
n i=1
1
LR CI’s - Visualization of this likelihood
0
−10 −5
Log Likelihood
−20
−30
0 2 4 6 8
mu
• In general, adding more data implies that the likelihood is going to be lower.
with Θ̂ as the fixed value at the MLE, and the LR test statistic compared to the (1 − α)th quantile of
χ2df
• The set of Θ where this holds is a (1 − α)% confidence interval with degrees of freedom equal to the
number of parameters the likelihood is varying over (free parameters)
2
LR CI’s - Motivating Example Cont’d
Motivating Example: Poisson Sample-Mean Estimation (1-parameter Poisson Mean)
• Let Θ = µ, then we can define:
LRTS(µ) = −2[L(µ) − L(µ̂)]
χ21 (.95)
= µ : L(µ) > L(µ̂) −
2
L = L(mu)
−1
Log Likelihood
−2
L = L(mu) − Z/2
−3
−4
−5
0 2 4 6 8
mu
3
Beta/Binomial Example - Handedness
Q: Are women and men left-handed at the same rate?
Gender RH LH Total
Men 43 9 n1 = 52
Women 44 4 n2 = 48
Total 77 13 100
In other words:
• Is there a difference in the proportions of men who are left-handed and women who are left-handed?
H0 : Pr(left-handed|male) = Pr(left-handed|female)
• Is the difference between left-handed men and left-handed women equal to zero?
H0 : Pr(left-handed|male) − Pr(left-handed|female) = 0
4
Bayesian Method - Computational Technique
1. Simulate N independent draws from p(π1 |y, n) and p(π2 |y, n)
2. Compute θi , i = 1, ..., N
3. Plot empirical posterior
4. Calculate summary statistics
Multinomial/Dirichlet - Example in R
Adapted from Bayesian Data Analysis 3
• A poll was conducted with n = 1447 participants, with the following results:
– Obama: y1 = 727
– Romney: y2 = 583
– Other: y3 = 137
5
• The estimand of interest is θ1 − θ2
• Assuming simple random sampling, we have
θ1
(y1 , y2 , y3 ) ∼ Multinomial(n, θ2 )
θ3
• We now apply the same computational technique as in the univariate case. . .
Multinomial/Dirichlet - Example in R
#data
y <- c(727, 583, 137)
#"uniform" hyperparameter
a <- c(1,1,1)
#prior
pri <- rdirichlet(100, a)
#Generate Posterior
postr <- rdirichlet(1000, y+a)
1.5
Group 1
1.0
0.5
0.0
1.5
density
Group 2
1.0
0.5
0.0
1.5
Group 3
1.0
0.5
0.0
0.00 0.25 0.50 0.75 1.00
value
6
Multinomial/Dirichlet - Visualization of Prior (3D)
V2
100
20
80
40
60
60
40
80
20
10
0
V1 V3
20
40
60
80
0
10
7
Multinomial/Dirichlet - Visualization of Posterior
50
40
Group 1
30
20
10
0
50
40
density
Group 2
30
20
10
0
50
40
Group 3
30
20
10
0
0.00 0.25 0.50 0.75 1.00
value
8
Multinomial/Dirichlet - Visualization of Posterior (3D)
V2
100
20
80
40
60
60
40
80
20
10
0
V1 V3
20
40
60
80
0
10
9
Histogram of difference between group 1 and group 2
150
100
Frequency
50
0
poll_diff
## [1] 0.09999157
### P-value
mean(poll_diff >0)
## [1] 1
### 95% CI
quantile(poll_diff, c(.025, .975))
## 2.5% 97.5%
## 0.05538754 0.14811648
10
Lecture 6: GLMs
Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Generalized Linear Models (GLMS)
GLMs are extensions or generalization of linear regression models to
encompass non-normal response distribution and modeling functions
of the mean . - Example for ordinary LM:
E[Y|X]
iid
Y = Xβ + , i ∼ N (0, σ 2 )
The best fit line on the following plot represents E (Y |X ).
100
80
60
y
40
20
0
x
5 10 15 20
Overview of GLMs
1 −(yi − µi )2
f (yi |µi , σ 2 ) = p exp{ }
(2πσ 2 ) 2σ 2
1 −yi2 2yi µi −µ2i
=p exp{ }exp{ }exp{ }
(2πσ 2 ) 2σ 2 2σ 2 2σ 2
I Where:
I θ = µi
2
I a(µi ) = exp{ −µ
2σ 2 }
i
−yi2
I b(yi ) = exp{ 2σ2 }
I Q(µi ) = exp{ σµ2i }
Example 2: Binomial Logit for binary outcome data
I Yi ∼ Pois(µi )
I
e −µi µyi i
f (yi |µi ) =
yi !
1
= e −µi exp{yi log µi }
yi
I Where:
I θ = µi
I a(µi ) = e −µi
I b(yi ) = y1
i
I Q(µi ) = log µi
Deviance
For a particular GLM for observations y = (y1 , ..., yN ), let L(µ|y )
denote the log-likelihood function expressed in terms of the means
µ = (µ1 , ..., µN ). The deviance of a Poisson or binomial GLM is
D = −2[L(µ̂|y ) − L(y |y )]
πi
logit(πi ) = log
1 − πi
= logit(Pr (Yi = 1))
= logit(E [Yi ])
= g(E [Yi ])
= Xβ
= β0 + βi xi ,
eX β
which implies Pr (Yi = 1) = 1+e X β
.
I g does not have to be a linear function (linear model means
linear with respect to β).
Logistic Regression (Cont.)
1.0
10
β1 > 0 β1 > 0
Log−odds scale, logit(πi)
0.8
Probability scale, πi
5
0.6
0
0.4
−5
0.2
β1 < 0 β1 < 0
−10
0.0
0 5 10 15 20 0 5 10 15 20
Example: Linear Probability vs. Logistic Regression Models
## (Intercept) Var1
## -3.866245 0.397335
## (Intercept) Var1
## 0.01724645 0.01977784
odds(πi |Xi = k + 1, X2 = z)
log = log odds ratio
odds(πi |Xi = k, X2 = z)
−5
Log−odds scale, logit(πi)
−6
−7
−7
β1 comparing ORs
−8
−8
−9
−9
−10
−10
0 1 2 3 4 5 0 1 2 3 4 5
Lecture 7: GLMs: Score equations,
Residuals
i=1
where wi denotes
1 ∂µ 2
i
wi =
Var (Yi ) ∂ηi
Asymptotic Covariance Matrix of β̂:
SLR GLM
σ̂ 2
Var (β̂i ) PN the i th main diagnal
(xi −x̄ )2
i=1
element of (X T WX )−1
Cov (β̂i ) σ̂ 2 (X T X )−1 (X T WX )−1
Residual and Diagnostics
I Deviance Tests
I Measure of goodness of fit in GLM based on likelihood
I Most useful as a comparison between models (used as a
screening method to identify important covariates)
I Use the saturated model as a baseline for comparison with other
model fits
I For Poisson or binomial GLM: D = −2[L(µ̂|y ) − L(y |y )].
I Example of Deviance
−2[L(µ̂0 |y ) − L(µ̂1 |y )]
= −2[L(µ̂0 |y ) − L(y |y )] − {−2[L(µ̂1 |y ) − L(y |y )]}
= D(y |µ̂0 ) − D(y |µ̂1 )
Hypothesis test with differences in Deviance
I Pearson residuals :
eip = √y −µ̂i , where µi = g −1 (ηi ) = g −1 (Xi β)
V (µ̂i )
I Deviance residuals√:
eid = sign(yi − µ̂i ) di , where di is the deviance contribution of
(
1 x >0
ith obs. and sign(x ) =
−1 x ≤ 0
I Standardized residuals:
ri = p ei , where ei = √y −µ̂i , hc1 is the measure of
(1−hbi ) V (µ̂i )
leverage, and ri ∼
= N(0, 1)
Residual Plot
Problem: Residual plot is hard to interpret for logistic regression
2
1
Residuals
0
−1
−2
−3 −2 −1 0 1 2 3
Expected Values
Binned Residual Plot
I Group observations into ordered groups (by xj , ŷ or xij ), with
equal number of observations per group.
I Compute group-wise average for raw residuals
I Plot the average residuals vs predicted value. Each dot
represent a group.
Average Residuals
0.0 0.4
−0.6
−2 −1 0 1 2
Expected Values
Binned Residual Plot (Part 2)
I Red lines indicate ± 2 standard-error bounds, within which one
would expect about 95% of the binned residuals to fall.
I R function avaiable.
linrary(arm)
binnedplot(x ,y, nclass...)
# x <- Expected values. # y <- Residuals values.
# nclass <- Number of bins.
Average Residuals
0.2
−0.6
−2 −1 0 1 2
Expected Values
Binned Residual Plot (Part 3)
0.4
0.2
Average Residuals
Average Residuals
0.2
0.1
0.0
0.0
−0.2
−0.2
−3 −2 −1 0 1 2 −4 −2 0 2 4
0.4
Average Residuals
Average Residuals
0.2
0.2
0.0
0.0
−0.2
−0.2
−0.4
−4 −2 0 2 4 −4 −2 0 2 4
logit(πi ) = α + βxi
Model Diagnostics via Residuals
4 independence
linear logit
3
2
Residuals
1
0
−2
SBP
Model Diagnostics via Residuals
∂L(β)
=0
∂β
∂L(β) ∂L(β)
µT = ( , ..., )
∂β1 ∂βp
∂ 2 L(β)
H=( ), ∀ i, j = 1, 2, ..., p
∂βi ∂βj
I where H is Hessian matrix, also called observed information.
Newton Raphson Algorithm
∂ 2 L(β)
J = (−E ( )) ∀ i, j = 1, 2, ..., p
∂βi ∂βj
β (t+1) = β (t) − (J (t) )−1 µ(t)
Inference for GLMs
I Bayesian inference
I Sample directly from posterior multivariate distribution and
calculate the credible sets.
I Likelihood based inference
I Similarly, directly calculate confidence intervals from likelihood
function.
Inference for GLMs
logit(µi ) = α + βxi
⇒ Var (logit(µ̂i )) = Var (α̂ + β̂xi )
= Var (α̂) + xi2 Var (β̂) + 2Cov (α̂, β̂)
⇒ CI : logit(µ̂i ) ± 1.96SE (logit(µ̂i ))
I We can get the 95% CI for µi with Delta method and the CI
for logit(µ̂i ).
Multivariate GLMs
Made available under the Creative Commons Attribution-ShareAlike 4.0 International License.
Overview: Models for Multinomial Responses
Note: This lecture focuses mainly on the Baseline Category Logit
Model (see Agresti Ch. 8), but for the exam we are responsible for
reading Chapter 8 of the text and being familiar with all types of
models for multinomial responses introduced there.
I Discrete-Choice Models
I Conditional Logit Models (and relationship to Multinomial Logit Model)
I Multinomial Probit Discrete-Choice Models
I Extension to Nested Logit and Mixed Logit Models
I Extension to Discrete Choice Model with Ordered Categories
Baseline Category Logit Model
The Baseline Category Logit (BCL) model is appropriate for
modeling nominal response data as a function of one or more
categorical or quantitative covariates.
Yi |~
xi ∼ Multinomial(1, {π1 (~
xi ), . . . , πJ (~
xi )})
Overview of Modeling Process
πa (~
xi ) πb (~
xi ) πa (~
xi )
log − log = log
πJ (~
xi ) πJ (~
xi ) πb (~
xi )
E [~
yi ] = g (µ~i )
Link Function
!T
π1 (~
xi ) π2 (~
xi ) π(J−1) (~xi )
g (µ~i ) = log , log , . . . , log = Xi β
πJ (~
xi ) πJ (~
xi ) πJ (~
xi )
1 xi1 . . . xiP 0 0 ... 0 ... 0 ... 0
0 0 ... 0 1 xi1 . . . xiP . . . 0 . . . 0
Xi = .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . . . . . . .
0 0 ... 0 0 0 ... 0 . . . 1 xi1 .. xip
Vector of Parameters
β is a column vector with dimension (J − 1)P x 1, containing the
category-specific coefficients αj and βjk for j ∈ {1, J − 1} and
k ∈ {1, P}:
β = (α1 , β11 , . . . , β1P , α2 , β21 , . . . , β2P , . . . , αJ−1 , β(J−1)1 , . . . ,β(J−1)P )T
Multivariate GLM : The Mechanics of Prediction
(j)
Let Xi refer to the j th row vector of Xi . Then the dot product of
(j)
Xi with the parameter vector β is the predicted log probability
ratio for observation i and non-reference category Cj :
πj (~
xi )
(j)
yij = g(µ~i ) = log = Xi · β
πJ (~
xi )
Multivariate GLM : Example of the Mechanics of
Prediction
Suppose we wish to calculate yi1 .
The first row vector of Xi is:
(1)
Xi = (1, xi1 , xi2 , . . . , xiP , 0, 0, 0, . . . , 0)
The column vector of parameters β is the same for all i:
β = (α1 , β11 , . . . , β1P , α2 , β21 , . . . , β2P , . . . ,αJ−1 , β(J−1)1 , . . . , β(J−1)P )
Their dot product gives us the predicted yi1 :
π1 (~
xi )
yi1 =g(π1 (~
xi )) = log
πJ (~
xi )
(1)
=Xi ·β
=1α1 + xi1 β11 + · · · + xiP β1P
+ 0 ∗ α2 +0∗β21 + · · · + 0∗β2p
+ ...
+ 0∗αJ−1 +0∗β(J−1)1 + · · · + 0∗β(J−1)p
=1α1 + xi1 β11 + · · · + xiP β1P
Response Probabilities
πj (~
xi ) exp(Xiβj )
log = Xiβ =⇒ πj (~
xi ) = PJ−1
πJ (~
xi ) 1 + n=1 exp(Xiβn )
When all predictors in a model are categorical and the training data
can be represented in a contingency table that is not sparse, the
chi 2 or G 2 goodness-of-fit tests used earlier in the semester can be
used to assess whether or not the fitted BCL model is appropriate.
(generate “expected” contingency table from predicted results and
then “residuals” are expected-observed)
If some predictors are not categorical or the contingency table is
sparse, these statistics are “valid only for comparing nested models
differing by relatively few terms” (A. Agresti, Categorical Data
Analysis p. 294). This means that they cannot validly be used as a
model check overall, but they can be used to compare fit of full
vs. reduced models if the full model only has “relatively few” more
covariates than the reduced one(s).
Example: Using Symptoms to Classify Disease (Reich Lab
Research)
Motivating Question: Confirmatory clinical tests are expensive
and take time, meaning they are not a reasonable diagnostic option
in many public health settings. Can we instead design a model that
can use routine observable symptoms to classify sick individuals
accurately? (Adapted from work in progress by Brown et. al. )
I C2 : Zika I Age
I C3 : Flu I Headache
I C4 : Chikingunya I Rash
I C5 : Other I Conjunctivitis
I C6 : No Diagnosis I ...
Using Symptoms to Classify Disease, Continued
Assume that each individual can only have one disease at once, and
let yi be a binary vector representing the gold-standard diagnosis of
the i th individual in the training set. For example, using the ordering
on the previous slide, the observation
y1 = (0, 0, 1, 0, 0, 0)
πj (~
xi )
log = Xi β
π6 (~
xi )
Using Symptoms to Classify Disease: Visualization Method
We might use a graphic like the one below to represent resulting estimates
for β (these results are just randomly generated from a standard normal
distribution):
Using Symptoms to Classify Disease: Interpretation of
Graphic
Uij = αj + β T
j (~
xi ) + ij
Under the assumption that the distribution of the ij are i.i.d. with
the extreme value distribution, McFadden (1974) showed that this
model is equivalent to the BCL model. In this setting, the
interpretation of βj is the expected change in Ui j with a change of
one unit in covariate xij , all other covariates held constant.
Recall that the extreme value distribution has CDF:
FX (x ) = exp(−exp(−x ))
What if the ij are not assumed to have this distribution?
Supplementary - Utility Functions and Probit Models
(Agresti p.299)
Uij = αj + β T
j (~
xi ) + ij
The multinomial probit model does not depend on the IIA axiom, and is
therefore an interesting approach for many applications, including voting
theory.
Example: In the 2016 election, if the only two candidates in the mix were
Hillary Clinton and Jill Stein, a voter might have chosen Jill Stein knowing
that Hillary was likely to win anyways but that a vote for Jill represented
their beliefs. However, introducing Donald Trump into the mix might have
convinced that voter that they should choose Hillary instead of Jill, since a
third-party vote for Jill would draw from Hillary’s chance. Thus the IIA
axiom is violated. The multinomial probit model can model this setting.
Example: Alvarez and Katz (2007) Multinomial Probit
Model for Election Choice in Chile in 2005
Yi ∼ Poisson(λi )
I Key Points:
I Log link implies multiplicative effect of covariates
log(λ) = β0 + β1 X1 + β2 X2
λ = e β0 e β1 X1 e β2 X2
- Relative risk is the interpretation for e β
Yi ∼ Poisson(ui ∗ λi )
E (Yi ) = ui ∗ λi
log(E (Yi )) = log(ui ) + log(λi )
Yi ∼ Poisson(ui ∗ λi )
Var (Yi ) = λi
µi = E (λi )
Var (µi ) = λi
N
X (yi − ui )xij ∂µi
=0
i=1
Var (µi ) ∂ηi
j = 0, .., p
∂ui 2
wi = ( ) /Var (Yi )
∂ηi
cov (β̂) = (X T WX )−1 = φcov (β̂)
yi − yˆi
zi = p
Var (yˆi )
yi − µi
= √ ∼ N(0, 1)
µi
n
X
zi2 ∼ χ2n−k
i=1
I where k is the number of parameters
I if the sum of zi2 is large (compare to chi-squared), we may
need an overdispersion term φ
Is Overdispersion Term Needed in a Model?
Pn 2
i=1 zi
φ̂ =
n−k
q
SEqp (β̂) = SEp (β̂) ∗ φ̂
Lecture 11 : Smoothing: Penalized Likelihood Methods
and GAMs
Tyrell Richu
10/23/2018
Penalized Likelihood
• Consider an arbitrary model with generic parameter β, a log-likelihood function L(β).
• Let λ(.) denotes the roughness penalty which decreases as the values of β are smoother (i.e. uniformly
close to zero). The penalized likelihood estimator L*(β) is :
• Penalized likelihood methods are examples of regularization methods. It is a general approach for
modifying ML methods to give sensible answers in unstable situations such as modeling using data sets
consisting too many variables.
1
Pros/Cons of Penalized Likelihood
• L2 -norms (-) : Useless for finding a rigid model, because all the variables remain in the model.
• L1 -norms (+) : Allows us to plot estimates as a function of λ to summarize how explanatory variables,
βj drop out as λ increases by selecting individual indicators rather than entire factors.
• L1 -norms (-) : May overly penalize βj that are truly large may hold high bias, making inference difficult.
Solution: adjust the penalty function such that it includes both the L_{1} and L_{2} norms.
• This will fit a model that assigns a deviance and an approximate degree of freedom to each sj in the
additive predictor, allowing inference about each term. The df helps determine how smooth the GAM
fit looks. (e.g. Smooth functions with df = 4 look similar to cubic polynomials, which has 4 parameters)
• Like with GLMs, we can compare deviances for nested models to test whether a model gives a significantly
better fit than a simpler model.
Final Notes
• GAMs and penalized likelihood methods are stronger than GLMs because they impersonate GLMs
in assuming a binomial distribution for a binary response and having a df value associated with each
explanatory effect.
2
Lecture 13: Clustered Data, An
Introduction to Applied Mixed
Models
j = 1, 2, ..., n
i = 1, 2, ..., ni
X
N= ni = total number of groups
i
(
1, if Trtij was treated
Treatment Trtij =
0, otherwise
Example to begin with
Yij ∼ Poisson(λij , Pi )
log(λij ) = β0 + β1 Trtij + β2 Xi
I We want to make inference about β1 . What is the treatment
effect?
I Observations are not iid, need “adjustment”
I How can we account for variance within groups?
I 1. Marginal models with generalized estimating equations
(GEE) for variance adjustment
I 2. generalized linear mixed models (GLMM’s)
GLMM’s Models
tr (H) = p
p ≤ tr (H) ≤ p + q
Example
I Lets say we are looking at treatment(spinal implants) to relieve
back pain. Rows represent visit 1 and columns represent visit 2
Figure 3:
Yi ∼ N(π, σy2 )
logit(π) = αi + βXij
αi ∼ N(β0 , σα2 )
β0 ∼ N(0, 100)
σα ∼ U(0, 5)
Example (interpretation)
K
X
f (x ) = wk gk (x ) (1)
k=1
PK
where k=1 wk = 1 and wk ≥ 0 ∀k.
Mixture Distributions (cont’d)
Yi |πi ∼ Binomial(ni , πi )
πi ∼ Beta(α, β)
ni α
E [Yi ] = (2)
α+β
ni αβ(α + β + ni )
Var (Yi ) = (3)
(α + β)2 (α + β + 1)
Shrinkage Effect for Beta-Binomial
I We can use the conjugacy of the Beta prior and Binomial
likelihood to derive a distributional form for the posterior πi |yi .
yi |πi ∼ Binomial(ni , πi )
πi |α, β ∼ Beta(α, β)
p(πi |yi , α, β) ∝ p(yi |πi )p(πi |α, β)
∝ πiyi (1 − πi )ni −yi ∗ πiα−1 (1 − πi )β−1
∝ πiyi +α−1 (1 − πi )ni −yi +β−1
∼ Beta(yi + α, ni − yi + β)
(yi + α)
E (πi |yi , α, β) = (4)
(ni + α + β)
Shrinkage Effect for Beta-Binomial (cont’d)
Prior Distn.
Beta(1,1)
Beta(1,3)
Beta(3,1)
4
Beta(.5,.5)
Posterior density
3
2
1
0
Pi
Extentions to Models for Count
Data
There are several ways to extend models for count data in order to
capture properties like overdispersion.
I Poisson model with adjustment for overdispersion (see previous
notes)
I Poisson-Gamma Model
I Generalized Linear Mixed Models (GLMMs)
Poisson-Gamma Model
ui ∼ N(0, σ 2 )
I This example uses a log link and assumes the ui are normally
distributed.
Generalized Linear Mixed Models
I Note that you need to use a link that transforms the linear
predictor to a non-negative value. For example, the identity link
leads to structural problems because a negative linear predictor
implies a negative expected count, which is impossible.
Other choices are possible for the distribution of ui :
I Assuming ui ∼ Gamma(1, γ) implies a negative binomially
distributed outcome Y .
I Another possible choice is assume ui follow a log-normal
distribution.
I Each choice implies a different structure for the random
intercepts.