Week 10 Non Response and Missing Data
Week 10 Non Response and Missing Data
/////
Acknowledgement
▶ The content of this module is heavily inspired by the following books and workshop
materials:
▶ John Ormerod’s workshop on Missing data: Seeing a spider isn’t the problem. It becomes a
problem when it disappears.
▶ After further investigation, i.e. not from the data above, the overall voting rate for the
persons who refused to participate in the survey was 81%, Voting rates varied among
different types of non-response:
▶ refusal (81%),
▶ not-at-home (65%),
▶ As discussed previously, the best way of dealing with non-response is to prevent it.
▶ Like other kinds of non-sampling error, there are many ways that non-response bias can be
minimised in the design stage of a survey.
▶ Do not scrimp on the survey design; every hour spent on design may save weeks of
remorse later.
▶ Statistical Models: Predict (or impute) missing values using statistical models.
▶ Example: We run a taste study for 20 different drinks. Each subject asked to rate 4 drinks
chosen uniformly at random.
▶ Example: In a survey, lower income subjects were overall less likely to answer a question about
drug use than higher income subjects.
▶ Example: Recording time until subjects have an accident but only follow for three years
(censoring)
▶ In the next few slides, we will define the type of missingness in mathematical terms.
▶ We can imagine that there is a probability ϕj that population unit j would respond if an
attempt was made to measure that unit, i.e.
ϕj = P(Rj = 1)
.
▶ We also have a set of auxiliary variables X j , which are known for every unit in the sample
(whether or not that unit provided a response to Yj )
▶ Assume that we have drawn a sample of n units from the population of size N, and that nR
of these have responded.
▶ This is when the probability of response ϕj is the same for all units, no matter what their
value of Yj and X j .
▶ This is the best situation: the respondents and non-respondents do not differ in any
important way, and it is as if the non-respondents had been selected at random from the
sample.
▶ To analyse data where we have MCAR, we can simply ignore the non-responding units, and
proceed as if we had selected a sample of nR units from the population.
▶ This occurs when the probability of response ϕj depends on known quantities: the auxiliary
covariates X j , but not on the unknown Yj , i.e.
P(Rj = 1|Yj , X j ) = P(Rj = 1|X j ),
for all sampled unit.
▶ This is a slightly more complex situation than MCAR, but we can still fully account for
non-response.
▶ This is the unfortunate situation where the probability of response ϕj depends on the
outcome of interest Yj (and possibly X j as well).
▶ If the data are MAR we can test whether they are MCAR by comparing non-response rates
in subgroups defined by X (or by logistic regression, which you will learn more in STAT6175
& STAT8111) and test for any dependence on X.
▶ However it may not be possible to distinguish whether the data are MAR or whether there is
non-ignorable response.
▶ Just like starting any analysis, we should always explore the data first.
▶ In this case, we should determine if there is missing data and what types of missing data are
present.
▶ Great for analysis and reporting of survey data and can inform the survey design in future
studies.
▶ Example: large amounts of unexpected missing data may indicate the questions were unclear
or difficult to recall.
▶ When assessing the missing data, we recommend using a tibble or [Link] object
and not a survey object as most of the analysis is about patterns of records and so weights
are not necessary.
▶ Here is the code to create the design objects for each to use throughout this chapter.
▶ For ANES, we need to adjust the weight so it sums to the population instead of the sample.
▶ We can also use the count() function to get an understanding of the different types of
missing data on the original variables.
▶ For example, let’s look at the count of data for V202072, which corresponds to our
VotedPres2020 variable.
▶ It can be challenging to look at tables for every variable, so let’s start visualising!
▶ We can use the vis_miss() function available in naniar (also in both visdat) package
to view the amount of missing data by variable.
anes_2020_derived<-anes_2020 %>%
)
3%
6%
(2
(1
)
%
n
io
io
1
)
select(!starts_with("V2"), -CaseID,
ct
ct
0.
)
%
)
9%
)
ele
ele
5%
(<
(0
(0
(1
t
s
t
(1
)
es
en
6_
0_
%
)
6
0
%
er
20
(0
nm
01
01
02
02
)
)
%
%
)
(4
)
nt
)
(2
%
1%
0
%
-InterviewMode, -Weight,
ple
s2
s2
s2
s2
(1
(7
er
nI
e2
up
(0
(7
)
re
re
re
re
ig
ov
(
o
io
%
th
e7
t
ro
Vo
Pe
ID
er
e
pa
dP
dP
dP
dP
at
eE
G
(4
m
eG
d
ty
rly
uc
st
st
am
te
te
te
te
en
co
co
e
ac
r
Ag
Ag
Ed
Ea
Pa
Vo
Vo
Vo
Vo
In
In
Tr
Tr
G
C
R
-Stratum, -VarUnit) 0
Observations
4000
anes_2020_derived %>%
vis_miss(cluster= TRUE, show_perc = FALSE) + 6000
scale_fill_manual(
values = ggcolour, Missing Present
labels = c("Present","Missing"),
name = "") ▶ What do you see?
▶ Even if we did not have the informative variable names, we could be able to deduce that
VotedPres2020, VotedPres2020_selection, and EarlyVote2020 are likely related
since their missing data patterns are similar.
▶ Additionally, we can also look at VotedPres2016_selection and see that there is a lot of
missing data in that variable.
▶ Most likely this is due to a skip pattern, and we can look at further graphics to see how it
might be related to other variables.
▶ We can dive deeper with functions such as gg_miss_fct() which looks at missing data for
all variables by levels of another variable.
gg_miss_fct(VotedPres2016) + VotedPres2020
VotedPres2016_selection
scale_fill_gradientn( TrustPeople
colors = ggcolour
PartyID
Variable
75
Income7
) +
50
Income
25
ylab("Variable") + Gender
0
EarlyVote2020
CampaignInterest
AgeGroup
Age
A
Ye
N
Voted for President in 2016
▶ We can see that if they did not vote for president in 2016 or did not answer that question,
then they were not asked about who they voted for in 2016 (the percentage of missing data
is 100%).
▶ We can see with this graphic, that there is more missing data across all questions if they did
not provide an answer to VotedPres2016.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 27
Visualising Missing Data (cont.)
▶ UpSet plots are a data visualisation method for showing set data with more than three
intersecting sets.
▶ UpSet shows intersections in a matrix, and the size of the sets and of the intersections are
shown as bar charts.
▶ Here we can use it to illustrate the combinations of missingness across cases.
gg_miss_upset(anes_2020,
nsets = 10,
nintersects = 40
# nsets = number of variables to be consider
# can change to n_var_miss(anes_2020) find out the number of variable that have
#
# nintersets = number of interaction to plot, NA means everything
)
720
695
600
Intersection Size
400
271
243
219
200
87
60 54 51
43 34 33
30 25
16 16 15 13 13 11 11 10 10 10 8 7 7 7 7 7
5 5 4 4 4 4 4 4 4 4
0
RaceEth_NA
Education_NA
Age_NA
AgeGroup_NA
Income_NA
Income7_NA
EarlyVote2020_NA
VotedPres2020_selection_NA
VotedPres2020_NA
VotedPres2016_selection_NA
1500 1000 500 0
Set Size
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 29
Visualising Missing Data (cont.)
▶ There are other graphics that work well
with numeric data.
▶ For this part and the rest of the module, we will use a new dataset.
library(mlbench)
data(PimaIndiansDiabetes)
PimaIndiansDiabetes <- PimaIndiansDiabetes %>%
as_tibble()
▶ obtain the histogram of each numerical variable? Hint: use geom_histogram() and
facet_wrap().
▶ Using summary()
PimaIndiansDiabetes %>%
summary()
PimaIndiansDiabetes %>% 90
60
60 300
200
40
select(where([Link])) %>% 30 20 100
pivot_longer(everything()) %>% 0
20 40 60 80
0
0 50 100 150 200
0
0 250 500 750
count
facet_wrap(~ name, scales = "free") 50
25
50 50
0 0 0
0 20 40 60 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15
pressure triceps
120
200
90
150
60
100
30 50
0 0
0 40 80 120 0 25 50 75 100
value
▶ In the previous slides, the summary or histogram revealed implicit missing values in
glucose, pressure, triceps, insulin, and mass (since they can’t be zero!).
▶ Most of the covariates are right skewed, so let’s convert these to log-scale so the
assumption is closer to being satisfied.
▶ Notice that we will take a log(1+x) transform instead of just log(x). Why?
▶ Homework If you have some time, create an UpSet plot for this dataset. Comment on the
missingness of the dataset.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 36
Visualising Missing Data (cont.)
▶ The example printed below shows how the shadow matrix works.
▶ Most levels of triceps are !NA in triceps_NA.
▶ Those are NA in triceps are still NA in triceps_NA.
▶ We can then use these new variables to plot the missing data along side the actual data.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 37
Visualising Missing Data (cont.)
▶ Let’s plot a histogram of the (log-transformed) glucose readings grouped by those that are
missing and not missing by the (log-transformed) insulin reading.
geom_histogram() +
scale_fill_manual( 40 Triceps reading
count
values = ggcolour,
Present
Missing
ggtitle("Histogram of log-glucose by
insulin missing data") 0
▶ Here is another example plot based on the recs data (code can be found at the back of
this slide deck.)
▶ This plot indicates that respondents who did not provide a
Histogram of Energy Cost by Heating Behavior Missing Data
response for the heating behaviour question may have a
1500 different distribution of total energy cost compared to
respondents who did provide a response.
▶ This view of the raw data and missingness could indicate some
Number of Households
1000
Heating Behavior
Present
bias in the data.
Missing
▶ Normally it’s hard to explore the relationship between two variables as geom_point() will
just ignore any missing data.
# using regular geom_point() ## Warning: Removed 375 rows containing missing values (`geom
ggplot(dat_shadow,
aes(x = log1p_insulin,
y = log1p_glucose)) +
geom_point() 5.0
log1p_glucose
4.5
4.0
3 4 5 6
log1p_insulin
▶ One approach involves moving missing values to where they can be seen alongside regular
values on a graph. Then, it marks them with different colours.
ggplot(dat_shadow,
aes(x = log1p_insulin,
y = log1p_glucose)) + 5.0
geom_miss_point()
log1p_glucose
missing
4.5
Missing
Not Missing
4.0
2 3 4 5 6
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES log1p_insulin 41
Statistical tests and Modelling of missingness
▶ Before we move on to discuss imputation, we will finish this section by discuss a test on
missingness as well as a way to model missingness.
▶ naniar provides mcar_test() for Little (1988)’s statistical test for missing completely at
random (MCAR) data.
▶ The null hypothesis in this test is that the data is MCAR, and the test statistic is a
chi-squared value.
mcar_test(dat)
## # A tibble: 1 x 4
## statistic df [Link] [Link]
## <dbl> <dbl> <dbl> <int>
## 1 175. 69 3.17e-11 11
▶ Given the high statistic value and low p-value, we can conclude that the
PimaIndiansDiabetes data is not missing completely at random.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 42
Statistical tests and Modelling of missingness (cont.)
▶ One question most of you would like to know is that can we model missingness?
▶ Sometimes it can be impractical to explore all of the variables that have missing data.
▶ One approach, however, is to model missing data using methods from N. J. Tierney et al.
(2015).
▶ Here, the approach is to predict the proportion of missingness in a given case, using all
variables.
▶ There is a little helper function to add a column with the proportion of cases or rows
missing add_prop_miss(). This created a column named prop_miss, which is the
proportion of missing values in that row.
## # A tibble: 768 x 10
## pressure triceps diabetes log1p_pregnant log1p_glucose log1p_insulin
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 72 35 pos 1.95 5.00 NA
## 2 66 29 neg 0.693 4.45 NA
## 3 64 NA pos 2.20 5.21 NA
## 4 66 23 neg 0.693 4.50 4.55
## 5 40 35 pos 0 4.93 5.13
## log1p_mass log1p_pedigree log1p_age prop_miss_all
## <dbl> <dbl> <dbl> <dbl>
## 1 3.54 0.487 3.93 0.111
## 2 3.32 0.301 3.47 0.111
## 3 3.19 0.514 3.50 0.222
## 4 3.37 0.154 3.09 0
## 5 3.79 1.19 3.53 0
## # i 763 more rows
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 44
Statistical tests and Modelling of missingness (cont.)
▶ We can then use a model like decision trees to predict which variables and their values are
important for predicting the proportion of missingness:
Prop. Miss = 0.094
n=768 100%
add_prop_miss() %>% Prop. Miss = 0.069 Prop. Miss = 0.21 Prop. Miss = 0.12 Prop. Miss = 0.25
▶ Here we can see that this can produce quite a complex tree.
▶ This can be pruned back and the depth of the decision tree can be controlled.
▶ For more details on the workflow and theory underpinning naniar, read the vignette
Getting started with naniar. Click here.
▶ For a short primer on the data visualisation available in naniar, read the vignette Gallery of
Missing Data Visualisations. Click here.
▶ The simplest method for handling missing data is to eliminate rows or columns until a
dataset is entirely filled.
▶ The most common method is complete case analysis (also known as list-wise deletion), where
only data with complete information are considered.
▶ Another common method is available case analysis (also called pairwise deletion), where
variables with missing values are deleted first, followed by removal of samples containing
missing data.
library(gtsummary)
tbl_merge(
map(list(res_glm1, res_glm2, res_glm3, res_glm4),
~ tbl_regression(.x) %>%
add_glance_table(include = c(AIC, nobs))),
tab_spanner = c("**res_glm1**", "**res_glm2**", "**res_glm3**", "**res_glm4**")) %>%
as_kable_extra(format = "latex", booktabs = TRUE)
Characteristic log(OR) 95% CI p-value log(OR) 95% CI p-value log(OR) 95% CI p-value log(OR) 95% CI p-value
pressure 0.00 -0.03, 0.02 0.8 0.00 -0.03, 0.02 0.8 -0.01 -0.03, 0.01 0.2
triceps 0.01 -0.03, 0.04 0.8
log1p_pregnant 0.09 -0.38, 0.56 0.7 0.09 -0.39, 0.56 0.7 0.37 0.07, 0.67 0.017 0.35 0.06, 0.64 0.018
log1p_glucose 4.5 3.1, 6.1 <0.001 4.6 3.4, 6.0 <0.001 4.5 3.6, 5.4 <0.001 4.5 3.7, 5.5 <0.001
log1p_insulin 0.06 -0.47, 0.58 0.8
log1p_mass 2.7 0.79, 4.8 0.007 3.0 1.4, 4.5 <0.001 3.5 2.4, 4.6 <0.001 3.3 2.3, 4.3 <0.001
log1p_pedigree 1.9 0.57, 3.3 0.006 2.0 0.61, 3.4 0.005 1.7 0.74, 2.7 <0.001 1.6 0.67, 2.6 <0.001
log1p_age 2.1 0.76, 3.4 0.002 2.1 0.80, 3.4 0.002 1.2 0.40, 1.9 0.003 0.92 0.21, 1.6 0.011
▶ The next step is to fill in the missing values with realistic values.
▶ Suppose we fill in all missing values for a particular covariate with the mean of the values in
that covariate then the variance of that covariate becomes artificially smaller.
▶ This means inferences based on these methods will be, in general, unreliable.
▶ However, if the goal is prediction then single imputation methods can still be reasonable.
▶ When imputing, keep in mind that some variables are restricted to be positive or bounded.
▶ We will consider a number of methods now that replace all missing values with a single
value.
▶ Mean/Median substitution.
▶ Regression imputation.
▶ Linear regression models.
▶ Robust regression models.
▶ Penalized regression.
▶ Machine learning imputation
▶ K-nearest neighbours.
▶ Cart imputation
▶ Random forest imputation.
▶ There is also hot-deck (even cold-deck) imputation (which we will not cover here) and
expectation maximisation.
# Impute all missing values by their mean value # Impute all missing values by their mean value
# then add normal noise # then add residuals to result
dat_imp3 <- dat %>% dat_imp4 <- dat %>%
impute_lm(log1p_glucose ~ 1, impute_median(log1p_glucose ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(pressure ~ 1, impute_median(pressure ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(triceps ~ 1, impute_median(triceps ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(log1p_insulin ~ 1, impute_median(log1p_insulin ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(log1p_mass ~ 1, impute_median(log1p_mass ~ 1,
add_residual = "normal") add_residual = "observed")
res_glm7 <- glm(diabetes ~ ., res_glm8 <- glm(diabetes ~ .,
data = dat_imp3, data = dat_imp4,
family = binomial) family = binomial)
Characteristic log(OR) 95% CI p-value log(OR) 95% CI p-value log(OR) 95% CI p-value
pressure 0.00 -0.03, 0.02 0.8 -0.01 -0.03, 0.01 0.2 -0.01 -0.03, 0.01 0.2
triceps 0.01 -0.03, 0.04 0.8 0.00 -0.03, 0.03 >0.9 0.00 -0.03, 0.03 >0.9
log1p_pregnant 0.09 -0.38, 0.56 0.7 0.38 0.10, 0.67 0.010 0.38 0.10, 0.67 0.010
log1p_glucose 4.5 3.1, 6.1 <0.001 4.6 3.7, 5.6 <0.001 4.6 3.7, 5.6 <0.001
log1p_insulin 0.06 -0.47, 0.58 0.8 -0.02 -0.47, 0.43 >0.9 -0.01 -0.46, 0.44 >0.9
log1p_mass 2.7 0.79, 4.8 0.007 3.6 2.4, 4.8 <0.001 3.6 2.4, 4.8 <0.001
log1p_pedigree 1.9 0.57, 3.3 0.006 1.5 0.59, 2.5 0.002 1.5 0.59, 2.5 0.002
log1p_age 2.1 0.76, 3.4 0.002 1.0 0.30, 1.8 0.006 1.0 0.31, 1.8 0.005
res_glm7 res_glm8
6 6
log1p_insulin
log1p_insulin
5
5 log1p_insulin_NA log1p_insulin_NA
!NA !NA
NA NA
4
4
3
3
6 6 6
log1p_insulin
log1p_insulin
log1p_insulin
ind_na ind_na ind_na
5 5 5
FALSE FALSE FALSE
TRUE TRUE TRUE
4 4 4
3 3 3
75 75 75
triceps
triceps
FALSE FALSE FALSE
50 50 50
TRUE TRUE TRUE
25 25 25
▶ The process to do this is really, really complicated since there are two sources of variation
1. Variation from estimation of model parameters (including any in the model for the covariates);
▶ Many methods iterate between addressing these to form a Markov chain where the imputed
values mimic draws from the right (posterior) distribution.
▶ Usually a small number of completed datasets are needed, say m = 5 is often recommended,
with 20 iterations between estimating model parameters and imputing missing values.
▶ Symbolically we impute a matrix X(mis) that contains missing values and mice outputs m
complete datasets (containing no missing values)
X(1) , . . . , X(m) .
▶ Suppose that we want to test that θj = θj0 for a particular j. Rubin (1987) showed that
θb − θj0 approx
qj ∼ tν
c θ̂)jj
Var(
where 2
W jj
ν = (m − 1) 1 + .
(1 + m−1 B jj )
▶ The statistic
rj = (1 + m−1 )Bjj /Wjj
can be interpreted as the relative increase in variance due to missing data.
▶ The dataset airquality has two variables with missing values Ozone and Solar.R
## 1 2 3 4 5
## 5 14 28 6 6 6
## 10 23 27 21 13 41
## 25 18 19 19 18 8
## 26 7 1 19 7 13
## 27 12 1 18 21 9
## 32 39 35 46 47 28
▶ missForest
▶ Hmisc
▶ mi
data(recs_2020)
recs_2020_shadow <- recs_2020 %>%
bind_shadow()
recs_2020_shadow %>%
filter(TOTALDOL < 5000) %>%
ggplot(aes(x = TOTALDOL, fill = HeatingBehavior_NA)) +
geom_histogram() +
scale_fill_manual(values = ggcolour,
labels = c("Present", "Missing"),
name = "Heating Behavior") +
xlab("Total Energy Cost (Truncated at $5000)") +
ylab("Number of Households") +
labs(title = "Histogram of Energy Cost by Heating Behavior Missing Data")
▶ See Sections 8.5 & 8.6 of Sharon L. Lohr (2021) to learn how to adjust sampling weight for
Non-resonse.
▶ Go to this link here to learn more about how to recode missing data and also accouting for
skip pattern in your survey.
▶ See Section 8.7 of Sharon L. Lohr (2021) and Section 11 of Nguyen (2020) which you can
access that here to learn more about imputation.
Little, Roderick J. A. 1988. “A Test of Missing Completely at Random for Multivariate Data with Missing Values.”
Journal of the American Statistical Association 83 (404): 1198–1202.
[Link]
Loo, Mark van der. 2022. “Simputation: Simple Imputation.”
[Link]
Nguyen, Mike. 2020. A Guide on Data Analysis. Bookdown. [Link]
Rubin, Donald B. 1987. “Multiple Imputation for Nonresponse in Sureys.” In Multiple Imputation for Nonresponse
in Surveys, i–xxix. John Wiley & Sons, Ltd. [Link]
Sharon L. Lohr. 2021. Sampling : Design and Analysis. Chapman & Hall/CRC Texts in Statistical Science.
BOCA RATON: Chapman and Hall/CRC. [Link]
AuthType=sso&db=nlebk&AN=2940082&site=ehost-live&custid=s8434881.
Thomsen, Ib, and Erling Siring. 1980. “On the Causes and Effects of Non-response: Norwegian Experiences.”
Statistisk Sentralbyrå, no. 121: 52.
Tierney, Nicholas J, Fiona A Harden, Maurice J Harden, and Kerrie L Mengersen. 2015. “Using Decision Trees to
Understand Structure in Missing Data.” BMJ Open 5 (6): e007450.
[Link]
Tierney, Nicholas, Sean Hughes (Sean Hughes reviewed the package for rOpenSci, see
[Link] Mara Averick (Mara Averick reviewed the package for
rOpenSci, see [Link] Stuart Lee, Earo Wang, Nic Crane, and
Christophe Regouby. 2023. “Visdat: Preliminary Visualisation of Data.”
[Link]
Zimmer, Stephanie, Rebecca J. Powell, and Isabella Velásquez. n.d. Exploring Complex Survey Data Analysis in R.
Accessed February 26, 2024. [Link]