0% found this document useful (0 votes)
59 views73 pages

Week 10 Non Response and Missing Data

The document discusses non-response and missing data in sampling design and analysis, outlining types of missingness: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). It emphasizes the importance of preventing non-response through effective survey design and provides strategies for dealing with non-response, including statistical models and adjustment methods. Additionally, it introduces the naniar package for exploring and visualizing missing data.

Uploaded by

harad akin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views73 pages

Week 10 Non Response and Missing Data

The document discusses non-response and missing data in sampling design and analysis, outlining types of missingness: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). It emphasizes the importance of preventing non-response through effective survey design and provides strategies for dealing with non-response, including statistical models and adjustment methods. Additionally, it introduces the naniar package for exploring and visualizing missing data.

Uploaded by

harad akin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STAT8101: Sampling Design and Analysis

Non-response & Missing Data

/////
Acknowledgement

▶ The content of this module is heavily inspired by the following books and workshop
materials:
▶ John Ormerod’s workshop on Missing data: Seeing a spider isn’t the problem. It becomes a
problem when it disappears.

▶ Nick Tierney’s vignette on the naniar package.

▶ Upcoming work of Zimmer, Powell, and Velásquez (n.d.).

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 2


Road Map

▶ What is non-response & missing data?


1. Missing Completely at Random (MCAR)

2. Missing at Random (MAR)

3. Non-ignorable Non-response or Missing not at random (MNAR).

▶ Explore and Visualise Missing data with naniar.

▶ Impute Missing data

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 3


/////
Non-Response & Missing Data
Non-Response & Missing Data

▶ Non-response can lead to flawed inferences due to


differences between respondents and
non-respondents.

▶ While it’s sometime possible to construct models to


predict the missing data, it is never as good as
observing them in the first place.

▶ Prevention is Key: It’s best to prevent


non-response whenever possible. Figure 1: From [Link]

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 5


Example: Norwegian voting rate

▶ Thomsen and Siring (1983) study


on voting behaviour in Norway Table 1: Percentage of persons who voted, from Thomsen and
Siring (1983).
▶ Survey included three calls
followed by a mail survey with a
final response rate of 90.1%, which Age group
is quite high. All 20-24 25-29 30-49 50-69 70-79
Non-respondents 71 59 56 72 78 74
▶ Utilised Norwegian voting register Selected sample 88 81 84 90 91 84
to track voting participation.

▶ Did the non-respondents differ from the respondents?


▶ What do you think?
▶ If there is no difference, what should the data look like?

▶ If there is a difference, where are they in particular?

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 6


Example: Norwegian voting rate (cont.)

▶ Results showed significant differences in voting rates between respondents and


non-respondents, especially among younger age groups.

▶ After further investigation, i.e. not from the data above, the overall voting rate for the
persons who refused to participate in the survey was 81%, Voting rates varied among
different types of non-response:
▶ refusal (81%),

▶ not-at-home (65%),

▶ mentally and physically ill (55%).

▶ Absence or illness identified as primary causes of non-response bias.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 7


Types of Non-response

▶ Unit Non-response: Entire observation unit is


missing.
▶ Example: Household can’t be contacted, or refusal
to participate.

▶ Item Non-response: Some measurements within


the observation unit are missing.
▶ Example: Skipping a question or not completing the
survey.

Figure 2: Missingness. My dog ate my


homework. From Reddit

Non-response can also be considered as a case of missing data.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 8


Preventing Non-response

▶ As discussed previously, the best way of dealing with non-response is to prevent it.

▶ Like other kinds of non-sampling error, there are many ways that non-response bias can be
minimised in the design stage of a survey.

▶ If we paraphrase a quote from Ronald Fisher:


To call in the statistician after the survey is done may be no more than asking him to
perform a post-mortem examination: he may be able to say what the survey died of.

▶ Do not scrimp on the survey design; every hour spent on design may save weeks of
remorse later.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 9


Preventing Non-response

▶ Non-response can be reduced by:


▶ Public education and publicity before a survey;
▶ Individual pre-notification letters sent to respondents prior to the survey telling them to expect
an interviewer to call;
▶ Providing incentives or compulsion for response;
▶ Providing clear information about the uses of the data and the confidentiality of the data;
▶ Interviewing at an appropriate time (of the day or of the year) to avoid inconvenient times;
▶ Well trained interviewers/data collectors;
▶ Data collection method: personal interviews may achieve a better response rate than
post-back questionnaires;
▶ Proxy responses–it may be appropriate to allow one person to respond on behalf of another;
▶ Questionnaire design: well designed questionnaires which look short, easy and clear may
achieve better response rates than large complex ones. Poorly worded questions may lead to
item non-response, or even to irritating a respondent into a total refusal;
▶ Call backs at different times to turn non-contacts into responses.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 10


Dealing with Non-response

▶ Other than preventing, what can we do to deal with non-response?


▶ Sub-sampling Non-respondents: Use a representative sub-sample for inference.

▶ Statistical Models: Predict (or impute) missing values using statistical models.

▶ Adjustment Methods: Adjust sample weights.

▶ Ignoring Non-response: Not recommended, can lead to biased results.

▶ Which strategy to deploy depends on the type of non-response.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 11


Type of Non-Response or Missingness

▶ There are then three types of missingness:


1. Missing Completely at Random (MCAR)

2. Missing at Random (MAR)

3. Non-ignorable Non-response or Missing not at random (MNAR).

▶ Missing completely at random (MCAR)


▶ Pattern of missingness independent of missing values and the values of any measured variables.

▶ Example: We run a taste study for 20 different drinks. Each subject asked to rate 4 drinks
chosen uniformly at random.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 12


Type of Non-Response or Missingness (cont.)

▶ Missing at random (MAR)


▶ Conditional on observed variables, missingness independent of missing value.

▶ Example: In a survey, lower income subjects were overall less likely to answer a question about
drug use than higher income subjects.

▶ Related to observed predictors (income) but not to drug use.

▶ Non-ignorable Non-response or Missing not at random


▶ Pattern of missingness related to missing value, even after correcting for measured variables.

▶ Example: High earners less likely to report their income.

▶ Example: Recording time until subjects have an accident but only follow for three years
(censoring)

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 13


Type of Non-Response or Missingness (cont.)

▶ In the next few slides, we will define the type of missingness in mathematical terms.

▶ Each unit in the sample may or may not respond.

▶ Suppose we use an indicator variable to represent this:


(
1, if unit j responds
Rj =
0, if unit j does not respond

▶ We can imagine that there is a probability ϕj that population unit j would respond if an
attempt was made to measure that unit, i.e.
ϕj = P(Rj = 1)
.

▶ ϕj is also called the response propensity or propensity score.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 14


Type of Non-Response or Missingness (cont.)

▶ We are interested in the values of outcome variable Yj on each sample unit.

▶ We also have a set of auxiliary variables X j , which are known for every unit in the sample
(whether or not that unit provided a response to Yj )

▶ The variable X j include the design variables from the frame.

▶ Assume that we have drawn a sample of n units from the population of size N, and that nR
of these have responded.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 15


Missing Completely at Random (MCAR)

▶ This is when the probability of response ϕj is the same for all units, no matter what their
value of Yj and X j .

▶ This is the best situation: the respondents and non-respondents do not differ in any
important way, and it is as if the non-respondents had been selected at random from the
sample.

▶ To analyse data where we have MCAR, we can simply ignore the non-responding units, and
proceed as if we had selected a sample of nR units from the population.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 16


Missing at Random (MAR)

▶ This is also known as missing at random given covariates or alternatively ignorable


non-response.

▶ This occurs when the probability of response ϕj depends on known quantities: the auxiliary
covariates X j , but not on the unknown Yj , i.e.
P(Rj = 1|Yj , X j ) = P(Rj = 1|X j ),
for all sampled unit.

▶ We therefore assume that if two units have the same values of


X j , then their likelihood of response is the same.

▶ This is a slightly more complex situation than MCAR, but we can still fully account for
non-response.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 17


Non-ignorable Non-response

▶ This is the unfortunate situation where the probability of response ϕj depends on the
outcome of interest Yj (and possibly X j as well).

▶ MAR and MCAR are strong assumptions.

▶ If the data are MAR we can test whether they are MCAR by comparing non-response rates
in subgroups defined by X (or by logistic regression, which you will learn more in STAT6175
& STAT8111) and test for any dependence on X.

▶ However it may not be possible to distinguish whether the data are MAR or whether there is
non-ignorable response.

▶ We of course exclude from consideration any data which is missing by design.


▶ These are data items which are missing because the respondent is not asked that particular
question (e.g. we do not ask for the voting record of children.)

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 18


/////
Explore & Visualise Missing Data
Explore & Visualise Missing Data

▶ Just like starting any analysis, we should always explore the data first.

▶ In this case, we should determine if there is missing data and what types of missing data are
present.
▶ Great for analysis and reporting of survey data and can inform the survey design in future
studies.

▶ Example: large amounts of unexpected missing data may indicate the questions were unclear
or difficult to recall.

▶ When assessing the missing data, we recommend using a tibble or [Link] object
and not a survey object as most of the analysis is about patterns of records and so weights
are not necessary.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 20


Setup: Packages

▶ In this section, we will introduce a new package called naniar of


1.
▶ naniar builds on the visdat package of N. Tierney et al. (2023) that
provides preliminary exploratory visualisation of data, to focus on
exploring missing data.

# pak::pak("naniar", "mlbench", "haven") # install the package


library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)
library(visdat)
library(naniar)
library(haven)
library(mlbench)
data(PimaIndiansDiabetes)
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 21
Setup: ANES Data
▶ We will be using data from ANES that we introduced during tidy analysis of survey data
module.
▶ We will introduce a new dataset a bit later.

▶ Here is the code to create the design objects for each to use throughout this chapter.
▶ For ANES, we need to adjust the weight so it sums to the population instead of the sample.

targetpop <- 231592693


data(anes_2020)
anes_adjwgt <- anes_2020 %>%
mutate(Weight = Weight / sum(Weight)
* targetpop)

anes_des <- anes_adjwgt %>%


as_survey_design(
weights = Weight,
strata = Stratum,
ids = VarUnit,
nest = TRUE
)
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 22
Summarise data
▶ Let’s look at a few analytic variables on the ANES 2020 data using summary() which will
illuminate NA values in the data:

anes_2020 %>% ## V202051 V202066 V202072 VotedPres2020


select(V202051:EarlyVote2020) %>% ## Min. :-9.0000 Min. :-9.000 Min. :-9.0000 Yes :5952
summary() ## 1st Qu.:-1.0000 1st Qu.: 4.000 1st Qu.: 1.0000 No : 77
## Median :-1.0000 Median : 4.000 Median : 1.0000 NA's:1424
## Mean :-0.7259 Mean : 3.402 Mean : 0.6234
▶ What do you see? ## 3rd Qu.:-1.0000 3rd Qu.: 4.000 3rd Qu.: 1.0000
## Max. : 3.0000 Max. : 4.000 Max. : 2.0000
▶ Notice that we can only include ## V202073 V202109x V202110x VotedPres2020_selection
so much output on a slide. You ## Min. :-9.0000 Min. :-2.0000 Min. :-9.0000 Biden:3509
can see more if you run the code ## 1st Qu.: 1.0000 1st Qu.: 1.0000 1st Qu.: 1.0000 Trump:2567
on your console. ## Median : 1.0000 Median : 1.0000 Median : 1.0000 Other: 158
## Mean : 0.9415 Mean : 0.8578 Mean : 0.9902 NA's :1219
## 3rd Qu.: 2.0000 3rd Qu.: 1.0000 3rd Qu.: 2.0000
## Max. :12.0000 Max. : 1.0000 Max. : 5.0000
## EarlyVote2020
## Yes : 371
## No :5949
## NA's:1133
##
##
##
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 23
Summarise data (cont.)
▶ We see that there are
▶ NA values in several of the derived variables (those not beginning with “V”) and

▶ negative values in the original variables (those beginning with “V”).

▶ We can also use the count() function to get an understanding of the different types of
missing data on the original variables.

▶ For example, let’s look at the count of data for V202072, which corresponds to our
VotedPres2020 variable.

anes_2020 %>% ## # A tibble: 5 x 3


count(VotedPres2020, V202072) ## VotedPres2020 V202072 n
## <fct> <dbl+lbl> <int>
## 1 Yes 1 [1. Yes, voted for President] 5952
▶ What do you see here?
## 2 No 2 [2. No, didn't vote for President] 77
## 3 <NA> -9 [-9. Refused] 2
## 4 <NA> -6 [-6. No post-election interview] 4
## 5 <NA> -1 [-1. Inapplicable] 1418
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 24
Visualising Missing Data

▶ It can be challenging to look at tables for every variable, so let’s start visualising!

▶ We can use the vis_miss() function available in naniar (also in both visdat) package
to view the amount of missing data by variable.

anes_2020_derived<-anes_2020 %>%

)
3%

6%
(2

(1
)
%

n
io

io
1

)
select(!starts_with("V2"), -CaseID,

ct

ct
0.

)
%

)
9%
)

ele

ele

5%
(<

(0
(0

(1
t

s
t

(1
)
es

en
6_

0_
%

)
6

0
%
er

20
(0
nm
01

01

02

02
)

)
%

%
)

(4

)
nt

)
(2
%

1%

0
%
-InterviewMode, -Weight,

ple
s2

s2

s2

s2
(1

(7
er
nI

e2
up
(0

(7
)
re

re

re

re
ig

ov

(
o

io
%

th

e7

t
ro

Vo
Pe
ID

er

e
pa

dP

dP

dP

dP
at

eE
G

(4

m
eG

d
ty

rly
uc
st

st
am

te

te

te

te
en

co

co
e

ac
r

Ag

Ag

Ed

Ea
Pa
Vo

Vo

Vo

Vo
In

In
Tr

Tr

G
C

R
-Stratum, -VarUnit) 0

ggcolour <- scales::hue_pal()(3)[c(1,3)] 2000

Observations
4000

anes_2020_derived %>%
vis_miss(cluster= TRUE, show_perc = FALSE) + 6000

scale_fill_manual(
values = ggcolour, Missing Present

labels = c("Present","Missing"),
name = "") ▶ What do you see?

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 25


Visualising Missing Data (cont.)

▶ Even if we did not have the informative variable names, we could be able to deduce that
VotedPres2020, VotedPres2020_selection, and EarlyVote2020 are likely related
since their missing data patterns are similar.

▶ Additionally, we can also look at VotedPres2016_selection and see that there is a lot of
missing data in that variable.

▶ Most likely this is due to a skip pattern, and we can look at further graphics to see how it
might be related to other variables.

▶ We can dive deeper with functions such as gg_miss_fct() which looks at missing data for
all variables by levels of another variable.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 26


Visualising Missing Data (cont.)

anes_2020_derived %>% VotedPres2020_selection

gg_miss_fct(VotedPres2016) + VotedPres2020

VotedPres2016_selection
scale_fill_gradientn( TrustPeople

guide = "colorbar", TrustGovernment

name = "% Miss", RaceEth % Miss


100

colors = ggcolour
PartyID

Variable
75
Income7

) +
50
Income
25

ylab("Variable") + Gender
0

xlab("Voted for President in 2016")


Education

EarlyVote2020

CampaignInterest

AgeGroup

Age

A
Ye

N
Voted for President in 2016

▶ We can see that if they did not vote for president in 2016 or did not answer that question,
then they were not asked about who they voted for in 2016 (the percentage of missing data
is 100%).

▶ We can see with this graphic, that there is more missing data across all questions if they did
not provide an answer to VotedPres2016.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 27
Visualising Missing Data (cont.)

▶ UpSet plots are a data visualisation method for showing set data with more than three
intersecting sets.
▶ UpSet shows intersections in a matrix, and the size of the sets and of the intersections are
shown as bar charts.
▶ Here we can use it to illustrate the combinations of missingness across cases.

gg_miss_upset(anes_2020,
nsets = 10,
nintersects = 40
# nsets = number of variables to be consider
# can change to n_var_miss(anes_2020) find out the number of variable that have
#
# nintersets = number of interaction to plot, NA means everything
)

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 28


Visualising Missing Data (cont.)
▶ So what do you see?
800

720
695

600

Intersection Size
400

271
243
219
200

87
60 54 51
43 34 33
30 25
16 16 15 13 13 11 11 10 10 10 8 7 7 7 7 7
5 5 4 4 4 4 4 4 4 4
0
RaceEth_NA
Education_NA
Age_NA
AgeGroup_NA
Income_NA
Income7_NA
EarlyVote2020_NA
VotedPres2020_selection_NA
VotedPres2020_NA
VotedPres2016_selection_NA
1500 1000 500 0
Set Size
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 29
Visualising Missing Data (cont.)
▶ There are other graphics that work well
with numeric data.

▶ For example, in the RECS 2020 data we


can plot two continuous variables and the
missing data associated with it to see if
there are any patterns to the missingness.

▶ To do this, we can use


naniar::bind_shadow().

▶ This creates a nabular (combination of na


with tabular), which features the original
columns followed by the same number of
columns with a specific NA format. Figure 3: Illustration of data structures for
facilitating visualisation of missings and not missings
▶ These NA columns are indicators of if the
value in the original data is missing or not.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 30
Setup: PimaIndiansDiabetes Data

▶ For this part and the rest of the module, we will use a new dataset.

▶ PimaIndiansDiabetes: the Pima-Indians diabetes (PID) dataset is an immensely popular


dataset originally collected by the National Institute of Diabetes and Digestive and Kidney
Diseases.
▶ This dataset is part of the mlbench package

▶ Some variables in the dataset


▶ pregnant: Number of times pregnant (count);
▶ glucose: Plasma glucose concentration (glucose tolerance test) (continuous);
▶ pressure: Diastolic blood pressure (mm Hg) (continuous);
▶ triceps: Triceps skin fold thickness (mm) (continuous);
▶ insulin: 2-Hour serum insulin (mu U/ml) (continuous);
▶ mass: Body mass index (weight in kg/(height in m)2 ) (continuous);
▶ pedigree: Diabetes pedigree function (continuous values);
▶ age: Age (years) (count); and
▶ diabetes: a class variable (test for diabetes) with values “pos” and “neg” (binary).
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 31
Setup: PimaIndiansDiabetes Data (cont.)

▶ Our code should start with something like this:

library(mlbench)
data(PimaIndiansDiabetes)
PimaIndiansDiabetes <- PimaIndiansDiabetes %>%
as_tibble()

▶ Can you then have a go and


▶ obtain the summary() of each variable? Hint: See slide 22.

▶ obtain the histogram of each numerical variable? Hint: use geom_histogram() and
facet_wrap().

▶ Comment on anything unusual with your results.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 32


Setup: PimaIndiansDiabetes Data (cont.)

▶ Using summary()

PimaIndiansDiabetes %>%
summary()

## pregnant glucose pressure triceps


## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin mass pedigree age diabetes
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00 neg:500
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00 pos:268
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 33


Setup: PimaIndiansDiabetes Data (cont.)

▶ Histogram with facet_wrap


age glucose insulin
80

PimaIndiansDiabetes %>% 90

60
60 300

200
40
select(where([Link])) %>% 30 20 100

pivot_longer(everything()) %>% 0
20 40 60 80
0
0 50 100 150 200
0
0 250 500 750

ggplot(aes(x = value)) + mass


150
pedigree pregnant
100
geom_histogram() + 75 100 100

count
facet_wrap(~ name, scales = "free") 50
25
50 50

0 0 0
0 20 40 60 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15

pressure triceps
120
200
90
150
60
100
30 50
0 0
0 40 80 120 0 25 50 75 100
value

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 34


Setup: PimaIndiansDiabetes Data (cont.)

▶ In the previous slides, the summary or histogram revealed implicit missing values in
glucose, pressure, triceps, insulin, and mass (since they can’t be zero!).

▶ So we convert these explicitly to missing values.

# Replace implicit missing values with NAs


# Use across() to apply the same operation to multiple variables
# Use ifelse() to do the conversion
dat <- PimaIndiansDiabetes %>%
mutate(across(c("glucose","pressure","triceps","insulin","mass"),
~ ifelse(.x == 0, NA, .x)))

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 35


Setup: PimaIndiansDiabetes Data (cont.)
▶ Later some of the methods we will look at will assume that the missing covariates are
normally distributed.

▶ Most of the covariates are right skewed, so let’s convert these to log-scale so the
assumption is closer to being satisfied.

▶ Notice that we will take a log(1+x) transform instead of just log(x). Why?

# a list of variable that we would like to take log


logTransVars <- c("pregnant","glucose","insulin","mass","pedigree","age")

dat <- dat %>%


mutate(across(logTransVars, ~ log(1 + .x),
.names = "log1p_{.col}")) %>%
select(-all_of(logTransVars))

▶ Homework If you have some time, create an UpSet plot for this dataset. Comment on the
missingness of the dataset.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 36
Visualising Missing Data (cont.)

▶ We can now get back to creating a shadow matrix.

▶ The example printed below shows how the shadow matrix works.
▶ Most levels of triceps are !NA in triceps_NA.
▶ Those are NA in triceps are still NA in triceps_NA.

dat_shadow <- dat %>% ## # A tibble: 6 x 3


bind_shadow() ## triceps triceps_NA n
c(ncol(dat), ncol(dat_shadow)) ## <dbl> <fct> <int>
## 1 54 !NA 2
## [1] 9 18 ## 2 56 !NA 1
## 3 60 !NA 1
## 4 63 !NA 1
dat_shadow %>% ## 5 99 !NA 1
count(triceps, triceps_NA) %>% ## 6 NA NA 227
tail()

▶ We can then use these new variables to plot the missing data along side the actual data.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 37
Visualising Missing Data (cont.)

▶ Let’s plot a histogram of the (log-transformed) glucose readings grouped by those that are
missing and not missing by the (log-transformed) insulin reading.

dat_shadow %>% Histogram of log−glucose by


insulin missing data
ggplot(aes(x = log1p_glucose,
fill = log1p_insulin_NA)) +
# the shadow matrix provides a variable for grouping
60

geom_histogram() +
scale_fill_manual( 40 Triceps reading

count
values = ggcolour,
Present
Missing

labels = c("Present", "Missing"),


name = "Triceps reading") +
20

ggtitle("Histogram of log-glucose by
insulin missing data") 0

4.0 4.5 5.0


log1p_glucose

▶ So what can you comment on?

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 38


Visualising Missing Data (cont.)
▶ The plot on the previous slide indicates that respondents who didn’t provide a response for
insulin seems to have the same distribution of glucose compared to respondents who did
provide a response.

▶ It suggests the missing data mechanism is ignorable.

▶ Here is another example plot based on the recs data (code can be found at the back of
this slide deck.)
▶ This plot indicates that respondents who did not provide a
Histogram of Energy Cost by Heating Behavior Missing Data
response for the heating behaviour question may have a
1500 different distribution of total energy cost compared to
respondents who did provide a response.
▶ This view of the raw data and missingness could indicate some
Number of Households

1000
Heating Behavior
Present
bias in the data.
Missing

▶ Researchers take these different bias aspects into account


500

when calculating weights and we need to make sure that the


weights are incorporated when analysing the data.
0

0 1000 2000 3000 4000 5000


Total Energy Cost (Truncated at $5000)
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 39
Visualising Missing Data (cont.)

▶ Insulin seems to have the most missing data in this dataset.

▶ Normally it’s hard to explore the relationship between two variables as geom_point() will
just ignore any missing data.

# using regular geom_point() ## Warning: Removed 375 rows containing missing values (`geom
ggplot(dat_shadow,
aes(x = log1p_insulin,
y = log1p_glucose)) +
geom_point() 5.0

log1p_glucose
4.5

4.0

3 4 5 6
log1p_insulin

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 40


Visualising Missing Data (cont.)
▶ There are a few different ways to explore different missing data mechanisms and
relationships.

▶ One approach involves moving missing values to where they can be seen alongside regular
values on a graph. Then, it marks them with different colours.

▶ This is implemented with geom_miss_point().

ggplot(dat_shadow,
aes(x = log1p_insulin,
y = log1p_glucose)) + 5.0

geom_miss_point()

log1p_glucose
missing
4.5
Missing
Not Missing

4.0

2 3 4 5 6
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES log1p_insulin 41
Statistical tests and Modelling of missingness
▶ Before we move on to discuss imputation, we will finish this section by discuss a test on
missingness as well as a way to model missingness.

▶ naniar provides mcar_test() for Little (1988)’s statistical test for missing completely at
random (MCAR) data.

▶ The null hypothesis in this test is that the data is MCAR, and the test statistic is a
chi-squared value.

mcar_test(dat)

## # A tibble: 1 x 4
## statistic df [Link] [Link]
## <dbl> <dbl> <dbl> <int>
## 1 175. 69 3.17e-11 11

▶ Given the high statistic value and low p-value, we can conclude that the
PimaIndiansDiabetes data is not missing completely at random.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 42
Statistical tests and Modelling of missingness (cont.)

▶ One question most of you would like to know is that can we model missingness?

▶ Sometimes it can be impractical to explore all of the variables that have missing data.

▶ One approach, however, is to model missing data using methods from N. J. Tierney et al.
(2015).

▶ Here, the approach is to predict the proportion of missingness in a given case, using all
variables.

▶ There is a little helper function to add a column with the proportion of cases or rows
missing add_prop_miss(). This created a column named prop_miss, which is the
proportion of missing values in that row.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 43


Statistical tests and Modelling of missingness (cont.)
dat %>%
add_prop_miss() %>%
print(n = 5, width = Inf)

## # A tibble: 768 x 10
## pressure triceps diabetes log1p_pregnant log1p_glucose log1p_insulin
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 72 35 pos 1.95 5.00 NA
## 2 66 29 neg 0.693 4.45 NA
## 3 64 NA pos 2.20 5.21 NA
## 4 66 23 neg 0.693 4.50 4.55
## 5 40 35 pos 0 4.93 5.13
## log1p_mass log1p_pedigree log1p_age prop_miss_all
## <dbl> <dbl> <dbl> <dbl>
## 1 3.54 0.487 3.93 0.111
## 2 3.32 0.301 3.47 0.111
## 3 3.19 0.514 3.50 0.222
## 4 3.37 0.154 3.09 0
## 5 3.79 1.19 3.53 0
## # i 763 more rows
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 44
Statistical tests and Modelling of missingness (cont.)

▶ We can then use a model like decision trees to predict which variables and their values are
important for predicting the proportion of missingness:
Prop. Miss = 0.094
n=768 100%

# pak::pak(c("rpart", "[Link])) log1p_pe >= 0.27


< 0.27
library(rpart)
Prop. Miss = 0.073 Prop. Miss = 0.12
library([Link]) n=453 59% n=315 41%

dat %>% log1p_ag < 4.1


>= 4.1
log1p_pe < 0.26
>= 0.26

add_prop_miss() %>% Prop. Miss = 0.069 Prop. Miss = 0.21 Prop. Miss = 0.12 Prop. Miss = 0.25

rpart(prop_miss_all ~ ., data = .) %>% n=440 57%


log1p_pr < 1.5
n=13 2% n=304 40%
log1p_ag < 3.4
n=11 1%

prp(type = 4, extra = 101, >= 1.5 >= 3.4

prefix = "Prop. Miss = ") Prop. Miss = 0.052


n=255 33%
Prop. Miss = 0.092
n=185 24%
Prop. Miss = 0.094
n=146 19%
Prop. Miss = 0.14
n=158 21%
log1p_pr < 1.9
>= 1.9

Prop. Miss = 0.087 Prop. Miss = 0.22


n=139 18% n=7 1%

▶ Here we can see that this can produce quite a complex tree.

▶ This can be pruned back and the depth of the decision tree can be controlled.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 45


Want to learn more about naniar

▶ For more details on the workflow and theory underpinning naniar, read the vignette
Getting started with naniar. Click here.

▶ For a short primer on the data visualisation available in naniar, read the vignette Gallery of
Missing Data Visualisations. Click here.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 46


/////
To impute or not impute?
Deletion methods

▶ The simplest method for handling missing data is to eliminate rows or columns until a
dataset is entirely filled.
▶ The most common method is complete case analysis (also known as list-wise deletion), where
only data with complete information are considered.

▶ Another common method is available case analysis (also called pairwise deletion), where
variables with missing values are deleted first, followed by removal of samples containing
missing data.

▶ Both of these methods assume Missing Completely at Random (MCAR) or Missing at


Random (MAR).

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 48


Complete and available case fits
# complete case analysis # complete case analysis can be a bit wasteful
# rows with any NAs are removed rbind(dim(dat_complete),
dat_complete <- dat %>% dim(dat_available1),
drop_na() dim(dat_available2))

# diabetes ~ everything ## [,1] [,2]


res_glm1 <- glm(diabetes ~ ., ## [1,] 392 9
data = dat_complete, ## [2,] 724 7
family = binomial) ## [3,] 752 6
# diabetes ~ everything except triceps & insulin
res_glm2 <- glm(diabetes ~ . -triceps -
log1p_insulin, res_glm3 <- glm(diabetes ~ .,
data = dat_complete, data = dat_available1,
family = binomial) family = binomial)

dat_available1 <- dat %>% res_glm4 <- glm(diabetes ~ .,


select(-triceps, -log1p_insulin) %>% data = dat_available2,
drop_na() family = binomial)

dat_available2 <- dat %>%


select(-triceps, -log1p_insulin, -pressure) %>%
drop_na()
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 49
Comparing complete and available case analyses

library(gtsummary)
tbl_merge(
map(list(res_glm1, res_glm2, res_glm3, res_glm4),
~ tbl_regression(.x) %>%
add_glance_table(include = c(AIC, nobs))),
tab_spanner = c("**res_glm1**", "**res_glm2**", "**res_glm3**", "**res_glm4**")) %>%
as_kable_extra(format = "latex", booktabs = TRUE)

res_glm1 res_glm2 res_glm3 res_glm4

Characteristic log(OR) 95% CI p-value log(OR) 95% CI p-value log(OR) 95% CI p-value log(OR) 95% CI p-value

pressure 0.00 -0.03, 0.02 0.8 0.00 -0.03, 0.02 0.8 -0.01 -0.03, 0.01 0.2
triceps 0.01 -0.03, 0.04 0.8
log1p_pregnant 0.09 -0.38, 0.56 0.7 0.09 -0.39, 0.56 0.7 0.37 0.07, 0.67 0.017 0.35 0.06, 0.64 0.018
log1p_glucose 4.5 3.1, 6.1 <0.001 4.6 3.4, 6.0 <0.001 4.5 3.6, 5.4 <0.001 4.5 3.7, 5.5 <0.001
log1p_insulin 0.06 -0.47, 0.58 0.8

log1p_mass 2.7 0.79, 4.8 0.007 3.0 1.4, 4.5 <0.001 3.5 2.4, 4.6 <0.001 3.3 2.3, 4.3 <0.001
log1p_pedigree 1.9 0.57, 3.3 0.006 2.0 0.61, 3.4 0.005 1.7 0.74, 2.7 <0.001 1.6 0.67, 2.6 <0.001
log1p_age 2.1 0.76, 3.4 0.002 2.1 0.80, 3.4 0.002 1.2 0.40, 1.9 0.003 0.92 0.21, 1.6 0.011

AIC 358 355 678 708


No. Obs. 392 392 724 752
1 OR = Odds Ratio, CI = Confidence Interval

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 50


Single imputation methods

▶ The next step is to fill in the missing values with realistic values.

▶ This should be done with plenty of care.

▶ Suppose we fill in all missing values for a particular covariate with the mean of the values in
that covariate then the variance of that covariate becomes artificially smaller.

▶ This means inferences based on these methods will be, in general, unreliable.

▶ However, if the goal is prediction then single imputation methods can still be reasonable.

▶ When imputing, keep in mind that some variables are restricted to be positive or bounded.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 51


Simple imputation methods (cont.)

▶ We will consider a number of methods now that replace all missing values with a single
value.
▶ Mean/Median substitution.
▶ Regression imputation.
▶ Linear regression models.
▶ Robust regression models.
▶ Penalized regression.
▶ Machine learning imputation
▶ K-nearest neighbours.
▶ Cart imputation
▶ Random forest imputation.

▶ There is also hot-deck (even cold-deck) imputation (which we will not cover here) and
expectation maximisation.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 52


Simple imputation methods
▶ Below is an overview of the imputation methods that the simputation package of Loo
(2022) supports.

function method library


impute_rlm M-estimation MASS
impute_en ridge/elasticnet/lasso glmnet
impute_cart CART rpart
impute_rf random forest randomForest
impute_rhd random hot deck VIM (optional)
impute_shd sequential hot deck VIM (optional)
impute_knn k nearest neighbours VIM (optional)
impute_mf missForest missForest
impute_em mv-normal norm

▶ The general syntax and the formula are of the form


# general syntax
impute_<model>(data, formula, [model-specific options])
# formula syntax
IMPUTED ~ MODEL_SPECIFICATION [ | GROUPING ]
# where the part between [] is optional.
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 53
Mean/Median substitution

# pak::pak("simputation") res_glm5 <- glm(diabetes ~ .,


library(simputation) data = dat_imp1,
# Impute all missing values by their median value family = binomial)
dat_imp1 <- dat %>% res_glm6 <- glm(diabetes ~ .,
impute_lm(log1p_glucose ~ 1) %>% data = dat_imp2,
impute_lm(pressure ~ 1) %>% family = binomial)
impute_lm(triceps ~ 1) %>%
impute_lm(log1p_insulin ~ 1) %>%
impute_lm(log1p_mass ~ 1)

# Impute all missing values by their median value


dat_imp2 <- dat %>%
impute_median(log1p_glucose ~ 1) %>%
impute_median(pressure ~ 1) %>%
impute_median(triceps ~ 1) %>%
impute_median(log1p_insulin ~ 1) %>%
impute_median(log1p_mass ~ 1)

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 54


Mean/Median substitution with added noise

# Impute all missing values by their mean value # Impute all missing values by their mean value
# then add normal noise # then add residuals to result
dat_imp3 <- dat %>% dat_imp4 <- dat %>%
impute_lm(log1p_glucose ~ 1, impute_median(log1p_glucose ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(pressure ~ 1, impute_median(pressure ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(triceps ~ 1, impute_median(triceps ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(log1p_insulin ~ 1, impute_median(log1p_insulin ~ 1,
add_residual = "normal") %>% add_residual = "observed") %>%
impute_lm(log1p_mass ~ 1, impute_median(log1p_mass ~ 1,
add_residual = "normal") add_residual = "observed")
res_glm7 <- glm(diabetes ~ ., res_glm8 <- glm(diabetes ~ .,
data = dat_imp3, data = dat_imp4,
family = binomial) family = binomial)

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 55


Comparing the results
res_glm1 res_glm5 res_glm6

Characteristic log(OR) 95% CI p-value log(OR) 95% CI p-value log(OR) 95% CI p-value

pressure 0.00 -0.03, 0.02 0.8 -0.01 -0.03, 0.01 0.2 -0.01 -0.03, 0.01 0.2
triceps 0.01 -0.03, 0.04 0.8 0.00 -0.03, 0.03 >0.9 0.00 -0.03, 0.03 >0.9
log1p_pregnant 0.09 -0.38, 0.56 0.7 0.38 0.10, 0.67 0.010 0.38 0.10, 0.67 0.010
log1p_glucose 4.5 3.1, 6.1 <0.001 4.6 3.7, 5.6 <0.001 4.6 3.7, 5.6 <0.001
log1p_insulin 0.06 -0.47, 0.58 0.8 -0.02 -0.47, 0.43 >0.9 -0.01 -0.46, 0.44 >0.9

log1p_mass 2.7 0.79, 4.8 0.007 3.6 2.4, 4.8 <0.001 3.6 2.4, 4.8 <0.001
log1p_pedigree 1.9 0.57, 3.3 0.006 1.5 0.59, 2.5 0.002 1.5 0.59, 2.5 0.002
log1p_age 2.1 0.76, 3.4 0.002 1.0 0.30, 1.8 0.006 1.0 0.31, 1.8 0.005

AIC 358 724 724


No. Obs. 392 768 768
1 OR = Odds Ratio, CI = Confidence Interval

res_glm7 res_glm8

Characteristic log(OR) 95% CI p-value log(OR) 95% CI p-value

pressure -0.01 -0.03, 0.01 0.3 -0.01 -0.03, 0.01 0.2


triceps 0.01 -0.01, 0.03 0.5 0.00 -0.03, 0.03 >0.9
log1p_pregnant 0.39 0.10, 0.68 0.008 0.38 0.10, 0.67 0.010
log1p_glucose 4.4 3.5, 5.4 <0.001 4.6 3.7, 5.6 <0.001
log1p_insulin 0.10 -0.18, 0.38 0.5 -0.01 -0.46, 0.44 >0.9

log1p_mass 3.5 2.3, 4.6 <0.001 3.6 2.4, 4.8 <0.001


log1p_pedigree 1.6 0.62, 2.5 0.001 1.5 0.59, 2.5 0.002
log1p_age 1.0 0.28, 1.7 0.007 1.0 0.31, 1.8 0.005

AIC 724 724


No. Obs. 768 768
1 OR = Odds Ratio, CI = Confidence Interval
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 56
Regression imputation and its visualisation
summary(lm(log1p_insulin ~ ., ##
data = dat_complete)) ## Call:
## lm(formula = log1p_insulin ~ ., data = dat_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.07402 -0.32786 -0.00332 0.34950 1.50801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.4284331 0.8424177 -6.444 3.5e-10 ***
## pressure -0.0034987 0.0023442 -1.493 0.136387
## triceps -0.0027364 0.0034601 -0.791 0.429521
## diabetespos -0.0050517 0.0693882 -0.073 0.942000
## log1p_pregnant 0.0009177 0.0481307 0.019 0.984798
## log1p_glucose 1.6601886 0.1275366 13.017 < 2e-16 ***
## log1p_mass 0.6199578 0.1837378 3.374 0.000816 ***
## log1p_pedigree 0.0607200 0.1369102 0.444 0.657653
## log1p_age 0.1268410 0.1366074 0.929 0.353729
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 57
Regression imputation and its visualisation (cont.)

imp1 <- dat_shadow %>%


impute_lm(log1p_insulin ~ log1p_glucose + log1p_mass)
imp2 <- dat_shadow %>%
impute_lm(log1p_insulin ~ log1p_glucose + log1p_mass,
add_residual = "normal")
g1 <- imp1 %>%
ggplot(aes(x = log1p_glucose, y = log1p_insulin, colour = log1p_insulin_NA )) +
geom_point() +
ggtitle("lm impute")
g2 <- imp2 %>%
ggplot(aes(x = log1p_glucose, y = log1p_insulin, colour = log1p_insulin_NA )) +
geom_point() +
ggtitle("lm impute + noise")
library(patchwork)
g1 + g2

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 58


Regression imputation and its visualisation (cont.)

lm impute lm impute + noise

6 6
log1p_insulin

log1p_insulin
5
5 log1p_insulin_NA log1p_insulin_NA
!NA !NA
NA NA

4
4

3
3

4.0 4.5 5.0 4.0 4.5 5.0


log1p_glucose log1p_glucose

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 59


Nearest neighbour imputation and its visualisation
library(DMwR2) g5 <- dat_imp3 %>%
dat_imp1 <- knnImputation(dat, k = 1) ggplot(aes(x = log1p_glucose, y = log1p_insulin,
dat_imp2 <- knnImputation(dat, k = 2) colour = ind_na )) + geom_point() +
dat_imp3 <- knnImputation(dat, k = 3) ggtitle("k = 3")

ind_na <- dat %>% g6 <- dat_imp1 %>%


mutate(any_na = rowSums([Link](.)) > 0) %>% ggplot(aes(x = log1p_mass, y = triceps,
pull(any_na) colour = ind_na )) + geom_point() +
ggtitle("k = 1")
g3 <- dat_imp1 %>%
ggplot(aes(x = log1p_glucose, g7 <- dat_imp2 %>%
y = log1p_insulin, ggplot(aes(x = log1p_mass, y = triceps,
colour = ind_na)) + colour = ind_na)) + geom_point() +
geom_point() + ggtitle("k = 2")
ggtitle("k = 1")
g8 <- dat_imp3 %>%
g4 <- dat_imp2 %>% ggplot(aes(x = log1p_mass, y = triceps,
ggplot(aes(x = log1p_glucose, colour = ind_na )) + geom_point() +
y = log1p_insulin, ggtitle("k = 3")
colour = ind_na)) +
geom_point() + patchwork::wrap_plots(
ggtitle("k = 2") list(g3, g4, g5, g6, g7, g8), ncol = 3)
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 60
Nearest neighbour imputation and its visualisation (cont.)
k=1 k=2 k=3

6 6 6
log1p_insulin

log1p_insulin

log1p_insulin
ind_na ind_na ind_na
5 5 5
FALSE FALSE FALSE
TRUE TRUE TRUE
4 4 4

3 3 3

4.0 4.5 5.0 4.0 4.5 5.0 4.0 4.5 5.0


log1p_glucose log1p_glucose log1p_glucose

k=1 k=2 k=3


100 100 100

75 75 75

ind_na ind_na ind_na


triceps

triceps

triceps
FALSE FALSE FALSE
50 50 50
TRUE TRUE TRUE

25 25 25

3.2 3.6 4.0 3.2 3.6 4.0 3.2 3.6 4.0


log1p_mass log1p_mass log1p_mass

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 61


Multiple imputation

▶ When inference is important, data should be imputed multiple times.

▶ The process to do this is really, really complicated since there are two sources of variation
1. Variation from estimation of model parameters (including any in the model for the covariates);

2. Variation from the missing values.

▶ Many methods iterate between addressing these to form a Markov chain where the imputed
values mimic draws from the right (posterior) distribution.

▶ Usually a small number of completed datasets are needed, say m = 5 is often recommended,
with 20 iterations between estimating model parameters and imputing missing values.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 62


Combining the results

▶ Symbolically we impute a matrix X(mis) that contains missing values and mice outputs m
complete datasets (containing no missing values)
X(1) , . . . , X(m) .

▶ Suppose that the estimates of θ corresponding to X(1) , . . . , X(m) are


b(1) , . . . , θ
θ b(m)
where the sampling variances for each of the estimates are
b (1) , . . . , V
V b (m) .

▶ Then the commonly used point estimate for θ is


m
b= 1 b(i)
X
θ θ
m i=1

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 63


Combining the Results (cont.)

▶ The estimated variance of θ


b is
 
1
Var(
c θ) b = 1+ B+W
m
where the between and within variances are:
m
1 X b(i) b b(i) b T
B= (θ − θ)(θ − θ)
m − 1 i=1
and
m
1 X b (i)
W= V .
m i=1

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 64


Testing Scalars

▶ Suppose that we want to test that θj = θj0 for a particular j. Rubin (1987) showed that
θb − θj0 approx
qj ∼ tν
c θ̂)jj
Var(
where  2
W jj
ν = (m − 1) 1 + .
(1 + m−1 B jj )

▶ The statistic
rj = (1 + m−1 )Bjj /Wjj
can be interpreted as the relative increase in variance due to missing data.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 65


Multiple imputation with mice
▶ We will use the package mice for multiple
imputation and the dataset airquality # pak::pak("mice")
for demonstration. library(mice)
data(airquality)

▶ The dataset airquality has two variables with missing values Ozone and Solar.R

airquality_imputed <- ## Class: mids


mice(airquality, m = 5, ## Number of multiple imputations: 5
maxit = 50, method = 'pmm', ## Imputation methods:
seed = 1, printFlag = FALSE) ## Ozone Solar.R Wind Temp Month Day
summary(airquality_imputed) ## "pmm" "pmm" "" "" "" ""
## PredictorMatrix:
## Ozone Solar.R Wind Temp Month Day
## Ozone 0 1 1 1 1 1
## Solar.R 1 0 1 1 1 1
## Wind 1 1 0 1 1 1
## Temp 1 1 1 0 1 1
## Month 1 1 1 1 0 1
## Day 1 1 1 1 1 0
SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 66
Multiple imputation with mice (cont.)

# Get the m = 5 different missing values


head(airquality_imputed$imp$Ozone)

## 1 2 3 4 5
## 5 14 28 6 6 6
## 10 23 27 21 13 41
## 25 18 19 19 18 8
## 26 7 1 19 7 13
## 27 12 1 18 21 9
## 32 39 35 46 47 28

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 67


Multiple imputation with mice (cont.)

▶ We can then fit multiple models using each complete dataset

fit <- with(airquality_imputed,


lm(Ozone ~ Solar.R + Wind + Temp))
#combine results of all 5 models using Rubin's rule
combine <- pool(fit)
knitr::kable(summary(combine),
booktabs = TRUE, digits = 3) %>%
kableExtra::kable_styling()

term estimate [Link] statistic df [Link]


(Intercept) -67.238 20.527 -3.276 80.885 0.002
Solar.R 0.060 0.020 2.950 126.594 0.004
Wind -2.967 0.581 -5.107 96.095 0.000
Temp 1.633 0.232 7.036 61.808 0.000

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 68


Other multiple imputation packages

The main alternatives to mice are:


▶ Amelia

▶ missForest

▶ Hmisc

▶ mi

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 69


Code for recs data histogram

data(recs_2020)
recs_2020_shadow <- recs_2020 %>%
bind_shadow()
recs_2020_shadow %>%
filter(TOTALDOL < 5000) %>%
ggplot(aes(x = TOTALDOL, fill = HeatingBehavior_NA)) +
geom_histogram() +
scale_fill_manual(values = ggcolour,
labels = c("Present", "Missing"),
name = "Heating Behavior") +
xlab("Total Energy Cost (Truncated at $5000)") +
ylab("Number of Households") +
labs(title = "Histogram of Energy Cost by Heating Behavior Missing Data")

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 70


Want to learn more?

▶ See Sections 8.5 & 8.6 of Sharon L. Lohr (2021) to learn how to adjust sampling weight for
Non-resonse.

▶ Go to this link here to learn more about how to recode missing data and also accouting for
skip pattern in your survey.

▶ See Section 8.7 of Sharon L. Lohr (2021) and Section 11 of Nguyen (2020) which you can
access that here to learn more about imputation.

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 71


References I

Little, Roderick J. A. 1988. “A Test of Missing Completely at Random for Multivariate Data with Missing Values.”
Journal of the American Statistical Association 83 (404): 1198–1202.
[Link]
Loo, Mark van der. 2022. “Simputation: Simple Imputation.”
[Link]
Nguyen, Mike. 2020. A Guide on Data Analysis. Bookdown. [Link]
Rubin, Donald B. 1987. “Multiple Imputation for Nonresponse in Sureys.” In Multiple Imputation for Nonresponse
in Surveys, i–xxix. John Wiley & Sons, Ltd. [Link]
Sharon L. Lohr. 2021. Sampling : Design and Analysis. Chapman & Hall/CRC Texts in Statistical Science.
BOCA RATON: Chapman and Hall/CRC. [Link]
AuthType=sso&db=nlebk&AN=2940082&site=ehost-live&custid=s8434881.
Thomsen, Ib, and Erling Siring. 1980. “On the Causes and Effects of Non-response: Norwegian Experiences.”
Statistisk Sentralbyrå, no. 121: 52.
Tierney, Nicholas J, Fiona A Harden, Maurice J Harden, and Kerrie L Mengersen. 2015. “Using Decision Trees to
Understand Structure in Missing Data.” BMJ Open 5 (6): e007450.
[Link]

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 72


References II

Tierney, Nicholas, Sean Hughes (Sean Hughes reviewed the package for rOpenSci, see
[Link] Mara Averick (Mara Averick reviewed the package for
rOpenSci, see [Link] Stuart Lee, Earo Wang, Nic Crane, and
Christophe Regouby. 2023. “Visdat: Preliminary Visualisation of Data.”
[Link]

Zimmer, Stephanie, Rebecca J. Powell, and Isabella Velásquez. n.d. Exploring Complex Survey Data Analysis in R.
Accessed February 26, 2024. [Link]

SCHOOL OF MATHEMATICAL & PHYSICAL SCIENCES 73

You might also like