missing data
Missing completely at random (MCAR)
Pattern of missingness is random.
Think: aliens.
Missing at random (MAR)
the pattern of missingness on variable A is related to variable.
Think: raccoons.
Missing not at random (MNAR)
missingness of variable A is related to the values of variable A.
Think: criminal mastermind
From RM:
The fact that a variable has an unobserved value is still an observation. It is data, just with a very special value. The meaning of this value depends upon the context. Consider for example a questionnaire on personal income. If some people refuse to fill in their income, this may be associated with low (or high) income. Therefore a model that tries to predict the missing values can be enlightening. In ecology, the absence of an observation of a species is a subtle kind of observation. It could mean the species isn’t there. Or it could mean it is there but you didn’t see it. An entire category of models, occupancy models, exists to take this duality into account. Missing values are always produced by some process, and thinking about that process can sometimes solve big problems.
# panel a
dag_coords <-
tibble(name = c("S", "H", "Hs", "D"),
x = c(1, 2, 2, 1),
y = c(2, 2, 1, 1))
p1 <-
dagify(H ~ S,
Hs ~ H + D,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name == "H", "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color=color), size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "H*", "S")) + # Corrected order
geom_dag_edges() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
# panel b
p2 <-
dagify(H ~ S,
Hs ~ H + D,
D ~ S,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name == "H", "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color=color), size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "H*", "S")) + # Corrected order
geom_dag_edges() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
# panel c - coordinates for 5 nodes
dag_coords_c <-
tibble(name = c("S", "H", "Hs", "D", "X"),
x = c(1, 2, 2, 1, 1.5),
y = c(2, 2, 1, 1, 1.5))
p3 <-
dagify(H ~ S + X,
Hs ~ H + D,
D ~ X,
coords = dag_coords_c) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name %in% c("H", "X"), "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color=color), size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "H*", "S", "X")) + # Corrected order
geom_dag_edges() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
# panel d
p4 <-
dagify(H ~ S,
Hs ~ H + D,
D ~ H,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name == "H", "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color=color), size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "H*", "S")) + # Corrected order
geom_dag_edges() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
(p1 + p2 + p3 + p4) +
plot_annotation(tag_levels = "a", tag_prefix = "(", tag_suffix = ")") &
theme(plot.margin = margin(0.15, 0.15, 0.15, 0.15, "in"))age– participants’ agebmi – body mass index1hyp – hypertensivechl – total serum cholesterol vars n mean sd median min max range skew kurtosis se
age 1 25 1.76 0.83 2.00 1.0 3.0 2.0 0.44 -1.47 0.17
bmi 2 16 26.56 4.22 26.75 20.4 35.3 14.9 0.39 -0.83 1.05
hyp 3 17 1.24 0.44 1.00 1.0 2.0 1.0 1.14 -0.73 0.11
chl 4 15 191.40 45.22 187.00 113.0 284.0 171.0 -0.09 -0.43 11.67
age bmi hyp chl
1 1 NA NA NA
2 2 22.7 1 187
3 1 NA 1 187
4 3 NA NA NA
5 1 20.4 1 113
6 3 NA NA 184
7 1 22.5 1 118
8 1 30.1 1 187
9 2 22.0 1 238
10 2 NA NA NA
11 1 NA NA NA
12 2 NA NA NA
13 3 21.7 1 206
14 2 28.7 2 204
15 1 29.6 1 NA
16 1 NA NA NA
17 3 27.2 2 284
18 2 26.3 2 199
19 1 35.3 1 218
20 3 25.5 2 NA
21 1 NA NA NA
22 1 33.2 1 229
23 1 27.5 1 131
24 3 24.9 1 NA
25 2 27.4 1 186
Let’s say we’re interested in studying the causal relationship of cholesterol on BMI.
# Define coordinates for the nodes
dag_coords <- tibble(
name = c("age", "chl", "bmi"),
x = c(1, 2, 3),
y = c(2, 2, 1)
)
# Create the DAG
dagify(
bmi ~ age + chl,
chl ~ age,
coords = dag_coords
) %>%
tidy_dagitty() %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(size = 10) +
geom_dag_text(size = 4) +
geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, "pt"), type = "closed")) +
theme_dag() Family: gaussian
Links: mu = identity; sigma = identity
Formula: bmi ~ age + chl + age:chl
Data: d (Number of observations: 13)
Draws: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
total post-warmup draws = 6000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 13.25 7.22 -0.98 27.19 1.00 2512 2969
age -1.08 4.74 -10.48 8.42 1.00 2589 2700
chl 0.11 0.04 0.04 0.19 1.00 2700 3356
age:chl -0.02 0.02 -0.06 0.02 1.00 2387 2695
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 2.98 0.61 2.02 4.40 1.00 2447 2805
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Now we return to the mi() function from before. When brms sees mi(), it really interprets this to mean that there’s some value that’s missing to impute. We used it in the previous lecture to indicate that the true value for our outcome (divorce rate) was missing but we could impute based on our observation and standard error. Here, we’ll be using it to say that some values are missing.
m2 <- brm(
data = d,
family = gaussian,
bf(bmi | mi() ~ age*chl),
prior = c( prior(normal(0,10), class=Intercept),
prior(normal(0,10), class=b),
prior(exponential(1), class=sigma) ),
iter=2000, warmup=500, chains=4, cores=4, seed=1,
save_pars = save_pars(latent=T),
file = here("files/models/m101.2")
)Note that we imputed values for only two of our participants. That’s because we threw out the participants who were also missing values for cholesterol.
Estimate Est.Error Q2.5 Q97.5
b_Intercept 13.66084019 7.37313173 -0.71736117 28.57811312
b_age -1.36387746 4.84825569 -11.13526859 8.24436780
b_chl 0.11042045 0.03877522 0.03323725 0.18556068
b_age:chl -0.01813567 0.02157951 -0.06056509 0.02507034
sigma 2.99552652 0.61771493 2.04345882 4.53120985
Intercept 26.10433116 0.88077406 24.28372927 27.81724992
Ymi[2] 29.54989319 3.27712282 22.97137909 36.06398022
Ymi[4] 19.83850975 4.01928004 11.91997753 27.91179381
lprior -19.41956131 0.69864255 -21.07188515 -18.36397422
lp__ -56.81789108 2.29046184 -62.50340221 -53.53969572
But we can estimate those missing cholesterol values as well. Now that we have multiple outcomes, we have to specify which outcome (resp) each prior belongs to.
m3 <- brm(
data = d,
family = gaussian,
bf(bmi | mi() ~ age * mi(chl)) +
bf(chl | mi() ~ 1) +
set_rescor(FALSE),
prior = c( prior(normal(20,10), class=Intercept, resp=bmi, lb=0),
prior(normal(100,50), class=Intercept, resp=chl, lb=0),
prior(normal(0,10), class=b),
prior(exponential(1), class=sigma, resp=bmi),
prior(exponential(1), class=sigma, resp=chl) ),
iter=2000, warmup=500, chains=4, cores=4, seed=1,
save_pars = save_pars(latent=T),
file = here("files/models/m101.3b")
) Family: MV(gaussian, gaussian)
Links: mu = identity; sigma = identity
mu = identity; sigma = identity
Formula: bmi | mi() ~ age * mi(chl)
chl | mi() ~ 1
Data: d (Number of observations: 25)
Draws: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
total post-warmup draws = 6000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
bmi_Intercept 11.00 7.72 -4.16 26.06 1.00 2328 3022
chl_Intercept 190.28 6.77 176.68 203.56 1.00 3805 3992
bmi_age 2.82 4.70 -6.75 11.86 1.00 2142 3047
bmi_michl 0.11 0.04 0.04 0.19 1.00 2207 3462
bmi_michl:age -0.03 0.02 -0.07 0.01 1.00 2110 3022
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_bmi 2.96 0.57 2.07 4.26 1.00 2514 3853
sigma_chl 26.95 2.61 22.40 32.48 1.00 4736 4363
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Hooray! Now we’ve not only imputed values for cholesterol, but we’ve also imputed more values for BMI.
Estimate Est.Error Q2.5 Q97.5
b_bmi_Intercept 11.00292211 7.71766893 -4.15889940 26.05787440
b_chl_Intercept 190.27580054 6.77316464 176.67764330 203.55783373
b_bmi_age 2.81947981 4.70402189 -6.75040021 11.85982329
bsp_bmi_michl 0.11419749 0.04040533 0.03695908 0.19269591
bsp_bmi_michl:age -0.03154068 0.02227456 -0.07465667 0.01281869
sigma_bmi 2.96167622 0.57206550 2.07438660 4.25900928
sigma_chl 26.95003583 2.60940066 22.39710883 32.48262761
Intercept_bmi 15.96520658 3.89977587 8.24836695 23.76109013
Intercept_chl 190.27580054 6.77316464 176.67764330 203.55783373
Ymi_bmi[1] 29.52505935 4.03300188 21.59564129 37.51233733
Ymi_bmi[3] 29.21871415 3.24683113 22.95113041 35.78827551
Ymi_bmi[4] 23.17133992 3.62520077 15.75464866 30.27312629
Ymi_bmi[6] 23.04989623 3.49206657 16.27517841 29.80444468
Ymi_bmi[10] 26.36546858 3.50293440 19.38320450 33.25983865
Ymi_bmi[11] 29.63427529 3.98270171 21.81064930 37.46774725
Ymi_bmi[12] 26.38061879 3.49128305 19.44938351 33.41758240
Ymi_bmi[16] 29.44956970 3.98224787 21.78025029 37.23174419
Ymi_bmi[21] 29.61676627 4.09280209 21.53134499 37.81530406
Ymi_chl[1] 189.86705130 27.99295690 134.89401657 245.19030981
Ymi_chl[4] 190.18782593 27.73979217 134.39094208 244.45507836
Ymi_chl[10] 189.90638441 27.89791446 134.49973729 245.09103336
Ymi_chl[11] 191.36232236 27.51154040 137.43193901 245.90780320
Ymi_chl[12] 190.59849711 28.07768981 135.37395776 246.73913511
Ymi_chl[15] 190.32240331 22.37205761 147.41314647 235.59196919
Ymi_chl[16] 189.78630428 27.96298792 136.15320332 245.76551613
Ymi_chl[20] 195.81275676 27.45515789 140.77752722 249.78113679
Ymi_chl[21] 190.75896739 27.51943382 136.77215916 244.18524529
Ymi_chl[24] 194.75081534 27.11424947 140.50579698 248.38187913
lprior -39.71471543 2.69032134 -45.50902046 -34.92718490
lp__ -221.17872027 4.52524758 -231.03823697 -213.61918386
And once more, we can improve our model, because we can use information from age to better impute cholesterol.
m4 <- brm(
data = d,
family = gaussian,
bf(bmi | mi() ~ age * mi(chl)) +
bf(chl | mi() ~ age) +
set_rescor(FALSE),
prior = c( prior(normal(20,10), class=Intercept, resp=bmi, lb=0),
prior(normal(100,50), class=Intercept, resp=chl, lb=0),
prior(normal(0,10), class=b),
prior(exponential(1), class=sigma, resp=bmi),
prior(exponential(1), class=sigma, resp=chl) ),
iter=2000, warmup=500, chains=4, cores=4, seed=1,
save_pars = save_pars(latent=T),
file = here("files/models/m101.4b")
)# Extract imputed cholesterol values
post3_chl = post3 |>
select(starts_with("Ymi_chl")) |>
pivot_longer(everything(), names_to = "imputation", values_to = "value") |>
mutate(model = "m3")
post4_chl = post4 |>
select(starts_with("Ymi_chl")) |>
pivot_longer(everything(), names_to = "imputation", values_to = "value") |>
mutate(model = "m4")
# Combine the data
combined_chl = bind_rows(post3_chl, post4_chl) |>
with_groups(c(model, imputation), summarise, value=mean(value)) |>
mutate(row = as.numeric(str_extract(imputation, "[0-9]{1,2}")))
ages = tibble(row = 1:nrow(d), age =d$age)
combined_chl |>
inner_join(ages) |>
ggplot( aes(x=age, y=value, color=model)) +
geom_smooth(se=F) +
geom_jitter() +
scale_color_manual(values = c("m3" = "#5e8485", "m4" = "#0f393a")) +
theme(legend.position = "bottom")Much like when we incorporated measurement error, we may find that our coefficient estimates change.
post2 = as_draws_df(m2) |> mutate(model="m2")
extract_coef = function(post){
out = post |>
select(starts_with("b"), model) |>
pivot_longer(-model,
names_to = c("prefix", "outcome", "parameter"),
names_sep = "_",
values_to = "value") |>
mutate(
parameter = ifelse(is.na(parameter), outcome, parameter),
outcome = ifelse(outcome==parameter, "bmi", outcome),
parameter = str_remove(parameter, "mi"),
parameter=ifelse(parameter=="age:chl", "chl:age", parameter)
) |>
group_by(model, outcome, parameter) |>
summarise(
mean = mean(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975),
.groups = "drop" )
}
bind_rows(
post2 |> extract_coef(),
post3 |> extract_coef(),
post4 |> extract_coef()
) |>
mutate(
estimate = sprintf("%.2f [%.2f, %.2f]", mean, lower, upper) ) |>
select(model, outcome, parameter, estimate) |>
pivot_wider(names_from = model, values_from = estimate) |>
knitr::kable(booktabs=T) |>
kableExtra::kable_styling()| outcome | parameter | m2 | m3 | m4 |
|---|---|---|---|---|
| bmi | Intercept | 13.66 [-0.72, 28.58] | 11.00 [-4.16, 26.06] | 13.87 [-1.52, 29.07] |
| bmi | age | -1.36 [-11.14, 8.24] | 2.82 [-6.75, 11.86] | -0.01 [-9.76, 10.08] |
| bmi | chl | 0.11 [0.03, 0.19] | 0.11 [0.04, 0.19] | 0.11 [0.03, 0.18] |
| bmi | chl:age | -0.02 [-0.06, 0.03] | -0.03 [-0.07, 0.01] | -0.02 [-0.07, 0.02] |
| chl | Intercept | NA | 190.28 [176.68, 203.56] | 140.18 [110.21, 170.09] |
| chl | age | NA | NA | 29.27 [13.31, 44.94] |
Multiple imputation is a statistical technique for handling missing data that creates several complete datasets by filling in missing values with plausible estimates. Rather than simply deleting cases with missing data or using a single imputation method, this approach generates multiple versions of the dataset where each missing value is replaced with different reasonable estimates based on the observed data patterns. Traditionally, the number of imputed datasets ranges from 5 to 20, depending on the amount of missing data and the specific analysis requirements. (However, you should plan to do many more than that as a Bayesian.)
Multiple imputation is particularly valuable because it preserves the relationships between variables while acknowledging uncertainty about the true values of missing data. Unlike single imputation methods that can underestimate standard errors and lead to overly confident results, multiple imputation provides more accurate statistical inferences by incorporating the additional uncertainty from not knowing the missing values. This makes it especially useful in research contexts where maintaining statistical power and avoiding bias from missing data are crucial for drawing valid conclusions.
Five imputations is probably too few. Depending on the data, you may need as many as 100 imputations (Zhou & Reiter, 2010)[https://www.tandfonline.com/doi/abs/10.1198/tast.2010.09109].
For each imputation, we have a full dataset. The guesses for each previously-missing value are informed by the other values in the dataset.
$age
[1] 1 2 3 4 5
<0 rows> (or 0-length row.names)
$bmi
1 2 3 4 5
1 29.6 30.1 21.7 22.0 27.2
3 28.7 30.1 22.0 30.1 28.7
4 20.4 21.7 22.5 27.4 20.4
6 22.7 21.7 24.9 25.5 25.5
10 20.4 28.7 27.2 22.5 26.3
11 22.5 28.7 30.1 27.2 26.3
12 25.5 30.1 22.7 22.0 27.5
16 30.1 29.6 25.5 29.6 35.3
21 30.1 29.6 35.3 26.3 27.5
$hyp
1 2 3 4 5
1 1 1 1 1 1
4 1 2 1 2 2
6 2 1 1 2 1
10 1 2 1 1 1
11 1 1 1 1 1
12 2 1 1 1 1
16 1 1 1 1 1
21 1 1 2 1 1
$chl
1 2 3 4 5
1 229 238 187 131 238
4 218 204 131 284 199
10 187 206 186 187 229
11 238 238 238 187 238
12 187 218 131 131 206
15 204 187 238 187 238
16 204 131 238 187 229
20 186 184 184 218 184
21 184 131 284 118 187
24 206 284 186 284 204
Look at how many draws!
Family: gaussian
Links: mu = identity; sigma = identity
Formula: bmi ~ age * chl
Data: imp (Number of observations: 25)
Draws: 20 chains, each with iter = 2000; warmup = 500; thin = 1;
total post-warmup draws = 30000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 15.75 7.70 1.44 31.77 1.20 67 324
age 1.00 4.34 -8.29 8.96 1.12 107 291
chl 0.08 0.04 -0.00 0.16 1.25 57 404
age:chl -0.02 0.02 -0.06 0.02 1.14 95 397
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 3.08 0.53 2.18 4.27 1.19 73 322
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```{r, results=‘asis’} #| code-fold:true
post5 = as_draws_df(m5) |> mutate(model=“m5”)
bind_rows( post2 |> extract_coef(), post3 |> extract_coef(), post4 |> extract_coef(), post5 |> extract_coef() ) |> mutate( estimate = sprintf(“%.2f [%.2f, %.2f]”, mean, lower, upper) ) |> select(model, outcome, parameter, estimate) |> pivot_wider(names_from = model, values_from = estimate) |> knitr::kable(booktabs=T) |> kableExtra::kable_styling() ```
We’ve covered two methods of handling missing data:
These methods approach the problem of missing data from different angles. At their core, their distinction is between causal inference and prediction. Neither is without its flaws. For the first to work, you must have the correct causal model and the right variables observed in your data. The second one is more flexible, but there’s circularity in the imputation/estimation process. Note how our estimates for model 5 were not dissimliar from our estimates with complete cases only. The question is always: what assumptions are you willing to make and what do you want to prioritize?