week 10: final countdown

missing data

library(tidyverse)
library(psych)
library(cowplot)
library(patchwork)
library(here)
library(brms) 
library(tidybayes)
library(ggdag)
library(mice) # new package!

missing understandable terms

Missing completely at random (MCAR)
- Pattern of missingness is random.
- Think: aliens.
Missing at random (MAR)
- the pattern of missingness on variable A is related to variable.
- Think: raccoons.
Missing not at random (MNAR)
- missingness of variable A is related to the values of variable A.
- Think: criminal mastermind

missing data as a causal problem

From RM:

The fact that a variable has an unobserved value is still an observation. It is data, just with a very special value. The meaning of this value depends upon the context. Consider for example a questionnaire on personal income. If some people refuse to fill in their income, this may be associated with low (or high) income. Therefore a model that tries to predict the missing values can be enlightening. In ecology, the absence of an observation of a species is a subtle kind of observation. It could mean the species isn’t there. Or it could mean it is there but you didn’t see it. An entire category of models, occupancy models, exists to take this duality into account. Missing values are always produced by some process, and thinking about that process can sometimes solve big problems.

DAG ate my homework

Code

# panel a
dag_coords <-
  tibble(name = c("S", "H", "Hs", "D"),
         x    = c(1, 2, 2, 1),
         y    = c(2, 2, 1, 1))

p1 <-
  dagify(H  ~ S,
         Hs ~ H + D,
         coords = dag_coords) %>%
  tidy_dagitty() %>% 
  mutate(color = ifelse(name == "H", "a", "b")) %>% 
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(aes(color=color), size = 7, show.legend = F) +
  geom_dag_text(label = c("D", "H", "H*", "S")) +  # Corrected order
  geom_dag_edges() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

# panel b
p2 <-
  dagify(H  ~ S,
         Hs ~ H + D,
         D  ~ S,
         coords = dag_coords) %>%
  tidy_dagitty() %>% 
  mutate(color = ifelse(name == "H", "a", "b")) %>% 
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(aes(color=color), size = 7, show.legend = F) +
  geom_dag_text(label = c("D", "H", "H*", "S")) +  # Corrected order
  geom_dag_edges() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

# panel c - coordinates for 5 nodes
dag_coords_c <-
  tibble(name = c("S", "H", "Hs", "D", "X"),
         x    = c(1, 2, 2, 1, 1.5),
         y    = c(2, 2, 1, 1, 1.5))

p3 <-
  dagify(H  ~ S + X,
         Hs ~ H + D,
         D  ~ X,
         coords = dag_coords_c) %>%
  tidy_dagitty() %>% 
  mutate(color = ifelse(name %in% c("H", "X"), "a", "b")) %>% 
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(aes(color=color), size = 7, show.legend = F) +
  geom_dag_text(label = c("D", "H", "H*", "S", "X")) +  # Corrected order
  geom_dag_edges() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

# panel d
p4 <-
  dagify(H  ~ S,
         Hs ~ H + D,
         D  ~ H,
         coords = dag_coords) %>%
  tidy_dagitty() %>% 
  mutate(color = ifelse(name == "H", "a", "b")) %>% 
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(aes(color=color), size = 7, show.legend = F) +
  geom_dag_text(label = c("D", "H", "H*", "S")) +  # Corrected order
  geom_dag_edges() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

(p1 + p2 + p3 + p4) +
  plot_annotation(tag_levels = "a", tag_prefix = "(", tag_suffix = ")") &
  theme(plot.margin = margin(0.15, 0.15, 0.15, 0.15, "in"))

an example: nhanes

age– participants’ age
bmi – body mass index¹
hyp – hypertensive
chl – total serum cholesterol

data(nhanes, package="mice")
d <- nhanes
psych::describe(d, fast=T)

    vars  n   mean    sd median   min   max range  skew kurtosis    se
age    1 25   1.76  0.83   2.00   1.0   3.0   2.0  0.44    -1.47  0.17
bmi    2 16  26.56  4.22  26.75  20.4  35.3  14.9  0.39    -0.83  1.05
hyp    3 17   1.24  0.44   1.00   1.0   2.0   1.0  1.14    -0.73  0.11
chl    4 15 191.40 45.22 187.00 113.0 284.0 171.0 -0.09    -0.43 11.67

   age  bmi hyp chl
1    1   NA  NA  NA
2    2 22.7   1 187
3    1   NA   1 187
4    3   NA  NA  NA
5    1 20.4   1 113
6    3   NA  NA 184
7    1 22.5   1 118
8    1 30.1   1 187
9    2 22.0   1 238
10   2   NA  NA  NA
11   1   NA  NA  NA
12   2   NA  NA  NA
13   3 21.7   1 206
14   2 28.7   2 204
15   1 29.6   1  NA
16   1   NA  NA  NA
17   3 27.2   2 284
18   2 26.3   2 199
19   1 35.3   1 218
20   3 25.5   2  NA
21   1   NA  NA  NA
22   1 33.2   1 229
23   1 27.5   1 131
24   3 24.9   1  NA
25   2 27.4   1 186

Let’s say we’re interested in studying the causal relationship of cholesterol on BMI.

Code

# Define coordinates for the nodes
dag_coords <- tibble(
  name = c("age", "chl", "bmi"),
  x    = c(1, 2, 3),
  y    = c(2, 2, 1)
)

# Create the DAG
dagify(
  bmi ~ age + chl,
  chl ~ age,
  coords = dag_coords
) %>%
  tidy_dagitty() %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point(size = 10) +
  geom_dag_text(size = 4) +
  geom_dag_edges(arrow_directed = grid::arrow(length = grid::unit(10, "pt"), type = "closed")) +
  theme_dag()

m1 <- brm(
  data = d,
  family = gaussian,
  bmi ~ age*chl,
  prior = c( prior(normal(0,10), class=Intercept),
             prior(normal(0,10), class=b),
             prior(exponential(1), class=sigma) ),
  iter=2000, warmup=500, chains=4, cores=4, seed=1,
  file = here("files/models/m101.1")           
)

m1

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: bmi ~ age + chl + age:chl 
   Data: d (Number of observations: 13) 
  Draws: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
         total post-warmup draws = 6000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    13.25      7.22    -0.98    27.19 1.00     2512     2969
age          -1.08      4.74   -10.48     8.42 1.00     2589     2700
chl           0.11      0.04     0.04     0.19 1.00     2700     3356
age:chl      -0.02      0.02    -0.06     0.02 1.00     2387     2695

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     2.98      0.61     2.02     4.40 1.00     2447     2805

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Now we return to the mi() function from before. When brms sees mi(), it really interprets this to mean that there’s some value that’s missing to impute. We used it in the previous lecture to indicate that the true value for our outcome (divorce rate) was missing but we could impute based on our observation and standard error. Here, we’ll be using it to say that some values are missing.

m2 <- brm(
  data = d,
  family = gaussian,
  bf(bmi | mi() ~ age*chl),
  prior = c( prior(normal(0,10), class=Intercept),
             prior(normal(0,10), class=b),
             prior(exponential(1), class=sigma) ),
  iter=2000, warmup=500, chains=4, cores=4, seed=1,
  save_pars = save_pars(latent=T),
  file = here("files/models/m101.2")           
)

Note that we imputed values for only two of our participants. That’s because we threw out the participants who were also missing values for cholesterol.

posterior_summary(m2)

                Estimate  Est.Error         Q2.5        Q97.5
b_Intercept  13.66084019 7.37313173  -0.71736117  28.57811312
b_age        -1.36387746 4.84825569 -11.13526859   8.24436780
b_chl         0.11042045 0.03877522   0.03323725   0.18556068
b_age:chl    -0.01813567 0.02157951  -0.06056509   0.02507034
sigma         2.99552652 0.61771493   2.04345882   4.53120985
Intercept    26.10433116 0.88077406  24.28372927  27.81724992
Ymi[2]       29.54989319 3.27712282  22.97137909  36.06398022
Ymi[4]       19.83850975 4.01928004  11.91997753  27.91179381
lprior      -19.41956131 0.69864255 -21.07188515 -18.36397422
lp__        -56.81789108 2.29046184 -62.50340221 -53.53969572

But we can estimate those missing cholesterol values as well. Now that we have multiple outcomes, we have to specify which outcome (resp) each prior belongs to.

m3 <- brm(
  data = d,
  family = gaussian,
  bf(bmi | mi() ~ age * mi(chl)) + 
  bf(chl | mi() ~ 1) + 
  set_rescor(FALSE),
  prior = c( prior(normal(20,10), class=Intercept, resp=bmi, lb=0),
             prior(normal(100,50), class=Intercept, resp=chl, lb=0),
             prior(normal(0,10), class=b),
             prior(exponential(1), class=sigma, resp=bmi),
             prior(exponential(1), class=sigma, resp=chl) ),
  iter=2000, warmup=500, chains=4, cores=4, seed=1,
  save_pars = save_pars(latent=T),
  file = here("files/models/m101.3b")           
)

m3

 Family: MV(gaussian, gaussian) 
  Links: mu = identity; sigma = identity
         mu = identity; sigma = identity 
Formula: bmi | mi() ~ age * mi(chl) 
         chl | mi() ~ 1 
   Data: d (Number of observations: 25) 
  Draws: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
         total post-warmup draws = 6000

Regression Coefficients:
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
bmi_Intercept    11.00      7.72    -4.16    26.06 1.00     2328     3022
chl_Intercept   190.28      6.77   176.68   203.56 1.00     3805     3992
bmi_age           2.82      4.70    -6.75    11.86 1.00     2142     3047
bmi_michl         0.11      0.04     0.04     0.19 1.00     2207     3462
bmi_michl:age    -0.03      0.02    -0.07     0.01 1.00     2110     3022

Further Distributional Parameters:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_bmi     2.96      0.57     2.07     4.26 1.00     2514     3853
sigma_chl    26.95      2.61    22.40    32.48 1.00     4736     4363

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Hooray! Now we’ve not only imputed values for cholesterol, but we’ve also imputed more values for BMI.

posterior_summary(m3)

                       Estimate   Est.Error          Q2.5         Q97.5
b_bmi_Intercept     11.00292211  7.71766893   -4.15889940   26.05787440
b_chl_Intercept    190.27580054  6.77316464  176.67764330  203.55783373
b_bmi_age            2.81947981  4.70402189   -6.75040021   11.85982329
bsp_bmi_michl        0.11419749  0.04040533    0.03695908    0.19269591
bsp_bmi_michl:age   -0.03154068  0.02227456   -0.07465667    0.01281869
sigma_bmi            2.96167622  0.57206550    2.07438660    4.25900928
sigma_chl           26.95003583  2.60940066   22.39710883   32.48262761
Intercept_bmi       15.96520658  3.89977587    8.24836695   23.76109013
Intercept_chl      190.27580054  6.77316464  176.67764330  203.55783373
Ymi_bmi[1]          29.52505935  4.03300188   21.59564129   37.51233733
Ymi_bmi[3]          29.21871415  3.24683113   22.95113041   35.78827551
Ymi_bmi[4]          23.17133992  3.62520077   15.75464866   30.27312629
Ymi_bmi[6]          23.04989623  3.49206657   16.27517841   29.80444468
Ymi_bmi[10]         26.36546858  3.50293440   19.38320450   33.25983865
Ymi_bmi[11]         29.63427529  3.98270171   21.81064930   37.46774725
Ymi_bmi[12]         26.38061879  3.49128305   19.44938351   33.41758240
Ymi_bmi[16]         29.44956970  3.98224787   21.78025029   37.23174419
Ymi_bmi[21]         29.61676627  4.09280209   21.53134499   37.81530406
Ymi_chl[1]         189.86705130 27.99295690  134.89401657  245.19030981
Ymi_chl[4]         190.18782593 27.73979217  134.39094208  244.45507836
Ymi_chl[10]        189.90638441 27.89791446  134.49973729  245.09103336
Ymi_chl[11]        191.36232236 27.51154040  137.43193901  245.90780320
Ymi_chl[12]        190.59849711 28.07768981  135.37395776  246.73913511
Ymi_chl[15]        190.32240331 22.37205761  147.41314647  235.59196919
Ymi_chl[16]        189.78630428 27.96298792  136.15320332  245.76551613
Ymi_chl[20]        195.81275676 27.45515789  140.77752722  249.78113679
Ymi_chl[21]        190.75896739 27.51943382  136.77215916  244.18524529
Ymi_chl[24]        194.75081534 27.11424947  140.50579698  248.38187913
lprior             -39.71471543  2.69032134  -45.50902046  -34.92718490
lp__              -221.17872027  4.52524758 -231.03823697 -213.61918386

And once more, we can improve our model, because we can use information from age to better impute cholesterol.

m4 <- brm(
  data = d,
  family = gaussian,
  bf(bmi | mi() ~ age * mi(chl)) + 
  bf(chl | mi() ~ age) + 
  set_rescor(FALSE),
  prior = c( prior(normal(20,10), class=Intercept, resp=bmi, lb=0),
             prior(normal(100,50), class=Intercept, resp=chl, lb=0),
             prior(normal(0,10), class=b),
             prior(exponential(1), class=sigma, resp=bmi),
             prior(exponential(1), class=sigma, resp=chl) ),
  iter=2000, warmup=500, chains=4, cores=4, seed=1,
  save_pars = save_pars(latent=T),
  file = here("files/models/m101.4b")           
)

post3 = as_draws_df(m3) |> mutate(model="m3")
post4 = as_draws_df(m4) |> mutate(model="m4")

Code

# Extract imputed cholesterol values
post3_chl = post3 |> 
  select(starts_with("Ymi_chl")) |> 
  pivot_longer(everything(), names_to = "imputation", values_to = "value") |>
  mutate(model = "m3")

post4_chl = post4 |> 
  select(starts_with("Ymi_chl")) |> 
  pivot_longer(everything(), names_to = "imputation", values_to = "value") |>
  mutate(model = "m4")

# Combine the data
combined_chl = bind_rows(post3_chl, post4_chl) |> 
  with_groups(c(model, imputation), summarise, value=mean(value)) |> 
  mutate(row = as.numeric(str_extract(imputation, "[0-9]{1,2}")))

ages = tibble(row = 1:nrow(d), age =d$age)

combined_chl |> 
  inner_join(ages) |> 
  ggplot( aes(x=age, y=value, color=model)) +
  geom_smooth(se=F) +
  geom_jitter() +
  scale_color_manual(values = c("m3" = "#5e8485", "m4" = "#0f393a")) +
  theme(legend.position = "bottom")

Much like when we incorporated measurement error, we may find that our coefficient estimates change.

post2 = as_draws_df(m2) |> mutate(model="m2")

extract_coef = function(post){
  out = post |> 
    select(starts_with("b"), model) |> 
    pivot_longer(-model, 
    names_to = c("prefix", "outcome", "parameter"), 
    names_sep = "_",
    values_to = "value") |>
    mutate(
      parameter = ifelse(is.na(parameter), outcome, parameter),
      outcome = ifelse(outcome==parameter, "bmi", outcome),
      parameter = str_remove(parameter, "mi"),
      parameter=ifelse(parameter=="age:chl", "chl:age", parameter)
    ) |> 
    group_by(model, outcome, parameter) |> 
    summarise( 
      mean = mean(value), 
      lower = quantile(value, 0.025), 
      upper = quantile(value, 0.975), 
      .groups = "drop" )
}

bind_rows( 
  post2 |> extract_coef(),
  post3 |> extract_coef(), 
  post4 |> extract_coef()
  ) |> 
  mutate( 
    estimate = sprintf("%.2f [%.2f, %.2f]", mean, lower, upper) ) |> 
  select(model, outcome, parameter, estimate) |> 
  pivot_wider(names_from = model, values_from = estimate) |> 
  knitr::kable(booktabs=T) |> 
  kableExtra::kable_styling()

outcome	parameter	m2	m3	m4
bmi	Intercept	13.66 [-0.72, 28.58]	11.00 [-4.16, 26.06]	13.87 [-1.52, 29.07]
bmi	age	-1.36 [-11.14, 8.24]	2.82 [-6.75, 11.86]	-0.01 [-9.76, 10.08]
bmi	chl	0.11 [0.03, 0.19]	0.11 [0.04, 0.19]	0.11 [0.03, 0.18]
bmi	chl:age	-0.02 [-0.06, 0.03]	-0.03 [-0.07, 0.01]	-0.02 [-0.07, 0.02]
chl	Intercept	NA	190.28 [176.68, 203.56]	140.18 [110.21, 170.09]
chl	age	NA	NA	29.27 [13.31, 44.94]

multiple imputation

Multiple imputation is a statistical technique for handling missing data that creates several complete datasets by filling in missing values with plausible estimates. Rather than simply deleting cases with missing data or using a single imputation method, this approach generates multiple versions of the dataset where each missing value is replaced with different reasonable estimates based on the observed data patterns. Traditionally, the number of imputed datasets ranges from 5 to 20, depending on the amount of missing data and the specific analysis requirements. (However, you should plan to do many more than that as a Bayesian.)

Multiple imputation is particularly valuable because it preserves the relationships between variables while acknowledging uncertainty about the true values of missing data. Unlike single imputation methods that can underestimate standard errors and lead to overly confident results, multiple imputation provides more accurate statistical inferences by incorporating the additional uncertainty from not knowing the missing values. This makes it especially useful in research contexts where maintaining statistical power and avoiding bias from missing data are crucial for drawing valid conclusions.

steps of MI

Imputation: Create multiple complete datasets by filling in missing values with plausible estimates.
Analysis: Run the analysis on each of the imputed datasets.
Pooling: Combine the results from each analysis to obtain a single set of estimates and their uncertainty.

Five imputations is probably too few. Depending on the data, you may need as many as 100 imputations (Zhou & Reiter, 2010)[https://www.tandfonline.com/doi/abs/10.1198/tast.2010.09109].

# number of imputations
m <- 5

imp <- mice(nhanes, m = m, print=F)
class(imp)

[1] "mids"

For each imputation, we have a full dataset. The guesses for each previously-missing value are informed by the other values in the dataset.

imp$imp

$age
[1] 1 2 3 4 5
<0 rows> (or 0-length row.names)

$bmi
      1    2    3    4    5
1  29.6 30.1 21.7 22.0 27.2
3  28.7 30.1 22.0 30.1 28.7
4  20.4 21.7 22.5 27.4 20.4
6  22.7 21.7 24.9 25.5 25.5
10 20.4 28.7 27.2 22.5 26.3
11 22.5 28.7 30.1 27.2 26.3
12 25.5 30.1 22.7 22.0 27.5
16 30.1 29.6 25.5 29.6 35.3
21 30.1 29.6 35.3 26.3 27.5

$hyp
   1 2 3 4 5
1  1 1 1 1 1
4  1 2 1 2 2
6  2 1 1 2 1
10 1 2 1 1 1
11 1 1 1 1 1
12 2 1 1 1 1
16 1 1 1 1 1
21 1 1 2 1 1

$chl
     1   2   3   4   5
1  229 238 187 131 238
4  218 204 131 284 199
10 187 206 186 187 229
11 238 238 238 187 238
12 187 218 131 131 206
15 204 187 238 187 238
16 204 131 238 187 229
20 186 184 184 218 184
21 184 131 284 118 187
24 206 284 186 284 204

m5 <- 
  brm_multiple(
    family=gaussian,
    data=imp,
    bmi ~ age*chl,
    prior = c( prior(normal(0,10), class=Intercept),
             prior(normal(0,10), class=b),
             prior(exponential(1), class=sigma) ),
  iter=2000, warmup=500, chains=4, cores=4, seed=1,
  file = here("files/models/m101.5")
  )

Look at how many draws!

m5

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: bmi ~ age * chl 
   Data: imp (Number of observations: 25) 
  Draws: 20 chains, each with iter = 2000; warmup = 500; thin = 1;
         total post-warmup draws = 30000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    15.75      7.70     1.44    31.77 1.20       67      324
age           1.00      4.34    -8.29     8.96 1.12      107      291
chl           0.08      0.04    -0.00     0.16 1.25       57      404
age:chl      -0.02      0.02    -0.06     0.02 1.14       95      397

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     3.08      0.53     2.18     4.27 1.19       73      322

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

```{r, results=‘asis’} #| code-fold:true

post5 = as_draws_df(m5) |> mutate(model=“m5”)

We’ve covered two methods of handling missing data:

Causal inference. We can estimate the values of the missing data using the casual model and observed data.
Multiple imputation. We can create multiple complete datasets by filling in missing values with plausible estimates.

These methods approach the problem of missing data from different angles. At their core, their distinction is between causal inference and prediction. Neither is without its flaws. For the first to work, you must have the correct causal model and the right variables observed in your data. The second one is more flexible, but there’s circularity in the imputation/estimation process. Note how our estimates for model 5 were not dissimliar from our estimates with complete cases only. The question is always: what assumptions are you willing to make and what do you want to prioritize?