library(tidyverse)
library(cowplot)
library(brms)
library(tidybayes)
library(patchwork)
Start at treatment (X)
Look for any arrows coming INTO X
Follow all possible paths to outcome (Y)
A valid adjustment set blocks all backdoor paths
But be careful not to control for colliders!
Paths:
Exercise → Health (direct/front-door) Exercise ← Motivation → Health (backdoor) Exercise ← Motivation → Diet ← Age → Health (backdoor)
Valid adjustment sets:
Copy the code below.
Run the base simulation and observe results
Modify the simulation parameters:
set.seed(9)
#number of sims
N = 1000
# Generate data
U <- rnorm(N) # Unobserved confounder
X <- rnorm(N, mean = 0.5 * U) # Treatment affected by U
Y <- rnorm(N, mean = 0.8 * U) # Outcome affected by U
Z <- rnorm(N, mean = 0.6 * U) # Observed variable that captures U
d <- data.frame(X, Y, Z)
# Fit models
flist1 <- alist(
Y ~ dnorm(mu, sigma),
mu <- a + bX*X,
a ~ dnorm(0, .5),
bX ~ dnorm(0, .25),
sigma ~ dexp(1)
)
m1 <- brm(
data=d,
family=gaussian,
Y ~ 1 + X,
prior = c( prior(normal(0, .50), class=Intercept),
prior(normal(0, .25), class=b),
prior(exponential(1), class=sigma)),
iter=2000, warmup=1000, seed=3, chains=1
)
posterior_summary(m1)
m2 <- brm(
data=d,
family=gaussian,
Y ~ 1 + X + Z,
prior = c( prior(normal(0, .50), class=Intercept),
prior(normal(0, .25), class=b),
prior(exponential(1), class=sigma)),
iter=2000, warmup=1000, seed=3, chains=1
)
posterior_summary(m2)
post.1 <- as_draws_df(m1)
post.2 <- as_draws_df(m2)
results_df = data.frame(naive = post.1$b_X,
adjusted = post.2$b_X)
results_df %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, fill = name)) +
geom_density(alpha = .5) +
geom_vline(aes(xintercept = 0), linetype = "dashed")
“Bad controls” can create bias in three main ways:
Warning signs of bad controls:
Use this code to simulate new variables:
n = 100
# Z affects X but is not a confounder
Z <- rnorm(n)
X <- rnorm(n, mean = Z)
Y <- rnorm(n, mean = X) # True effect of X on Y is 1
Using different sample sizes (n = 50, 100, 1000), test two models exploring the relationship between X (exposure) and Y (outcome). For each sample size, compare:
How does sample size affect the impact of the precision parasite (Z)? Under what conditions is the precision loss most severe?
Use this code to simulate new variables:
n = 100
conf_strength = 1
# U is unmeasured confounder
U <- rnorm(n)
Z <- rnorm(n)
X <- rnorm(n, mean = Z + conf_strength * U)
Y <- rnorm(n, mean = conf_strength * U) # No true effect of X
Using different confounder strengths (0.5, 1, 2), test two models exploring the relationship between X (exposure) and Y (outcome). For each sample size, compare:
Questions: * What happens to the bias when you control for Z? * How does the strength of the confounding affect the amount of bias amplification? * Can you explain why this happens using the DAG?
Use the simulation code provided in the last two exercises to create a new scenario with both a precision parasite variable (Z1) and a bias amplification variable (Z2).
Questions:
What happens to our estimates when we control for both variables?
Is it better to:
How can we use DAGs to decide which controls to include?
The table 2 fallacy, first described by Westreich and Greenland in 2013, refers to a common misinterpretation in epidemiology and statistics when researchers present multiple adjusted effect estimates in a single table (often “Table 2” in academic papers).
The fallacy occurs when researchers interpret all coefficients in a multiple regression model as total effects, when in fact some are direct effects conditional on the other variables in the model. This can lead to incorrect causal interpretations, particularly when some variables are mediators (lying on the causal pathway between exposure and outcome).
For example, imagine studying how education affects income, with job type as a mediator:
If you include both education and job type in the same regression model, the coefficient for education represents only its direct effect on income (not mediated through job type), not its total effect. However, researchers often mistakenly interpret it as the total effect.
This fallacy becomes particularly problematic when:
To avoid this fallacy, researchers should:
As you know, the covariates in a statistical analysis can have a variety of different roles from a causal inference perspective: they can be mediators, confounders, proxy confounders, or competing exposures. If a suitable set of covariates can be identified that removes confounding, we may proceed to estimate our causal effect using a multivariable regression model. In linear regression models, there are only two types of variables: the dependent variable (DV) and independent variables (IVs, or predictors). No further distinction is made between the IVs – specifically, the exposure is by no means a “special” IV and is treated just like any other IV. Thus, as you can see, there is a conceptual mismatch between causal theory (DAG) that leads us to formulate a multivariable regression model (that highlights the exposure-outcome relationship and associated statistical adjustment for confounding) and the regression model itself. This conceptual mismatch can easily lead to misinterpretation of the results from a multivariable regression model.
One particularly widespread misconception is known as mutual adjustment, recently called the Table 2 fallacy since the first table in most epidemiological articles usually describes the study data, and the second table reports the results of a multivariable regression model where the erroneous efforts to illustrate mutual adjustment often appear. To illustrate the fallacy, let us assume that we estimate the effect of X on Y. We know (e.g. from a DAG) that there is only one confounder, Z, so we run the regression Y~X+Z. If our background knowledge and the statistical assumptions of the regression (e.g. normality) hold, then the coefficient of X estimates the causal effect of X on Y. The ‘Table 2 fallacy’ is the belief that we can also interpret the coefficient of Z as the effect of Z on Y; indeed, in larger models, the fallacy is the belief that all coefficients have a similar interpretation with respect to Y.
Visit https://dagitty.net/learn/graphs/table2-fallacy.html) and scroll down to the first Test Your Knowledge section. Try this exercise as many times as you need to get the correct answer twice in a row.
Draw out the DAG model for this research question using dagitty.net. (You can model “personality” instead of C, N, and their interaction.)
Assuming there are no unobserved confounds, which of these coefficients are total effects and which are direct effects?
If personality is the primary exposure variable, are there any covariates in here that should not be included?
Are there any unobserved confounds?