general linear madness
Today’s class is less of a coherent topic and more of a plea to think about your data. Complex statistics are never a substitute for understanding the relationships between your constructs and drawing out the causal model.
It’s quite easy to learn about a model or technique and look for a reason to throw it at data. (I’m guilty of that. I’m probably doing that right now.) A problem with that approach is that we typically start with the model or framework and bend the data to fit that, or at least, start by assuming the assumptions of the model. But the best models are the ones that come from the question and theory. We should start with the assumptions of our questions and theory, then find a model that fits with those.
McElreath provides several examples in this lecture/chapter about how to start from the theory and work towards a model that actually makes sense, instead of throwing a standard model at a problem. Along the way, he uses Stan code, so it’s as good a time as any to look at that. We’ll use two ways to integrate Stan with brms.
Recall the developmental psychology experiment from the lecture. There are 629 children in this dataset between the ages of 4 and 14. The outcome y here takes the values 1, 2, and 3. It indicates which of the three options were chosen, where 1 indicates the unchosen color, 2 indicates the majority demonstrated color, and 3 indicates the minority demonstrated color.
mean sd 5.5% 94.5% histogram
y 2.1208267 0.7279860 1 3 ▃▁▁▁▇▁▁▁▁▅
gender 1.5055644 0.5003669 1 2 ▇▁▁▁▁▁▁▁▁▇
age 8.0302067 2.4979055 5 13 ▇▃▅▃▃▃▂▂▂▁
majority_first 0.4848967 0.5001696 0 1 ▇▁▁▁▁▁▁▁▁▇
culture 3.7519873 1.9603189 1 8 ▃▂▁▇▁▂▁▂▁▂▁▁▁▁
# wrangle
d %>%
mutate(color = factor(y,
levels = 1:3,
labels = c("unchosen", "majority", "minority"))) %>%
mutate(y = str_c(y, " (", color, ")")) %>%
count(y) %>%
mutate(percent = (100 * n / sum(n))) %>%
mutate(label = str_c(round(percent, digits = 1), "%")) %>%
# plot
ggplot(aes(y = y)) +
geom_col(aes(x = n)) +
geom_text(aes(x = n + 4, label = label), hjust = 0) +
scale_x_continuous(expression(italic(n)), expand = expansion(mult = c(0, 0.1))) +
labs(title = "The criterion variable y indexes three kinds of color choices",
subtitle = "Children tended to prefer the 'majority' color.",
y = NULL) +
theme(axis.text.y = element_text(hjust = 0),
axis.ticks.y = element_blank())Consider a group of children in which half of them choose at random and the other half follow the majority. If we simulate choices for these children, we can figure out how often we might see the “2” choice, the one that indicates the majority color.
set.seed(5)
N <- 30 # number of children
# half are random: sample from 1,2,3 at random for each
y1 <- sample( 1:3 , size=N/2 , replace=TRUE )
# half follow majority
y2 <- rep( 2 , N/2 )
# combine and shuffle y1 and y2
y <- sample( c(y1,y2) )
# count the 2s
sum(y==2)/N[1] 0.6333333
About two-thirds of the choices are for the majority color, but only half the children are actually following the majority.
We’ll consider 5 different strategies children might use.
Follow the Majority: Copy the majority demonstrated color.
Follow the Minority: Copy the minority demonstrated color.
Maverick: Choose the color that no demonstrator chose.
Random: Choose a color at random, ignoring the demonstrators.
Follow First: Copy the color that was demonstrated first. This was either the majority color (when majority_first equals 1) or the minority color (when 0).
An important step here is enumerating the likelihood of each choice (1, 2, 3) for each strategy.
| Strategy | Choice 1 | Choice 2 | Choice 3 |
|---|---|---|---|
| Majority | 0 | 1 | 0 |
| Minority | 0 | 0 | 1 |
| Maverick | 1 | 0 | 0 |
| Random | 1/3 | 1/3 | 1/3 |
| Follow First | 0* | 1 or 0** | 0 or 1** |
Notes:
*Follow First strategy for Choice 1: Always 0 (never chooses option 1)
**Follow First strategy varies based on majority_first[i]:
If majority_first[i] == 1: Choice 2 = 1, Choice 3 = 0
If majority_first[i] != 1: Choice 2 = 0, Choice 3 = 1
We can’t directly measure each child’s strategy. It is an unobserved variable. But each strategy has a specific probability of producing each choice. We can use that fact to compute the probability of each choice, given parameters which specify the probability of each strategy. Then we let Bayes loose and get the posterior distribution of each strategy back. Before we can let Bayes loose, we’ll need to enumerate the parameters, assign priors to each, and also figure out some technical issues for coding.
we can’t directly measure each child’s strategy. It is an unobserved variable. But each strategy has a specific probability of producing each choice. We can use that fact to compute the probability of each choice, given parameters which specify the probability of each strategy. Then we let Bayes loose and get the posterior distribution of each strategy back. Before we can let Bayes loose, we’ll need to enumerate the parameters, assign priors to each, and also figure out some technical issues for coding. We’ve used these before when studying categorical outcomes.
p∼Dirichlet([4,4,4,4,4])
library(gtools) # for rdirichlet function
set.seed(123)
# Draw samples from both distributions
samples_high <- rdirichlet(30, c(4,4,4,4,4))
samples_low <- rdirichlet(30, c(2,2,2,2,2))
# Convert to data frames and add identifiers
df_high <- data.frame(samples_high) %>%
mutate(sample_id = 1:30,
distribution = "Dirichlet([4,4,4,4,4])") %>%
setNames(c(paste0("X", 1:5), "sample_id", "distribution"))
df_low <- data.frame(samples_low) %>%
mutate(sample_id = 1:30,
distribution = "Dirichlet([2,2,2,2,2])") %>%
setNames(c(paste0("X", 1:5), "sample_id", "distribution"))
# Combine data frames
df_combined <- rbind(df_high, df_low)
# Reshape to long format for plotting
df_long <- df_combined %>%
pivot_longer(cols = X1:X5,
names_to = "component",
values_to = "probability") %>%
mutate(component = as.numeric(gsub("X", "", component)))
# Create the plot
ggplot(df_long, aes(x = component, y = probability, group = sample_id)) +
geom_line(alpha = 0.6, color = "steelblue") +
geom_hline(yintercept = 0.2, linetype = "dashed", color = "red", alpha = 0.7) +
facet_wrap(~ distribution, ncol = 2) +
labs(title = "Comparison of Dirichlet Distribution Samples",
subtitle = "30 samples from each distribution (red line shows uniform expectation)",
x = "Component",
y = "Probability") +
scale_x_continuous(breaks = 1:5) +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
strip.text = element_text(size = 12, face = "bold"))Great, now we need to express the likelihood. For each observed choice yi, each strategy s implies a probability of seeing yi.
Pr(yi)=5∑s=1psPr(yi|s)
Read this as the probability of yi is the weighted average of the probabilities of yi conditional on each strategy s.This expression is a mixture, as in earlier chapters. Sometimes you’ll read that this marginalizes out the unknown strategy. This just means averaging over the strategies, using some probability of each to get the weight of each in the average. Above, the values in p provide these weights.
We can put all these pieces together in a (very simple) statistical model.
yi∼Categorical(θ)θ=5∑s=1psPr(yi|s)p∼Dirichlet([4,4,4,4,4])
On to the Stan code…
First, we list the data. Recall our outcome is y and we’re using the variable majority_first to indicate the order in which children saw the two groups of peers. N is not part of our data, but Stan uses this to index sample size.
Next, we list our parameters. In this case, we’re trying to estimate the proportion of children who use each strategy:
Now for the model. This bit is chunky.
model{
vector[5] phi;
// prior
p ~ dirichlet( rep_vector(4,5) );
// probability of data
for ( i in 1:N ) {
if ( y[i]==2 ) phi[1]=1; else phi[1]=0; // majority
if ( y[i]==3 ) phi[2]=1; else phi[2]=0; // minority
if ( y[i]==1 ) phi[3]=1; else phi[3]=0; // maverick
phi[4]=1.0/3.0; // random
if ( majority_first[i]==1 ) // follow first
if ( y[i]==2 ) phi[5]=1; else phi[5]=0;
else
if ( y[i]==3 ) phi[5]=1; else phi[5]=0;
// compute log( p_s * Pr(y_i|s )
for ( j in 1:5 ) phi[j] = log(p[j]) + log(phi[j]);
// compute average log-probability of y_i
target += log_sum_exp( phi );
}
}To add run this code in R, add all of this to a character string. The function cat() allows you to see the string with the formatting (line breaks, etc).
boxes_stan = "
data{
int N;
array[N] int y;
array[N] int majority_first;
}
parameters{
simplex[5] p;
}
model{
vector[5] phi;
// prior
p ~ dirichlet( rep_vector(4,5) );
// probability of data
for ( i in 1:N ) {
if ( y[i]==2 ) phi[1]=1; else phi[1]=0; // majority
if ( y[i]==3 ) phi[2]=1; else phi[2]=0; // minority
if ( y[i]==1 ) phi[3]=1; else phi[3]=0; // maverick
phi[4]=1.0/3.0; // random
if ( majority_first[i]==1 ) // follow first
if ( y[i]==2 ) phi[5]=1; else phi[5]=0;
else
if ( y[i]==3 ) phi[5]=1; else phi[5]=0;
// compute log( p_s * Pr(y_i|s )
for ( j in 1:5 ) phi[j] = log(p[j]) + log(phi[j]);
// compute average log-probability of y_i
target += log_sum_exp( phi );
}
}
"
cat(boxes_stan)
data{
int N;
array[N] int y;
array[N] int majority_first;
}
parameters{
simplex[5] p;
}
model{
vector[5] phi;
// prior
p ~ dirichlet( rep_vector(4,5) );
// probability of data
for ( i in 1:N ) {
if ( y[i]==2 ) phi[1]=1; else phi[1]=0; // majority
if ( y[i]==3 ) phi[2]=1; else phi[2]=0; // minority
if ( y[i]==1 ) phi[3]=1; else phi[3]=0; // maverick
phi[4]=1.0/3.0; // random
if ( majority_first[i]==1 ) // follow first
if ( y[i]==2 ) phi[5]=1; else phi[5]=0;
else
if ( y[i]==3 ) phi[5]=1; else phi[5]=0;
// compute log( p_s * Pr(y_i|s )
for ( j in 1:5 ) phi[j] = log(p[j]) + log(phi[j]);
// compute average log-probability of y_i
target += log_sum_exp( phi );
}
}
Let’s integrate this with brms so we can make use of some of the helper functions.
First, we’ll create an empty brmsfit object.
Next, we prepare the data for Stan:
The function stan_model will convert our character string to something that Stan can recognize.
And that’s it! You can now use (some) of the usual functions to evaluate your model.
Note that some functions will work, but others will not.
Estimate Est.Error Q2.5 Q97.5
p[1] 0.2571846 0.03657030 0.18430188 0.3266894
p[2] 0.1379903 0.03226541 0.07350504 0.1988042
p[3] 0.1474092 0.03055084 0.08339295 0.2033196
p[4] 0.1981705 0.08014053 0.06273866 0.3693370
p[5] 0.2592454 0.03187899 0.19698956 0.3205356
lp__ -667.1548403 1.50296677 -670.99662701 -665.3105596
We’ve already discussed the phenomenon of time, and we included some models that regressed an outcome (like emotion) on the most recent emotion from the same individual. This is an example of AUTOREGRESSION. In an autoregressive model, the value of the outcome in the previous time step is called a lag variable and added as a predictor to the right side of a linear model.
Positive Emotiontoday=α+β(Positive Emotionyesterday)
These are extremely common, especially in the social sciences. And, they have some problems.
that’s not really causal
if the states are measured with error – they probably are – then the model propagates error
there is no biological, economic, or physical interpretation of the parameters
What should we do instead? Ordinary Differential Equations (ODE)
Rows: 21
Columns: 3
$ Year <int> 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910,…
$ Lynx <dbl> 4.0, 6.1, 9.8, 35.2, 59.4, 41.7, 19.0, 13.0, 8.3, 9.1, 7.4, 8.0, …
$ Hare <dbl> 30.0, 47.2, 70.2, 77.4, 36.3, 20.6, 18.1, 21.4, 22.0, 25.4, 27.1,…
Based on example by Bob Carpenter (2018) (worth reading!).
# for annotation
text <-
tibble(name = c("Hare", "Lynx"),
label = c("Lepus", "Lynx"),
Year = c(1913.5, 1915.5),
value = c(78, 52))
# wrangle
Lynx_Hare %>%
pivot_longer(-Year) %>%
# plot!
ggplot(aes(x = Year, y = value)) +
geom_line(aes(color = name),
linewidth = 1/4) +
geom_point(aes(fill = name),
size = 3, shape = 21, color = "white") +
geom_text(data = text,
aes(label = label, color = name),
hjust = 0) +
scale_fill_grey(start = 0, end = .5) +
scale_color_grey(start = 0, end = .5) +
scale_y_continuous("thousands of pelts", breaks = 0:4 * 20, limits = c(0, 90)) +
theme(legend.position = "none")Let’s start by modeling hares. You’ll be tempted to model the number of hares, H, at a given time point, but consider modeling the rate of change in hares instead:
dHdt=Ht×(birth rate)−Ht×(death rate)
For the sake of simplicity, let’s say the birth and death rates are constants.
dHdt=Htbh−Htmh=Ht(bh−mh)
And the presence of the Lynx modifies the mortality rate: dHdt=Htbh−Htmh=Ht(bh−Ltmh)
dHdt=Ht(bh−Ltmh)
Similar logic gives up the rate of change for Lynxs:
dLdt=Lt(Htbh−mh)
Now, this model is powerful because we can learn from it without data!
sim_lynx_hare <- function(n_steps, init, theta, dt = 0.002) {
#empty vectors to hold the lynx and hares
L <- rep(NA, n_steps)
H <- rep(NA, n_steps)
# set initial values
L[1] <- init[1]
H[1] <- init[2]
for (i in 2:n_steps) {
H[i] <- H[i - 1] + dt * H[i - 1] * (theta[1] - theta[2] * L[i - 1])
L[i] <- L[i - 1] + dt * L[i - 1] * (theta[3] * H[i - 1] - theta[4])
}
# return a tibble
tibble(t = 1:n_steps,
H = H,
L = L)
}# set the four theta values
theta <- c(0.5, 0.05, 0.025, 0.5)
# simulate
z <- sim_lynx_hare(n_steps = 1e4,
init = c(filter(Lynx_Hare, Year == 1900) %>% pull("Lynx"),
filter(Lynx_Hare, Year == 1900) %>% pull("Hare")),
theta = theta)
# what did we do?
glimpse(z)Rows: 10,000
Columns: 3
$ t <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…
$ H <dbl> 30.00000, 30.01800, 30.03600, 30.05401, 30.07203, 30.09005, 30.10807…
$ L <dbl> 4.000000, 4.002000, 4.004005, 4.006014, 4.008028, 4.010046, 4.012069…
z %>%
pivot_longer(-t) %>%
ggplot(aes(x = t, y = value, color = name)) +
geom_line(linewidth = 1/4) +
scale_color_grey(start = 0, end = .5) +
scale_x_continuous(expression(time~(italic(t))), breaks = NULL) +
scale_y_continuous("number (thousands)", breaks = 0:4 * 10, limits = c(0, 45)) +
theme(legend.position = "none")Anyone notice a problem with our data? We don’t have the number of hares or lynx – we have the number of pelts! So we have a different causal model.
We want a statistical model that can connect ht to Ht and connect lt to Lt. Part of the model would include the probability a hare was trapped on a given year, ph, and a similar probability for lynx, pl. To make things worse, further imagine the number of pelts for each, in a given year, was rounded to the nearest 100 and divided by 1,000. Those are our values.
We can do this though. Suppose for example there is a population of Ht=104 hares. Suppose also that the annual trapping rate varies according to a beta distribution ph ∼ Beta(2, 18). This means the average is 10%, but it is very rarely double that. We get a binomial count of pelts sampled for the population of hares, and then that is rounded to the nearest 100 and divided by 1000. Let’s see what this sort of distribution looks like:
This is our distribution of number of hares. We should consider what kind of likelihood function approximates this. A Log-Normal does a good job: right constraints (not below 0) and skew (positive).
ht∼Log-Normal(log(pHt),σH)
ht is the expected number of hairs trapped or observed pelts.
An important fact about this measurement process is that there is no good way to estimate p, not without lots of data at least. So we’re going to just fix it with a strong prior. If this makes you uncomfortable, notice that the model has forced us to realize that we cannot do any better than relative population estimates, unless we have a good way to know p. A typical time series model would just happily spin on its epicycles, teaching us nothing this useful.
probabilities of observed variables
ht∼Log-Normal(log(phHt),σH)lt∼Log-Normal(log(plLt),σL)
unobserved variables H1∼Log-Normal(log10,1)L1∼Log-Normal(log10,1)HT>1=H1+∫T1Ht(bH−LtmH)dtLT>1=L1+∫T1Lt(HtbL−mL)dt
priors
σH∼Exponential(1)σL∼Exponential(1)pH∼Beta(αH,βH)pL∼Beta(αL,βL)bH∼Half-Normal(1,0.5)bL∼Half-Normal(0.05,0.05)mH∼Half-Normal(0.05,0.05)mL∼Half-Normal(1,0.5)
For this example, we’ll transform our data into a long-format to fit with some constraints on brms.
Lynx_Hare_long <-
Lynx_Hare %>%
pivot_longer(-Year,
names_to = "population",
values_to = "pelts") %>%
mutate(delta = if_else(population == "Lynx", 1, 0),
t = Year - min(Year) + 1) %>%
arrange(delta, Year)
# what did we do?
Lynx_Hare_long |> arrange(Year) |> head()# A tibble: 6 × 5
Year population pelts delta t
<int> <chr> <dbl> <dbl> <dbl>
1 1900 Hare 30 0 1
2 1900 Lynx 4 1 1
3 1901 Hare 47.2 0 2
4 1901 Lynx 6.1 1 2
5 1902 Hare 70.2 0 3
6 1902 Lynx 9.8 1 3
Instead of writing the entire Stan mode, we’ll just write the bit that tells Stan how to fit this kind of model, called a Lotka-Volterra.
LotkaVolterra <- "
// Specify dynamical system (ODEs)
array[] real ode_LV(real t, // time
array[] real y, // the system rate
array[] real theta, // the parameters (i.e., the birth and mortality rates)
array[] real x_r, // data constant, not used here
array[] int x_i) { // data constant, not used here
// the outcome
array[2] real dydt;
// differential equations
dydt[1] = (theta[1] - theta[2] * y[2]) * y[1]; // Hare process
dydt[2] = (theta[3] * y[1] - theta[4]) * y[2]; // Lynx process
return dydt; // return a 2-element array
}
// Integrate ODEs and prepare output
real LV(real t, real Hare0, real Lynx0,
real brHare, real mrHare,
real brLynx, real mrLynx,
real delta) {
array[2] real y0; // Initial values
array[4] real theta; // Parameters
array[1, 2] real y; // ODE solution
// Set initial values
y0[1] = Hare0;
y0[2] = Lynx0;
// Set parameters
theta[1] = brHare;
theta[2] = mrHare;
theta[3] = brLynx;
theta[4] = mrLynx;
// Solve ODEs
y = integrate_ode_rk45(ode_LV,
y0, 0, rep_array(t, 1), theta,
rep_array(0.0, 0), rep_array(1, 1),
0.001, 0.001, 100); // tolerances, steps
// Return relevant population values based on our dummy-variable coding method
return (y[1, 1] * (1 - delta) +
y[1, 2] * delta);
}
"Now, we can incorporate this function into brms. For the sake of making everything fit, we’ll do this in pieces. First, we have our formula.
And our priors:
lv_priors <- c(
prior(lognormal(log(10), 1), nlpar = H1, lb = 0),
prior(lognormal(log(10), 1), nlpar = L1, lb = 0),
prior(normal(1, 0.5), nlpar = bh, lb = 0),
prior(normal(0.05, 0.05), nlpar = bl, lb = 0),
prior(normal(0.05, 0.05), nlpar = mh, lb = 0),
prior(normal(1, 0.5), nlpar = ml, lb = 0),
prior(exponential(1), dpar = sigma, lb = 0),
prior(beta(40, 200), nlpar = p, lb = 0, ub = 1)
)And now we fit:
Importantly, each of these parameters has a biological meaning. That’s not only cool, it’s useful. If only more of our models has meaningful parameters.
Family: lognormal
Links: mu = identity; sigma = identity
Formula: pelts ~ log(eta * p)
eta ~ LV(t, H1, L1, bh, mh, bl, ml, delta)
H1 ~ 1
L1 ~ 1
bh ~ 1
mh ~ 1
bl ~ 1
ml ~ 1
sigma ~ 0 + population
p ~ 0 + population
Data: Lynx_Hare_long (Number of observations: 42)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
H1_Intercept 132.51 21.36 96.59 181.13 1.00 2460
L1_Intercept 38.75 6.90 27.37 54.80 1.00 2564
bh_Intercept 0.55 0.06 0.43 0.67 1.00 2096
mh_Intercept 0.00 0.00 0.00 0.01 1.00 2099
bl_Intercept 0.00 0.00 0.00 0.01 1.00 1810
ml_Intercept 0.80 0.09 0.64 1.01 1.00 2119
p_populationHare 0.18 0.02 0.13 0.23 1.00 2260
p_populationLynx 0.17 0.02 0.13 0.22 1.00 2523
sigma_populationHare 0.25 0.05 0.18 0.36 1.00 3555
sigma_populationLynx 0.26 0.05 0.18 0.37 1.00 3347
Tail_ESS
H1_Intercept 2610
L1_Intercept 2488
bh_Intercept 2487
mh_Intercept 2263
bl_Intercept 2395
ml_Intercept 2613
p_populationHare 2858
p_populationLynx 2127
sigma_populationHare 2479
sigma_populationLynx 1983
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The parameters cooperate to produce population dynamics that are not easy to read in the table. Luckily, we can get the posterior predictive distributions.
expose_functions(m2, vectorize=TRUE)
pred = predicted_draws(m2, newdata=Lynx_Hare_long)
pred |>
filter(.draw <= 20) |>
ggplot(aes(x = Year, y = .prediction)) +
geom_line(aes(group = interaction(.draw, population), color = population),
linewidth = 1/3, alpha = 1/2) +
scale_color_manual(values = c("#1c5253" , "#e07a5f")) +
scale_y_continuous("thousands of pelts", breaks = 0:6 * 20) +
scale_x_continuous(NULL) +
coord_cartesian(ylim = c(0, 120)) +
theme(legend.position = "top")If we want to get the population of animals, rather than the number of pelts, we have to work backwards. We have from our predictions the number of pelts, which is:
pelts=population×ptrapped
We estimate that value of p for each animal – albeit with a tight prior – in the model.
post |>
select(starts_with("b_p_")) |>
rowid_to_column(var = ".draw") |>
pivot_longer(
-.draw,
names_to = "population",
names_prefix = "b_p_population",
values_to = "ptrapped") |>
full_join(pred) |>
mutate(pop_est = .prediction/ptrapped) |>
ggplot(aes(x = Year, y = pop_est)) +
geom_line(aes(group = interaction(.draw, population), color = population),
linewidth = 1/3, alpha = 1/2) +
scale_color_manual(values = c("#1c5253" , "#e07a5f")) +
scale_y_continuous("thousands of animals") +
scale_x_continuous(NULL) +
theme(legend.position = "top")