library(tidyverse)
library(psych)
library(cowplot)
library(patchwork)
library(here)
library(brms)
library(tidybayes)
library(bayesplot) ## NEW PACKAGE
Office hours will be on Tuesdays (10-12) starting next week.
10 islands, in a loop. Island 2 is twice as big as island 1, island 3 is three times as big as island 1, etc.
## record current position
current = 5
## flip coin to generate proposal
(coin = sample( c(-1,1) , size=1 ))
## [1] -1
(proposal = current + coin)
## [1] 4
## move?
(prob_move <- proposal/current)
## [1] 0.8
(current <- ifelse( runif(1) < prob_move , proposal , current ))
## [1] 4
num_weeks <- 1e5
positions <- rep(0,num_weeks)
current <- 10
for ( i in 1:num_weeks ) {
## record current position
positions[i] <- current
## flip coin to generate proposal
proposal <- current + sample( c(-1,1) , size=1 )
## now make sure he loops around the archipelago
if ( proposal < 1 ) proposal <- 10
if ( proposal > 10 ) proposal <- 1
## move?
prob_move <- proposal/current
current <- ifelse( runif(1) < prob_move , proposal , current )
}
data.frame(weeks = 1:1e5,positions) %>%
filter(weeks <=100) %>%
ggplot( aes(x=weeks, y=positions)) +
geom_line()
It works by drawing samples from conditional distributions of each parameter given current values of all other parameters. The process iteratively updates one parameter at a time, eventually converging to the joint posterior distribution. Gibbs sampling is particularly effective for high-dimensional problems where direct sampling is difficult, and it’s computationally efficient because it only requires conditional distributions rather than the full joint distribution.
However, both Metropolis and Gibbs sampling are inefficient as models become more complex. For that reason, we’ll skip right ahead to Hamiltonian Monte Carlo sampling.
The warmup period of Stan is figuring out what the leapfrog steps and step size should be. This uses an algorithm called a No-U-Turn Sampler (NUTS).
Let’s return to the height and weight data.
data(Howell1, package = "rethinking")
d <- Howell1
library(measurements)
d$height <- conv_unit(d$height, from = "cm", to = "feet")
d$weight <- conv_unit(d$weight, from = "kg", to = "lbs")
describe(d, fast = T)
## vars n mean sd median min max range skew kurtosis se
## height 1 544 4.54 0.91 4.88 1.77 5.88 4.1 -1.26 0.58 0.04
## weight 2 544 78.51 32.45 88.31 9.37 138.87 129.5 -0.54 -0.94 1.39
## age 3 544 29.34 20.75 27.00 0.00 88.00 88.0 0.49 -0.56 0.89
## male 4 544 0.47 0.50 0.00 0.00 1.00 1.0 0.11 -1.99 0.02
d <- d[d$age >= 18, ]
d$height_c <- d$height - mean(d$height)
\[\begin{align*} w_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta (h_i - \bar{h}) \\ \alpha &\sim \text{Normal}(130, 20) \\ \beta &\sim \text{Normal}(0, 25) \\ \sigma &\sim \text{Uniform}(0, 25) \\ \end{align*}\]
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
seed = 3,
file = here("files/data/generated_data/m1"))
brm() is the core function for fitting Bayesian models
using brms. This function translates your model above into
a Stan model, which in turn defines the sampler and does the hard work.
This is running in the background of your computer. There are many
programs that can be used to run Stan, and you can even run Stan
directly if you want.
brms::stancode(m1)
## // generated with brms 2.22.0
## functions {
## }
## data {
## int<lower=1> N; // total number of observations
## vector[N] Y; // response variable
## int<lower=1> K; // number of population-level effects
## matrix[N, K] X; // population-level design matrix
## int<lower=1> Kc; // number of population-level effects after centering
## int prior_only; // should the likelihood be ignored?
## }
## transformed data {
## matrix[N, Kc] Xc; // centered version of X without an intercept
## vector[Kc] means_X; // column means of X before centering
## for (i in 2:K) {
## means_X[i - 1] = mean(X[, i]);
## Xc[, i - 1] = X[, i] - means_X[i - 1];
## }
## }
## parameters {
## vector[Kc] b; // regression coefficients
## real Intercept; // temporary intercept for centered predictors
## real<lower=0,upper=50> sigma; // dispersion parameter
## }
## transformed parameters {
## real lprior = 0; // prior contributions to the log posterior
## lprior += normal_lpdf(b | 0, 25);
## lprior += normal_lpdf(Intercept | 130, 20);
## lprior += uniform_lpdf(sigma | 0, 50)
## - 1 * log_diff_exp(uniform_lcdf(50 | 0, 50), uniform_lcdf(0 | 0, 50));
## }
## model {
## // likelihood including constants
## if (!prior_only) {
## target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
## }
## // priors including constants
## target += lprior;
## }
## generated quantities {
## // actual population-level intercept
## real b_Intercept = Intercept - dot_product(means_X, b);
## }
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
seed = 3,
file = here("files/data/generated_data/m42.1"))
family specifies the distribution of the outcome family.
In many examples, we’ll use a gaussian (normal) distribution. But there
are many many
many options for this.
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
seed = 3,
file = here("files/data/generated_data/m1"))
The formula argument is what you would expect from the
lm() and lmer() functions you have seen in the
past. The benefit of brms is that this formula can easily
handle complex and non-linear terms. We’ll be playing with more in
future classes.
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
seed = 3,
file = here("files/data/generated_data/m1"))
Here we set our priors. Class b refers to slope
parameters or beta coefficients. Again, this argument has the ability to
become very detailed, specific, and flexible, and we’ll play more with
this.
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
seed = 3,
file = here("files/data/generated_data/m1"))
Hamiltonian MCMC runs for a set number of iterations, throws away the first bit (the warmup), and does that up multiple times (the number of chains).
warmup defaults to iter/2, but you can set
it to something else. Remember, the warmup is not simply a burn-in
period. It’s used to figure out the appropriate leapfrog and step size
settings. So it’s worth allowing this to be large-ish. Also note that
the warmup period will generally run more slowly than the sampling
period.
In general, it’s recommended that you use one short chain to debug, four longer chains for estimation, verification, and inference.
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
seed = 3,
file = here("files/data/generated_data/m1"))
Remember, these are random walks through parameter space, so set a seed for reproducbility. Also, these can take a while to run, especially when you are developing more complex models. If you specify a file, the output of the model will automatically be saved. Even better, then next time you run this code, R will check for that file and load it into your workspace instead of re-running the model. (Just be sure to delete the model file if you make changes to any other part of the code.)
m1 <-brm(
data = d,
family = gaussian,
weight ~ 1 + height_c,
prior = c( prior( normal(130,20), class = Intercept),
prior( normal(0,25), class = b),
prior( uniform(0,50), class = sigma, ub = 50)
),
iter = 5000, warmup = 1000, chains = 4,
cores = 4,
seed = 3,
file = here("files/data/generated_data/m1"))
You can parallelized your chains (run them in parallel instead of in sequence) by defining how many cores you want to run. Your computer (probably) has at least 4 cores, so 4 is a good choice here. But be warned that your computer may slow down in other places (while the code is running), and may even run a bit hot.
summary(m1)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: weight ~ 1 + height_c
## Data: d (Number of observations: 352)
## Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
## total post-warmup draws = 16000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 99.21 0.50 98.23 100.18 1.00 14807 12164
## height_c 42.05 1.95 38.22 45.91 1.00 15921 12552
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 9.38 0.36 8.72 10.12 1.00 15626 12366
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
lprior is the log-prior density, or logarithm of the
prior probability density for the model parameters. lp__ is
the log prior, l_pp = lprior +
log_likelihood
To evaluate your model, the first question to ask is “did your chains mix?”
library(bayesplot)
mcmc_trace(m1)
mcmc_rank_overlay(m1, pars=vars(b_Intercept:sigma)) +ylim(150, NA)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
This is not \(N\) like the number of participants in your sample. This is the number of draws from the posterior. The effective number is an estimate of the number of independent samples from the posterior (remember probability). Markov chains are typically autocorrelated, so that sequential samples are not entirely independent. This happens when chains explore the posterior slowly, like in a Metropolis algorithm. Autocorrelation reduces the effective number of samples. Think of effective sample size like the length of a Markov chain with no autocorrelation that is the same quality as your estimate.
brms gives you two estimates for effective sample size:
bulk and tail.
How many do you need? It depends on how much skewed your posterior is.
summary(m1)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: weight ~ 1 + height_c
## Data: d (Number of observations: 352)
## Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
## total post-warmup draws = 16000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 99.21 0.50 98.23 100.18 1.00 14807 12164
## height_c 42.05 1.95 38.22 45.91 1.00 15921 12552
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 9.38 0.36 8.72 10.12 1.00 15626 12366
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The Gelman-Rubin convergence metric \((\hat{R})\) is a comparison of between-chain variance to within-chain variance. You want this number to be as close to 1 as you can.
Note that a value of 1 doesn’t mean that you’re fine. It’s a sign of danger, but not a sign of safety.
summary(m1)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: weight ~ 1 + height_c
## Data: d (Number of observations: 352)
## Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
## total post-warmup draws = 16000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 99.21 0.50 98.23 100.18 1.00 14807 12164
## height_c 42.05 1.95 38.22 45.91 1.00 15921 12552
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 9.38 0.36 8.72 10.12 1.00 15626 12366
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
A rule in physics is that energy must be conserved. Remember that HMC is literally a physics simulation. Energy won’t be conserved in the system if your skateboarder zings out of the bowl and into outer space. That’s a divergent transition.
Two primary ways to handle divergent transitions are by increasing
the adapt_delta parameter and reparameterization.
adapt_delta specifies the target average acceptance
probability during the adaptation phase of sampling (warmup). The higher
the delta, the smaller the step size. This makes the model more
conservative, slower, and also much less likely to encounter problems.
Its default is .8. Try increasing it to .9 or .99 if you have divergent
transitions.
adapt_delta is like adjusting how smooth and controlled
the skateboarder’s ride should be. * A higher adapt_delta (e.g., 0.99)
means the skateboarder is super cautious, preferring smooth, stable
rides (i.e., high acceptance rates). This leads to smaller step sizes
and more leapfrog steps per iteration. * A lower adapt_delta (e.g., 0.8)
lets the skateboarder be riskier, zooming around the bowl faster, but at
a higher risk of crashing (divergent transitions).