50  Hierarchical modeling - M4L11

Bayesian Statistics: Techniques and Models

An overview of hierarchical modeling in the context of Bayesian statistics.
Monte Carlo Estimation
Author

Oren Bochman

Keywords

hierarchical modeling, Bayesian statistics, R programming, statistical modeling, multilevel models, group-level effects, random effects, hierarchical priors, Bayesian inference

50.1 Introduction to Hierarchical Modeling

Figure 50.1: Introduction to Hierarchical modeling
TipWhat about correlated data?

While the section shows the presence of correlation within groups, I wonder if we really need the iid assumptions below, particularly as we are told in (McElreath 2015) that hierarchical models can handle correlated data.

Throughout the last few lessons, we have assumed that all the observations were independent. But more often than not, this is too strong an assumption to make about our data. There is often a natural grouping to our data points, which leads us to believe that some observation pairs should be more similar to each other than to others.

Let’s look at an example. In the previous course, we talked about using a Poisson model to count chocolate chips in cookies.

Let’s suppose that you own a company that produces chocolate chip cookies. In an experiment, you’re going to produce 150 test cookies. Let’s say you’ll do 30 from your location and then 30 from four other factory locations. So let’s write this as 30 from each of 5 locations. We’re going to assume that all the locations use the same recipe when they make their chocolate chip cookies.

Should we expect a cookie from your location, batch one, to be more similar to another cookie from that batch than to a cookie from another location’s batch?

I’d say probably.

There’s a natural grouping to the cookies. By making it a hierarchical model, we can account for the likely differences between locations in our Poisson model.

The original fully independent model for model for Poisson, the number of chips in the cookies would have looked like this.

N=150 \quad \text{number of cookies}

B=30 \quad \text{from each locations}

50.1.1 Fully independent model

Fitting a single model to all data: This approach treats all observations as independent, ignoring any potential differences or relationships between groups. In the cookie example, fitting a single Poisson model to all 150 cookies would “be ignoring potential differences between locations and the fact that cookies from the same location are likely to be more similar to each other.”

This kind of model is called fully independent model or an completely pooled model as we consider all the cookies belonging to a single group

y_i \mid \lambda \stackrel{iid}{\sim} \mathrm{Pois}(\lambda) \qquad i=1, \ldots, N \qquad \tag{50.1}

Where \lambda is the expected number of chips per cookie.

At the other extreme we can think of fitting a model for each group separately. This kind of model is called a unpooled model.

50.1.2 Location-dependent model

Hierarchical models offer a powerful solution, acting as a “good compromise” between the two extreme approaches described above. They allow for the acknowledgment of natural groupings in data while still leveraging information across those groups. This is also called a partially pooled model, as it allows for some sharing of information between groups while still allowing for group-specific parameters.

In (Gelman et al. 2013, secs. 4.5,5.5,7.3) the authors describe the trade-offs between these types of options and how hierarchical models provide a good compromise. Hierarchical models, can incorporate a sufficient number of parameters to fit data well, while simultaneously using a population distribution to introduce dependence among parameters, thereby avoiding overfitting. This approach allows the parameters of a prior, or population, distribution to be estimated from the data itself

Also In (McElreath 2015), the author describes how shrinkage estimators can be used in hierarchical models to improve parameter estimates. But to keep things simple, we can say that shrinkage is how parameters for groups with less data are “pulled” towards the overall mean, while groups with more data are less affected by this pull.

\begin{aligned} y_i \mid l_i, \lambda_{l_i} & \stackrel{iid}{\sim} \mathrm{Pois}(\lambda_{l_i}) & i=1, \ldots, N \\ \lambda_{l_i} |\alpha, \beta & \sim \mathrm{Gamma}(\alpha, \beta) & l_i = 1, \ldots, B \\ \alpha &\sim \mathrm{Prior}(\alpha) & \text {(hyperparameters)} \\ \beta &\sim \mathrm{Prior}(\beta) & \text {(hyperparameters)} \end{aligned} \tag{50.2}

50.1.3 Graphical Model

The structure of a hierarchical model can be visualized as follows:

  • Top Level (Hyperparameters): Independent parameters like \alpha and \beta.
  • Second Level (Group Parameters): Parameters for each group (e.g., \lambda_1, \lambda_2, \dots, \lambda_5) that are dependent on the hyperparameters.
  • Lowest Level (Observations): Individual observations (e.g., cookies) are grouped by their respective locations, and their distributions depend on the specific group parameter.

This multi-level structure allows for the estimation of the hyperparameters from the group-specific parameters.

import daft
import matplotlib.pyplot as plt
import warnings
import logging

# Suppress font warnings
warnings.filterwarnings("ignore", category=UserWarning, module="matplotlib.font_manager")

# Also lower logging level for matplotlib font manager
logging.getLogger('matplotlib.font_manager').setLevel(logging.ERROR)

pgm = daft.PGM([6.5, 4.5], node_unit=1.2)

# Hyperparameters
pgm.add_node(daft.Node("alpha", r"$\alpha$", 2, 4));
pgm.add_node(daft.Node("beta", r"$\beta$", 4, 4));

pgm.add_node(daft.Node("lambda_1", r"$\lambda_1$", 1, 3));
pgm.add_node(daft.Node("lambda_2", r"$\lambda_2$", 2, 3));
pgm.add_node(daft.Node("lambda_3", r"$\lambda_3$", 3, 3));
pgm.add_node(daft.Node("lambda_4", r"$\lambda_4$", 4, 3));
pgm.add_node(daft.Node("lambda_5", r"$\lambda_5$", 5, 3));

# Observed data y_i, grouped under lambda_l

pgm.add_node(daft.Node("y_1", r"$y_i$", 1, 2, observed=True));
pgm.add_node(daft.Node("y_2", r"$y_i$", 2, 2, observed=True));
pgm.add_node(daft.Node("y_3", r"$y_i$", 3, 2, observed=True));
pgm.add_node(daft.Node("y_4", r"$y_i$", 4, 2, observed=True));
pgm.add_node(daft.Node("y_5", r"$y_i$", 5, 2, observed=True));

# Edges
pgm.add_edge("alpha", "lambda_1");
pgm.add_edge("alpha", "lambda_2");
pgm.add_edge("alpha", "lambda_3");
pgm.add_edge("alpha", "lambda_4");
pgm.add_edge("alpha", "lambda_5");

pgm.add_edge("beta", "lambda_1");
pgm.add_edge("beta", "lambda_2");
pgm.add_edge("beta", "lambda_3");
pgm.add_edge("beta", "lambda_4");
pgm.add_edge("beta", "lambda_5");

pgm.add_edge("lambda_1", "y_1");
pgm.add_edge("lambda_2", "y_2");
pgm.add_edge("lambda_3", "y_3");
pgm.add_edge("lambda_4", "y_4");
pgm.add_edge("lambda_5", "y_5");

# Plates
pgm.add_plate(daft.Plate([0.6, 1.5, 0.8, 1], label=r"$i=l_i=1$", shift=-0.1));
pgm.add_plate(daft.Plate([1.6, 1.5, 0.8, 1], label=r"$i=l_i=2$", shift=-0.1));
pgm.add_plate(daft.Plate([2.6, 1.5, 0.8, 1], label=r"$i=l_i=3$", shift=-0.1));
pgm.add_plate(daft.Plate([3.6, 1.5, 0.8, 1], label=r"$i=l_i=4$", shift=-0.1));
pgm.add_plate(daft.Plate([4.6, 1.5, 0.8, 1], label=r"$i=l_i=5$", shift=-0.1));


pgm.render()
plt.show()
Figure 50.2: Hierarchical model for chocolate chip cookies

A primary advantage of hierarchical models is their ability to share information, or borrow strength, from all the data.

  • Indirect Information Sharing: Even though groups might have their own specific parameters (e.g., your location’s \lambda_1), the common distribution from which these parameters are drawn means that “information about another location’s cookies might provide information about your cookies, at least indirectly.”
  • Improved Parameter Estimation: This shared information leads to more robust and accurate parameter estimates. In the cookie example, “your lambda is not only estimated directly from your 30 cookies, but also indirectly from the other 120 cookies leveraging this hierarchical structure.”

The overarching benefit of hierarchical models is their capacity for “Being able to account for relationships in the data while estimating everything with the single model is a primary advantage of using hierarchical models.” This provides a flexible and powerful framework for analyzing data with inherent group structures, leading to more realistic and informative inferences.

50.2 Correlations within the Normal hierarchical model 🗒️

This section is based on the handout titled Correlation from hierarchical models which deep dives into aspects of correlated quantities within hierarchical models by considering the special case of a Normal model.

50.2.1 Correlated data

In the supplementary material from Module 2, we introduced covariance and correlation.

Recall that covariance between two random variables is defined as:

\sigma_{xy} = \mathbb{C}ov(X,Y) = \mathbb{E}[(X − \mu_x)(Y − \mu_y)]

where:

  • \mu_x = \mathbb{E}(X) and
  • \mu_y = \mathbb{E}(Y).

And that Correlation between X and Y is defined as: \rho_{xy} = \mathbb{C}or(X,Y) = \frac{\mathbb{C}ov(X,Y)}{\sqrt{\mathbb{V}ar(X) \cdot \mathbb{V}ar(Y)}} = \frac{\sigma_{xy}}{\sqrt{\sigma^2_x \cdot \sigma^2_y}}.

Correlation measures the strength of linear relationship between two variables. Covariance has a useful mathematical property which we will use below.

If a, b, c, d are constants, and X,Y are random variables then

\begin{aligned} \mathbb{C}ov(a + bX, c + dY) &= \mathbb{C}ov(a, c) + \mathbb{C}ov(a, dY) + \mathbb{C}ov(bX, c) + \mathbb{C}ov(bX, dY) \\ &= \mathbb{C}ov(a,c) + \mathbb{C}ov(a,dY ) + \mathbb{C}ov(bX,c) + \mathbb{C}ov(bX,dY ) \\ &= 0 + 0 + 0 + b \cdot \mathbb{C}ov(X,Y ) \cdot d \\ &= b \cdot d \cdot \sigma_{xy} , \end{aligned}

where the 0 terms are due to the fact that constants do not co-vary with anything.

In the examples from this lesson, we use hierarchical models when the data were grouped in some way, so that two observations from the same group were assumed to be more similar than two observations from different groups. We would therefore expect two observations from the same group to be correlated. It turns out that hierarchical models do correlate such variables, as we will demonstrate with a normal hierarchical model.

50.2.2 Normal hierarchical model

Suppose our data come from a normal distribution, where each group has its own mean.

In the second stage, we assume that all the group means come from a common normal distribution.

Let’s write this hierarchical model. We y_{i,j} denote the i th observation in group j, with mean \theta_j. Thus we get: \begin{aligned} y_{i,j} \mid \theta_j &\stackrel{iid}{\sim} \mathcal{N}(\theta_j, \sigma^2) \\ \theta_j &\stackrel{iid}{\sim} \mathcal{N}(\mu, \tau^2) \end{aligned}

where \sigma^2, \tau^2, and \mu are known constants. To get the marginal distribution of y_{i,j} only, we must compute:

\begin{aligned} \mathbb{P}r(y_{i,j}) &= \int \mathbb{P}r(y_{i,j}, \theta_j) d\theta_j \\ &= \int \mathbb{P}r(y_{i,j} \mid \theta_j) \mathbb{P}r(\theta_j) d\theta_j \end{aligned}

With normally distributed variables, it is often easier to work with the following equivalent formulation of the model

\begin{aligned} \theta_j &= \mu + \nu_j, & \nu_j &\stackrel{iid}{\sim} \mathcal{N}(0, \tau^2) \\ y_{i,j} &= \theta_j + \varepsilon_{i,j}, & \varepsilon_{i,j} &\stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2) \end{aligned}

with all \nu_j and \varepsilon_{i,j} independent. This allows us to substitute \theta_j into the expression for y_{i,j} to get y_{i,j} = \mu + \nu_j + \varepsilon_{i,j}.

One nice property of normal random variables is that if X and Y are both normally distributed (correlated or uncorrelated), then the new variable X + Y will also be normally distributed. Hence \mathbb{P}r(y_{i,j}) is a normal distribution with mean

\begin{aligned} \mathbb{E}(y_{i,j}) &= \mathbb{E}(\mu + \nu_j + \varepsilon_{i,j}) \\ &= E(\mu) + E(\nu_j) + E(\varepsilon_{i,j}) \\ &= \mu + 0 + 0 \\ &= \mu \end{aligned}

and variance

\begin{aligned} \mathbb{V}ar(y_{i,j}) =& \mathbb{V}ar(\mu + \nu_j + \varepsilon_{i,j} ) \\ =& \mathbb{C}ov(\mu + \nu_j + \varepsilon_{i,j} , \mu + \nu_j + \varepsilon_{i,j} ) \\ =& \mathbb{C}ov(\mu,\mu) + \mathbb{C}ov(\nu_j ,\nu_j ) + \mathbb{C}ov(\varepsilon_{i,j} ,\varepsilon_{i,j} ) \\ & + 2 \cdot \mathbb{C}ov(\mu,\nu_j ) + 2 \cdot \mathbb{C}ov(\mu,\varepsilon_{i,j} ) \\ & + 2 \cdot \mathbb{C}ov(\nu_j ,\varepsilon_{i,j} ) \\ =& 0 + \mathbb{V}ar(\nu_j ) + \mathbb{V}ar(\varepsilon_{i,j} ) + 0 + 0 + 0 &&& \text{(since } \nu_j \text{ and } \varepsilon_{i,j} \text{ are independent)} \\ =& \tau^2 + \sigma^2 . \end{aligned}

Now, we want to show that observations in the same group are correlated under this hierarchical model.

Let’s take, for example, observations 1 and 2 from group j, y_{1,j} and y_{2,j}. It does not matter which two you select, as long as they are from the same group. We know that \mathbb{V}ar(y_{1,j}) = \mathbb{V}ar(y_{2,j}) = \tau^2 + \sigma^2. What about their covariance?

\begin{aligned} \mathbb{C}ov(y_{1,j},y_{2,j}) &= \mathbb{C}ov(\mu + \nu_j + \varepsilon_{2,j}, \mu + \nu_j + \varepsilon_{2,j}) \\ &= \mathbb{C}ov(\mu,\mu) + \mathbb{C}ov(\nu_j,\nu_j) + \mathbb{C}ov(\varepsilon_{1,j},\varepsilon_{2,j}) + 2 \cdot \mathbb{C}ov(\mu,\nu_j) + \mathbb{C}ov(\mu,\varepsilon_{1,j}) + \mathbb{C}ov(\mu,\varepsilon_{2,j}) + \mathbb{C}ov(\nu_j,\varepsilon_{1,j}) + \mathbb{C}ov(\nu_j,\varepsilon_{2,j}) \\ &= 0 + \mathbb{V}ar(\nu_j) + 0 + 2 \cdot 0 + 0 + 0 + 0 + 0 \qquad \text{(as } \varepsilon_{1,j} \text{ and } \varepsilon_{2,j} \text{ are independent)} &= \tau^2 \end{aligned}

which gives us correlation:

\begin{aligned} \mathbb{C}or(y_{1,j},y_{2,j}) &= \mathbb{C}ov(y_{1,j},y_{2,j})/\sqrt{\mathbb{V}ar(y_{1,j}) \cdot \mathbb{V}ar(y_{2,j})} \\ &= \tau^2/\sqrt{(\tau^2 + \sigma^2) \cdot (\tau^2 + \sigma^2)} \\ &= \tau^2/\sqrt{(\tau^2 + \sigma^2) \cdot (\tau^2 + \sigma^2)} \\ &= \tau^2/\sqrt{(\tau^2 + \sigma^2)^2} \\ &= \frac{\tau^2}{\tau^2 + \sigma^2}. \end{aligned}

Finally, let’s check the covariance between observations in different groups. Let’s take observation i from groups 1 and 2 (again, our choices do not matter), y_{i,1} and y_{i,2}. Their covariance is

\begin{aligned} \mathbb{C}ov(y_{i,1},y_{i,2}) &= \mathbb{C}ov(\mu + \nu_1 + \varepsilon_{i,1}, \mu + \nu_2 + \varepsilon_{i,2}) \\ &= \mathbb{C}ov(\mu,\mu) + \mathbb{C}ov(\nu_1,\nu_2) + \mathbb{C}ov(\varepsilon_{i,1},\varepsilon_{i,2}) \\ &\quad + \mathbb{C}ov(\mu,\nu_1) + \mathbb{C}ov(\mu,\nu_2) + \mathbb{C}ov(\mu,\varepsilon_{i,1}) + \mathbb{C}ov(\mu,\varepsilon_{i,2}) \\ &\quad + \mathbb{C}ov(\nu_1,\varepsilon_{i,1}) + \mathbb{C}ov(\nu_2,\varepsilon_{i,2}) \\ &= 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 \qquad \text{(as all } \varepsilon \text{ and } \nu \text{ are independent)} \\ &= 0, \end{aligned}

which obviously yields correlation 0.

Thus, we have demonstrated that observations from the same group are correlated while observations from different groups are uncorrelated in the marginal distribution for observations.

50.3 Applications of hierarchical modeling

Handout: Common applications of Bayesian hierarchical models

50.4 Prior predictive simulation

50.4.1 Data

Let’s fit our hierarchical model for counts of chocolate chips. The data can be found in cookies.dat.

dat = read.table(file="data/cookies.dat", header=TRUE)
head(dat)
  chips location
1    12        1
2    12        1
3     6        1
4    13        1
5    12        1
6    12        1
table(dat$location)

 1  2  3  4  5 
30 30 30 30 30 

We can also visualize the distribution of chips by location.

hist(dat$chips)
Figure 50.3
boxplot(chips ~ location, data=dat)
Figure 50.4

50.4.2 Prior predictive checks

Before implementing the model, we need to select prior distributions for \alpha and \beta, the hyperparameters governing the gamma distribution for the \lambda parameters. First, think about what the \lambda’s represent. For location j, \lambda_j is the expected number of chocolate chips per cookie. Hence, \alpha and \beta control the distribution of these means between locations. The mean of this gamma distribution will represent the overall mean of number of chips for all cookies. The variance of this gamma distribution controls the variability between locations. If this is high, the mean number of chips will vary widely from location to location. If it is small, the mean number of chips will be nearly the same from location to location.

To see the effects of different priors on the distribution of \lambda’s, we can simulate. Suppose we try independent exponential priors for \alpha and \beta.

set.seed(112)
n_sim = 500
alpha_pri = rexp(n_sim, rate=1.0/2.0)
beta_pri = rexp(n_sim, rate=5.0)
mu_pri = alpha_pri/beta_pri
sig_pri = sqrt(alpha_pri/beta_pri^2)

summary(mu_pri)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0213    2.9829    9.8522   61.1271   29.9801 4858.7861 
summary(sig_pri)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1834    3.3663    8.5488   41.8137   22.2219 2865.6461 

After simulating from the priors for \alpha and \beta, we can use those samples to simulate further down the hierarchy:

lam_pri = rgamma(n=n_sim, shape=alpha_pri, rate=beta_pri)
summary(lam_pri)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.000     1.171     7.668    83.062    28.621 11005.331 

Or for a prior predictive reconstruction of the original data set:

(lam_pri = rgamma(n=5, shape=alpha_pri[1:5], rate=beta_pri[1:5]))
[1] 66.444084  9.946688  6.028319 15.922568 47.978587
(y_pri = rpois(n=150, lambda=rep(lam_pri, each=30)))
  [1] 63 58 64 63 70 62 61 48 71 73 70 77 66 60 72 77 69 62 66 71 49 80 66 75 74
 [26] 55 62 90 65 57 12  9  7 10 12 10 11  7 14 13  9  6  6 13  7 10 12  9  9 10
 [51]  7  8  6  9  7 10 13 13  8 12  6 10  3  6  7  4  6  7  5  5  4  3  6  2  8
 [76]  4  8  4  5  7  1  4  5  3  8  8  3  1  7  3 16 14 13 17 17 12 13 13 16 16
[101] 15 14 11 10 13 17 16 19 16 17 15 16  7 17 21 16 12 15 14 13 52 44 51 46 39
[126] 40 40 44 46 59 45 49 58 42 31 52 43 47 53 41 48 57 35 60 51 58 36 34 41 59

Because these priors have high variance and are somewhat noninformative, they produce unrealistic predictive distributions. Still, enough data would overwhelm the prior, resulting in useful posterior distributions. Alternatively, we could tweak and simulate from these prior distributions until they adequately represent our prior beliefs. Yet another approach would be to re-parameterize the gamma prior, which we’ll demonstrate as we fit the model.

50.5 JAGS Model

library("rjags")
mod_string = " model {
for (i in 1:length(chips)) {
  chips[i] ~ dpois(lam[location[i]])
}

for (j in 1:max(location)) {
  lam[j] ~ dgamma(alpha, beta)
}

alpha = mu^2 / sig^2
beta = mu / sig^2

mu ~ dgamma(2.0, 1.0/5.0)
sig ~ dexp(1.0)

} "

set.seed(113)

data_jags = as.list(dat)

params = c("lam", "mu", "sig")

mod = jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 150
   Unobserved stochastic nodes: 7
   Total graph size: 315

Initializing model
update(mod, 1e3)

mod_sim = coda.samples(model=mod,
                       variable.names=params,
                       n.iter=5e3)
mod_csim = as.mcmc(do.call(rbind, mod_sim))
## convergence diagnostics
par(mar = c(2.5, 1, 2.5, 1))
plot(mod_sim)
Figure 50.5
Figure 50.6
gelman.diag(mod_sim)
Potential scale reduction factors:

       Point est. Upper C.I.
lam[1]          1          1
lam[2]          1          1
lam[3]          1          1
lam[4]          1          1
lam[5]          1          1
mu              1          1
sig             1          1

Multivariate psrf

1
autocorr.diag(mod_sim)
             lam[1]       lam[2]      lam[3]       lam[4]       lam[5]
Lag 0   1.000000000  1.000000000 1.000000000  1.000000000 1.0000000000
Lag 1   0.024194365  0.124021047 0.019111607  0.000355262 0.0664428799
Lag 5  -0.015704807 -0.003810331 0.005958220 -0.005751762 0.0105149694
Lag 10 -0.002111030  0.008210715 0.024357785 -0.002885581 0.0001820758
Lag 50 -0.004695945 -0.001131281 0.001797284  0.003237898 0.0075204542
                 mu           sig
Lag 0  1.0000000000  1.0000000000
Lag 1  0.3747466092  0.5475759223
Lag 5  0.0254688975  0.0748964816
Lag 10 0.0012197344  0.0197119156
Lag 50 0.0003065432 -0.0001996277
par(mar = c(2.5, 1, 2.5, 1))
autocorr.plot(mod_sim)
Figure 50.7
Figure 50.8
Figure 50.9
effectiveSize(mod_sim)
   lam[1]    lam[2]    lam[3]    lam[4]    lam[5]        mu       sig 
14123.659 10855.385 14305.054 15257.466 13129.698  6154.835  4073.565 
## compute DIC
dic = dic.samples(mod, n.iter=1e3)

50.5.1 Model checking

After assessing convergence, we can check the fit via residuals. With a hierarchical model, there are now two levels of residuals: the observation level and the location mean level. To simplify, we’ll look at the residuals associated with the posterior means of the parameters.

First, we have observation residuals, based on the estimates of location means.

## observation level residuals
(pm_params = colMeans(mod_csim))
   lam[1]    lam[2]    lam[3]    lam[4]    lam[5]        mu       sig 
 9.284372  6.222039  9.526465  8.955319 11.755696  9.094434  2.074664 
yhat = rep(pm_params[1:5], each=30)
resid = dat$chips - yhat
plot(resid)
Figure 50.10
plot(jitter(yhat), resid)
Figure 50.11
var(resid[yhat<7])
[1] 6.447126
var(resid[yhat>11])
[1] 13.72414

Also, we can look at how the location means differ from the overall mean \mu.

## location level residuals
lam_resid = pm_params[1:5] - pm_params["mu"]
plot(lam_resid)
abline(h=0, lty=2)

We don’t see any obvious violations of our model assumptions.

50.5.2 Results

summary(mod_sim)

Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

         Mean     SD Naive SE Time-series SE
lam[1]  9.284 0.5420 0.004425       0.004568
lam[2]  6.222 0.4611 0.003765       0.004428
lam[3]  9.526 0.5478 0.004473       0.004583
lam[4]  8.955 0.5313 0.004338       0.004302
lam[5] 11.756 0.6218 0.005077       0.005427
mu      9.094 0.9746 0.007957       0.012423
sig     2.075 0.6911 0.005643       0.010852

2. Quantiles for each variable:

         2.5%    25%    50%    75%  97.5%
lam[1]  8.248  8.913  9.275  9.640 10.387
lam[2]  5.347  5.903  6.215  6.530  7.156
lam[3]  8.474  9.147  9.515  9.895 10.623
lam[4]  7.953  8.594  8.935  9.306 10.029
lam[5] 10.572 11.330 11.746 12.164 13.011
mu      7.222  8.476  9.065  9.691 11.117
sig     1.093  1.585  1.946  2.436  3.708

50.6 Posterior predictive simulation

Just as we did with the prior distribution, we can use these posterior samples to get Monte Carlo estimates that interest us from the posterior predictive distribution.

For example, we can use draws from the posterior distribution of \mu and \sigma to simulate the posterior predictive distribution of the mean for a new location.

(n_sim = nrow(mod_csim))
[1] 15000
lam_pred = rgamma(n=n_sim, shape=mod_csim[,"mu"]^2/mod_csim[,"sig"]^2, 
                  rate=mod_csim[,"mu"]/mod_csim[,"sig"]^2)
hist(lam_pred)
Figure 50.12
mean(lam_pred > 15)
[1] 0.0172

Using these \lambda draws, we can go to the observation level and simulate the number of chips per cookie, which takes into account the uncertainty in \lambda:

y_pred = rpois(n=n_sim, lambda=lam_pred)
hist(y_pred)

mean(y_pred > 15)
[1] 0.0582
hist(dat$chips)
Figure 50.13

Finally, we could answer questions like: what is the posterior probability that the next cookie produced in Location 1 will have fewer than seven chips?

y_pred1 = rpois(n=n_sim, lambda=mod_csim[,"lam[1]"])
hist(y_pred1)
Figure 50.14
mean(y_pred1 < 7)
[1] 0.187

50.7 Random intercept linear model

We can extend the linear model for the Leinhardt data on infant mortality by incorporating the region variable. We’ll do this with a hierarchical model, where each region has its own intercept.

library("car")
data("Leinhardt")
?Leinhardt
str(Leinhardt)
'data.frame':   105 obs. of  4 variables:
 $ income: int  3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
 $ infant: num  26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
 $ region: Factor w/ 4 levels "Africa","Americas",..: 3 4 4 2 4 4 4 4 4 4 ...
 $ oil   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
pairs(Leinhardt)
Figure 50.15
head(Leinhardt)
          income infant   region oil
Australia   3426   26.7     Asia  no
Austria     3350   23.7   Europe  no
Belgium     3346   17.0   Europe  no
Canada      4751   16.8 Americas  no
Denmark     5029   13.5   Europe  no
Finland     3312   10.1   Europe  no

Previously, we worked with infant mortality and income on the logarithmic scale. Recall also that we had to remove some missing data.

dat = na.omit(Leinhardt)
dat$logincome = log(dat$income)
dat$loginfant = log(dat$infant)
str(dat)
'data.frame':   101 obs. of  6 variables:
 $ income   : int  3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
 $ infant   : num  26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
 $ region   : Factor w/ 4 levels "Africa","Americas",..: 3 4 4 2 4 4 4 4 4 4 ...
 $ oil      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ logincome: num  8.14 8.12 8.12 8.47 8.52 ...
 $ loginfant: num  3.28 3.17 2.83 2.82 2.6 ...
 - attr(*, "na.action")= 'omit' Named int [1:4] 24 83 86 91
  ..- attr(*, "names")= chr [1:4] "Iran" "Haiti" "Laos" "Nepal"

Now we can fit the proposed model:

library("rjags")
mod_string = " model {
  for (i in 1:length(y)) {
    y[i] ~ dnorm(mu[i], prec)
    mu[i] = a[region[i]] + b[1]*log_income[i] + b[2]*is_oil[i]
  }
  
  for (j in 1:max(region)) {
    a[j] ~ dnorm(a0, prec_a)
  }
  
  a0 ~ dnorm(0.0, 1.0/1.0e6)
  prec_a ~ dgamma(1/2.0, 1*10.0/2.0)
  tau = sqrt( 1.0 / prec_a )
  
  for (j in 1:2) {
    b[j] ~ dnorm(0.0, 1.0/1.0e6)
  }
  
  prec ~ dgamma(5/2.0, 5*10.0/2.0)
  sig = sqrt( 1.0 / prec )
} "
set.seed(116)
data_jags = list(y=dat$loginfant, log_income=dat$logincome,
                  is_oil=as.numeric(dat$oil=="yes"), region=as.numeric(dat$region))
data_jags$is_oil
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
table(data_jags$is_oil, data_jags$region)
   
     1  2  3  4
  0 31 20 24 18
  1  3  2  3  0
params = c("a0", "a", "b", "sig", "tau")

mod = jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 101
   Unobserved stochastic nodes: 9
   Total graph size: 622

Initializing model
update(mod, 1e3) # burn-in

mod_sim = coda.samples(model=mod,
                       variable.names=params,
                       n.iter=5e3)

mod_csim = as.mcmc(do.call(rbind, mod_sim)) # combine multiple chains

## convergence diagnostics
par(mar = c(2.5, 1, 2.5, 1))
plot(mod_sim)

gelman.diag(mod_sim)
Potential scale reduction factors:

     Point est. Upper C.I.
a[1]       1.04       1.13
a[2]       1.04       1.13
a[3]       1.04       1.12
a[4]       1.04       1.13
a0         1.01       1.04
b[1]       1.04       1.13
b[2]       1.00       1.00
sig        1.00       1.01
tau        1.00       1.00

Multivariate psrf

1.03
autocorr.diag(mod_sim)
            a[1]      a[2]      a[3]      a[4]         a0      b[1]       b[2]
Lag 0  1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000 1.00000000
Lag 1  0.9236793 0.9230317 0.9210966 0.9382209 0.25672516 0.9801634 0.13886787
Lag 5  0.8506661 0.8543049 0.8526044 0.8665871 0.24657389 0.9066750 0.02593193
Lag 10 0.7702622 0.7754805 0.7747608 0.7809924 0.22260691 0.8207662 0.03150141
Lag 50 0.2856952 0.2867057 0.2886358 0.2828899 0.07009041 0.3009284 0.02435816
                sig          tau
Lag 0   1.000000000  1.000000000
Lag 1   0.060094445  0.275457817
Lag 5  -0.003795647 -0.001307823
Lag 10  0.011436618  0.018025788
Lag 50 -0.011981321  0.016768961
autocorr.plot(mod_sim)

effectiveSize(mod_sim)
      a[1]       a[2]       a[3]       a[4]         a0       b[1]       b[2] 
  156.9881   162.2592   158.4710   158.5000   708.7858   147.7942  7831.2017 
       sig        tau 
12620.3554  8548.4739 

50.7.1 Results

Convergence looks okay, so let’s compare this with the old model from Lesson 7 using DIC:

dic.samples(mod, n.iter=1e3)
Mean deviance:  213.5 
penalty 6.664 
Penalized deviance: 220.1 
### nonhierarchical model: 230.1

It appears that this model is an improvement over the non-hierarchical one we fit earlier. Notice that the penalty term, which can be interpreted as the “effective” number of parameters, is less than the actual number of parameters (nine). There are fewer “effective” parameters because they are “sharing” information or “borrowing strength” from each other in the hierarchical structure. If we had skipped the hierarchy and fit one intercept, there would have been four parameters. If we had fit separate, independent intercepts for each region, there would have been seven parameters (which is close to what we ended up with).

Finally, let’s look at the posterior summary.

summary(mod_sim)

Iterations = 1001:6000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

        Mean      SD  Naive SE Time-series SE
a[1]  6.6229 0.57886 0.0047264      0.0454324
a[2]  6.0977 0.72975 0.0059584      0.0566921
a[3]  5.9261 0.64592 0.0052739      0.0505992
a[4]  5.6460 0.89153 0.0072793      0.0701273
a0    6.0715 1.35832 0.0110907      0.0515712
b[1] -0.3549 0.11024 0.0009001      0.0089712
b[2]  0.6554 0.34901 0.0028497      0.0040367
sig   0.9203 0.06576 0.0005369      0.0005856
tau   2.0513 1.04738 0.0085518      0.0113345

2. Quantiles for each variable:

         2.5%     25%     50%     75%   97.5%
a[1]  5.51110  6.2251  6.6166  7.0036  7.7577
a[2]  4.69866  5.5904  6.0871  6.5878  7.5529
a[3]  4.70357  5.4805  5.9194  6.3583  7.1967
a[4]  3.95552  5.0333  5.6349  6.2329  7.4161
a0    3.40307  5.2655  6.0664  6.8752  8.7627
b[1] -0.57277 -0.4277 -0.3534 -0.2785 -0.1463
b[2] -0.01538  0.4183  0.6502  0.8896  1.3402
sig   0.80332  0.8739  0.9166  0.9621  1.0596
tau   0.98329  1.4067  1.7883  2.3771  4.6476

In this particular model, the intercepts do not have a real interpretation because they correspond to the mean response for a country that does not produce oil and has USD0 log-income per capita (which is USD1 income per capita). We can interpret a_0 as the overall mean intercept and \tau as the standard deviation of intercepts across regions.

50.7.2 Other models

We have not investigated adding interaction terms, which might be appropriate. We only considered adding hierarchy on the intercepts, but in reality nothing prevents us from doing the same for other terms in the model, such as the coefficients for income and oil. We could try any or all of these alternatives and see how the DIC changes for those models. This, together with other model checking techniques we have discussed could be used to identify your best model that you can use to make inferences and predictions.

50.8 Mixture models

Histograms of data often reveal that they do not follow any standard probability distribution. Sometimes we have explanatory variables (or covariates) to account for the different values, and normally distributed errors are adequate, as in normal regression. However, if we only have the data values themselves and no covariates, we might have to fit a non-standard distribution to the data. One way to do this is by mixing standard distributions.

Mixture distributions are just a weighted combination of probability distributions. For example, we could take an exponential distribution with mean 1 and normal distribution with mean 3 and variance 1 (although typically the two mixture components would have the same support; here the exponential component has to be non-negative and the normal component can be positive or negative). Suppose we give them weights: 0.4 for the exponential distribution and 0.6 for the normal distribution. We could write the PDF for this distribution as

\mathbb{P}r(y) = 0.4 \cdot \exp(-y) \cdot \mathbb{I}_{(y \ge 0)} + 0.6 \cdot \frac{1}{\sqrt{2 \pi}} \exp\left(- \frac{1}{2} (y - 3)^2\right) \tag{50.3}

where \mathbb{I}_{(y \ge 0)} is the indicator function that is 1 if y \ge 0 and 0 otherwise. This PDF is a mixture of the exponential and normal distributions, with the weights indicating the relative contribution of each distribution to the overall PDF.

The PDF of this mixture distribution would look like this:

curve( 0.4*dexp(x, 1.0) + 0.6*dnorm(x, 3.0, 1.0), from=-2.0, to=7.0, ylab="density", xlab="y", main="40/60 mixture of exponential and normal distributions", lwd=2)
Figure 50.16

We could think of these two distributions as governing two distinct populations, one following the exponential distribution and the other following the normal distribution.

Let’s draw the weighted PDFs for each population.

curve( 0.4*dexp(x, 1.0) + 0.6*dnorm(x, 3.0, 1.0), from=-2.0, to=7.0, ylab="density", xlab="y", main="40/60 mixture of exponential and normal distributions", lwd=2)
curve( 0.4*dexp(x, 1.0), from=-2.0, to=7.0, col="red", lty=2, add=TRUE)
curve( 0.6*dnorm(x, 3.0, 1.0), from=-2.0, to=7.0, col="blue", lty=2, add=TRUE)
Figure 50.17

The general form for a discrete mixture of distributions is as follows:

\mathbb{P}r(y) = \sum_{j=1}^J \omega_j \cdot f_j (y) \tag{50.4}

where the \omega’s are positive weights that add up to 1 (they are probabilities) and each of the J f_j(y) functions is a PDF for some distribution. In the example above, the weights were 0.4 and 0.6, f_1 was an exponential PDF and f_2 was a normal PDF.

One way to simulate from a mixture distribution is with a hierarchical model. We first simulate an indicator for which “population” the next observation will come from using the weights \omega. Let’s call this z_i. In the example above, z_i would take the value 1 (indicating the exponential distribution) with probability 0.4 and 2 (indicating the normal distribution) with probability 0.6. Next, simulate the observation y_i from the distribution corresponding to z_i.

Let’s simulate from our example mixture distribution.

set.seed(117)
n = 1000
z = numeric(n)
y = numeric(n)
for (i in 1:n) {
  z[i] = sample.int(2, 1, prob=c(0.4, 0.6)) # returns a 1 with probability 0.4, or a 2 with probability 0.6
  if (z[i] == 1) {
    y[i] = rexp(1, rate=1.0)
  } else if (z[i] == 2) {
    y[i] = rnorm(1, mean=3.0, sd=1.0)
  }
}
hist(y, breaks=30)

If we keep only the y values and throw away the z values, we have a sample from the mixture model above. To see that they are equivalent, we can marginalize the joint distribution of y and z:

\begin{aligned} \mathbb{P}r(y) &= \sum_{j=1}^2 \mathbb{P}r(y, z=j) \\ &= \sum_{j=1}^2 \mathbb{P}r(z=j) \cdot \mathbb{P}r(y \mid z=j) \\ &= \sum_{j=1}^2 \omega_j \cdot f_j(y) \end{aligned} \tag{50.5}

50.8.1 Bayesian inference for mixture models

When we fit a mixture model to data, we usually only have the y values and do not know which “population” they belong to. Because the z variables are unobserved, they are called latent variables. We can treat them as parameters in a hierarchical model and perform Bayesian inference for them. The hierarchial model might look like this:

\begin{aligned} y_i \mid z_i, \theta & \overset{\text{ind}}{\sim} f_{z_i}(y \mid \theta) \, , \quad i = 1, \ldots, n \\ \text{Pr}(z_i = j \mid \omega) &= \omega_j \, , \quad j=1, \ldots, J \\ \omega &\sim \mathbb{P}r(\omega) \\ \theta &\sim \mathbb{P}r(\theta) \end{aligned} \tag{50.6}

where we might use a Dirichlet prior (see the review of distributions in the supplementary material) for the weight vector \omega and conjugate priors for the population-specific parameters in \theta. With this model, we could obtain posterior distributions for z (population membership of the observations), \omega (population weights), and \theta (population-specific parameters in f_j). Next, we will look at how to fit a mixture of two normal distributions in JAGS.

50.9 Example with JAGS

50.9.1 Data

For this example, we will use the data in the attached file mixture.csv.

dat = read.csv("data/mixture.csv", header=FALSE)
y = dat$V1
(n = length(y))
[1] 200

Let’s visualize these data.

hist(y, breaks=20)
Figure 50.18
plot(density(y))
Figure 50.19

It appears that we have two populations, but we do not know which population each observation belongs to. We can learn this, along with the mixture weights and population-specific parameters with a Bayesian hierarchical model.

We will use a mixture of two normal distributions with variance 1 and different (and unknown) means.

50.9.2 Model

library("rjags")
mod_string = " model {
    for (i in 1:length(y)) {
        y[i] ~ dnorm(mu[z[i]], prec)
      z[i] ~ dcat(omega)
    }
  
  mu[1] ~ dnorm(-1.0, 1.0/100.0)
    mu[2] ~ dnorm(1.0, 1.0/100.0) T(mu[1],) # ensures mu[1] < mu[2]

    prec ~ dgamma(1.0/2.0, 1.0*1.0/2.0)
  sig = sqrt(1.0/prec)
    
    omega ~ ddirich(c(1.0, 1.0))
} "

set.seed(11)

data_jags = list(y=y)

params = c("mu", "sig", "omega", "z[1]", "z[31]", "z[49]", "z[6]") # Select some z's to monitor

mod = jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 200
   Unobserved stochastic nodes: 204
   Total graph size: 614

Initializing model
update(mod, 1e3)

mod_sim = coda.samples(model=mod,
                        variable.names=params,
                        n.iter=5e3)
mod_csim = as.mcmc(do.call(rbind, mod_sim))
## convergence diagnostics
par(mar = c(2.5, 1, 2.5, 1))
plot(mod_sim, ask=TRUE)
Figure 50.20
Figure 50.21
Figure 50.22
autocorr.diag(mod_sim)
              mu[1]       mu[2]    omega[1]    omega[2]          sig
Lag 0   1.000000000 1.000000000 1.000000000 1.000000000  1.000000000
Lag 1   0.538254886 0.326772821 0.292807794 0.292807794  0.416449528
Lag 5   0.083997317 0.057020925 0.062593661 0.062593661  0.017149923
Lag 10  0.019465377 0.010151346 0.007230271 0.007230271 -0.002850070
Lag 50 -0.005585322 0.006517438 0.009811168 0.009811168 -0.004975608
               z[1]        z[31]       z[49] z[6]
Lag 0   1.000000000  1.000000000 1.000000000  NaN
Lag 1  -0.003091796  0.044509351 0.026699093  NaN
Lag 5   0.004059621  0.005297103 0.011071694  NaN
Lag 10 -0.010121736 -0.021714603 0.015178298  NaN
Lag 50  0.004090286  0.003683641 0.003789031  NaN
effectiveSize(mod_sim)
    mu[1]     mu[2]  omega[1]  omega[2]       sig      z[1]     z[31]     z[49] 
 4058.434  5597.945  5970.649  5970.649  5680.320 15000.000 13445.002 13233.898 
     z[6] 
 5000.000 

50.9.3 Results

summary(mod_sim)

Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

            Mean       SD  Naive SE Time-series SE
mu[1]    -2.1224 0.167391 1.367e-03      2.628e-03
mu[2]     1.4899 0.126256 1.031e-03      1.691e-03
omega[1]  0.3872 0.040461 3.304e-04      5.240e-04
omega[2]  0.6128 0.040461 3.304e-04      5.240e-04
sig       1.1363 0.075378 6.155e-04      1.001e-03
z[1]      1.0100 0.099502 8.124e-04      8.125e-04
z[31]     1.5779 0.493906 4.033e-03      4.264e-03
z[49]     1.8077 0.394095 3.218e-03      3.437e-03
z[6]      1.9999 0.008165 6.667e-05      6.667e-05

2. Quantiles for each variable:

            2.5%     25%     50%     75%   97.5%
mu[1]    -2.4459 -2.2360 -2.1230 -2.0100 -1.7924
mu[2]     1.2382  1.4064  1.4914  1.5755  1.7314
omega[1]  0.3094  0.3598  0.3867  0.4140  0.4682
omega[2]  0.5318  0.5860  0.6133  0.6402  0.6906
sig       1.0038  1.0840  1.1308  1.1824  1.3013
z[1]      1.0000  1.0000  1.0000  1.0000  1.0000
z[31]     1.0000  1.0000  2.0000  2.0000  2.0000
z[49]     1.0000  2.0000  2.0000  2.0000  2.0000
z[6]      2.0000  2.0000  2.0000  2.0000  2.0000
## for the population parameters and the mixing weights

par(mar = c(2.5, 1, 2.5, 1))
par(mfrow=c(3,2))
densplot(mod_csim[,c("mu[1]", "mu[2]", "omega[1]", "omega[2]", "sig")])
Figure 50.23
## for the z's
par(mfrow=c(2,2))
par(mar = c(2.5, 1, 2.5, 1))
densplot(mod_csim[,c("z[1]", "z[31]", "z[49]", "z[6]")])
Figure 50.24
table(mod_csim[,"z[1]"]) / nrow(mod_csim) ## posterior probabilities for z[1], the membership of y[1]

   1    2 
0.99 0.01 
table(mod_csim[,"z[31]"]) / nrow(mod_csim) ## posterior probabilities for z[31], the membership of y[31]

        1         2 
0.4220667 0.5779333 
table(mod_csim[,"z[49]"]) / nrow(mod_csim) ## posterior probabilities for z[49], the membership of y[49]

        1         2 
0.1922667 0.8077333 
table(mod_csim[,"z[6]"]) / nrow(mod_csim) ## posterior probabilities for z[6], the membership of y[6]

           1            2 
6.666667e-05 9.999333e-01 
y[c(1, 31, 49, 6)]
[1] -2.2661749 -0.3702666  0.0365564  3.7548080

If we look back to the y values associated with these z variables we monitored, we see that y_1 is clearly in Population 1’s territory, y_{31} is ambiguous, y_{49} is ambiguous but is closer to Population 2’s territory, and y_6 is clearly in Population 2’s territory. The posterior distributions for the z variables closely reflect our assessment.