48  Poisson regression - M4L10

Bayesian Statistics: Techniques and Models

An overview of Poisson regression in the context of Bayesian statistics.
Monte Carlo Estimation
Author

Oren Bochman

Keywords

Poisson regression, Bayesian statistics, R programming, statistical modeling, count data, Poisson distribution, over dispersion, zero-inflated models

Poisson regression is the preferred method to handle count data where the response is positive values but includes zeroes. The gist of this method is that We use a log transform link function on the regressors but not on the response which allows the response to be zero valued which correspond to zero counts. One limit of this approach mentioned below is that the Poisson takes one parameter \lambda for both its expected value and its variance. We look give a deeper solution to this problem in the next course on mixture models, however in this course we will consider more restricted cases where we extend the Poisson regression with the Negative Binomial Distribution which allows us to model over-dispersed data.

NoteOverdispersion

In cases where the variance needs to be a separate feature of the model we can use an alternative model with additional free parameters may provide a better fit. In this case of count data, a Poisson mixture model like the Negative Binomial Distribution can be posited instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn – from the Gamma distribution thereby introducing an additional free parameter

We don’t see the case of over-dispersion in this course but the next course is on mixture models.

48.1 Introduction to Poisson regression

Figure 48.1: Introduction to Poisson regression

We now have experience fitting regression models when the response is continuous, and when it is binary. What about when we have count data? We could fit a linear normal regression, but here we have a couple of drawbacks. First of all, counts usually aren’t negative. And the variances might not be constant. The Poisson distribution provides a natural likelihood for count data.

y_i\mid \lambda+i \stackrel {iid} \sim \mathrm{Pois}(\lambda_i) \qquad i=1, \ldots, n

Here, \lambda conveniently represents the expected value of y \mathbb{E}[y]. It turns out that \lambda is also the variance of y \mathbb{V}ar[y]. So if we expect a count to be higher, we also expect the variability in counts to go up.

We saw this earlier with the warp breaks data.

If we model the mean directly, like we did with linear regression. That is, we had the expected value y_i was directly modeled with this linear form.

\mathbb{E}[y] = \beta_0 + \beta_1x_i \qquad \text{(linear regression)}

We would run into the same problem we did with logistic regression. The expected value has to be greater than zero in the Poisson distribution. To naturally deal with that restriction, we’re going to use the logarithmic link function.

So, the log link. That is, that the log of \lambda_i is equal to this linear piece.

log link:

log(\lambda_i) = \beta_0+\beta_1x_i \qquad \text{(log link)} \tag{48.1}

\mathbb{E}[y]=\beta_0+\beta_1x_i \qquad \text{(linear regression)}

From this, we can easily recover the expression for the mean itself. That is, we can invert this link function to get the expected value of y_i,

\implies \mathbb{E}[y] = \lambda_i = e^{\left(\beta_0+\beta_1x_i \right)} \tag{48.2}

It might seem like this model is equivalent to fitting a normal linear regression to the log of y. But there are a few key differences. In the normal regression, we’re modeling the mean of the response directly. So we would be fitting a model to the \log(y). Where we’re modeling the expected value of the \log(y). This is different from what we’re modeling here, here we’re doing the log of the expected value of y.

\mathbb{E}[log(y)]\ne log(\mathbb{E}[y])

These are not equal, they’re usually similar, but they’re not the same. Another difference is that we have a separate independent parameter for the variants in a normal regression. In Poisson regression, the variance is automatically the same as \lambda, which may not always be appropriate, as we’ll see in an upcoming example.

As usual, we can add more explanatory x variables to the Poisson regression framework. They can be continuous, categorical, or they could be counts themselves.

If we have three predictor variables x_i = ( x_{1,i}, x_{2,i}, x_{3,i} ), what would the likelihood part of the hierarchical representation of a Poisson regression with logarithmic link look like?

Here we incorporated the (inverse) link function directly into the likelihood rather than writing it with two lines.

48.2 Poisson regression - JAGS model

For an example of Poisson regression, we’ll use the badhealth data set from the COUNT package in R.

doctor visits
library("COUNT")
Loading required package: msme
Loading required package: MASS
Loading required package: lattice
Loading required package: sandwich
data("badhealth")
#?badhealth
head(badhealth)
  numvisit badh age
1       30    0  58
2       20    0  54
3       16    0  44
4       20    0  57
5       15    0  33
6       15    0  28

according to the description:

NoteData Card for badhealth

1,127 observations from a 1998 German survey with 3 variables:

  1. numvisit - number of visits to the doctor in 1998 (response)
  2. badh - \begin{cases} 1 \qquad \text{ patient claims to be in bad health} \\ 0 \qquad \text{ patient does not claim to be in bad health} \end{cases}
  3. age - age of patient
any(is.na(badhealth))
[1] FALSE
  1. remove na

As usual, let’s visualize these data.

hist(badhealth$numvisit, breaks=20)
Figure 48.2: Histogram of number of doctor visits
plot(jitter(log(numvisit)) ~ jitter(age), data=badhealth, subset=badh==0, xlab="age", ylab="log(visits)")
points(jitter(log(numvisit)) ~ jitter(age), data=badhealth, subset=badh==1, col="red")
Figure 48.3

48.2.1 Doctor Visits Model

It appears that both age and bad health are related to the number of doctor visits. We should include model terms for both variables. If we believe the age/visits relationship is different between healthy and non-healthy populations, we should also include an interaction term. We will fit the full model here and leave it to you to compare it with the simpler additive model.

doctor visits
library("rjags")
Loading required package: coda
Linked to JAGS 4.3.2
Loaded modules: basemod,bugs
mod_string = " model {
    for (i in 1:length(numvisit)) {
        numvisit[i] ~ dpois(lam[i])
        log(lam[i]) = int + b_badh*badh[i] + b_age*age[i] + b_intx*age[i]*badh[i]
    }
    
    int ~ dnorm(0.0, 1.0/1e6)
    b_badh ~ dnorm(0.0, 1.0/1e4)
    b_age ~ dnorm(0.0, 1.0/1e4)
    b_intx ~ dnorm(0.0, 1.0/1e4)
} "

set.seed(102)

data_jags = as.list(badhealth)

params = c("int", "b_badh", "b_age", "b_intx")

mod = jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1127
   Unobserved stochastic nodes: 4
   Total graph size: 3665

Initializing model
update(mod, 1e3)

mod_sim = coda.samples(model=mod,  variable.names=params, n.iter=5e3)
mod_csim = as.mcmc(do.call(rbind, mod_sim))

## convergence diagnostics
par(mar = c(2.5, 1, 2.5, 1))
plot(mod_sim)

gelman.diag(mod_sim)
Potential scale reduction factors:

       Point est. Upper C.I.
b_age        1.04       1.14
b_badh       1.03       1.11
b_intx       1.04       1.12
int          1.04       1.14

Multivariate psrf

1.04
autocorr.diag(mod_sim)
           b_age    b_badh    b_intx       int
Lag 0  1.0000000 1.0000000 1.0000000 1.0000000
Lag 1  0.9562791 0.9653642 0.9661073 0.9539499
Lag 5  0.8344328 0.8647549 0.8687001 0.8296236
Lag 10 0.7132532 0.7589190 0.7636995 0.7081269
Lag 50 0.2477757 0.3154168 0.3265712 0.2452060
autocorr.plot(mod_csim)

effectiveSize(mod_sim)
   b_age   b_badh   b_intx      int 
257.3945 205.2655 195.8040 258.4887 
## compute DIC
dic = dic.samples(mod, n.iter=1e3)

48.2.2 Model checking - Residuals

“While inexact models may mislead, attempting to allow for every contingency a priori is impractical. Thus models must be built by an iterative feedback process in which an initial parsimonious model may be modified when diagnostic checks applied to residuals indicate the need.” —G. E. P. Box

To get a general idea of the model’s performance, we can look at predicted values and residuals as usual. Don’t forget that we must apply the inverse of the link function to get predictions for \lambda .

X = as.matrix(badhealth[,-1])
X = cbind(X, with(badhealth, badh*age))
head(X)
1
we drop the first column since it is the column for our y.
2
we add a third column with \mathbb{I}_{badh}\times age
     badh age  
[1,]    0  58 0
[2,]    0  54 0
[3,]    0  44 0
[4,]    0  57 0
[5,]    0  33 0
[6,]    0  28 0
(pmed_coef = apply(mod_csim, 2, median))
1
this are the column medians of the coefficients.
       b_age       b_badh       b_intx          int 
 0.008597515  1.573413745 -0.011091209  0.342728633 
llam_hat = pmed_coef["int"] + X %*% pmed_coef[c("b_badh", "b_age", "b_intx")]
lam_hat = exp(llam_hat)

hist(lam_hat)
1
X \cdot \vec b_i gives the linear part.
2
\hat\lambda_i=e^{X \cdot \vec b_i} we need to apply the inverse link function

resid = badhealth$numvisit - lam_hat
plot(resid) # the data were ordered

this plot looks bad, it might not be iid but we can ignore the issue since the data is presorted/

plot(lam_hat, badhealth$numvisit)
abline(0.0, 1.0)
Figure 48.4
plot(lam_hat[which(badhealth$badh==0)], resid[which(badhealth$badh==0)], xlim=c(0, 8), ylab="residuals", xlab=expression(hat(lambda)), ylim=range(resid))
points(lam_hat[which(badhealth$badh==1)], resid[which(badhealth$badh==1)], col="red")
Figure 48.5

It is not surprising that the variability increases for values predicted at higher values since the mean is also the variance in the Poisson distribution. However, observations predicted to have about two visits should have variance about two, and observations predicted to have about six visits should have variance about six.

var(resid[which(badhealth$badh==0)])
[1] 7.022486
var(resid[which(badhealth$badh==1)])
[1] 41.1961

For this data the variance is much bigger this is not the case with these data. This indicates that either the model fits poorly (meaning the covariates don’t explain enough of the variability in the data), or the data are “overdispersed” for the Poisson likelihood we have chosen. This is a common issue with count data. If the data are more variable than the Poisson likelihood would suggest, a good alternative is the negative binomial distribution, which we will not pursue here.

48.3 Predictive distributions

Assuming the model fit is adequate, we can interpret the results.

summary(mod_sim)

Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

            Mean       SD  Naive SE Time-series SE
b_age   0.008566 0.002082 1.700e-05      0.0001320
b_badh  1.575435 0.183920 1.502e-03      0.0129623
b_intx -0.011097 0.004249 3.469e-05      0.0003062
int     0.343699 0.081153 6.626e-04      0.0051234

2. Quantiles for each variable:

            2.5%       25%       50%       75%     97.5%
b_age   0.004374  0.007174  0.008598  0.009993  0.012555
b_badh  1.228106  1.448707  1.573414  1.699820  1.937777
b_intx -0.019539 -0.013948 -0.011091 -0.008208 -0.003083
int     0.186948  0.288840  0.342729  0.397622  0.506098

The intercept is not necessarily interpretable here because it corresponds to the number of doctor visits for a healthy 0-year-old. While the number of visits for a newborn baby sounds like interesting information, the youngest person in the data set is 20 years old. In such cases we should avoid making such projections and say that the intercept is an artifact of the model.

  • For healthy individuals, it appears that age is associated with an increase in the Expected number of doctor visits.

  • Bad health is associated with an increase in expected number of visits.

  • The interaction coefficient is interpreted as an adjustment to the age coefficient for people in bad health. Hence, for people with bad health, age is essentially unassociated with number of visits.

48.3.1 Predictive distributions

Let’s say we have two people aged 35, one in good health and the other in poor health.

Q. What is the posterior probability that the individual with poor health will have more doctor visits?

This goes beyond the posterior probabilities we have calculated comparing expected responses in previous lessons. Here we will create Monte Carlo samples for the responses themselves. This is done by taking the Monte Carlo samples of the model parameters, and for each of those, drawing a sample from the likelihood.

Let’s walk through this.

First, we need the x values for each individual. We’ll say the healthy one is Person 1 and the unhealthy one is Person 2. Their x values are:

x1 = c(0, 35, 0)
x2 = c(1, 35, 35)
1
good health person’s data (bad_health_indicator=0,age=35,age*indicator=0)
2
bad health person’s (bad_health_indicator=1,age=35,age*indicator=35)

The posterior samples of the model parameters are stored in mod_csim:

head(mod_csim)
Markov Chain Monte Carlo (MCMC) output:
Start = 1 
End = 7 
Thinning interval = 1 
          b_age   b_badh      b_intx       int
[1,] 0.01104922 1.786340 -0.01616279 0.2529443
[2,] 0.01040047 1.855573 -0.01669443 0.2526289
[3,] 0.01024503 1.848712 -0.01664210 0.2566747
[4,] 0.01129367 1.852566 -0.01825171 0.2444922
[5,] 0.01176085 1.845686 -0.01698446 0.2342671
[6,] 0.01141745 1.853138 -0.01737064 0.2307088
[7,] 0.01154839 1.822791 -0.01628769 0.2323831

First, we’ll compute the linear part of the predictor:

loglam1 = mod_csim[,"int"] + mod_csim[,c(2,1,3)] %*% x1
loglam2 = mod_csim[,"int"] + mod_csim[,c(2,1,3)] %*% x2

Next we’ll apply the inverse link:

lam1 = exp(loglam1)
lam2 = exp(loglam2)

The final step is to use these samples for the \lambda parameter for each individual and simulate actual number of doctor visits using the likelihood:

(n_sim = length(lam1))
[1] 15000

we have distribution of 15000 samples of \lambda for each person.

plot(table(factor(y1, levels=0:18))/n_sim, pch=2, ylab="posterior prob.", xlab="visits")
points(table(y2+0.1)/n_sim, col="red")
Figure 48.6
y1 = rpois(n=n_sim, lambda=lam1)
y2 = rpois(n=n_sim, lambda=lam2)

plot(table(factor(y1, levels=0:18))/n_sim, pch=2, ylab="posterior prob.", xlab="visits")
points(table(y2+0.1)/n_sim, col="red")

Finally, we can answer the original question: What is the probability that the person with poor health will have more doctor visits than the person with good health?

mean(y2 > y1)
[1] 0.9216667

Because we used our posterior samples for the model parameters in our simulation (the loglam1 and loglam2 step above), this posterior predictive distribution on the number of visits for these two new individuals naturally account for our uncertainty in the model estimates. This is a more honest/realistic distribution than we would get if we had fixed the model parameters at their MLE or posterior means and simulated data for the new individuals.

48.4 Prior sensitivity analysis

When communicating results from any analysis, a responsible statistician will report and justify modeling decisions, especially assumptions. In a Bayesian analysis, there is an additional assumption that is open to scrutiny: the choices of prior distributions. In the models considered so far in this course, there are an infinite number of prior distributions we could have chosen from. When communicating results from any analysis, a responsible statistician will report and justify modeling decisions, especially assumptions. In a Bayesian analysis, there is another assumption that is open to scrutiny: the choices of prior distributions. In the models considered so far in this course, there are an infinite number of prior distributions we could have chosen from.

Q. How do you justify the model you choose?

If they truly represent your beliefs about the parameters before analysis and the model is appropriate, then the posterior distribution truly represents your updated beliefs. If you don’t have any strong beliefs beforehand, there are often default, reference, or non-informative prior options, and you will have to select one. However, a collaborator or a boss (indeed, somebody somewhere) may not agree with your choice of prior. One way to increase the credibility of your results is to repeat the analysis under a variety of priors, and report how the results differ as a result. This process is called prior sensitivity analysis.

Q. How do you justify the priors you choose?

At a minimum you should always report your choice of model and prior. If you include a sensitivity analysis, select one or more alternative priors and describe how the results of the analysis change. If they are sensitive to the choice of prior, you will likely have to explain both sets of results, or at least explain why you favor one prior over another. If the results are not sensitive to the choice of prior, this is evidence that the data are strongly driving the results. It suggests that different investigators coming from different backgrounds should come to the same conclusions.

If the purpose of your analysis is to establish a hypothesis, it is often prudent to include a “skeptical” prior which does not favor the hypothesis. Then, if the posterior distribution still favors the hypothesis despite the unfavorable prior, you will be able to say that the data substantially favor the hypothesis. This is the approach we will take in the following example, continued from the previous lesson.

48.4.1 Poisson regression example

Let’s return to the example of number of doctor visits. We concluded from our previous analysis of these data that both bad health and increased age are associated with more visits. Suppose the burden of proof that bad health is actually associated with more visits rests with us, and we need to convince a skeptic.

doctor visits

First, let’s re-run the original analysis and remind ourselves of the posterior distribution for the badh (bad health) indicator.

library("COUNT")
library("rjags")

data("badhealth")
mod_string = " model {
    for (i in 1:length(numvisit)) {
        numvisit[i] ~ dpois(lam[i])
        log(lam[i]) = int + b_badh*badh[i] + b_age*age[i] + b_intx*age[i]*badh[i]
    }
    
    int ~ dnorm(0.0, 1.0/1e6)
    b_badh ~ dnorm(0.0, 1.0/1e4)
    b_age ~ dnorm(0.0, 1.0/1e4)
    b_intx ~ dnorm(0.0, 1.0/1e4)
} "

set.seed(102)

data_jags = as.list(badhealth)

params = c("int", "b_badh", "b_age", "b_intx")

mod = jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1127
   Unobserved stochastic nodes: 4
   Total graph size: 3665

Initializing model
update(mod, 1e3)

mod_sim = coda.samples(model=mod,
                        variable.names=params,
                        n.iter=5e3)
mod_csim = as.mcmc(do.call(rbind, mod_sim))
plot(density(mod_csim[,"b_badh"]))
Figure 48.7

Essentially all of the posterior probability mass is above 0, suggesting that this coefficient is positive (and consequently that bad health is associated with more visits). We obtained this result using a relatively noninformative prior. What if we use a prior that strongly favors values near 0? Let’s repeat the analysis with a normal prior on the badh coefficient that has mean 0 and standard deviation 0.2, so that the prior probability that the coefficient is less than 0.6 is >0.998 . We’ll also use a small variance on the prior for the interaction term involving badh (standard deviation 0.01 because this coefficient is on a much smaller scale).

mod2_string = " model {
    for (i in 1:length(numvisit)) {
        numvisit[i] ~ dpois(lam[i])
        log(lam[i]) = int + b_badh*badh[i] + b_age*age[i] + b_intx*age[i]*badh[i]
    }
    
    int ~ dnorm(0.0, 1.0/1e6)
    b_badh ~ dnorm(0.0, 1.0/0.2^2)
    b_age ~ dnorm(0.0, 1.0/1e4)
    b_intx ~ dnorm(0.0, 1.0/0.01^2)
} "

mod2 = jags.model(textConnection(mod2_string), data=data_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1127
   Unobserved stochastic nodes: 4
   Total graph size: 3672

Initializing model
update(mod2, 1e3)

mod2_sim = coda.samples(model=mod2,
                        variable.names=params,
                        n.iter=5e3)
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))

How did the posterior distribution for the coefficient of badh change?

curve(dnorm(x, mean=0.0, sd=sqrt(1e4)), from=-3.0, to=3.0, ylim=c(0.0, 3.0), lty=2,
      main="b_badh", ylab="density", xlab="b_badh")
curve(dnorm(x, mean=0.0, sd=0.2), from=-3.0, to=3.0, col="red", lty=2, add=TRUE)
lines(density(mod_csim[,"b_badh"]))
lines(density(mod2_csim[,"b_badh"]), col="red")
legend("topleft", legend=c("noninformative prior", "posterior", "skeptical prior", "posterior"),
       lty=c(2,1,2,1), col=rep(c("black", "red"), each=2), bty="n")
Figure 48.8

Under the skeptical prior, our posterior distribution for b_badh has significantly dropped to between about 0.6 and 1.1. Although the strong prior influenced our inference on the magnitude of the bad health effect on visits, it did not change the fact that the coefficient is significantly above 0. In other words: even under the skeptical prior, bad health is associated with more visits, with posterior probability near 1.

We should also check the effect of our skeptical prior on the interaction term involving both age and health.

curve(dnorm(x, mean=0.0, sd=sqrt(1e4)), from=-0.05, to=0.05, ylim=c(0.0, 140.0), lty=2,
      main="b_intx", ylab="density", xlab="b_intx")
curve(dnorm(x, mean=0.0, sd=0.01), from=-0.05, to=0.05, col="red", lty=2, add=TRUE)
lines(density(mod_csim[,"b_intx"]))
lines(density(mod2_csim[,"b_intx"]), col="red")
legend("topleft", legend=c("noninformative prior", "posterior", "skeptical prior", "posterior"),
       lty=c(2,1,2,1), col=rep(c("black", "red"), each=2), bty="n")
Figure 48.9
mean(mod2_csim[,"b_intx"] > 0) # posterior probability that b_intx is positive
[1] 0.9400667

The result here is interesting. Our estimate for the interaction coefficient has gone from negative under the non-informative prior to positive under the skeptical prior, so the result is sensitive. In this case, because the skeptical prior shrinks away much of the bad health main effect, it is likely that this interaction effect attempts to restore some of the positive effect of bad health on visits. Thus, despite some observed prior sensitivity, our conclusion that bad health positively associates with more visits remains unchanged.

48.5 Overdispersed model

Recall that the Negative Binomial can be used to model overdispersed count data.

stan has three parameterizations for the Negative Binomial.

The first looks like similar to a binomial parameterization:

\text{NegBinomial}(y~|~\alpha,\beta) = \binom{y + \alpha - 1}{\alpha - 1} \, \left( \frac{\beta}{\beta+1} \right)^{\!\alpha} \, \left( \frac{1}{\beta + 1} \right)^{\!y} \!.

\mathbb{E}[y] = \frac{\alpha}{\beta} \ \ \text{ and } \ \ \text{Var}[Y] = \frac{\alpha}{\beta^2} (\beta + 1).

we can sample from this using the following statement

n ~ neg_binomial(alpha, beta)

But this parameterization if not a match to the Poisson model, so we move on

The second parametrization \mu \in \mathbb{R}^+ and \phi \in \mathbb{R}^+:

\mathrm{NegBinomial2}(n \mid \mu, \phi) = \binom{n + \phi - 1}{n} \, \left( \frac{\mu}{\mu+\phi} \right)^{\!n} \, \left( \frac{\phi}{\mu+\phi} \right)^{\!\phi}

\mathbb{E}[n] = \mu \ \ \text{ and } \ \ \ \mathbb{V}\text{ar}[n] = \mu + \frac{\mu^2}{\phi}

we can sample from this using the following statement

n ~ neg_binomial_2(mu, phi)

And there is a third parametrization

NegBinomial2Log(y\mid\mu,\phi) = NegBinomial2(y\mid exp(\eta),\phi).

we can sample from this using the following statement:

y \~ **neg_binomial_2\_log**(y\|mu, phi)`

jags has just one parameterization:

f(y \mid r, p) = \frac{\Gamma(y+r)}{\Gamma(r)\Gamma(y+1)}p^r(1-p)^y

We think of the Negative Binomial Distribution as the probability of completing y successful trials allowing for r failures in a sequence of (y+r) Bernoulli trials where success is defined as drawing (with replacement) a white ball from an urn of white and black balls with a probability p of success.

\mathbb{E}[Y] = \mu = { r(1-p) \over p } \qquad \text{ and } \qquad \mathbb{V}\text{ar}[Y] = \mu + \frac{\mu^2}{r}

48.5.1 Transformations:

Since we want to have a model corresponding to a poisson regression we will transform the model as follows:

If we set p = {\text{r} \over {r} + \lambda } then the mean becomes : \lambda

and if we also set r= {\lambda^2 \over \omega} then the variance becomes a sum of \lambda + \omega where \omega is our over dispersion term.

\omega = \lambda^2 / r

\begin{aligned} \mathbb{E}[Y] &= { r(1-p) \over p } \\ &= rp^{-1} -r \\ & \stackrel {sub\ p} = {\cancel{r}(\bcancel{r}+\lambda) \over \cancel{r}} - \bcancel{r} \\ &= \lambda \mathbb{V}\text{ar}[Y] \\ &= { (1-p) r \over p^2 } \\ & \stackrel {sub \ \lambda } = {1 \over p} \lambda \\ & \stackrel {sub p} = \lambda { (r+ \lambda) \over r} \\ &= {\lambda r + \lambda^2 \over r } \\ &= 1 \lambda + {\lambda^2 \over r} \\ &\stackrel { sub \ \omega}= \lambda + \omega \end{aligned}

Where we interpret \lambda as the mean and \omega as the overdispersion

library("rjags")
library("COUNT")
data("badhealth")
mod3_string = "
model {
    for (i in 1:length(numvisit)) { 
        mu[i]       = b0 + b_badh*badh[i] + b_age*age[i] + b_intx*age[i]*badh[i]
        lambda[i]   = exp(mu[i])
        p[i]        = r / (r + lambda[i])
        numvisit[i] ~ dnegbin(p[i], r)
        resid[i]       = numvisit[i] - p[i]
    }
    ## Priors
    b0        ~ dnorm(0.0, 1.0/1e6)
    b_badh    ~ dnorm(0.0, 1.0/0.2^2)
    b_age     ~ dnorm(0.0, 1.0/1e4)
    b_intx    ~ dnorm(0.0, 1.0/0.01^2)
    r ~ dunif(0,50)

    ## extra deterministic parameters
    omega      <-  pow(mean(lambda),2)/2
    #theta      <- pow(1/mean(p),2)
    #scale      <- mean((1-p)/p)
}"
data3_jags = as.list(badhealth)
mod3 = jags.model(textConnection(mod3_string), data=data3_jags, n.chains=3)
update(mod3, 1e3)
params3 = c("b_intx", "b_badh", "b_age", 'over_disp', 'b0','omega','r')
mod3_sim = coda.samples(model=mod3,  variable.names=params3, n.iter=5e3)
mod3_csim = as.mcmc(do.call(rbind, mod3_sim))
(dic3 = dic.samples(mod3, n.iter=1e3))
1
the linear part
2
lambda corresponds to the parameter used in the Poisson regression
3
p is the success parameter
4
we draw from the negative binomial distribution
5
sampling using the parametrization of the Negative Binomial distribution.
6
normal prior for intercept b0
7
normal prior for b_badh
8
normal prior for b_age
9
normal prior for b_intx
10
uniform prior for over_disp - at the upper limit of 50 NegBin converges to Poisson see (Jackman 2009, 280)
11
theta param
12
scale param
13
burn in
14
sample
15
stack samples from the chains
16
estimate the DIC
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1127
   Unobserved stochastic nodes: 5
   Total graph size: 4204

Initializing model

Mean deviance:  4478 
penalty 4.693 
Penalized deviance: 4483 
gelman.diag(mod3_sim )
Potential scale reduction factors:

       Point est. Upper C.I.
b0              1          1
b_age           1          1
b_badh          1          1
b_intx          1          1
omega           1          1
r               1          1

Multivariate psrf

1
raftery.diag(mod3_sim)
[[1]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 
                                              
        Burn-in  Total Lower bound  Dependence
        (M)      (N)   (Nmin)       factor (I)
 b0     28       29580 3746         7.90      
 b_age  24       24360 3746         6.50      
 b_badh 9        9892  3746         2.64      
 b_intx 12       14032 3746         3.75      
 omega  2        3930  3746         1.05      
 r      5        5577  3746         1.49      


[[2]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 
                                              
        Burn-in  Total Lower bound  Dependence
        (M)      (N)   (Nmin)       factor (I)
 b0     30       30584 3746         8.16      
 b_age  33       35550 3746         9.49      
 b_badh 12       13382 3746         3.57      
 b_intx 12       15768 3746         4.21      
 omega  3        4267  3746         1.14      
 r      5        5973  3746         1.59      


[[3]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 
                                              
        Burn-in  Total Lower bound  Dependence
        (M)      (N)   (Nmin)       factor (I)
 b0     48       52160 3746         13.90     
 b_age  20       20144 3746          5.38     
 b_badh 9        10098 3746          2.70     
 b_intx 14       15120 3746          4.04     
 omega  3        4129  3746          1.10     
 r      5        5871  3746          1.57     
autocorr.diag(mod3_sim)
              b0     b_age      b_badh       b_intx         omega            r
Lag 0  1.0000000 1.0000000  1.00000000  1.000000000  1.0000000000  1.000000000
Lag 1  0.9436188 0.9431087  0.77898397  0.788592589  0.0067402011  0.239142676
Lag 5  0.7847874 0.7841002  0.33811676  0.355473616 -0.0019071799  0.009177638
Lag 10 0.6265807 0.6270416  0.11169413  0.127104787  0.0007097493 -0.008166353
Lag 50 0.1173577 0.1125394 -0.01567314 -0.009216688  0.0148101740  0.003385658
effectiveSize(mod3_sim)
        b0      b_age     b_badh     b_intx      omega          r 
  351.4081   353.8235  1601.5942  1527.7504 15000.0000  9360.4345 
summary(mod3_sim)

Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

           Mean       SD  Naive SE Time-series SE
b0     0.464384 0.132188 1.079e-03      0.0070495
b_age  0.005724 0.003453 2.819e-05      0.0001835
b_badh 0.372397 0.166535 1.360e-03      0.0041627
b_intx 0.015025 0.004280 3.495e-05      0.0001098
omega  2.779121 0.219241 1.790e-03      0.0017902
r      0.986804 0.068431 5.587e-04      0.0007074

2. Quantiles for each variable:

             2.5%     25%      50%      75%   97.5%
b0      0.2077872 0.37512 0.464201 0.555332 0.72077
b_age  -0.0009516 0.00334 0.005728 0.008074 0.01250
b_badh  0.0482848 0.25919 0.372467 0.486122 0.69876
b_intx  0.0068151 0.01205 0.015025 0.017924 0.02344
omega   2.3821992 2.62603 2.767314 2.919679 3.23466
r       0.8607627 0.94022 0.984471 1.030820 1.12922
par(mar = c(2.5, 1, 2.5, 1))
plot(mod3_sim, auto.layout = FALSE)
Figure 48.10
Figure 48.11
Figure 48.12
Figure 48.13
Figure 48.14
Figure 48.15
Figure 48.16
Figure 48.17
Figure 48.18
Figure 48.19
Figure 48.20
Figure 48.21
autocorr.plot(mod3_csim,auto.layout = FALSE)
Figure 48.22
Figure 48.23
Figure 48.24
Figure 48.25
Figure 48.26
Figure 48.27
X = as.matrix(badhealth[,-1])
X = cbind(X, with(badhealth, badh*age))
(pmed_coef = apply(mod3_csim, 2, median))
         b0       b_age      b_badh      b_intx       omega           r 
0.464200784 0.005728099 0.372466918 0.015024677 2.767314048 0.984471272 
(r = pmed_coef["r"] )
        r 
0.9844713 
mu_hat = pmed_coef["b0"] + X %*% pmed_coef[c("b_badh", "b_age", "b_intx")]
lambda_hat = exp(mu_hat)
p_hat = r / (r + lambda_hat)
hist(lambda_hat)

hist(p_hat)

residuals

resid = badhealth$numvisit - p_hat
head(resid)
         [,1]
[1,] 29.69255
[2,] 19.68765
[3,] 15.67522
[4,] 19.69133
[5,] 14.66125
[6,] 14.65481
plot(resid) # the data were ordered
Figure 48.28: Plot of residuals
Table 48.1: First few rows of mod3_csim
head(mod3_csim)
Markov Chain Monte Carlo (MCMC) output:
Start = 1 
End = 7 
Thinning interval = 1 
            b0       b_age    b_badh     b_intx    omega         r
[1,] 0.2879316 0.010364354 0.3934882 0.01463083 2.804355 0.8865159
[2,] 0.2538401 0.011643202 0.4727899 0.01351479 2.943405 0.9366237
[3,] 0.2404547 0.010677055 0.4433926 0.01301887 2.582592 0.9236367
[4,] 0.3180986 0.009188870 0.5082574 0.01402866 2.838893 1.0248221
[5,] 0.3373588 0.009508351 0.3355875 0.01453021 2.803520 0.9749444
[6,] 0.3297522 0.007936156 0.3450995 0.01581101 2.521498 0.9674048
[7,] 0.3872685 0.008064847 0.3154034 0.01799401 2.965153 0.9911909
plot(p_hat, badhealth$numvisit)
abline(0.0, 1.0)
Figure 48.29: Plot of p_hat vs numvisit
plot(p_hat[which(badhealth$badh==0)], resid[which(badhealth$badh==0)], xlim=range(p_hat), ylab="residuals", xlab=expression(hat(p)), ylim=range(resid))
points(p_hat[which(badhealth$badh==1)], resid[which(badhealth$badh==1)], col="red")
Figure 48.30: Plot of p_hat vs residuals, colored by health status
var(resid[which(badhealth$badh==0)])
[1] 7.061266
var(resid[which(badhealth$badh==1)])
[1] 41.21237