16  M3L7 - Binomial Data

Bayesian Statistics: From Concept to Data Analysis

Outline of probability
Bayesian Statistics
Author

Oren Bochman

Keywords

Binomial Data, Binomial Distribution, Bernoulli Distribution, Beta Distribution

16.1 Bernoulli/Binomial likelihood with a uniform prior

Figure 16.1: Binomial likelihood with a Uniform prior

When we use a uniform prior for a Bernoulli likelihood, we get a beta posterior.

The Bernoulli likelihood of \vec Y \mid \theta is

{\color{green}f(\vec Y \mid \theta) = {\theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}}} \qquad \text{Bernoulli Likelihood}

Our prior for \theta is just a Uniform distribution

{\color{red}f(\theta) = I_{\{0 \le \theta \le 1\}} }\qquad \text {Uniform prior}

Thus our posterior for \theta is \begin{aligned} f(\theta \mid y) & = \frac{f(y \mid \theta) f(\theta)}{\int f(y \mid \theta)f(\theta) \, d\theta} & \text{Bayes law} \\ & = \frac{\theta^{\sum{y_i}} (1 - \theta)^{n - \sum{y_i}} \mathbb{I}_{\{0 \le \theta \le 1\}}}{\int_0^1 \theta^{\sum{y_i}}(1 - \theta)^{n - \sum{y_i}} \mathbb{I}_{\{0 \le \theta \le 1\}} \, d\theta} & \text{subst. Likelihood \& Prior} \\ & = \frac{\theta^{\sum{y_i}} (1-\theta)^{n - \sum{y_i}} \mathbb{I}_{\{0 \le \theta \le 1\}}}{\frac{\Gamma(\sum{y_i} + 1)\Gamma(n - \sum{y_i} + 1)}{\Gamma(n + 2)} \cancel{\int_0^1 \frac{\Gamma(n + 2)}{\Gamma(\sum{y_i} + 1) \Gamma(n - \sum{y_i} + 1)} \theta^{\sum{y_i}} (1 - \theta)^{n - \sum{y_i}} \, d\theta}} & \text{Beta PDF integrates to 1} \\ & = \frac{\Gamma(n + 2)}{\Gamma(\sum{y_i}+ 1) \Gamma(n - \sum{y_i}+ 1)} \theta^{\sum{y_i}}(1 - \theta)^{n - \sum{y_i}} \mathbb{I}_{\{0 \le \theta \le 1\}} & \text{simplifying} \\ & = \mathrm{Beta} \left (\sum{y_i} + 1, n - \sum{y_i} + 1 \right ) \end{aligned}

Where we used a trick of recognizing the denominator as a Beta distribution (Equation C.7) we then manipulate it to take the exact form of Beta. We can then cancel it since the Beta density integrates to 1, we can simplify this as From here we can see that the posterior follows a Beta distribution

\theta \mid y \sim \mathrm{Beta}\left (\sum{y_i} + 1, n - \sum{y_i} + 1 \right )

Figure 16.2: R.A. Fisher
TipHistorical Note on Sir Ronald Aylmer Fisher FRS

R.A. Fisher’s objection to the Bayesian approach is that “The theory of inverse probability is founded upon an error, and must be wholly rejected” (Fisher 1925) was specifically referring to this example of a”Binomial with a Uniform prior”. The gist of it is that the posterior depends on the parametrization of the prior.(Aldrich 2008). Sir Harold Jeffreys FRS who corresponded with Fisher went on to develop his eponymous priors which were invariant to reparametrization. Which we will consider in Section 25.2

16.2 Conjugate Priors

Figure 16.3: Conjugate Priors

The Uniform distribution is \mathrm{Beta}(1, 1)

Any beta distribution is conjugate for the Bernoulli distribution. Any beta prior will give a beta posterior.

f(\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1-\theta)^{\beta -1}\mathbb{I}_{\{\theta \le \theta \le 1\}}

f(\theta \mid y) \propto f(y \mid \theta)f(\theta) = \theta^{\sum{y_i}}(1-\theta)^{n - \sum{y_i}}\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}\mathbb{I}_{\{\theta \le \theta \le 1\}}

f(y \mid\theta)f(\theta) \propto \theta^{\alpha + \sum{y_i}-1}(1-\theta)^{\beta + n - \sum{y_i} - 1}

Thus we see that this is a beta distribution

\theta \mid y \sim \mathrm{Beta}(\alpha + \sum{y_i}, \beta + n - \sum{y_i})

When \alpha and \beta are one like in the uniform distribution, then we get the same result as earlier.

This whole process where we choose a particular form of prior that works with a likelihood is called using a conjugate family.

A family of distributions is referred to as conjugate if when you use a member of that family as a prior, you get another member of that family as your posterior.

The beta distribution is conjugate for the Bernoulli distribution. It’s also conjugate for the binomial distribution. The only difference in the binomial likelihood is that there is a combinatorics term. Since that does not depend on \theta, we get the same posterior.

We often use conjugate priors because they make life much simpler, sticking to conjugate families allows us to get closed-form solutions easily.

If the family is flexible enough, then you can find a member of that family that closely represents your beliefs.

  • the Uniform distribution can be written as the \mathrm{Beta}(1,1) prior.
  • Any Beta prior will result in a Beta posterior.
  • Beta is conjugate for Binomial and for Bernoulli

16.3 Posterior mean and effective sample size

Figure 16.4: Effective Sample Size

Returning to the Beta posterior model it is clear how both the prior and the data contribute to the posterior.

For a prior \mathrm{Beta}(\alpha,\beta) we can say that the effective sample size of the prior is

\alpha + \beta \qquad \text {(ESS)} \tag{16.1}

Recall that the expected value or mean of a Beta distribution is \frac{\alpha}{\alpha + \beta}

Therefore we can derive the posterior mean as

\begin{aligned} posterior_{mean} &= \frac{\alpha + \sum{y_i}}{\alpha + \sum{y_i}+\beta + n - \sum{y_i}} \\ &= \frac{\alpha+\sum{y_i}}{\alpha + \beta + n} \\ &= \frac{\alpha + \beta}{\alpha + \beta + n}\frac{\alpha}{\alpha + \beta} + \frac{n}{\alpha + \beta + n}\frac{\sum{y_i}}{n} \\ &= (\text{prior weight} \times \text{prior mean}) + (\text{data weight} \times \text{data mean}) \end{aligned} \tag{16.2}

i.e. The posterior mean is a weighted average of the prior mean and the data mean.

This effective sample size gives you an idea of how much data you would need to make sure that your prior does not have much influence on your posterior.

If \alpha + \beta is small compared to n then the posterior will largely just be driven by the data. If \alpha + \beta is large relative to n then the posterior will be largely driven by the prior.

We can make a 95% credible interval using our posterior distribution for \theta . We can find an interval that has 95 \% probability of containing \theta.

Using Bayesian Statistics we can do sequential analysis by doing a sequential update every time we get new data. We can get a new posterior, and we just use our previous Posterior as a Prior for doing another update using Bayes’ theorem.

Figure 16.5: ESS algorithms
  • for a Beta prior, its effective sample size is a + b
  • if n >> \alpha+\beta the posterior will be predominantly determined by the prior
  • if n << \alpha+\beta the posterior will be predominantly determined by the data
  • the idea of an effective sample size of the prior is a useful concept to work with.
  • (Wiesenfarth and Calderazzo 2020)
Figure 16.6: 200 year meteorological record
Figure 16.7: 800K year ice core record
Figure 16.8: 5M year deep sea record

Exercise 16.1 (Discussion on Prior elicitation)  

Suppose we are interested in global temperatures, and that we have a summary measure that represents the average global temperature for each year. Now we could ask “What is the probability that next year will have a higher average global temperature than this year?” What would be your choice of prior and why? Be specific about the distribution and its parameters. You may use any other information that you want to bring into this problem.

It is possible to get historical estimates using:

  • Meteorological and satellites for the last 200 years. Figure 16.6

  • Ice cores for the last 800,000 years Figure 16.7

  • Deep sea sediment oxygen 18 isotope fractation for the last 5 million years. Figure 16.8

or yearly temperature data from 1850 till today based on meteorological readings. We can also consider Greenland ice core data covering 800,000 years.

One simple way is to model the yearly temperature as a random walk

i.e. Each year is a Bernoulli trial where success is the temperature getting warmer. We can then use the historical data since 1800 to estimate theta the probability that we get warmer.

I suppose we can use a Binomial prior with parameters for alpha the count of years the temperature increased and N for the total number of years and p the probability the a given year is hotter than the previous.

16.4 Data Analysis Example in R

Suppose we’re giving two students a multiple-choice exam with 40 questions, where each question has four choices. We don’t know how much the students have studied for this exam, but we think that they’ll do better than just guessing randomly

  1. What are the parameters of interest?

The parameters of interests are \theta_1 = true the probability that the first student will answer a question correctly, \theta_2 = true the probability that the second student will answer a question correctly.

  1. What is our likelihood?

The likelihood is \mathrm{Binomial}(40, \theta) if we assume that each question is independent and that the probability a student gets each question right is the same for all questions for that student.

  1. What prior should we use?

The Conjugate Prior is a Beta Distribution. We can plot the density with dbeta

theta = seq(from = 0, to = 1, by = 0.1)
# Uniform
plot(theta, dbeta(theta, 1, 1), type = 'l')

# Prior mean 2/3
plot(theta, dbeta(theta, 4, 2), type = 'l')

# Prior mean 2/3 but higher effect size (more concentrated at mean)
plot(theta, dbeta(theta, 8, 4), type = 'l')

  1. What are the prior probabilities \mathbb{P}r(\theta > 0.25)? \mathbb{P}r(\theta > 0.5)? \mathbb{P}r(\theta > 0.8)?
1 - pbeta(0.25, 8, 4)
[1] 0.9988117
#[1] 0.998117
1 - pbeta(0.5, 8, 4)
[1] 0.8867188
#[1] 0.8867188
1 - pbeta(0.8, 8, 4)
[1] 0.1611392
#[1] 0.16113392
  1. Suppose the first student gets 33 questions right. What is the posterior distribution for \theta_1 ? \mathbb{P}r(\theta > 0.25) ? \mathbb{P}r(\theta > 0.5) ? \mathbb{P}r(\theta > 0.8) ? What is the 95% posterior credible interval for \theta_1?

\text{Posterior} \sim \mathrm{Beta}(8 + 33, 4 + 40 - 33) = \mathrm{Beta}(41, 11)

With a posterior mean of \frac{41}{41+11} = \frac{41}{52}

We can plot the posterior distribution with the prior

plot(theta, dbeta(theta, 41, 11), type = 'l')
lines(theta, dbeta(theta, 8 ,4), lty = 2) #Dashed line for prior

Posterior probabilities

1 - pbeta(0.25, 41, 11)
[1] 1
#[1] 1
1 - pbeta(0.5, 41, 11)
[1] 0.9999926
#[1] 0.9999926
1 - pbeta(0.8, 41, 11)
[1] 0.4444044
#[1] 0.4444044

Equal-tailed 95% credible interval

qbeta(0.025, 41, 11)
[1] 0.6688426
#[1] 0.6688426
qbeta(0.975, 41, 11)
[1] 0.8871094
#[1] 0.8871094

95% confidence that \theta_1 is between 0.67 and 0.89

  1. Suppose the second student gets 24 questions right. What is the posterior distribution for \theta_2? \mathbb{P}r(\theta > 0.25)? \mathbb{P}r(\theta > 0.5)? \mathbb{P}r(\theta > 0.8)? What is the 95% posterior credible interval for \theta_2

\text{Posterior} \sim \mathrm{Beta}(8 + 24, 4 + 40 - 24) = \mathrm{Beta}(32, 20)

With a posterior mean of \frac{32}{32+20} = \frac{32}{52}

We can plot the posterior distribution with the prior

plot(theta, dbeta(theta, 32, 20), type = 'l')
lines(theta, dbeta(theta, 8 ,4), lty = 2) #Dashed line for prior

Posterior probabilities

1 - pbeta(0.25, 32, 20)
[1] 1
#[1] 1
1 - pbeta(0.5, 32, 20)
[1] 0.9540427
#[1] 0.9540427
1 - pbeta(0.8, 32, 20)
[1] 0.00124819
#[1] 0.00124819

Equal-tailed 95% credible interval

qbeta(0.025, 32, 20)
[1] 0.4808022
#[1] 0.4808022
qbeta(0.975, 32, 20)
[1] 0.7415564
#[1] 0.7415564

95% confidence that \theta_1 is between 0.48 and 0.74

  1. What is the posterior probability that \theta_1 > \theta_2?

i.e., that the first student has a better chance of getting a question right than the second student?

Estimate by simulation: draw 1,000 samples from each and see how often we observe \theta_1 > \theta_2

theta1 = rbeta(100000, 41, 11)
theta2 = rbeta(100000, 32, 20)
mean(theta1 > theta2)
[1] 0.97561
#[1] 0.975