71 The MCMC algorithm for Zero-Inflated Mixtures - M4L1HW1

Caution

Section omitted to comply with the Honor Code

--- title : 'The MCMC algorithm for Zero-Inflated Mixtures - M4L1HW1' subtitle : 'Bayesian Statistics: Mixture Models' categories: - Bayesian Statistics keywords: - Mixture Models - Homework - Zero-Inflated Data - Zero-Inflated Poisson - MCMC --- ::::: {.content-visible unless-profile="HC"} ::: {.callout-caution} Section omitted to comply with the Honor Code ::: ::::: ::::: {.content-hidden unless-profile="HC"} ## HW - Simulation of Poisson mixture model In the lessons we mentioned that Zero-Inflated Poisson (ZIP) models arises naturally in biology when analyzing nest size data since many birds fail to mate or lay eggs. As such they will have zero eggs in their nests. Poisson tends to underestimate the number of empty nests in the data, and overestimate the number of nests with either 1 or 2 eggs. Negative binomial can mitigate this problem by adding a tunable parameter to control for the dispersion of count data, however, it isn't necessarily a good fix for zero-inflated data. In this exercise, we will use the EM algorithm to fit a ZIP mixture model to a dataset of nest sizes. ::: {.callout-note} ### Instructions - A biologist is interest in characterizing the number of eggs laid by a particular bird species. To do this, they sample n=300n, nests on a site in Southern California. The observations are contained in the attached file `data/nestsize.csv` - Generate a barplot with the empirical frequencies of all the integers included in the sample. As you can see, the Poisson distribution underestimates the number of empty nests in the data, and overestimates the number of nests with either 1 or 2 eggs. To address this, you are asked to modify the implementation of the EM algorithm contained in the Reading "Sample code for EM example 1" so that you can fit a mixture between a point mass at zero and a Poisson distribution (we call this a "Zero-Inflated Poisson" distribution) $$ f(x) = w \delta_0(x) + (1-w) \frac{e^{-\lambda} \lambda^x}{x!}, \quad x \in \{0, 1, 2, \ldots\} $$ {#eq-zip-mixture} where $w$ is the weight associated with the point mass at zero, $\lambda$ is the parameter of the Poisson distribution, and $\delta_0(x)$ represents the degenerate distribution placing all of its mass at zero. - You then should run your algorithm with the data contained in `nestsize.csv` and report the values of the estimates that you obtained, rounded to two decimal places. ::: ::: {.callout-note} ### Grading overview The code you generate should follow the same structure as "Sample code for EM example 1". Peer reviewers will be asked to check whether the different pieces of code have been adequately modified to reflect the fact that 1. you provided a reasonable initial point for you algorithm, 2. the observation-specific weights $v_i,k$ are computed correctly (E step), 3. the formulas for the maximum of the $Q$ functions are correct (M step), 4. the converge check is correct, and 5. the numerical values that you obtain are correct. To simplify the peer-review process, assume that component 1 corresponds to the point mass at zero, while component 2 corresponds to the Poisson distribution. There are two things that make this problem more challenging than the ones we have used for illustrations so far: 1. the two components in the mixture belong to different families, and 2. each component has a very different support. keep these two circumstances in mind when working on your answer. ::: ## Full conditional distributions {#sec-zip-mixture-fc} The full conditional distributions for the indicators of the ZIP model are given by: $$ \mathbb{P}r(c_i = 1 \mid \cdots) \propto \begin{cases} w & x_i=0 \\ 0 & \mbox{otherwise} \end{cases} $$ {#eq-zip-mixture-fc-c1} $$ \mathbb{P}r(c_i = 2 \mid \cdots) \propto \begin{cases} (1-w) \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} & x_i=0 \\ 1 & \mbox{otherwise} \end{cases} $$ {#eq-zip-mixture-fc-c2} where $c_i$ is the latent indicator for observation $i$, and $x_i$ is the observed data. The full conditional distributions for the weights are given by: $$ \omega \mid \cdots \sim \mbox{Beta}\left(m(\mathbf{c})+1, n-m(\mathbf{c})+1\right) $$ {#eq-zip-mixture-fc-w} where $m(\mathbf{c})$ is the number of observations with $c_i=1$. Is the full conditional for the rate $\gamma$ $$ \lambda \mid \cdots \sim \mbox{Gamma}\left( 1 + \sum_{i : c_i = 2} x_i , 1 + n-m(\mathbf{c}) \right) $$ {#eq-zip-mixture-fc-lambda} where $m(\mathbf{c})$ is the number of observations with $c_i=2$. :::{.callout-note collapse="true"} ## Sample code for MCMC example 1 ```{r} #| label: lbl-sample-code-for-MCMC-example-1 #### Example of an MCMC algorithm for fitting a location mixture of 2 Gaussian components #### The algorithm is tested using simulated data ## Clear the environment and load required libraries rm(list=ls()) library(MCMCpack) set.seed(81196) # So that results are reproducible ## Generate data from a mixture with 2 components KK = 2 w.true = 0.6 # True weights associated with the components mu.true = rep(0, KK) mu.true[1] = 0 # True mean for the first component mu.true[2] = 5 # True mean for the second component sigma.true = 1 # True standard deviation of all components n = 120 # Number of observations to be generated cc.true = sample(1:KK, n, replace=T, prob=c(w.true,1-w.true)) x = rep(0, n) for(i in 1:n){ x[i] = rnorm(1, mu.true[cc.true[i]], sigma.true) } # Plot the true density par(mfrow=c(1,1)) xx.true = seq(-8,11,length=200) yy.true = w.true*dnorm(xx.true, mu.true[1], sigma.true) + (1-w.true)*dnorm(xx.true, mu.true[2], sigma.true) plot(xx.true, yy.true, type="l", xlab="x", ylab="True density", lwd=2) points(x, rep(0,n), col=cc.true) ## Initialize the parameters w = 1/2 #Assign equal weight to each component to start with mu = rnorm(KK, mean(x), sd(x)) #Random cluster centers randomly spread over the support of the data sigma = sd(x) #Initial standard deviation # Plot the initial guess for the density xx = seq(-8,11,length=200) yy = w*dnorm(xx, mu[1], sigma) + (1-w)*dnorm(xx, mu[2], sigma) plot(xx, yy, type="l", ylim=c(0, max(yy)), xlab="x", ylab="Initial density", lwd=2) points(x, rep(0,n), col=cc.true) ## The actual MCMC algorithm starts here # Priors aa = rep(1,KK) # Uniform prior on w eta = 0 # Mean 0 for the prior on mu_k tau = 5 # Standard deviation 5 on the prior for mu_l dd = 2 qq = 1 # Number of iterations of the sampler rrr = 6000 burn = 1000 # Storing the samples cc.out = array(0, dim=c(rrr, n)) w.out = rep(0, rrr) mu.out = array(0, dim=c(rrr, KK)) sigma.out = rep(0, rrr) logpost = rep(0, rrr) # MCMC iterations for(s in 1:rrr){ # Sample the indicators cc = rep(0,n) for(i in 1:n){ v = rep(0,KK) v[1] = log(w) + dnorm(x[i], mu[1], sigma, log=TRUE) #Compute the log of the weights v[2] = log(1-w) + dnorm(x[i], mu[2], sigma, log=TRUE) #Compute the log of the weights v = exp(v - max(v))/sum(exp(v - max(v))) cc[i] = sample(1:KK, 1, replace=TRUE, prob=v) } # Sample the weights w = rbeta(1, aa[1] + sum(cc==1), aa[2] + sum(cc==2)) # Sample the means for(k in 1:KK){ nk = sum(cc==k) xsumk = sum(x[cc==k]) tau2.hat = 1/(nk/sigma^2 + 1/tau^2) mu.hat = tau2.hat*(xsumk/sigma^2 + eta/tau^2) mu[k] = rnorm(1, mu.hat, sqrt(tau2.hat)) } # Sample the variances dd.star = dd + n/2 qq.star = qq + sum((x - mu[cc])^2)/2 sigma = sqrt(rinvgamma(1, dd.star, qq.star)) # Store samples cc.out[s,] = cc w.out[s] = w mu.out[s,] = mu sigma.out[s] = sigma for(i in 1:n){ if(cc[i]==1){ logpost[s] = logpost[s] + log(w) + dnorm(x[i], mu[1], sigma, log=TRUE) }else{ logpost[s] = logpost[s] + log(1-w) + dnorm(x[i], mu[2], sigma, log=TRUE) } } logpost[s] = logpost[s] + dbeta(w, aa[1], aa[2],log = T) for(k in 1:KK){ logpost[s] = logpost[s] + dnorm(mu[k], eta, tau, log = T) } logpost[s] = logpost[s] + log(dinvgamma(sigma^2, dd, 1/qq)) if(s/500==floor(s/500)){ print(paste("s =",s)) } } ## Plot the logposterior distribution for various samples par(mfrow=c(1,1)) par(mar=c(4,4,1,1)+0.1) plot(logpost, type="l", xlab="Iterations", ylab="Log posterior") xx = seq(-8,11,length=200) density.posterior = array(0, dim=c(rrr-burn,length(xx))) for(s in 1:(rrr-burn)){ density.posterior[s,] = density.posterior[s,] + w.out[s+burn]*dnorm(xx,mu.out[s+burn,1],sigma.out[s+burn]) + (1-w.out[s+burn])*dnorm(xx,mu.out[s+burn,2],sigma.out[s+burn]) } density.posterior.m = apply(density.posterior , 2, mean) density.posterior.lq = apply(density.posterior, 2, quantile, 0.025) density.posterior.uq = apply(density.posterior, 2, quantile, 0.975) par(mfrow=c(1,1)) par(mar=c(4,4,1,1)+0.1) plot(xx, density.posterior.m, type="n",ylim=c(0,max(density.posterior.uq)), xlab="x", ylab="Density") polygon(c(xx,rev(xx)), c(density.posterior.lq, rev(density.posterior.uq)), col="grey", border="grey") lines(xx, density.posterior.m, lwd=2) points(x, rep(0,n), col=cc.true) ``` ::: ## My Solution code ```{r} #| label: lbl-load-data # Load the nest size data rm(list=ls()) library(MCMCpack) set.seed(81196) # So that results are reproducible (same simulated data every time) nestsize <- read.csv("data/nestsize.csv",header=FALSE) colnames(nestsize) <- c("n") x <- nestsize$n n <- length(x) # Number of observations # how many rows in the data nrow(nestsize) # how many zeros in x sum(x==0) # almost half of the data is zeros! par(mfrow=c(1,1)) xx.true = seq(0, max(x), by=1) hist(x, breaks=seq(0, max(x), by=1), freq=FALSE, xlab="Number of eggs", ylab="Density", main="Empirical distribution of nest sizes") ``` ```{r} #| label: lbl-zip-mix-1 # MCMC algorithm for fitting a ZIP # The actual MCMC algorithm starts here ## MCMC iterations of the sampler iterations <- 6000 burn <- 1000 ## Initialize the parameters cc = rep(2, n) cc[x == 0] = sample(1:2, sum(x == 0), replace = TRUE, prob = c(0.5, 0.5)) cc[x != 0] = 2 ## Priors aa = c(1, 1) # Uniform prior on w alpha = 1 beta = 1 lambda = mean(x[x > 0]) # Initial lambda from nonzero data #lambda = mean(x) # Initial lambda from all data w = 0.5 # fewer zeros # Storing the samples w.out = numeric(iterations) cc.out = array(0, dim=c(iterations, n)) lambda.out = numeric(iterations) #logpost = rep(0, iterations) # MCMC iterations for (s in 1:iterations) { # Sample latent indicators c_i cc = numeric(n) # Full conditional for cc for(i in 1:n){ v = rep(0,2) if(x[i]==0){ v[1] = log(w) v[2] = log(1-w) + dpois(x[i], lambda, log=TRUE) v = exp(v - max(v))/sum(exp(v - max(v))) }else{ v[1] = 0 v[2] = 1 } cc[i] = sample(1:2, 1, replace=TRUE, prob=v) } # Sample the weights # Full conditional for w w = rbeta(1, 1+sum(cc==1), 1+n-sum(cc==1)) lambda = rgamma(1, shape= sum(x[cc==2]) + 1, rate= sum(cc==2) + 1) #lambda = rgamma(1, shape = 1 + sum(x[cc == 2]), rate = 1 + sum(cc == 2)) # Store samples w.out[s] = w lambda.out[s] = lambda cc.out[s,] = cc } ``` ```{r} #| label: lbl-posterior-means # Posterior summaries w.post = w.out[-(1:burn)] lambda.post = lambda.out[-(1:burn)] cat("Posterior mean of w:", round(mean(w.post),2), "\n") cat("Posterior mean of lambda:", round(mean(lambda.post),2),"\n") ``` :::::