76 Old Faithful eruptions density estimation with the EM algorithm - M4L7HW1

Caution

Section omitted to comply with the Honor Code

--- title : 'Old Faithful eruptions density estimation with the EM algorithm - M4L7HW1' subtitle : 'Bayesian Statistics: Mixture Models' categories: - Bayesian Statistics keywords: - Mixture Models - Homework - Density Estimation - EM algorithm - Location and scale of 2 Gaussians --- ::::: {.content-visible unless-profile="HC"} ::: {.callout-caution} Section omitted to comply with the Honor Code ::: ::::: ::::: {.content-hidden unless-profile="HC"} ## HW - The EM algorithm and density estimation In this exercise, we implement the EM algorithm to fit a location-and-scale mixture of 2 univariate Gaussian distributions to the duration of eruptions of the Old Faithful geyser. We will then use the fitted model to generate a density estimate of the marginal distribution of the duration of the eruptions. ::: {.callout-note} ### Instructions \index{dataset! old faithful} - The R dataset faithful contains data on waiting time between eruptions (the column named waiting) and the duration of the eruption (the column named eruptions) for the famous Old Faithful geyser in Yellowstone National Park, Wyoming, USA. - You are asked to modify the EM algorithm provided in "Sample code for density estimation problems" to provide a density estimate the marginal distribution of the duration of the eruptions using a location-and-scale mixture of 2 univariate Gaussian distributions (as opposed to the location mixture of 6 univariate Gaussian distributions that we used for the galaxies dataset). ::: ::: {.callout-note} ### Grading overview Reviewers will check whether the code has been modified correctly, and whether the density estimate you generate appears correct. Please remember that you are being asked to use a **location-and-scale mixture** to generate the density estimate, so the "Sample code for density estimation problems" cannot be used directly and requires some modification. Before submitting your answer, it might be useful to compare the density estimate generated by your algorithm against a kernel density estimate generated by the R function density(). While they should not be identical, they should be similar. ::: ```{r} #| label: lbl-location-scale-mixture ###### Setup data x = faithful$eruptions n = length(x) ###### EM algorithm to fit the location-and-scale mixture of 2 Gaussians w = 0.5 mu = c(mean(x)-sd(x), mean(x)+sd(x)) sigma = rep(sd(x)/4,2) s = 0 sw = FALSE QQ = -Inf QQ.out = NULL epsilon = 10^(-5) while(!sw){ ## E step v = array(0, dim=c(n,2)) for(i in 1:n){ v[i,1] = log(w) + dnorm(x[i], mu[1], sigma[1], log=T) v[i,2] = log(1-w) + dnorm(x[i], mu[2], sigma[2], log=T) v[i,] = exp(v[i,] - max(v[i,]))/sum(exp(v[i,] - max(v[i,]))) } ## M step # Weights w = mean(v[,1]) # Means mu = rep(0, 2) for(k in 1:2){ for(i in 1:n){ mu[k] = mu[k] + v[i,k]*x[i] } mu[k] = mu[k]/sum(v[,k]) } # Variances sigma = rep(0,2) for(k in 1:2){ for(i in 1:n){ sigma[k] = sigma[k] + v[i,k]*(x[i] - mu[k])^2 } sigma[k] = sqrt(sigma[k]/sum(v[,k])) } ## Check convergence QQn = 0 for(i in 1:n){ QQn = QQn + v[i,1]*(log(w) + dnorm(x[i],mu[1],sigma[1],log=TRUE)) + v[i,2]*(log(1-w) + dnorm(x[i],mu[2],sigma[2],log=TRUE)) } if(abs(QQn-QQ)/abs(QQn)<epsilon){ sw=TRUE } QQ = QQn QQ.out = c(QQ.out, QQ) s = s + 1 print(paste(s, QQn)) } ``` ```{r} #| label: lbl-density-estimate xx = seq(0,7,length=150) nxx = length(xx) density.EM = rep(0, nxx) for(s in 1:nxx){ density.EM[s] = density.EM[s] + w*dnorm(xx[s], mu[1], sigma[1]) + (1-w)*dnorm(xx[s], mu[2], sigma[2]) } ``` ```{r} #| label: fig-density-estimate #| fig-cap: "Density estimate of the duration of the eruptions using a location-and-scale mixture of 2 univariate Gaussian distributions" plot(xx, density.EM, col="red", lwd=2, type="l", xlab="Eruptions") points(x,rep(0,n)) title(main="Mixture of 2 Gaussians") ``` :::::