Caution
Section omitted to comply with the Honor Code
Bayesian Statistics: Mixture Models
Oren Bochman
Mixture Models, Homework, Density Estimation, EM algorithm, Location and scale of 2 Gaussians
Section omitted to comply with the Honor Code
---
title : 'Old Faithful eruptions density estimation with the EM algorithm - M4L7HW1'
subtitle : 'Bayesian Statistics: Mixture Models'
categories:
- Bayesian Statistics
keywords:
- Mixture Models
- Homework
- Density Estimation
- EM algorithm
- Location and scale of 2 Gaussians
---
::::: {.content-visible unless-profile="HC"}
::: {.callout-caution}
Section omitted to comply with the Honor Code
:::
:::::
::::: {.content-hidden unless-profile="HC"}
## HW - The EM algorithm and density estimation
In this exercise, we implement the EM algorithm to fit a location-and-scale mixture of 2 univariate Gaussian distributions to the duration of eruptions of the Old Faithful geyser. We will then use the fitted model to generate a density estimate of the marginal distribution of the duration of the eruptions.
::: {.callout-note}
### Instructions
\index{dataset! old faithful}
- The R dataset faithful contains data on waiting time between eruptions (the column named waiting) and the duration of the eruption (the column named eruptions) for the famous Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
- You are asked to modify the EM algorithm provided in "Sample code for density estimation problems" to provide a density estimate the marginal distribution of the duration of the eruptions using a location-and-scale mixture of 2 univariate Gaussian distributions (as opposed to the location mixture of 6 univariate Gaussian distributions that we used for the galaxies dataset).
:::
::: {.callout-note}
### Grading overview
Reviewers will check whether the code has been modified correctly, and whether the density estimate you generate appears correct. Please remember that you are being asked to use a **location-and-scale mixture** to generate the density estimate, so the "Sample code for density estimation problems" cannot be used directly and requires some modification. Before submitting your answer, it might be useful to compare the density estimate generated by your algorithm against a kernel density estimate generated by the R function density(). While they should not be identical, they should be similar.
:::
```{r}
#| label: lbl-location-scale-mixture
###### Setup data
x = faithful$eruptions
n = length(x)
###### EM algorithm to fit the location-and-scale mixture of 2 Gaussians
w = 0.5
mu = c(mean(x)-sd(x), mean(x)+sd(x))
sigma = rep(sd(x)/4,2)
s = 0
sw = FALSE
QQ = -Inf
QQ.out = NULL
epsilon = 10^(-5)
while(!sw){
## E step
v = array(0, dim=c(n,2))
for(i in 1:n){
v[i,1] = log(w) + dnorm(x[i], mu[1], sigma[1], log=T)
v[i,2] = log(1-w) + dnorm(x[i], mu[2], sigma[2], log=T)
v[i,] = exp(v[i,] - max(v[i,]))/sum(exp(v[i,] - max(v[i,])))
}
## M step
# Weights
w = mean(v[,1])
# Means
mu = rep(0, 2)
for(k in 1:2){
for(i in 1:n){
mu[k] = mu[k] + v[i,k]*x[i]
}
mu[k] = mu[k]/sum(v[,k])
}
# Variances
sigma = rep(0,2)
for(k in 1:2){
for(i in 1:n){
sigma[k] = sigma[k] + v[i,k]*(x[i] - mu[k])^2
}
sigma[k] = sqrt(sigma[k]/sum(v[,k]))
}
## Check convergence
QQn = 0
for(i in 1:n){
QQn = QQn + v[i,1]*(log(w) + dnorm(x[i],mu[1],sigma[1],log=TRUE)) +
v[i,2]*(log(1-w) + dnorm(x[i],mu[2],sigma[2],log=TRUE))
}
if(abs(QQn-QQ)/abs(QQn)<epsilon){
sw=TRUE
}
QQ = QQn
QQ.out = c(QQ.out, QQ)
s = s + 1
print(paste(s, QQn))
}
```
```{r}
#| label: lbl-density-estimate
xx = seq(0,7,length=150)
nxx = length(xx)
density.EM = rep(0, nxx)
for(s in 1:nxx){
density.EM[s] = density.EM[s] + w*dnorm(xx[s], mu[1], sigma[1]) +
(1-w)*dnorm(xx[s], mu[2], sigma[2])
}
```
```{r}
#| label: fig-density-estimate
#| fig-cap: "Density estimate of the duration of the eruptions using a location-and-scale mixture of 2 univariate Gaussian distributions"
plot(xx, density.EM, col="red", lwd=2, type="l", xlab="Eruptions")
points(x,rep(0,n))
title(main="Mixture of 2 Gaussians")
```
:::::