the typical method for estimating the density of a random variable is to use a kernel density estimator (KDE)
the KDE is a non-parametric method that estimates the density of a random variable by averaging the contributions of a set of kernel functions centered at each data point
where h is the bandwidth of the kernel and g is a kernel function. The kernel function is a non-negative function that integrates to 1 and is symmetric around 0.
For example, the Gaussian kernel is given by:
g(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}
\tag{73.2}
a mixture of K components is a parametric method that estimates the density of a random variable by averaging the contributions of K kernel functions, each centered at a different location and with a different scale
where w_k is the weight of the k-th component, \hat{\theta}_k is the location and scale of the k-th component, and g(x \mid \hat{\theta}_k) is the kernel function centered at \hat{\theta}_k. The weights are non-negative and sum to 1.
Example: a location mixture of K Gaussian distributions is given by:
where \hat{w}_k is the weight of the k-th component, \hat{\mu}_k is the mean of the k-th component, and \hat{\sigma} is the standard deviation of the k-th component.
we can see the the two methods are quite similar, but the mixture model is more flexible and can capture more complex shapes in the data.
The KDE is a special case of the mixture model where all the components have the same scale and location.
KDE needs as many components as the number of data points, while the mixture model can have fewer components.
KDE uses a fixed bandwidth,
MDE can adaptively choose the bandwidth for each component. In fact we have a weight for each component and a scale parameter that controls the width of the kernel function.
MDE tends to use less components and the weights tend to be 1/K
The above model can be improved by:
using a scale-location mixture model, where the scale and location of each component are estimated from the data.
73.1.3 Density Estimation Example 🎥
We use the galaxies dataset to illustrate the differences between the two methods.
The galaxies dataset contains the velocities of 82 galaxies in the Virgo cluster. The data is available in the MASS package.
73.1.4 Sample code for Density Estimation Problem
## Using mixture models for density estimation in the galaxies dataset## Compare kernel density estimation, and estimates from mixtures of KK=6## components obtained using both frequentist and Bayesian proceduresrm(list=ls())### Loading data and setting up global variableslibrary(MASS)library(MCMCpack)
Loading required package: coda
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2025 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##
data(galaxies)KK =6# Based on the description of the datasetx = galaxiesn =length(x)set.seed(781209)### First, compute the "Maximum Likelihood" density estimate associated with a location mixture of 6 Gaussian distributions using the EM algorithm## Initialize the parametersw =rep(1,KK)/KKmu =rnorm(KK, mean(x), sd(x))sigma =sd(x)/KKepsilon =0.000001s =0sw =FALSEKL =-InfKL.out =NULLwhile(!sw){## E step v =array(0, dim=c(n,KK))for(k in1:KK){ v[,k] =log(w[k]) +dnorm(x, mu[k], sigma,log=TRUE) }for(i in1:n){ v[i,] =exp(v[i,] -max(v[i,]))/sum(exp(v[i,] -max(v[i,]))) }## M step# Weights w =apply(v,2,mean) mu =rep(0, KK)for(k in1:KK){for(i in1:n){ mu[k] = mu[k] + v[i,k]*x[i] } mu[k] = mu[k]/sum(v[,k]) }# Standard deviations sigma =0for(i in1:n){for(k in1:KK){ sigma = sigma + v[i,k]*(x[i] - mu[k])^2 } } sigma =sqrt(sigma/sum(v))##Check convergence KLn =0for(i in1:n){for(k in1:KK){ KLn = KLn + v[i,k]*(log(w[k]) +dnorm(x[i], mu[k], sigma, log=TRUE)) } }if(abs(KLn-KL)/abs(KLn)<epsilon){ sw=TRUE } KL = KLn KL.out =c(KL.out, KL) s = s +1print(paste(s, KLn))}
---title : 'Density Estimation - M4L5'subtitle : 'Bayesian Statistics: Mixture Models - Applications'categories: - Bayesian Statisticskeywords: - Mixture Models - Kernel Density Estimation - Mixture Density Estimation - Clustering - Classification - notes---## Density Estimation using Mixture Models :movie_camera: {#sec-density-estimation-using-mixture-models}{.column-margin width="53mm"}### KDE- the typical method for estimating the density of a random variable is to use a kernel density estimator (KDE)- the KDE is a non-parametric method that estimates the density of a random variable by averaging the contributions of a set of kernel functions centered at each data point$$X_1, \ldots, X_n \sim f(x)$$$$\tilde{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} g \left (\frac{\|X - X_i\|}{h}\right )$$ {#eq-kde-definition-for-density-estimation}\index{bandwidth} \index{kernel function}where $h$ is the bandwidth of the kernel and $g$ is a kernel function. The kernel function is a non-negative function that integrates to 1 and is symmetric around 0. \index{Gaussian function}For example, the Gaussian kernel is given by:$$g(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}$$ {#eq-gaussian-kernel}giving us:$$\tilde{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} \frac{1}{\sqrt{2\pi}} e^{-\frac{|X - X_i|^2}{2h^2}}$$ {#eq-kde}### Mixture of K components- a mixture of K components is a parametric method that estimates the density of a random variable by averaging the contributions of K kernel functions, each centered at a different location and with a different scale- the mixture model is given by:$$X_1, \ldots, X_n \sim f(x) = \sum_{k=1}^K w_k g(x \mid \hat{\theta}_k)$$ {#eq-mixture-density-estimation}where $w_k$ is the weight of the $k$^-th^ component, $\hat{\theta}_k$ is the location and scale of the $k$^-th^ component, and $g(x \mid \hat{\theta}_k)$ is the kernel function centered at $\hat{\theta}_k$. The weights are non-negative and sum to 1.Example: a location mixture of K Gaussian distributions is given by:$$X_1, \ldots, X_n \sim \hat{f}(x) = \sum_{k=1}^K \hat{w}_k \frac{1}{\sqrt{2\pi}\hat{\sigma}} \exp^{-\frac{(x - \hat{\mu}_k)^2}{2\hat{\sigma}^2}}$$ {#eq-mixture-density-estimation-gaussian-location}where $\hat{w}_k$ is the weight of the $k$^-th^ component, $\hat{\mu}_k$ is the mean of the $k$^-th^ component, and $\hat{\sigma}$ is the standard deviation of the $k$^-th^ component.we can see the the two methods are quite similar, but the mixture model is more flexible and can capture more complex shapes in the data. - The KDE is a special case of the mixture model where all the components have the same scale and location.- KDE needs as many components as the number of data points, while the mixture model can have fewer components.- KDE uses a fixed bandwidth, - MDE can adaptively choose the bandwidth for each component. In fact we have a weight for each component and a scale parameter that controls the width of the kernel function. - MDE tends to use less components and the weights tend to be 1/KThe above model can be improved by:1. using a scale-location mixture model, where the scale and location of each component are estimated from the data.### Density Estimation Example :movie_camera: {#sec-density-estimation-example}\index{dataset!galaxies}We use the galaxies dataset to illustrate the differences between the two methods.The galaxies dataset contains the velocities of 82 galaxies in the Virgo cluster. The data is available in the MASS package.### Sample code for Density Estimation Problem```{r}#| label: "lbl-density-estimation-example"## Using mixture models for density estimation in the galaxies dataset## Compare kernel density estimation, and estimates from mixtures of KK=6## components obtained using both frequentist and Bayesian proceduresrm(list=ls())### Loading data and setting up global variableslibrary(MASS)library(MCMCpack)data(galaxies)KK =6# Based on the description of the datasetx = galaxiesn =length(x)set.seed(781209)### First, compute the "Maximum Likelihood" density estimate associated with a location mixture of 6 Gaussian distributions using the EM algorithm## Initialize the parametersw =rep(1,KK)/KKmu =rnorm(KK, mean(x), sd(x))sigma =sd(x)/KKepsilon =0.000001s =0sw =FALSEKL =-InfKL.out =NULLwhile(!sw){## E step v =array(0, dim=c(n,KK))for(k in1:KK){ v[,k] =log(w[k]) +dnorm(x, mu[k], sigma,log=TRUE) }for(i in1:n){ v[i,] =exp(v[i,] -max(v[i,]))/sum(exp(v[i,] -max(v[i,]))) }## M step# Weights w =apply(v,2,mean) mu =rep(0, KK)for(k in1:KK){for(i in1:n){ mu[k] = mu[k] + v[i,k]*x[i] } mu[k] = mu[k]/sum(v[,k]) }# Standard deviations sigma =0for(i in1:n){for(k in1:KK){ sigma = sigma + v[i,k]*(x[i] - mu[k])^2 } } sigma =sqrt(sigma/sum(v))##Check convergence KLn =0for(i in1:n){for(k in1:KK){ KLn = KLn + v[i,k]*(log(w[k]) +dnorm(x[i], mu[k], sigma, log=TRUE)) } }if(abs(KLn-KL)/abs(KLn)<epsilon){ sw=TRUE } KL = KLn KL.out =c(KL.out, KL) s = s +1print(paste(s, KLn))}xx =seq(5000,37000,length=300)nxx =length(xx)density.EM =rep(0, nxx)for(s in1:nxx){for(k in1:KK){ density.EM[s] = density.EM[s] + w[k]*dnorm(xx[s], mu[k], sigma) }}### Get a "Bayesian" kernel density estimator based on the same location mixture of 6 normals## Priors set up using an "empirical Bayes" approachaa =rep(1,KK) eta =mean(x) tau =sqrt(var(x))dd =2qq =var(x)/KK## Initialize the parametersw =rep(1,KK)/KKmu =rnorm(KK, mean(x), sd(x))sigma =sd(x)/KKcc =sample(1:KK, n, replace=T, prob=w)## Number of iterations of the samplerrrr =12000burn =2000## Storing the samplescc.out =array(0, dim=c(rrr, n))w.out =array(0, dim=c(rrr, KK))mu.out =array(0, dim=c(rrr, KK))sigma.out =rep(0, rrr)logpost =rep(0, rrr)for(s in1:rrr){# Sample the indicatorsfor(i in1:n){ v =rep(0,KK)for(k in1:KK){ v[k] =log(w[k]) +dnorm(x[i], mu[k], sigma, log=TRUE) #Compute the log of the weights } v =exp(v -max(v))/sum(exp(v -max(v))) cc[i] =sample(1:KK, 1, replace=TRUE, prob=v) }# Sample the weights w =as.vector(rdirichlet(1, aa +tabulate(cc, nbins=KK)))# Sample the meansfor(k in1:KK){ nk =sum(cc==k) xsumk =sum(x[cc==k]) tau2.hat =1/(nk/sigma^2+1/tau^2) mu.hat = tau2.hat*(xsumk/sigma^2+ eta/tau^2) mu[k] =rnorm(1, mu.hat, sqrt(tau2.hat)) }# Sample the variances dd.star = dd + n/2 qq.star = qq +sum((x - mu[cc])^2)/2 sigma =sqrt(1/rgamma(1, dd.star, qq.star))# Store samples cc.out[s,] = cc w.out[s,] = w mu.out[s,] = mu sigma.out[s] = sigmafor(i in1:n){ logpost[s] = logpost[s] +log(w[cc[i]]) +dnorm(x[i], mu[cc[i]], sigma, log=TRUE) } logpost[s] = logpost[s] +log(ddirichlet(w, aa))for(k in1:KK){ logpost[s] = logpost[s] +dnorm(mu[k], eta, tau, log=TRUE) } logpost[s] = logpost[s] +dgamma(1/sigma^2, dd, qq, log=TRUE) -4*log(sigma)if(s/500==floor(s/500)){print(paste("s =",s)) }}## Compute the samples of the density over a dense griddensity.mcmc =array(0, dim=c(rrr-burn,length(xx)))for(s in1:(rrr-burn)){for(k in1:KK){ density.mcmc[s,] = density.mcmc[s,] + w.out[s+burn,k]*dnorm(xx,mu.out[s+burn,k],sigma.out[s+burn]) }}density.mcmc.m =apply(density.mcmc , 2, mean)## Plot Bayesian estimate with pointwise credible bands along with kernel density estimate and frequentist point estimatecolscale =c("black", "blue", "red")yy =density(x)density.mcmc.lq =apply(density.mcmc, 2, quantile, 0.025)density.mcmc.uq =apply(density.mcmc, 2, quantile, 0.975)plot(xx, density.mcmc.m, type="n",ylim=c(0,max(density.mcmc.uq)),xlab="Velocity", ylab="Density")polygon(c(xx,rev(xx)), c(density.mcmc.lq, rev(density.mcmc.uq)), col="grey", border="grey")lines(xx, density.mcmc.m, col=colscale[1], lwd=2)lines(xx, density.EM, col=colscale[2], lty=2, lwd=2)lines(yy, col=colscale[3], lty=3, lwd=2)points(x, rep(0,n))legend(27000, 0.00017, c("KDE","EM","MCMC"), col=colscale[c(3,2,1)], lty=c(3,2,1), lwd=2, bty="n")```