So far we considered the number of components in the mixture model as a known and fixed parameter. However, in practice, the number of components is often unknown and needs to be estimated from the data. We will cover two approaches to estimate the number of components in a mixture model:
Bayesian Information Criteria (BIC): A model selection criterion that penalizes the likelihood of the model based on the number of parameters.
Bayesian approach: A probabilistic approach that estimates the number of components based on the posterior distribution of the model parameters.
81.1 Bayesian Information Criteria - BIC 🎥
Figure 81.1: BIC
Figure 81.2: BIC
We can consider the choice of the number of components in a mixture model as a model selection problem. In this context, we have a collection of J models, each with a different number of components. The goal is to select the model that best fits with the evidence/data while avoiding overfitting.
A common approach to model selection that is useful for mixture models is the Bayesian Information Criteria (BIC). Given a collection of J models to be compared, the BIC for model j is given by the formula:
BIC_k = - 2\log L_j(\hat{\eta}) - r_k \log (n) \qquad
\tag{81.1}
where L_j is the likelihood of the model, \hat{\eta} is the maximum likelihood estimate of the parameters, r_j is the number of effective (independent) parameters in model j, and n is the number of observations. The model with the lowest BIC value is considered the best model.
We can interpret the first term in Equation 81.1 as a measure of the goodness of fit of the model, while the second term penalizes the model for its complexity. The BIC is a trade-off between the goodness of fit and the complexity of the model.
In the case of mixture models, j corresponds to the number of components K in the model. and \eta_k corresponds to the parameters of the model, which include the weights and parameters of the of the component distributions i.e. \eta_k = (w_1,\ldots,w_k, \theta_1, \ldots \theta_K) . The number of effective parameters in a mixture model with K components is given by:
L_k(\hat{w}_1,...\hat{w}_K, \hat{\theta}_1,...,\hat{\theta_K}) = \prod_{i=1}^n \sum_{k=1}^K \hat{w}_k g(x_i|\hat{\theta}_k)
furthermore, the number of effective parameters is given by:
r_k = K - 1 + \sum_{k=1}^K \dim(\theta_k)
81.2 Bayesian Information Criteria Example 🎥
This is a walkthrough of the code in the next section.
81.3 Sample code: Bayesian Information Criteria 🗒️
This code sample illustrates the use of BIC to estimate the number of components of a Mixture Model using the galaxies dataset
Listing 81.1: Illustrating the use of BIC to estimate the number of components of a Mixture Model using the galaxies dataset
## using the galaxies datasetrm(list=ls())library(MASS)data(galaxies)x = galaxiesn =length(x)set.seed(781209)KKmax =20BIC =rep(0, KKmax-1)w.sum =vector("list", KKmax-1)mu.sum =vector("list", KKmax-1)sigma.sum =rep(0, KKmax-1)for(KK in2:KKmax){### First, compute the "Maximum Likelihood" density estimate ### associated with a location mixture of 6 Gaussian distributions ### using the EM algorithm w =rep(1,KK)/KK mu =rnorm(KK, mean(x), sd(x)) sigma =sd(x)/KK epsilon =0.000001 s =0 sw =FALSE KL =-Inf KL.out =NULLwhile(!sw){ v =array(0, dim=c(n,KK))for(k in1:KK){ v[,k] =log(w[k]) +dnorm(x, mu[k], sigma,log=TRUE) }for(i in1:n){ v[i,] =exp(v[i,] -max(v[i,]))/sum(exp(v[i,] -max(v[i,]))) } w =apply(v,2,mean) mu =rep(0, KK)for(k in1:KK){ for(i in1:n){ mu[k] = mu[k] + v[i,k]*x[i] } mu[k] = mu[k]/sum(v[,k]) } sigma =0for(i in1:n){ for(k in1:KK){ sigma = sigma + v[i,k]*(x[i] - mu[k])^2 } } sigma =sqrt(sigma/sum(v)) KLn =0for(i in1:n){for(k in1:KK){ KLn = KLn + v[i,k]*(log(w[k]) +dnorm(x[i], mu[k], sigma, log=TRUE)) } }if(abs(KLn-KL)/abs(KLn)<epsilon){ sw=TRUE } KL = KLn KL.out =c(KL.out, KL) s = s +1if(s/20==floor(s/20)){ print(paste(s, KLn)) } } w.sum[[KK-1]] = w mu.sum[[KK-1]] = mu sigma.sum[KK-1] = sigmafor(i in1:n){ BIC[KK-1] = BIC[KK-1] -2*log(sum(w*dnorm(x[i], mu, sigma))) } BIC[KK-1] = BIC[KK-1] + ((KK-1) +1+ KK)*log(n) ### KK-1 independent weights, one variance, and KK means}
1
Clear the environment
2
Loading data
3
Setting up global variables
4
Setting a random seed for reproducibility
5
Setting the maximum number of components to consider
6
Initializing the BIC vector to store the BIC values for each number of components
7
Initializing list to store the weights for each number of components
8
Initializing list to store the means for each number of components
9
Initializing vector to store the standard deviations for each number of components
10
Looping over the number of components from 2 to KKmax
11
Fitting a mixture model using the EM algorithm for each number of components
12
Initializing the parameters of the mixture model
13
Initializing the convergence criteria
14
Running the EM algorithm until convergence
15
E-step: computing the responsibilities for each component
16
M-step: updating the weights, means, and standard deviations
17
Weights are updated based on the responsibilities
18
Initializing the means for the mixture model
19
Updating the means based on the responsibilities
20
Normalizing the means
21
Computing the standard deviations
22
Updating the standard deviations based on the responsibilities
23
Normalizing the standard deviations
24
Checking convergence of the EM algorithm
25
Computing the BIC for the current number of components
Figure 81.4: What happens with the variance (bandwidth) as K increases
81.3.1 Estimating the number of components in Bayesian settings 🎥
Is BIC Bayesian?
The BIC has the term Bayesian in its name, but it is not considered a Bayesian method. It is considered a frequentist method that uses the likelihood of the model and the number of parameters to estimate the number of components. In contrast, Bayesian methods use the posterior distribution of the model parameters to estimate the number of components.
So what we want is to have a posterior estimate of the number of components. We can do this by using a Dirichlet process prior on the weights of the mixture model. The Dirichlet process is a nonparametric prior that allows for an infinite number of components, but only a finite number of them will be used in the posterior distribution.
81.3.2 Bayesian Information Criteria (BIC) for Mixture Models
Figure 81.5: Estimating the number of components
Figure 81.6: Estimating the number of components
K= maximum number of components
K* = number of components that really generated the model
but this won’t work because the number of weights in the prior increases with K and has increasing influence on the posterior. We need to use a prior that reduces the influence on the posterior as K increase like:
\tilde{w} \sim \mathrm{Dir}(\alpha/K, \ldots, \alpha/K)
\tag{81.3}
where \alpha is a hyperparameter that controls the strength of the prior. The larger the value of \alpha, the more influence the prior has on the posterior distribution.
if (w_1,\ldots,w_K) \sim \mathrm{Dir}(\alpha/K,\ldots,\alpha/K), then the expected number of occupied components is given by:
leaving us with just a single parameter \alpha to tune. This is a very useful result because it allows us to estimate the number of components in a mixture model without having to specify the number of components in advance.
81.3.3 Sample code for estimating the number of components and the partition structure in Bayesian models 🗒️
## Full Bayesian estimation of a mixture model for density estimation in the galaxies datasetrm(list=ls())### Loading data and setting up global variableslibrary(MASS)library(MCMCpack)
Loading required package: coda
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2025 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##
data(galaxies)x = galaxiesn =length(x)set.seed(781209)### Fitting a Bayesian mixture model with KK =30## In this formulation, it should be interpreted as the ## maximum number of components allowed## Finding the value of alpha consistent with 6 expected components a prioriff =function(alpha) alpha*log((82+alpha-1)/alpha) -6alph =uniroot(ff, c(0.01, 20))alph$root # 1.496393
[1] 1.496393
## Priors set up using an "empirical Bayes" approachaa =rep(1.5/KK,KK) # We approximate 1.496393 by 1.5eta =mean(x) tau =sqrt(var(x))dd =2qq =var(x)/KK## Initialize the parametersw =rep(1,KK)/KKmu =rnorm(KK, mean(x), sd(x))sigma =sd(x)/KKcc =sample(1:KK, n, replace=T, prob=w)## Number of iterations of the samplerrrr =25000burn =5000## Storing the samplescc.out =array(0, dim=c(rrr, n))w.out =array(0, dim=c(rrr, KK))mu.out =array(0, dim=c(rrr, KK))sigma.out =array(0, dim=c(rrr, KK))logpost =rep(0, rrr)for(s in1:rrr){# Sample the indicatorsfor(i in1:n){ v =rep(0,KK)for(k in1:KK){ v[k] =log(w[k]) +dnorm(x[i], mu[k], sigma, log=TRUE) #Compute the log of the weights } v =exp(v -max(v))/sum(exp(v -max(v))) cc[i] =sample(1:KK, 1, replace=TRUE, prob=v) }# Sample the weights w =as.vector(rdirichlet(1, aa +tabulate(cc, nbins=KK)))# Sample the meansfor(k in1:KK){ nk =sum(cc==k) xsumk =sum(x[cc==k]) tau2.hat =1/(nk/sigma^2+1/tau^2) mu.hat = tau2.hat*(xsumk/sigma^2+ eta/tau^2) mu[k] =rnorm(1, mu.hat, sqrt(tau2.hat)) }# Sample the variances dd.star = dd + n/2 qq.star = qq +sum((x - mu[cc])^2)/2 sigma =sqrt(1/rgamma(1, dd.star, qq.star))# Store samples cc.out[s,] = cc w.out[s,] = w mu.out[s,] = mu sigma.out[s] = sigmafor(i in1:n){ logpost[s] = logpost[s] +log(w[cc[i]]) +dnorm(x[i], mu[cc[i]], sigma, log=TRUE) } logpost[s] = logpost[s] +log(ddirichlet(w, aa))for(k in1:KK){ logpost[s] = logpost[s] +dnorm(mu[k], eta, tau, log=TRUE) } logpost[s] = logpost[s] +dgamma(1/sigma^2, dd, qq, log=TRUE) -4*log(sigma)if(s/500==floor(s/500)){print(paste("s =",s)) }}
## Plot Bayesian estimate with pointwise credible bands along with kernel density estimate and frequentist point estimate## Compute the samples of the density over a dense gridxx =seq(5000,37000,length=300)density.mcmc =array(0, dim=c(rrr-burn,length(xx)))for(s in1:(rrr-burn)){for(k in1:KK){ density.mcmc[s,] = density.mcmc[s,] + w.out[s+burn,k]*dnorm(xx,mu.out[s+burn,k],sigma.out[s+burn]) }}density.mcmc.m =apply(density.mcmc , 2, mean)yy =density(x)colscale =c("black", "red")density.mcmc.lq =apply(density.mcmc, 2, quantile, 0.025)density.mcmc.uq =apply(density.mcmc, 2, quantile, 0.975)par(mfrow=c(1,1))par(mar=c(4,4,1,1)+0.1)plot(xx, density.mcmc.m, type="n",ylim=c(0,max(density.mcmc.uq)),xlab="Velocity", ylab="Density")polygon(c(xx,rev(xx)), c(density.mcmc.lq, rev(density.mcmc.uq)), col="grey", border="grey")lines(xx, density.mcmc.m, col=colscale[1], lwd=2)lines(yy, col=colscale[2], lty=2, lwd=2)points(x, rep(0,n))legend(27000, 0.00017, c("MCMC","KDE"), col=colscale[c(1,2)], lty=c(1,2), lwd=2, bty="n")
##### Finding optimal partition according to Binder's loss function#### Function that computes the loss at a particular configurationLstst =function(cch, DD, Dbar){ z =0for(i in1:(n-1)){for(j in (i+1):n){if(cch[i]==cch[j]){ z = z + (DD[i,j]-Dbar) } } }return(z)}## Initial value of the algorithm is the last iteration of the sampler## Using as.numeric(factor()) is a cheap way to force the cluster labels ## to be sequential starting at 1cch =as.numeric(factor(cc))## Setup parameters for the recursive alorithmDbar =0.50optLstst.old =-InfoptLstst.new =Lstst(cch, DD, Dbar=Dbar)maxiter =50niter =1while((optLstst.old!=optLstst.new)&(niter<=maxiter)){for(i in1:n){ nq =max(cch) +1 q =rep(0, nq)for(s in1:nq){ ccht = cch ccht[i] = s q[s] =Lstst(ccht, DD, Dbar=Dbar) } cch[i] =which.max(q) cch =as.numeric(factor(cch)) } optLstst.old = optLstst.new optLstst.new =Lstst(cch, DD, Dbar=Dbar) niter = niter+1}#print(nunique(cch))## Create another heatmap plot of the co-clustering matrix in which the ## optimal clusters are represented.cchlo =as.numeric(as.character(factor(cch, labels=order(unique(cch)))))cchlotab =table(cchlo)llc =cumsum(cchlotab[-length(cchlotab)])heatmapplot(DD, seq(1,n), seq(1,n,by=3), llc=llc)
---title : 'Determining the number of components - M5L9'subtitle : 'Bayesian Statistics: Mixture Models'categories: - Bayesian Statisticskeywords: - Mixture Models - Maximum Likelihood Estimation - BIC - Bayesian Information Criteria - Multimodality - Numerical Instability - notes---So far we considered the number of components in the mixture model as a known and fixed parameter. However, in practice, the number of components is often unknown and needs to be estimated from the data. We will cover two approaches to estimate the number of components in a mixture model:1. **Bayesian Information Criteria (BIC)**: A model selection criterion that penalizes the likelihood of the model based on the number of parameters.2. **Bayesian approach**: A probabilistic approach that estimates the number of components based on the posterior distribution of the model parameters.## Bayesian Information Criteria - BIC :movie_camera: {#sec-mixtures-BIC} {#fig-slide03-BIC .column-margin width="53mm"}{#fig-slide04-BIC .column-margin width="53mm"}We can consider the choice of the number of components in a mixture model as a model selection problem. In this context, we have a collection of J models, each with a different number of components. The goal is to select the model that best fits with the evidence/data while avoiding overfitting.\index{model selection}\index{model selection!BIC}\index{Bayesian Information Criteria}A common approach to model selection that is useful for mixture models is the Bayesian Information Criteria (BIC). Given a collection of J models to be compared, the BIC for model j is given by the formula:$$BIC_k = - 2\log L_j(\hat{\eta}) - r_k \log (n) \qquad$$ {#eq-bayesian-information-criteria}where $L_j$ is the likelihood of the model, $\hat{\eta}$ is the maximum likelihood estimate of the parameters, $r_j$ is the number of effective (independent) parameters in model j, and n is the number of observations. The model with the lowest BIC value is considered the best model.We can interpret the first term in [@eq-bayesian-information-criteria] as a measure of the goodness of fit of the model, while the second term penalizes the model for its complexity. The BIC is a trade-off between the goodness of fit and the complexity of the model.In the case of mixture models, j corresponds to the number of components K in the model. and $\eta_k$ corresponds to the parameters of the model, which include the weights and parameters of the of the component distributions i.e. $\eta_k = (w_1,\ldots,w_k, \theta_1, \ldots \theta_K)$ . The number of effective parameters in a mixture model with K components is given by:$$L_k(\hat{w}_1,...\hat{w}_K, \hat{\theta}_1,...,\hat{\theta_K}) = \prod_{i=1}^n \sum_{k=1}^K \hat{w}_k g(x_i|\hat{\theta}_k)$$furthermore, the number of effective parameters is given by:$$r_k = K - 1 + \sum_{k=1}^K \dim(\theta_k)$$## Bayesian Information Criteria Example :movie_camera: This is a walkthrough of the code in the next section.## Sample code: Bayesian Information Criteria :spiral_notepad: \index{BIC} \index{dataset!galaxies}This code sample illustrates the use of **BIC** to estimate the number of components of a Mixture Model using the galaxies dataset```{r}#| label: lst-bayesian-information-criteria#| lst-label: lst-bayesian-information-criteria#| lst-cap: "Illustrating the use of BIC to estimate the number of components of a Mixture Model using the galaxies dataset"## using the galaxies datasetrm(list=ls()) # <1>library(MASS) # <2>data(galaxies) # <2>x = galaxies # <3>n =length(x) # <3>set.seed(781209) # <4>KKmax =20# <5>BIC =rep(0, KKmax-1) # <6>w.sum =vector("list", KKmax-1) # <7> mu.sum =vector("list", KKmax-1) # <8> sigma.sum =rep(0, KKmax-1) # <9> for(KK in2:KKmax){ # <10>### First, compute the "Maximum Likelihood" density estimate ### associated with a location mixture of 6 Gaussian distributions ### using the EM algorithm # <11> w =rep(1,KK)/KK # <12> mu =rnorm(KK, mean(x), sd(x)) # <12> sigma =sd(x)/KK # <12> epsilon =0.000001# <13> s =0# <13> sw =FALSE# <13> KL =-Inf# <13> KL.out =NULL# <13>while(!sw){ # <14># <15> v =array(0, dim=c(n,KK))for(k in1:KK){ v[,k] =log(w[k]) +dnorm(x, mu[k], sigma,log=TRUE) }for(i in1:n){ v[i,] =exp(v[i,] -max(v[i,]))/sum(exp(v[i,] -max(v[i,]))) }# <16> w =apply(v,2,mean) # <17> mu =rep(0, KK) # <18>for(k in1:KK){ for(i in1:n){ mu[k] = mu[k] + v[i,k]*x[i] # <19> } mu[k] = mu[k]/sum(v[,k]) # <20> }# <21> sigma =0for(i in1:n){ for(k in1:KK){ sigma = sigma + v[i,k]*(x[i] - mu[k])^2# <22> } } sigma =sqrt(sigma/sum(v)) # <23># <24> KLn =0for(i in1:n){for(k in1:KK){ KLn = KLn + v[i,k]*(log(w[k]) +dnorm(x[i], mu[k], sigma, log=TRUE)) } }if(abs(KLn-KL)/abs(KLn)<epsilon){ sw=TRUE } KL = KLn KL.out =c(KL.out, KL) s = s +1if(s/20==floor(s/20)){ print(paste(s, KLn)) } } w.sum[[KK-1]] = w mu.sum[[KK-1]] = mu sigma.sum[KK-1] = sigma# <25>for(i in1:n){ BIC[KK-1] = BIC[KK-1] -2*log(sum(w*dnorm(x[i], mu, sigma))) } BIC[KK-1] = BIC[KK-1] + ((KK-1) +1+ KK)*log(n) ### KK-1 independent weights, one variance, and KK means}```1. Clear the environment2. Loading data3. Setting up global variables4. Setting a random seed for reproducibility5. Setting the maximum number of components to consider6. Initializing the BIC vector to store the BIC values for each number of components7. Initializing list to store the weights for each number of components8. Initializing list to store the means for each number of components9. Initializing vector to store the standard deviations for each number of components10. Looping over the number of components from 2 to KKmax11. Fitting a mixture model using the EM algorithm for each number of components12. Initializing the parameters of the mixture model13. Initializing the convergence criteria14. Running the EM algorithm until convergence15. E-step: computing the responsibilities for each component16. M-step: updating the weights, means, and standard deviations17. Weights are updated based on the responsibilities18. Initializing the means for the mixture model19. Updating the means based on the responsibilities20. Normalizing the means21. Computing the standard deviations22. Updating the standard deviations based on the responsibilities23. Normalizing the standard deviations24. Checking convergence of the EM algorithm25. Computing the BIC for the current number of components```{r}#| label: fig-bayesian-information-criteria#| fig-cap: "What happens with the variance (bandwidth) as K increases"## Plot of BIC as a function of Kpar(mar=c(4,4,1,1) +0.1)plot(seq(2,KKmax), BIC, type="l", xlab="K", ylab="BIC", lwd=2)abline(v=6, lty=3)```Computing density estimates for various values of K```{r}#| label: lst-bayesian-information-criteria-density-estimatesdensity.est =function(xx, w, mu, sigma){ KK =length(w) nxx =length(xx) density.EM =rep(0, nxx)for(s in1:nxx){for(k in1:KK){ density.EM[s] = density.EM[s] + w[k]*dnorm(xx[s], mu[k], sigma) } }return(density.EM)}xx =seq(5000,37000,length=300)KK =8mdeKK8 =density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])KK =7mdeKK7 =density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])KK =6mdeKK6 =density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])KK =5mdeKK5 =density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])KK =4mdeKK4 =density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])## Comparing density estimates for K=4, 5 and 6par(mar=c(4,4,1,1)+0.1)plot(xx, mdeKK6, type="n",ylim=c(0,max(c(mdeKK4,mdeKK5,mdeKK6,mdeKK7))), xlab="Velocity", ylab="Density")lines(xx, mdeKK6, col="black", lty=1, lwd=2)lines(xx, mdeKK5, col="red", lty=2, lwd=2)lines(xx, mdeKK4, col="blue", lty=3, lwd=2)points(x, rep(0,n))legend(26000, 0.00022, c("K = 6","K = 5","K = 4"), lty=c(1,2,3), col=c("black","red","blue"), bty="n")## Comparing density estimates for K=6, 7 and 8par(mar=c(4,4,1,1)+0.1)plot(xx, mdeKK6, type="n",ylim=c(0,max(c(mdeKK6,mdeKK7,mdeKK8))), xlab="Velocity", ylab="Density")lines(xx, mdeKK6, col="black", lty=1, lwd=2)lines(xx, mdeKK7, col="red", lty=2, lwd=2)lines(xx, mdeKK8, col="blue", lty=3, lwd=2)points(x, rep(0,n))legend(26000, 0.00022, c("K = 6","K = 7","K = 8"), lty=c(1,2,3), col=c("black","red","blue"), bty="n")``````{r}#| label: fig-bayesian-information-criteria-variance#| fig-cap: "What happens with the variance (bandwidth) as K increases"par(mar=c(4,4,1,1) +0.1)plot(seq(2,KKmax), sigma.sum, type="l", xlab="K", ylab=expression(hat(sigma)), lwd=2)abline(v=6, lty=3)```### Estimating the number of components in Bayesian settings :movie_camera: [Is BIC Bayesian?]{.column-margin}\index{Bayesian Information Criteria}The **BIC** has the term *Bayesian* in its name, but it is not considered a Bayesian method. It is considered a frequentist method that uses the likelihood of the model and the number of parameters to estimate the number of components. In contrast, Bayesian methods use the posterior distribution of the model parameters to estimate the number of components.\index{mixture!Dirichlet process}So what we want is to have a posterior estimate of the number of components. We can do this by using a Dirichlet process prior on the weights of the mixture model. The Dirichlet process is a nonparametric prior that allows for an infinite number of components, but only a finite number of them will be used in the posterior distribution.### Bayesian Information Criteria (BIC) for Mixture Models{#fig-s05-bayesian-information-criteria-variance .column-margin width="53mm"}{#fig-s06-bayesian-information-criteria-variance .column-margin width="53mm"}- K= maximum number of components- K* = number of components that really generated the model$$K<<K*$$so far we used $$\tilde{w} \sim \mathrm{Dir}(1, \ldots ,1) = U(0,1)$$ {#eq-dirichlet-prior-uniform}but this won't work because the number of weights in the prior increases with K and has increasing influence on the posterior. We need to use a prior that reduces the influence on the posterior as K increase like:$$\tilde{w} \sim \mathrm{Dir}(\alpha/K, \ldots, \alpha/K)$$ {#eq-dirichlet-prior-diminishing-influence}where $\alpha$ is a hyperparameter that controls the strength of the prior. The larger the value of $\alpha$, the more influence the prior has on the posterior distribution.if $(w_1,\ldots,w_K) \sim \mathrm{Dir}(\alpha/K,\ldots,\alpha/K)$, then the expected number of occupied components is given by:$$\begin{aligned}\lim_{K \to \infty} \mathbb{E}[K^*] &= \sum_{i=0}^n \frac{\alpha}{\alpha+i-1}\\ & \approx \int_0^1 \frac{\alpha}{\alpha +x-1} dx \qquad \text{(Riemann sum approximation)}\\ & = \alpha \log\left(\frac{n+\alpha-1}{\alpha}\right) \end{aligned} \qquad$$ {#eq-expected-number-of-occupied-components-derivation}$$\mathbb{E}[K^*] \approx \alpha \log\left(\frac{n+\alpha-1}{\alpha}\right)$$ {#eq-approx-expected-number-of-occupied-components}leaving us with just a single parameter $\alpha$ to tune. This is a very useful result because it allows us to estimate the number of components in a mixture model without having to specify the number of components in advance.### Sample code for estimating the number of components and the partition structure in Bayesian models :spiral_notepad: {#sec-estimating-number-of-components-and-partition-structure}```{r}#| label: lst-estimating-number-of-components-and-partition-structure## Full Bayesian estimation of a mixture model for density estimation in the galaxies datasetrm(list=ls())### Loading data and setting up global variableslibrary(MASS)library(MCMCpack)data(galaxies)x = galaxiesn =length(x)set.seed(781209)### Fitting a Bayesian mixture model with KK =30## In this formulation, it should be interpreted as the ## maximum number of components allowed## Finding the value of alpha consistent with 6 expected components a prioriff =function(alpha) alpha*log((82+alpha-1)/alpha) -6alph =uniroot(ff, c(0.01, 20))alph$root # 1.496393## Priors set up using an "empirical Bayes" approachaa =rep(1.5/KK,KK) # We approximate 1.496393 by 1.5eta =mean(x) tau =sqrt(var(x))dd =2qq =var(x)/KK## Initialize the parametersw =rep(1,KK)/KKmu =rnorm(KK, mean(x), sd(x))sigma =sd(x)/KKcc =sample(1:KK, n, replace=T, prob=w)## Number of iterations of the samplerrrr =25000burn =5000## Storing the samplescc.out =array(0, dim=c(rrr, n))w.out =array(0, dim=c(rrr, KK))mu.out =array(0, dim=c(rrr, KK))sigma.out =array(0, dim=c(rrr, KK))logpost =rep(0, rrr)for(s in1:rrr){# Sample the indicatorsfor(i in1:n){ v =rep(0,KK)for(k in1:KK){ v[k] =log(w[k]) +dnorm(x[i], mu[k], sigma, log=TRUE) #Compute the log of the weights } v =exp(v -max(v))/sum(exp(v -max(v))) cc[i] =sample(1:KK, 1, replace=TRUE, prob=v) }# Sample the weights w =as.vector(rdirichlet(1, aa +tabulate(cc, nbins=KK)))# Sample the meansfor(k in1:KK){ nk =sum(cc==k) xsumk =sum(x[cc==k]) tau2.hat =1/(nk/sigma^2+1/tau^2) mu.hat = tau2.hat*(xsumk/sigma^2+ eta/tau^2) mu[k] =rnorm(1, mu.hat, sqrt(tau2.hat)) }# Sample the variances dd.star = dd + n/2 qq.star = qq +sum((x - mu[cc])^2)/2 sigma =sqrt(1/rgamma(1, dd.star, qq.star))# Store samples cc.out[s,] = cc w.out[s,] = w mu.out[s,] = mu sigma.out[s] = sigmafor(i in1:n){ logpost[s] = logpost[s] +log(w[cc[i]]) +dnorm(x[i], mu[cc[i]], sigma, log=TRUE) } logpost[s] = logpost[s] +log(ddirichlet(w, aa))for(k in1:KK){ logpost[s] = logpost[s] +dnorm(mu[k], eta, tau, log=TRUE) } logpost[s] = logpost[s] +dgamma(1/sigma^2, dd, qq, log=TRUE) -4*log(sigma)if(s/500==floor(s/500)){print(paste("s =",s)) }}nunique =function(x) length(unique(x))Kstar =apply(cc.out[-seq(1,burn),],1,nunique)par(mar=c(4,4,1,1) +0.1)barplot(table(Kstar)/sum(table(Kstar)), xlab=expression(K^"*"), ylab="Frequency")#dev.print(file="postKstaralpha2.pdf", dev=pdf)## Construct pairwise co-clustering matrix for this datasetpairwise =matrix(0, nrow=n, ncol=n)for(s in1:(rrr-burn)){for(i in1:n){for(j in i:n){ pairwise[i,j] = pairwise[i,j] +as.numeric(cc.out[s+burn,i]==cc.out[s+burn,j]) pairwise[j,i] = pairwise[i,j] } } }DD = pairwise/max(pairwise)heatmapplot =function(DD, alab, subsetaxis, llc=FALSE){ n =dim(DD)[1]#colorscale = rev(gray(0:100 / 100)) colorscale =c("white", rev(heat.colors(100))) nf =layout(matrix(c(1,2),nrow=1,ncol=2), c(7,1), TRUE)par(mar=c(4,3,1,0.5))###Display heat-mapimage(seq(1,n), seq(1,n), DD, axes=F, xlab="", ylab="", col=colorscale[seq(floor(min(100*DD)), floor(max(100*DD))) +1])axis(1,at=subsetaxis,labels=alab[subsetaxis],las=2,cex.axis=1)axis(2,at=subsetaxis,labels=alab[subsetaxis],las=2,cex.axis=1)box()abline(v = llc+0.5)abline(h = llc+0.5)###Display color scalepar(mar=c(3,0,0,0))plot(1:100,1:100,xlim=c(0,2),ylim=c(0,100),type="n",axes=F,xlab ="",ylab ="") yposr =1:100rect(0, yposr-.5, 0.5, yposr+.5,col = colorscale, border=F)rect(0, .5, 0.5, 100.5,col ="transparent")text(0.42,c(yposr[1],yposr[25],yposr[50],yposr[75],yposr[100]),c("0.00","0.25","0.50","0.75","1.00"),pos=4,cex=1.1)}heatmapplot(DD, seq(1,n), seq(1,n,by=3))## Plot Bayesian estimate with pointwise credible bands along with kernel density estimate and frequentist point estimate## Compute the samples of the density over a dense gridxx =seq(5000,37000,length=300)density.mcmc =array(0, dim=c(rrr-burn,length(xx)))for(s in1:(rrr-burn)){for(k in1:KK){ density.mcmc[s,] = density.mcmc[s,] + w.out[s+burn,k]*dnorm(xx,mu.out[s+burn,k],sigma.out[s+burn]) }}density.mcmc.m =apply(density.mcmc , 2, mean)yy =density(x)colscale =c("black", "red")density.mcmc.lq =apply(density.mcmc, 2, quantile, 0.025)density.mcmc.uq =apply(density.mcmc, 2, quantile, 0.975)par(mfrow=c(1,1))par(mar=c(4,4,1,1)+0.1)plot(xx, density.mcmc.m, type="n",ylim=c(0,max(density.mcmc.uq)),xlab="Velocity", ylab="Density")polygon(c(xx,rev(xx)), c(density.mcmc.lq, rev(density.mcmc.uq)), col="grey", border="grey")lines(xx, density.mcmc.m, col=colscale[1], lwd=2)lines(yy, col=colscale[2], lty=2, lwd=2)points(x, rep(0,n))legend(27000, 0.00017, c("MCMC","KDE"), col=colscale[c(1,2)], lty=c(1,2), lwd=2, bty="n")##### Finding optimal partition according to Binder's loss function#### Function that computes the loss at a particular configurationLstst =function(cch, DD, Dbar){ z =0for(i in1:(n-1)){for(j in (i+1):n){if(cch[i]==cch[j]){ z = z + (DD[i,j]-Dbar) } } }return(z)}## Initial value of the algorithm is the last iteration of the sampler## Using as.numeric(factor()) is a cheap way to force the cluster labels ## to be sequential starting at 1cch =as.numeric(factor(cc))## Setup parameters for the recursive alorithmDbar =0.50optLstst.old =-InfoptLstst.new =Lstst(cch, DD, Dbar=Dbar)maxiter =50niter =1while((optLstst.old!=optLstst.new)&(niter<=maxiter)){for(i in1:n){ nq =max(cch) +1 q =rep(0, nq)for(s in1:nq){ ccht = cch ccht[i] = s q[s] =Lstst(ccht, DD, Dbar=Dbar) } cch[i] =which.max(q) cch =as.numeric(factor(cch)) } optLstst.old = optLstst.new optLstst.new =Lstst(cch, DD, Dbar=Dbar) niter = niter+1}#print(nunique(cch))## Create another heatmap plot of the co-clustering matrix in which the ## optimal clusters are represented.cchlo =as.numeric(as.character(factor(cch, labels=order(unique(cch)))))cchlotab =table(cchlo)llc =cumsum(cchlotab[-length(cchlotab)])heatmapplot(DD, seq(1,n), seq(1,n,by=3), llc=llc)#dev.print(file="galaxiesheatmap50.pdf", dev=pdf)```