81  Determining the number of components - M5L9

Bayesian Statistics: Mixture Models

Bayesian Statistics
Author

Oren Bochman

Keywords

Mixture Models, Maximum Likelihood Estimation, BIC, Bayesian Information Criteria, Multimodality, Numerical Instability, notes

So far we considered the number of components in the mixture model as a known and fixed parameter. However, in practice, the number of components is often unknown and needs to be estimated from the data. We will cover two approaches to estimate the number of components in a mixture model:

  1. Bayesian Information Criteria (BIC): A model selection criterion that penalizes the likelihood of the model based on the number of parameters.
  2. Bayesian approach: A probabilistic approach that estimates the number of components based on the posterior distribution of the model parameters.

81.1 Bayesian Information Criteria - BIC 🎥

Figure 81.1: BIC
Figure 81.2: BIC

We can consider the choice of the number of components in a mixture model as a model selection problem. In this context, we have a collection of J models, each with a different number of components. The goal is to select the model that best fits with the evidence/data while avoiding overfitting.

A common approach to model selection that is useful for mixture models is the Bayesian Information Criteria (BIC). Given a collection of J models to be compared, the BIC for model j is given by the formula: BIC_k = - 2\log L_j(\hat{\eta}) - r_k \log (n) \qquad \tag{81.1}

where L_j is the likelihood of the model, \hat{\eta} is the maximum likelihood estimate of the parameters, r_j is the number of effective (independent) parameters in model j, and n is the number of observations. The model with the lowest BIC value is considered the best model.

We can interpret the first term in Equation 81.1 as a measure of the goodness of fit of the model, while the second term penalizes the model for its complexity. The BIC is a trade-off between the goodness of fit and the complexity of the model.

In the case of mixture models, j corresponds to the number of components K in the model. and \eta_k corresponds to the parameters of the model, which include the weights and parameters of the of the component distributions i.e. \eta_k = (w_1,\ldots,w_k, \theta_1, \ldots \theta_K) . The number of effective parameters in a mixture model with K components is given by: L_k(\hat{w}_1,...\hat{w}_K, \hat{\theta}_1,...,\hat{\theta_K}) = \prod_{i=1}^n \sum_{k=1}^K \hat{w}_k g(x_i|\hat{\theta}_k)

furthermore, the number of effective parameters is given by:

r_k = K - 1 + \sum_{k=1}^K \dim(\theta_k)

81.2 Bayesian Information Criteria Example 🎥

This is a walkthrough of the code in the next section.

81.3 Sample code: Bayesian Information Criteria 🗒️

This code sample illustrates the use of BIC to estimate the number of components of a Mixture Model using the galaxies dataset

Listing 81.1: Illustrating the use of BIC to estimate the number of components of a Mixture Model using the galaxies dataset
## using the galaxies dataset
rm(list=ls())
library(MASS)
data(galaxies)
x  = galaxies
n  = length(x)
set.seed(781209)

KKmax = 20
BIC   = rep(0, KKmax-1)

w.sum  = vector("list", KKmax-1)
mu.sum = vector("list", KKmax-1)
sigma.sum = rep(0, KKmax-1)

for(KK in 2:KKmax){
  ### First, compute the "Maximum Likelihood" density estimate 
  ### associated with a location mixture of 6 Gaussian distributions 
  ### using the EM algorithm 

  w     = rep(1,KK)/KK
  mu    = rnorm(KK, mean(x), sd(x))
  sigma = sd(x)/KK
  
  epsilon = 0.000001
  s       = 0
  sw      = FALSE
  KL      = -Inf
  KL.out  = NULL
  
  while(!sw){

    v = array(0, dim=c(n,KK))
    for(k in 1:KK){
      v[,k] = log(w[k]) + dnorm(x, mu[k], sigma,log=TRUE)  
    }
    for(i in 1:n){
      v[i,] = exp(v[i,] - max(v[i,]))/sum(exp(v[i,] - max(v[i,])))
    }
    

    w = apply(v,2,mean)
    mu = rep(0, KK)
    for(k in 1:KK){ 
      for(i in 1:n){ 
        mu[k] = mu[k] + v[i,k]*x[i]
      }
      mu[k] = mu[k]/sum(v[,k])
    }

    sigma = 0
    for(i in 1:n){     
      for(k in 1:KK){  
        sigma = sigma + v[i,k]*(x[i] - mu[k])^2
      }
    }
    sigma = sqrt(sigma/sum(v))
    

    KLn = 0
    for(i in 1:n){
      for(k in 1:KK){
        KLn = KLn + v[i,k]*(log(w[k]) + 
                        dnorm(x[i], mu[k], sigma, log=TRUE))
      }
    }
    if(abs(KLn-KL)/abs(KLn)<epsilon){
      sw=TRUE
    }
    KL = KLn
    KL.out = c(KL.out, KL)
    s = s + 1
    if(s/20==floor(s/20)){      
      print(paste(s, KLn))
    }
  }
  
  w.sum[[KK-1]]  = w
  mu.sum[[KK-1]] = mu
  sigma.sum[KK-1] = sigma
  
  

  for(i in 1:n){
    BIC[KK-1] = BIC[KK-1] - 2*log(sum(w*dnorm(x[i], mu, sigma)))
  }
  BIC[KK-1] = BIC[KK-1] + ((KK-1) + 1 + KK)*log(n)  ### KK-1 independent weights, one variance, and KK means
}
1
Clear the environment
2
Loading data
3
Setting up global variables
4
Setting a random seed for reproducibility
5
Setting the maximum number of components to consider
6
Initializing the BIC vector to store the BIC values for each number of components
7
Initializing list to store the weights for each number of components
8
Initializing list to store the means for each number of components
9
Initializing vector to store the standard deviations for each number of components
10
Looping over the number of components from 2 to KKmax
11
Fitting a mixture model using the EM algorithm for each number of components
12
Initializing the parameters of the mixture model
13
Initializing the convergence criteria
14
Running the EM algorithm until convergence
15
E-step: computing the responsibilities for each component
16
M-step: updating the weights, means, and standard deviations
17
Weights are updated based on the responsibilities
18
Initializing the means for the mixture model
19
Updating the means based on the responsibilities
20
Normalizing the means
21
Computing the standard deviations
22
Updating the standard deviations based on the responsibilities
23
Normalizing the standard deviations
24
Checking convergence of the EM algorithm
25
Computing the BIC for the current number of components
[1] "20 -861.300160340851"
[1] "40 -861.845102704334"
[1] "60 -862.021825056483"
[1] "80 -862.107312512661"
[1] "100 -862.156909986849"
[1] "120 -862.188905642997"
[1] "140 -862.211044984583"
[1] "20 -892.435584724028"
[1] "40 -893.104888016642"
[1] "60 -893.40498202712"
[1] "80 -893.591204291568"
[1] "100 -893.723710425507"
[1] "120 -893.825831831938"
[1] "140 -893.908860452777"
[1] "160 -893.979019408269"
[1] "180 -894.040054525798"
[1] "200 -894.094370451017"
[1] "220 -894.143587647454"
[1] "240 -894.188838741371"
[1] "260 -894.230936144577"
[1] "280 -894.27047113134"
[1] "300 -894.307873877557"
[1] "320 -894.343449688044"
[1] "340 -894.37739943714"
[1] "360 -894.40982826946"
[1] "380 -894.440744086025"
[1] "400 -894.470045342745"
[1] "420 -894.497495562189"
[1] "440 -894.522679009391"
[1] "460 -894.544927117938"
[1] "20 -833.432143007169"
[1] "40 -818.168044247337"
[1] "20 -826.416964018383"
[1] "40 -770.789119074513"
[1] "20 -828.321379092958"
[1] "40 -795.942449657444"
[1] "20 -794.662374351299"
[1] "20 -809.439563423673"
[1] "40 -808.663042378389"
[1] "20 -801.80909130307"
[1] "40 -799.673095387316"
[1] "60 -795.809414492802"
[1] "80 -796.314077121678"
[1] "100 -796.933079286827"
[1] "120 -797.365852316129"
[1] "140 -797.658242104202"
[1] "160 -797.855651509745"
[1] "180 -797.990205412934"
[1] "200 -798.083160070137"
[1] "220 -798.148337498443"
[1] "240 -798.194729342538"
[1] "260 -798.228231798405"
[1] "280 -798.252756139525"
[1] "300 -798.270931855528"
[1] "20 -798.319463648812"
[1] "40 -794.845009758997"
[1] "20 -826.983331709186"
[1] "40 -821.239623208658"
[1] "60 -819.799267130562"
[1] "20 -816.295987712457"
[1] "40 -806.709549732948"
[1] "60 -799.9518751137"
[1] "80 -799.129266973443"
[1] "100 -799.002745451039"
[1] "120 -798.945377403758"
[1] "140 -798.91051104072"
[1] "160 -798.887897813767"
[1] "20 -815.12255712431"
[1] "40 -815.861463325247"
[1] "60 -816.65277780204"
[1] "80 -817.367420905963"
[1] "100 -817.981695199728"
[1] "120 -818.482121987687"
[1] "140 -818.87830886769"
[1] "160 -819.188917828566"
[1] "180 -819.43207261762"
[1] "200 -819.622680437914"
[1] "220 -819.772400789357"
[1] "240 -819.890235329328"
[1] "260 -819.983130882009"
[1] "280 -820.05646679529"
[1] "300 -820.114425855522"
[1] "320 -820.160272929299"
[1] "340 -820.196564739275"
[1] "360 -820.225308853616"
[1] "380 -820.248085081872"
[1] "20 -851.834637132759"
[1] "40 -846.408634750858"
[1] "60 -845.373438543406"
[1] "80 -845.241334679279"
[1] "100 -845.152403322103"
[1] "120 -844.93698044634"
[1] "140 -844.478918069205"
[1] "160 -843.859272082294"
[1] "180 -843.396830866521"
[1] "200 -843.157791850473"
[1] "220 -842.987198075861"
[1] "240 -842.746387043908"
[1] "260 -842.311020162512"
[1] "280 -841.525460525532"
[1] "300 -840.13307886189"
[1] "320 -837.637646283637"
[1] "340 -834.447395567086"
[1] "360 -833.024355453823"
[1] "380 -832.778629503937"
[1] "20 -830.712619627954"
[1] "40 -831.59431933264"
[1] "60 -831.760529398233"
[1] "20 -839.780784325688"
[1] "40 -831.549890278729"
[1] "60 -799.604532650141"
[1] "80 -798.987352669928"
[1] "100 -794.989269879538"
[1] "120 -789.077330165171"
[1] "140 -782.992659155106"
[1] "160 -779.078509255756"
[1] "180 -777.844806633688"
[1] "200 -777.752544725934"
[1] "20 -833.977026832454"
[1] "40 -829.411289375974"
[1] "60 -807.265963558671"
[1] "80 -801.845362767957"
[1] "100 -796.131513754935"
[1] "120 -795.383807283214"
[1] "140 -793.411136510273"
[1] "160 -790.669333999831"
[1] "180 -787.910419504715"
[1] "200 -786.399810943812"
[1] "220 -785.118217074068"
[1] "240 -784.899208579399"
[1] "20 -833.733328938719"
[1] "40 -812.315430568933"
[1] "60 -808.877039049294"
[1] "80 -799.484504043974"
[1] "100 -781.392623208429"
[1] "120 -779.207456613408"
[1] "140 -778.924812359465"
## Plot of BIC as a function of K
par(mar=c(4,4,1,1) + 0.1)
plot(seq(2,KKmax), BIC, type="l", xlab="K", ylab="BIC", lwd=2)
abline(v=6, lty=3)
Figure 81.3: What happens with the variance (bandwidth) as K increases

Computing density estimates for various values of K

density.est = function(xx, w, mu, sigma){
  KK  = length(w)
  nxx = length(xx)
  density.EM = rep(0, nxx)
  for(s in 1:nxx){
    for(k in 1:KK){
      density.EM[s] = density.EM[s] + w[k]*dnorm(xx[s], mu[k], sigma)
    }
  }
  return(density.EM)
}

xx  = seq(5000,37000,length=300)
KK = 8
mdeKK8 = density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])
KK = 7
mdeKK7 = density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])
KK = 6
mdeKK6 = density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])
KK = 5
mdeKK5 = density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])
KK = 4
mdeKK4 = density.est(xx, w.sum[[KK-1]], mu.sum[[KK-1]], sigma.sum[KK-1])

## Comparing density estimates for K=4, 5 and 6
par(mar=c(4,4,1,1)+0.1)
plot(xx, mdeKK6, type="n",ylim=c(0,max(c(mdeKK4,mdeKK5,mdeKK6,mdeKK7))), 
     xlab="Velocity", ylab="Density")
lines(xx, mdeKK6, col="black", lty=1, lwd=2)
lines(xx, mdeKK5, col="red", lty=2, lwd=2)
lines(xx, mdeKK4, col="blue", lty=3, lwd=2)
points(x, rep(0,n))
legend(26000, 0.00022, c("K = 6","K = 5","K = 4"), 
       lty=c(1,2,3), col=c("black","red","blue"), bty="n")

## Comparing density estimates for K=6, 7 and 8
par(mar=c(4,4,1,1)+0.1)
plot(xx, mdeKK6, type="n",ylim=c(0,max(c(mdeKK6,mdeKK7,mdeKK8))), 
     xlab="Velocity", ylab="Density")
lines(xx, mdeKK6, col="black", lty=1, lwd=2)
lines(xx, mdeKK7, col="red", lty=2, lwd=2)
lines(xx, mdeKK8, col="blue", lty=3, lwd=2)
points(x, rep(0,n))
legend(26000, 0.00022, c("K = 6","K = 7","K = 8"), 
       lty=c(1,2,3), col=c("black","red","blue"), bty="n")

par(mar=c(4,4,1,1) + 0.1)
plot(seq(2,KKmax), sigma.sum, type="l", xlab="K", 
     ylab=expression(hat(sigma)), lwd=2)
abline(v=6, lty=3)
Figure 81.4: What happens with the variance (bandwidth) as K increases

81.3.1 Estimating the number of components in Bayesian settings 🎥

Is BIC Bayesian?

The BIC has the term Bayesian in its name, but it is not considered a Bayesian method. It is considered a frequentist method that uses the likelihood of the model and the number of parameters to estimate the number of components. In contrast, Bayesian methods use the posterior distribution of the model parameters to estimate the number of components.

So what we want is to have a posterior estimate of the number of components. We can do this by using a Dirichlet process prior on the weights of the mixture model. The Dirichlet process is a nonparametric prior that allows for an infinite number of components, but only a finite number of them will be used in the posterior distribution.

81.3.2 Bayesian Information Criteria (BIC) for Mixture Models

Figure 81.5: Estimating the number of components
Figure 81.6: Estimating the number of components
  • K= maximum number of components
  • K* = number of components that really generated the model

K<<K*

so far we used

\tilde{w} \sim \mathrm{Dir}(1, \ldots ,1) = U(0,1) \tag{81.2}

but this won’t work because the number of weights in the prior increases with K and has increasing influence on the posterior. We need to use a prior that reduces the influence on the posterior as K increase like: \tilde{w} \sim \mathrm{Dir}(\alpha/K, \ldots, \alpha/K) \tag{81.3}

where \alpha is a hyperparameter that controls the strength of the prior. The larger the value of \alpha, the more influence the prior has on the posterior distribution.

if (w_1,\ldots,w_K) \sim \mathrm{Dir}(\alpha/K,\ldots,\alpha/K), then the expected number of occupied components is given by:

\begin{aligned} \lim_{K \to \infty} \mathbb{E}[K^*] &= \sum_{i=0}^n \frac{\alpha}{\alpha+i-1} \\ & \approx \int_0^1 \frac{\alpha}{\alpha +x-1} dx \qquad \text{(Riemann sum approximation)} \\ & = \alpha \log\left(\frac{n+\alpha-1}{\alpha}\right) \end{aligned} \qquad \tag{81.4}

\mathbb{E}[K^*] \approx \alpha \log\left(\frac{n+\alpha-1}{\alpha}\right) \tag{81.5}

leaving us with just a single parameter \alpha to tune. This is a very useful result because it allows us to estimate the number of components in a mixture model without having to specify the number of components in advance.

81.3.3 Sample code for estimating the number of components and the partition structure in Bayesian models 🗒️

## Full Bayesian estimation of a mixture model for density estimation in the galaxies dataset
rm(list=ls())

### Loading data and setting up global variables
library(MASS)
library(MCMCpack)
Loading required package: coda
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2025 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##
data(galaxies)
x  = galaxies
n  = length(x)
set.seed(781209)

### Fitting a Bayesian mixture model with 
KK = 30   ## In this formulation, it should be interpreted as the 
          ## maximum number of components allowed

## Finding the value of alpha consistent with 6 expected components a priori
ff = function(alpha)  alpha*log((82+alpha-1)/alpha) - 6
alph = uniroot(ff, c(0.01, 20))
alph$root  # 1.496393
[1] 1.496393
## Priors set up using an "empirical Bayes" approach
aa  = rep(1.5/KK,KK)  # We approximate 1.496393 by 1.5
eta = mean(x)    
tau = sqrt(var(x))
dd  = 2
qq  = var(x)/KK

## Initialize the parameters
w     = rep(1,KK)/KK
mu    = rnorm(KK, mean(x), sd(x))
sigma = sd(x)/KK
cc    = sample(1:KK, n, replace=T, prob=w)

## Number of iterations of the sampler
rrr   = 25000
burn  = 5000

## Storing the samples
cc.out    = array(0, dim=c(rrr, n))
w.out     = array(0, dim=c(rrr, KK))
mu.out    = array(0, dim=c(rrr, KK))
sigma.out = array(0, dim=c(rrr, KK))
logpost   = rep(0, rrr)


for(s in 1:rrr){
  # Sample the indicators
  for(i in 1:n){
    v = rep(0,KK)
    for(k in 1:KK){
      v[k] = log(w[k]) + dnorm(x[i], mu[k], sigma, log=TRUE)  #Compute the log of the weights
    }
    v = exp(v - max(v))/sum(exp(v - max(v)))
    cc[i] = sample(1:KK, 1, replace=TRUE, prob=v)
  }
  
  # Sample the weights
  w = as.vector(rdirichlet(1, aa + tabulate(cc, nbins=KK)))
  
  # Sample the means
  for(k in 1:KK){
    nk    = sum(cc==k)
    xsumk = sum(x[cc==k])
    tau2.hat = 1/(nk/sigma^2 + 1/tau^2)
    mu.hat  = tau2.hat*(xsumk/sigma^2 + eta/tau^2)
    mu[k]   = rnorm(1, mu.hat, sqrt(tau2.hat))
  }
  
  # Sample the variances
  dd.star = dd + n/2
  qq.star = qq + sum((x - mu[cc])^2)/2
  sigma = sqrt(1/rgamma(1, dd.star, qq.star))
  
  # Store samples
  cc.out[s,]   = cc
  w.out[s,]    = w
  mu.out[s,]   = mu
  sigma.out[s] = sigma
  for(i in 1:n){
    logpost[s] = logpost[s] + log(w[cc[i]]) + dnorm(x[i], mu[cc[i]], sigma, log=TRUE)
  }
  logpost[s] = logpost[s] + log(ddirichlet(w, aa))
  for(k in 1:KK){
      logpost[s] = logpost[s] + dnorm(mu[k], eta, tau, log=TRUE)
  }
  logpost[s] = logpost[s] + dgamma(1/sigma^2, dd, qq, log=TRUE) - 4*log(sigma)
  if(s/500==floor(s/500)){
    print(paste("s =",s))
  }
}
[1] "s = 500"
[1] "s = 1000"
[1] "s = 1500"
[1] "s = 2000"
[1] "s = 2500"
[1] "s = 3000"
[1] "s = 3500"
[1] "s = 4000"
[1] "s = 4500"
[1] "s = 5000"
[1] "s = 5500"
[1] "s = 6000"
[1] "s = 6500"
[1] "s = 7000"
[1] "s = 7500"
[1] "s = 8000"
[1] "s = 8500"
[1] "s = 9000"
[1] "s = 9500"
[1] "s = 10000"
[1] "s = 10500"
[1] "s = 11000"
[1] "s = 11500"
[1] "s = 12000"
[1] "s = 12500"
[1] "s = 13000"
[1] "s = 13500"
[1] "s = 14000"
[1] "s = 14500"
[1] "s = 15000"
[1] "s = 15500"
[1] "s = 16000"
[1] "s = 16500"
[1] "s = 17000"
[1] "s = 17500"
[1] "s = 18000"
[1] "s = 18500"
[1] "s = 19000"
[1] "s = 19500"
[1] "s = 20000"
[1] "s = 20500"
[1] "s = 21000"
[1] "s = 21500"
[1] "s = 22000"
[1] "s = 22500"
[1] "s = 23000"
[1] "s = 23500"
[1] "s = 24000"
[1] "s = 24500"
[1] "s = 25000"
nunique = function(x)   length(unique(x))
Kstar = apply(cc.out[-seq(1,burn),],1,nunique)
par(mar=c(4,4,1,1) + 0.1)
barplot(table(Kstar)/sum(table(Kstar)), xlab=expression(K^"*"), ylab="Frequency")

#dev.print(file="postKstaralpha2.pdf", dev=pdf)


## Construct pairwise co-clustering matrix for this dataset
pairwise = matrix(0, nrow=n, ncol=n)
for(s in 1:(rrr-burn)){
  for(i in 1:n){
    for(j in i:n){
      pairwise[i,j] = pairwise[i,j] + as.numeric(cc.out[s+burn,i]==cc.out[s+burn,j])
      pairwise[j,i] = pairwise[i,j]
    }
  }  
}
DD = pairwise/max(pairwise)

heatmapplot = function(DD, alab, subsetaxis, llc=FALSE){
  n = dim(DD)[1]
  #colorscale = rev(gray(0:100 / 100))
  colorscale = c("white", rev(heat.colors(100)))
  nf = layout(matrix(c(1,2),nrow=1,ncol=2), c(7,1), TRUE)
  par(mar=c(4,3,1,0.5))
  
  ###Display heat-map
  image(seq(1,n), seq(1,n), DD, axes=F, xlab="", ylab="", 
        col=colorscale[seq(floor(min(100*DD)), floor(max(100*DD))) + 1])
  axis(1,at=subsetaxis,labels=alab[subsetaxis],las=2,cex.axis=1)
  axis(2,at=subsetaxis,labels=alab[subsetaxis],las=2,cex.axis=1)
  box()
  abline(v = llc+0.5)
  abline(h = llc+0.5)
  
  ###Display color scale
  par(mar=c(3,0,0,0))
  plot(1:100,1:100,xlim=c(0,2),ylim=c(0,100),type="n",axes=F,xlab ="",ylab ="")
  yposr = 1:100
  rect(0, yposr-.5, 0.5, yposr+.5,col = colorscale, border=F)
  rect(0, .5, 0.5, 100.5,col = "transparent")
  text(0.42,c(yposr[1],yposr[25],yposr[50],yposr[75],yposr[100]),c("0.00","0.25","0.50","0.75","1.00"),pos=4,cex=1.1)
}
heatmapplot(DD, seq(1,n), seq(1,n,by=3))

## Plot Bayesian estimate with pointwise credible bands along with kernel density estimate and frequentist point estimate
## Compute the samples of the density over a dense grid
xx  = seq(5000,37000,length=300)
density.mcmc = array(0, dim=c(rrr-burn,length(xx)))
for(s in 1:(rrr-burn)){
  for(k in 1:KK){
    density.mcmc[s,] = density.mcmc[s,] + w.out[s+burn,k]*dnorm(xx,mu.out[s+burn,k],sigma.out[s+burn])
  }
}
density.mcmc.m = apply(density.mcmc , 2, mean)

yy = density(x)
colscale = c("black", "red")
density.mcmc.lq = apply(density.mcmc, 2, quantile, 0.025)
density.mcmc.uq = apply(density.mcmc, 2, quantile, 0.975)
par(mfrow=c(1,1))
par(mar=c(4,4,1,1)+0.1)
plot(xx, density.mcmc.m, type="n",ylim=c(0,max(density.mcmc.uq)),xlab="Velocity", ylab="Density")
polygon(c(xx,rev(xx)), c(density.mcmc.lq, rev(density.mcmc.uq)), col="grey", border="grey")
lines(xx, density.mcmc.m, col=colscale[1], lwd=2)
lines(yy, col=colscale[2], lty=2, lwd=2)
points(x, rep(0,n))
legend(27000, 0.00017, c("MCMC","KDE"), col=colscale[c(1,2)], lty=c(1,2), lwd=2, bty="n")

##### Finding optimal partition according to Binder's loss function
##
## Function that computes the loss at a particular configuration
Lstst = function(cch, DD, Dbar){
  z = 0
  for(i in 1:(n-1)){
    for(j in (i+1):n){
      if(cch[i]==cch[j]){
        z = z + (DD[i,j]-Dbar)
      }
    }
  }
  return(z)
}

## Initial value of the algorithm is the last iteration of the sampler
## Using as.numeric(factor()) is a cheap way to force the cluster labels 
## to be sequential starting at 1
cch = as.numeric(factor(cc))

## Setup parameters for the recursive alorithm
Dbar = 0.50
optLstst.old  = -Inf
optLstst.new = Lstst(cch, DD, Dbar=Dbar)
maxiter = 50
niter   = 1
while((optLstst.old!=optLstst.new)&(niter<=maxiter)){
  for(i in 1:n){
    nq   = max(cch) + 1
    q    = rep(0, nq)
    for(s in 1:nq){
      ccht    = cch
      ccht[i] = s
      q[s] = Lstst(ccht, DD, Dbar=Dbar)
    }
    cch[i] = which.max(q)
    cch = as.numeric(factor(cch))
  }
  optLstst.old = optLstst.new
  optLstst.new = Lstst(cch, DD, Dbar=Dbar)
  niter = niter+1
}
#print(nunique(cch))

## Create another heatmap plot of the co-clustering matrix in which the 
## optimal clusters are represented.
cchlo    = as.numeric(as.character(factor(cch, labels=order(unique(cch)))))
cchlotab = table(cchlo)
llc      = cumsum(cchlotab[-length(cchlotab)])
heatmapplot(DD, seq(1,n), seq(1,n,by=3), llc=llc)

#dev.print(file="galaxiesheatmap50.pdf", dev=pdf)