Bayesian Non-Parametric Models

Bayesian Non-Parametric Models

Seminar on Advanced Topics in Bayesian Statistics
Bayesian Statistics
Author

Oren Bochman

Published

July 2, 2025

Keywords

Bayesian NonParametrics, BNP

The specialization has ended - I plan to revisit the material in the future and do the exercises again as well as extra assignments drawn from the recommended readings. However, I now want to cover some more advanced topics in Bayesian statistics which we might have touched on in the specialization. I find that starting on a new topic in Mathematics is very challenging, There are new concepts, definitions, notation, and often little in the way to connect the new material to what I already know.

So I want to start out mapping out the new territory by finding some tutorials and courses on the web that ease one into these new topics. Like a virtual seminar with guest lecturers and hands on workshops. The best of these should let one build enough intuition to dive deeper into a an online course, a textbook, or even research papers if there material is very new.

Last I checked there isn’t a course on Bayesian Non-parametric models on Coursera. At least not in English. But learning never ends.

This is not available as a course on Coursera and isn’t a part of the specialization which ended in the last course. So this notes are my own personal notes gathered from tutorials and courses I found on the web.

Also there are a number of very gifted people teaching and researching BNP. Some teach very formal courses with lemmas and theorems, others take a more informal and hands on approach presenting ideas and recent results from research. Seeing theorems and their proofs is important but I find it very boring and hard to follow without a good deal of motivation and intuition. So I wish to prioritize the more informal and intuitive approach and build up from there.

Motivational Questions

Let’s start with some motivational questions:

We often see that the BNP methods require jumping to an infinite dimensional space only to marginalize it out and get back to a finite dimensional distribution. – a meditation on random measures.

Here are some question extracted from my Bayesian Feynman Notebook.

Exercise 1 (Gaussian Processes) A problem with parametric regression models is that they impose a global structure on the data that might not be appropriate. One example we saw with mixtures is a bimodal distribution of heights. A two component mixture of Gaussians is a good model for this. However what if we need a more flexible model that can adapt to local structure in the data.

How can we do regression that is able to adapt model our data locally. I.e. a regression line that has many parameters (\mu_i, \sigma_i) for different x_i that are able to change as flexible as our DLM time series models which have an almost uncanny ability to fit complex data? We would probably need to handle regions with sparse data and regions with dense data.

Solution 1. So far we introduced the flexible Mixtures for density estimation clustering and regression. Mixture models are more flexible than any single distribution. However they are still parametric models. What we want is some kind of model that can do regression that can adapt to represent the data locally. We saw one efficient approach to this for Mixture density estimation but we need to find something more general that does not force us to focus on the number of components and that is easier to use for regression.

This is the idea behind Gaussian Processes (GPs). GPs are a type of stochastic process that can be used to define a distribution over functions. This allows us to model complex relationships in the data without having to specify a fixed number of parameters.

There are a number of ways to think about these: an generalization of the multivariate normal distribution to infinite dimensions, a distribution over functions. We should therefore look how GPs can be used for regression and how it fits into the Bayesian framework.

One well known method to do this is non parametric regression using Gaussian Processes (GPs). Will we learn about Gaussian Processes/Neural Networks in this course?

  • This is a type of Bayesian Non-parametric and we don’t cover these in the specialization.

Exercise 2 (What are some good resources to learn about Gaussian Processes?)  

Solution 2.

Figure 1: Gaussian Processes in Practice Workshop, Bletchley Park 2006 by David Mackay. Notes: Chapter 45 of his book

  • My Starting Point is Tamara Broderick’s Gaussian Processes for Regression tutorials from tutorial_2016_mlss_cadiz slides video and code
Figure 2: AMSTAT 2025 by Tamara Broderick. Slides

Figure 3: Statistical Rethinking 2023 by Richard McElreath. Material is based on his book Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Second Edition 2023 CRC Press, Taylor & Francis Group.

  • Here is another lecture on Gaussian Processes by Kilian Weinberger as part of a general Machine Learning course at Cornell University. The notes are available online: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote15.html

Machine Learning Lecture 26 “Gaussian Processes” - Cornell CS4780 SP17

Figure 4

This Gaussian process regression the way it’s typically explained there’s some very beautiful side of Gaussian process regression and that’s basically that it’s a prior over infinite dimensional functions and it’s so beautiful that anyone who describes it goes off for pages and pages about infinite dimensional functions and priors in that space etc which makes it really hard to understand m’s I’m trying to cut all that stuff out I haven’t found anywhere else an explanation of Gaussian processes where they don’t really manage to resist right the temptation to talk about how beautiful it is to have priors over infinite dimensional functions so make sure you pay good attention today because it’s really really hard if you wanna read the secondary reading it it will always go into that direction just because it’s it is very beautiful but it it’s hard if you haven’t heard it before. – Kilian Weinberger

Although he claims to be a frequentist this lecture presenting some Bayesian ideas and demonstrate how well the speaker has internalized this material and his ability to communicate it clearly. I enjoyed the approach of not leave any student behind but repeating the fundamental ideas and building up from there without letting this slow him down!


  • A deeper dive into Gaussian Processes seems to be Dr. Philipp Hennig is more abstract and mathematical than most and is thus a great reference to visit once you are no longer worries by notation and are looking for greater abstraction.
Figure 5

  • The key reference is the book by Carl Edward Rasmussen and Christopher K. I. Williams, “Gaussian Processes for Machine Learning” (Rasmussen and Williams 2006). The book is available online for free here.

  • Even more advanced is the GPSS YouTube channel with 60+ videos at different levels

  • I found a couple of talks by Rasmussen from the 2009 Machine Learning Summer School in the UK that may be of interest:

    • Gaussian Processes - part 1
    • Gaussian Processes - part 2
    • He says that it took him about two years to understand Gaussian Processes.
    • Why do we need the infinite dimensional view of GPs when in the end we work with finite dimensional distributions?
      • Because we need to find where the RV of interest lay.
      • There is a statistical reason: the finite dimensional distributions are just projections of the infinite dimensional distribution.

Bottom line I find Tamara Broderick’s tutorials to be the most accessible and practical introduction to Gaussian Processes, I think that after covering it I am quite ready to dive deeper into the MacKay and Rasmussen & Williams books.

Exercise 3 (Fully Bayesian Mixture Models)  

  1. How can we include the size of a Mixture as part of the inference in a Bayesian Mixture Model?

Solution 3. My initial instinct is to use a Cauchy prior and use HMC for inference. It seems that this might not work too well as HMC and NUTS are not good at jumping between modes.

So the second idea is to go back to the Gibbs sampler and add a step to sample the number of components. This time Cauchy prior might not be such a good idea (it has no moments). So I would go with a Poisson prior or a Negative Binomial prior might be better. Note though that both the Poisson and Negative Binomial will require setting parameters that place the mass at some mean value. What if we don’t know this value and it might be very big? Or worse what if we are streaming in data (perhaps using a Conditional Conjugate model) and we keep seeing new clusters? e.g. for modeling topics of Wikipedia articles.

I think this might work fine but I suspect this isn’t ideal either as this is not the setup you see mentioned if you do a bit of research on the web. In (Miller and Harrison 2018) the authors review the two main approaches to this problem and find a number of parallels between them. They cover mixture of finite mixtures (MFM) and Dirichlet process mixture (DPM) models.

My idea fits into MFM approach but they point out though that MFM models require sophisticated sampling setups based on reversible jump Markov chain Monte Carlo, but it can be nontrivial to design good reversible jump moves, especially in high-dimensional space. So it might be neat to implement this for the toy examples in the course like the galaxy data set, but that it is unlikely to be useful for more complex models.

Exercise 4 What are the main mathematical results we need to understand BNP models? Can we further organize these as prerequisites and results that are developed for BNP models?

Solution 4 (Prerequisites).

  • Doob’s theorem
    • This is a key result in Bayesian statistics that shows that under certain conditions, the posterior distribution converges to the true parameter value as the sample size increases.
    • This is important for understanding the asymptotic behavior of BNP models.
  • Kolmogorov extension theorem
    • This is a key result in probability theory that shows that under certain conditions, a collection of finite-dimensional distributions can be extended to a consistent infinite-dimensional distribution.
    • This is important for understanding how BNP models can be defined in terms of finite-dimensional distributions.

Key Concepts and Results

So far I have found the following key concepts and results that are important to understand BNP models:

  • Exchangeability and De Finetti’s theorem
    • This is a key result that shows that any exchangeable sequence of random variables can be represented as a mixture of i.i.d. sequences. This is important for understanding how BNP models can be used to model complex data.
    • Tamara Broderick point out a number of papers in which de Finetti’s theorem was proved and extended by various researchers over the years which now covers the case of infinite sequences of random variables.
    • In Abel Rodriguez’s notes he points out that there are a number of stronger and weaker forms of exchangeability. These ideas are intuitive but formalizing them is more challenging and knowing the terminology can make our thinking more precise.
      • One idea that is more of less obvious from the cannonical first example of clustering is that we can exchange the labels of the clusters. This is called label switching and is a common problem in Bayesian mixture models.
      • Changing the order the point are sampled shouldn’t matter.
      • However we still want point within a cluster to be more similar to each other than to points in other clusters.
      • More generally for different constracts like trees, graphs, time series, we want to have the benefits of de Finetti’s theorem but we want to respect the dependency structure of the data structure.
  • Stochastic Processes
    • A stochastic process generalists a random variable to a collection of random variables indexed by time or space. This index can be discrete or continuous.
    • Some Examples of stochastic processes we will cover include the Gaussian process, the Dirichlet process, the Chinese restaurant process, the Indian buffet process, and the Pitman-Yor process.
  • Conjugacy
    • This is a key concept in Bayesian statistics that allows us to update our beliefs about a parameter in a computationally efficient way. Many BNP models are based on conjugate priors, which makes inference easier.
  • Measure Theory
    • This is a branch of mathematics that deals with the formalization of the intuitive probability mass spread on the support of a distribution.
    • We will need enough to define random measures which allow to view BNP using an abstraction that makes most of the look very similar so that understating one (Say the Dirichlet process) will make it easier to understand many others.

Overview

  • In this course we learn to:
  • We will build the following skills:
    • Probability Distribution (Dirichlet, Beta)
    • Stochastic Processes (Gaussian Proceess, Poisson Process, Dirichlet Process, Chinese Restaurant Process, Indian Buffet Process, Pitman-Yor Process, Polya Tree)
  • There are currently five modules planned in this course:
    1. Gaussian Processes for Regression: We will focus on Gaussian processes as a flexible prior distribution for regression problems, allowing us to capture complex relationships in the data.
      1. Gaussian process model slides
      2. Gaussian process regression slides
      3. Squared exponential kernel and observation noise slides
      4. What uncertainty are we quantifying? slides
      5. A list of resources: slide
    2. Dirichlet process: We will explore the Dirichlet process as a prior distribution over probability measures, allowing for flexible modeling of unknown distributions.
    • The Beta distribution
    • The Dirichlet Distribution
    • Dirichlet process
    • Polya urn scheme
    • Stick breaking representation
    • Dirichlet process mixture models
    • Hierarchical Dirichlet processes
    1. Chinese restaurant process: We will introduce the Chinese restaurant process as a metaphor for the Dirichlet process, providing an intuitive understanding of how it works.
    2. Indian buffet process: We will discuss the Indian buffet process as a model for representing the distribution of features in a dataset, allowing for flexible and scalable modeling of complex data structures.
    3. Pitman-yor process: We will examine the Pitman-yor process as a generalization of the Dirichlet process, providing a more flexible framework for modeling distributions with power-law behavior.
    4. Polya tree: We will explore the Polya tree as a nonparametric prior distribution for modeling probability distributions, allowing for flexible and adaptive modeling of complex data structures.

Prerequisite skill checklist 🗒️

Bayesian Statistics

Mixture Models

Time Series Analysis

Some References:

  1. All of nonparametric bayesian statistics (Wasserman 2006)
  2. Gaussian Processes (Rasmussen and Williams 2006)
  3. Surrogates (Gramacy 2020)
  4. Bayesian Optimization (Garnett 2023)
  5. A Tutorial on Bayesian Optimization (Frazier 2018)
  • It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample
  • three common acquisition functions:
    • expected improvement
    • entropy search
    • knowledge gradient
  1. Bayesian Nonparametrics via Neural Networks (Lee 2004)

So after going over the first two tutorials by Tamara Broderick, I feel like I have a good understanding of Gaussian Processes and the Dirichlet Process. This seems like a very good starting point for learning more.

I listened felt kind of more comfortable to listen and read Larry Wasserman’s on this BNP and als FNPs. IT seems like it is also useful to go over Frequents non-parametric methods as they are different from Bayesian non-parametric methods. (For parametric models, there is a good case that Bayesian and Frequentist methods are not only similar but if one uses a reference prior, they can lead to the same results. Wasserman makes a strong point when explaining non parametric CDF that the Bayesian and Frequentist approaches are doing different things and that the distinction is interesting.

Papers

  • Most of the work on BNP is covered in papers.

  • In fact in (Wasserman 2006) points out that he doesn’t give proofs for many of the results and point us to the original papers.

  • Tamara Broderick’s tutorial notes provide a bibliography with the papers that are referenced in her tutorials.

    • Some of these she emphasizes as being particularly important, others are just sources for examples. However in retrospect an example for a problem like topic modeling is going to be quite different from time series etc. So that having examples covering BNP for different types of data is central to understanding how to use these models in practice.
    • She also points out that there were two eras of research in BNP. The first in which there were many theoretical results and a second are after MCMC methods became strong enough to handle the computational complexity of BNP models. In the second era people focused more on using MCMC to do inference on many of the models developed in the first era.
    • In another tutorial she points out that for many of the models there are still many open problems and that these may need to be answered before one can use these models for inference of parameters of interest to a specific problem and a specific data set.
  • At this point I want to get good enough to do inference on some of these models and then to adapt them to a few problems I have in mind. Like Lewis signaling, Elasticity based Pricing, and use in Bayesian RL agents.

  • So while I don’t think these are easy reads I will

    1. first try to get a better understanding of the key concepts and ideas.
    2. Try to do exercises in the notes and tutorials.
    3. Try to implement some of the models in R and Python.
    4. Try to use AI to breeze through the most important papers

So far (Ferguson 1973) and (Sudderth and Jordan 2008) seem the most interesting papers to start with for Dirichlet Processes. There is also (J. Kingman 1967) which is a key paper on random measures and (J. F. C. Kingman 1978) with the Kingman paintbox process.

Also the papers on De Finetti’s theorem and exchangeability are important yet very challenging. Tamara Broderick’s tutorial talks about this result. On the other hand Larry Wasserman blurted out that you can’t use Bayes theorem to do inference when the measure is not sigma-finite. So we need to be careful when using these models. This burst a bunch of bubbles. But like how I felt about Pappus’s hexagon theorem being correct in projective geometry, even when we discard the parallel postulate, I feel like this is a good thing. We see once again that Bayes theorem is not the result about which all of Bayesian statistics revolves around. We see once again the power of conjugacy and the definition of Joint distributions as conditional probabilities are more fundamental. And we also see that MCMC can help us do inference in many cases were we can’t use Bayes and where even Conjugacy breaks down. We can use sampling to get a good enough approximation to the posterior even when the measure we want to consider is not sigma-finite. Finally MCMC is also very useful for RL and one of my goals is to use BNP methods to make better RL agents.

Additional Resources:

  1. Abel Rodriguez from UCSC has a website the instructor of the third course on mixture model has a
  1. Herbert Lee from UCSC wrote a monograph (Lee 2004) on the subject of Neural Networks and Bayesian Non-parametrics.
  2. Athanasios Kottas from * UCSC* home page of has made the following notes available on his website:
  1. Tamara Broderick from UCSC has made the following resources available:
  • 2024 slides
  • Tamara Broderick Tutorial on the Dirichlet Process from MLSS 2016 ML the University of Cádiz in Cádiz, Spain link

Also I noticed that there are some reference book that teach non-parametric statistics that people used to go to to get started. It might be worthwhile to list these along with what they cover.

Also there are a number of young people who have made notes and tutorials available on the web. On the other hand some older people who invented things have also made notes available.

  1. Peter Orbanz from University College London Tutorials on Bayesian Nonparametrics
  1. Yee Whye Teh from University of Oxford home page has made written some papers on some topics I am intersted in
  • (Teh and Jordan 2010) “Hierarchical {B}ayesian Nonparametric Models with Applications” offers a whirlwind tour of BNP models and their applications. Similar to Tamara Broderick’s segment on other models in her tutorial but 48 pages worth.
    • Chinese Restaurant Franchise
    • Hidden Markov Models with Infinite State Spaces - Could help RL agents use finite data bound approximation to infinite state spaces.
    • Hierarchical Pitman-Yor Process - for langauge modeling I think that this one is great intro for rolling your own HBNP models. With strong intuition for topic modeling.
  • (Hinton, Osindero, and Teh 2006) “A Fast Learning Algorithm For Deep Belief Networks.” because RL agents are overconfident and could use a fast learning algorithm that factors in uncertainty.
  • (Teh 2006) “A Hierarchical Bayesian Language Model based on Pitman-Yor Processes.” may be a the language model that uses Lewis Signaling solution as its support
  • (Wood et al. 2011) “The Sequence Memoizer”